Apache Avro Tutorial for Beginners

by TechVidvan Team

Apache Avro is the language-neutral data serialization system developed by Doug Cutting. In this Avro tutorial article, you will learn the basics of Apache Avro. When we want to serialize the data in Apache Hadoop, then Apache Avro is the preferred tool.

If you don’t know what data serialization is, then don’t worry. The Avro Tutorial first explains what data serialization is and how we use Apache Avro for serializing data in Apache Hadoop.

Later on, this Avro Tutorial will cover the entire concepts of Apache Avro in detail.

What is Data Serialization?

The technique of converting data into the text or the binary format is called Data serialization. Data Serialization is useful for transferring data over the network or for the persistent storage of data.

There are various systems available for data serialization. Avro is one of those serialization systems, which is the schema-based serialization technique.

Let us now learn about Apache Avro in detail.

Introduction to Apache Avro

Avro is basically a language-neutral data serialization system developed by Doug Cutting, who is the father of Hadoop.
Since Apache Hadoop’s writable classes lack the language portability, Apache Avro becomes quite helpful because it deals with the data formats processed by multiple languages.
Apache Avro is the most preferred tool for serializing data in Hadoop.
It has a schema-based system.
With the Avro read and write operations, a language-independent schema is always associated.
Apache Avro serializes the data that has a built-in schema.
It serializes data into the compact binary format, which is deserializable by any application.
Apache Avro uses the JSON format for declaring the data structures.
At present, Apache Avro supports languages such as C, C++, C#, Java, Python, and Ruby.

Avro Schemas

Apache Avro heavily relies on the schemas. When we read the Avro data, the schema which is used when writing that data is always present.

This allows each datum to be written with no overheads and thus making serialization faster and smaller. This will also facilitate the use with the dynamic and scripting languages because the data with its schema is fully self-describing.

When the Avro data is stored in the file, then its schema is also stored with it. Thus, later on that files will be easily processed by any program. If in case the program which is reading the data expects a different schema, then this issue can be easily resolved because both the schemas are present.

When we use Apache Avro in RPC, then the client and the server exchange the schemas in a connection handshake. This exchange basically helps in the communication between the same-named fields, extra fields, missing fields, etc.

Apache Avro schemas are defined with the JSON . This will facilitate the implementation in the languages that are already having the JSON libraries.

Features of Apache Avro

It provides rich data structures. For example, we can create a record that can contain an array, an enumerated type, and a sub record.
Avro serializes the data into the compact, fast, binary data format.
Apache Avro supports implementation in many languages such as C, C++, C#, Java, Python, and Ruby.
Apache Avro creates the binary structured format, which is compressible as well as splittable. Thus it can be used efficiently as the input to the Hadoop MapReduce jobs.
Avro schemas are defined in the JSON, thus facilitating the implementation in the languages that are already having the JSON libraries.
Apache Avro creates the self-describing file named Avro Data File. In this file, it stores the data along with its schema.
We do not have to compress the Avro files manually. The framework itself performs some size optimizations.
Also, Avro is useful in the Remote Procedure Calls (RPCs). The client and the server during RPC exchange the schemas in the connection handshake.

Avro Data Types

Apache Avro supports two types of data. They are:

1. Primitive type: Apache Avro provides support for all the primitive types. The primitive types supported by Avro are null, boolean, int, long, float, double, bytes, and string.

2. Complex type: Apache Avro supports 6 kinds of complex types. They are arrays, enums, fixed, maps, records, and unions.

General Working of Avro

We have to follow the below workflow for using Avro.

1. We have to create Avro schemas as per our data.

2. Read the Avro schemas in our program that we can do in two ways −

a. By Generating a Class Corresponding to the Avro Schema − This is done by compiling the schema using Avro, which will generate a class file that corresponds to the schema.
b. By Using Parsers Library − By using prasers library, we can directly read the schema.

3. Now we have to serialize the data by using the serialization API provided for Apache Avro. This API is present in the package org.apache.avro.specific.

4. At last, deserialize the data by using the deserialization API provided for Apache Avro. This API is present in the package org.apache.avro.specific.

Comparison of Avro with Different Systems

Like Avro, there are various other data serialization mechanisms in Apache Hadoop, such as Protocol Buffers, Sequence Files, and Thrift. Apache Avro provides similar functionality as Thrift, Protocol Buffers, etc.

However, Avro differs from these systems in some of the following below mentioned fundamental aspects.

Dynamic typing: In Apache Avro, the data is always followed by a schema that allows the full processing of that data without any code generation, static data types, etc. This will aid the construction of the generic data-processing systems and languages.

Untagged data: While reading Avro data, the schema is present due to which we need to encode less type of information with data. This will result in a smaller serialization size.

No manually-assigned field IDs: While data processing, the old schema as well as the new schema are always present, so the differences are easily resolved by using the field names.

Advantages of Apache Avro

1. Avro encodes the data as a binary, thus the size stored and transmitted is small

2. The schema is not sent with the data, so the data is smaller.

3. It supports the schema evolution. Thus when we change the schema we will break less functionality or no functionality at all.

4. It provides implementation support in many languages such as Java, C#, C++, etc.

5. Apache Avro creates the binary structured format which is compressible as well as splittable.

Disadvantages of Apache Avro

1. We need to use .NET 4.5, in the case of C# Avro, in order to make the best use of it.

2. It offers potentially slower serialization.

3. For reading or writing data, we need a schema.

Summary

In short, we can say that Apache Avro is a data serialization system developed by Doug Cutting. It is basically useful for data serialization in Hadoop. It provides rich data structures. Avro heavily relies on the schemas.

Apache Avro supports implementation in many languages such as C, C++, C#, Java, Python, and Ruby. Apache Avro schemas are defined with the JSON, which aid the implementation in the languages that are already having the JSON libraries.

The Avro Tutorial article had explained all the concepts related to Avro. I hope you enjoyed learning!!