Apache Avro Fundamentals Cheatsheet
Personal cheatsheet for Apache Avro. Covers schema design, type system, doc fields, JSON-to-Avro conversion, schema evolution, Java serialization, Confluent Schema Registry integration, compatibility modes, and Spark Avro.
1. What Is Avro
Apache Avro is a binary serialization format that requires a schema to read or write data. The schema is always present, either embedded in the file or looked up from a registry.
| Avro | JSON | Protobuf | Parquet | |
|---|---|---|---|---|
| Format | Binary | Text | Binary | Binary columnar |
| Schema required | yes | no | yes (.proto) | yes (embedded) |
| Schema evolution | yes (built-in) | n/a | yes | limited |
| Row vs columnar | Row | Row | Row | Columnar |
| Kafka use | yes (primary) | yes | yes | no (file format) |
| Readable without schema | no | yes | no | no |
| Self-describing files | yes (.avro embeds schema) | yes | no | yes |
2. Schema Design
Avro schemas are written in JSON. A schema describes a single type.
2.1. Primitive Types
| Type | Java equivalent | Notes |
|---|---|---|
"null" | null | Only value is null |
"boolean" | boolean | |
"int" | int | 32-bit signed |
"long" | long | 64-bit signed |
"float" | float | 32-bit IEEE 754 |
"double" | double | 64-bit IEEE 754 |
"bytes" | ByteBuffer | Sequence of 8-bit bytes |
"string" | String | Unicode char sequence |
2.2. Record
The most common top-level type. Every field has a name and type.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
{
"type": "record",
"name": "User",
"namespace": "com.example.events",
"doc": "Represents a registered user.",
"fields": [
{
"name": "id",
"type": "long",
"doc": "Unique user identifier."
},
{
"name": "name",
"type": "string",
"doc": "Full display name of the user."
},
{
"name": "email",
"type": ["null", "string"],
"default": null,
"doc": "Email address. Null if not provided."
},
{
"name": "role",
"type": {
"type": "enum",
"name": "UserRole",
"symbols": ["ADMIN", "USER", "GUEST"],
"doc": "Access role of the user."
},
"default": "USER"
},
{
"name": "tags",
"type": {
"type": "array",
"items": "string"
},
"default": [],
"doc": "Free-form tags associated with the user."
},
{
"name": "metadata",
"type": {
"type": "map",
"values": "string"
},
"default": {},
"doc": "Key-value metadata pairs."
},
{
"name": "address",
"type": {
"type": "record",
"name": "Address",
"fields": [
{"name": "city", "type": "string"},
{"name": "country", "type": "string"}
]
},
"doc": "Physical address of the user."
}
]
}
2.3. Union (Nullable Fields)
Unions are expressed as a JSON array of types. The first type in the union is the default type.
1
2
3
"type": ["null", "string"] // nullable string - default must be null
"type": ["null", "long"] // nullable long
"type": ["string", "int"] // string or int - default is string
Convention: Always put "null" first for nullable fields so you can set "default": null.
2.4. Logical Types
Built on top of primitives - add semantic meaning:
1
2
3
4
5
6
{"type": "long", "logicalType": "timestamp-millis"} // epoch ms
{"type": "long", "logicalType": "timestamp-micros"} // epoch us
{"type": "int", "logicalType": "date"} // days since epoch
{"type": "int", "logicalType": "time-millis"} // ms since midnight
{"type": "bytes", "logicalType": "decimal", "precision": 10, "scale": 2} // BigDecimal
{"type": "string","logicalType": "uuid"}
2.5. doc Fields
Every schema, record, and field can have a "doc" string. It is stored in the schema but ignored during serialization. Use it to document business meaning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"type": "record",
"name": "OrderEvent",
"doc": "Emitted whenever an order transitions to a new status. Consumed by the billing and fulfillment services.",
"fields": [
{
"name": "orderId",
"type": "string",
"doc": "UUID of the order. Maps to orders.id in the orders database."
},
{
"name": "totalAmountCents",
"type": "long",
"doc": "Total order value in cents (USD). Divide by 100 for display."
}
]
}
doc fields are particularly useful for data catalog tools like Collibra or DataHub: you can export Avro schemas and import the doc strings as business glossary descriptions, data lineage annotations, or data dictionary entries. This makes Avro the source of truth for both technical schema and business metadata.
3. JSON to Avro Schema Conversion
Common patterns when converting a JSON payload to an Avro schema:
| JSON | Avro |
|---|---|
"name": "Ryo" (string) | {"name": "name", "type": "string"} |
"age": 25 (integer) | {"name": "age", "type": "int"} |
"price": 9.99 (float) | {"name": "price", "type": "double"} |
"active": true (boolean) | {"name": "active", "type": "boolean"} |
"note": null or sometimes null | {"name": "note", "type": ["null", "string"], "default": null} |
"tags": ["a", "b"] (array) | {"name": "tags", "type": {"type": "array", "items": "string"}, "default": []} |
"meta": {"k": "v"} (object) | {"name": "meta", "type": {"type": "map", "values": "string"}, "default": {}} |
| Nested object | Nested "type": "record" |
| Enum-like string field | {"type": "enum", "symbols": [...]} if values are fixed |
| Monetary value | "long" (store as cents) or "bytes" with decimal logicalType |
| Timestamp | "long" with timestamp-millis logicalType |
| UUID | "string" with uuid logicalType |
Example:
1
2
3
4
5
6
7
8
9
// JSON payload
{
"userId": "abc-123",
"amount": 49.99,
"currency": "USD",
"createdAt": "2024-01-15T10:30:00Z",
"tags": ["premium", "promo"],
"refundReason": null
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Avro schema
{
"type": "record",
"name": "PaymentEvent",
"namespace": "com.example.payments",
"fields": [
{"name": "userId", "type": "string"},
{"name": "amountCents", "type": "long", "doc": "Amount in cents. Original JSON had float 'amount'."},
{"name": "currency", "type": "string"},
{"name": "createdAt", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "tags", "type": {"type": "array", "items": "string"}, "default": []},
{"name": "refundReason", "type": ["null", "string"], "default": null}
]
}
4. Schema Evolution
Avro supports schema evolution - the schema used to write data (writer schema) and the schema used to read it (reader schema) can differ.
4.1. What Is and Isn’t Allowed
Safe changes (backward compatible - new schema can read old data):
- Add a field with a default value
- Remove a field that had a default value
- Change a field type to a wider type (
int->long,float->double)
Forward compatible (old schema can read new data):
- Add a field that the old reader can ignore (old reader skips unknown fields)
- Remove a field - old reader uses its default for the missing field
Breaking changes (avoid):
- Add a field without a default (old data has no value for it)
- Remove a field that had no default (reader can’t fill in the missing value)
- Rename a field without an alias
- Change a field to an incompatible type (
string->int) - Change enum symbols
4.2. Aliases for Renames
Avro supports field aliases to handle renames without breaking compatibility:
1
2
3
4
5
{
"name": "fullName",
"aliases": ["name"], // previously called "name"
"type": "string"
}
5. Java: GenericRecord
Use when you don’t want to generate Java classes. Schema is loaded at runtime.
1
2
3
4
5
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.3</version>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Parse schema
Schema schema = new Schema.Parser().parse(new File("user.avsc"));
// Write
GenericRecord user = new GenericData.Record(schema);
user.put("id", 1L);
user.put("name", "Ryo");
user.put("email", null);
// Serialize to bytes
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<GenericRecord> writer = new GenericDatumWriter<>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(user, encoder);
encoder.flush();
byte[] bytes = out.toByteArray();
// Deserialize from bytes
DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
GenericRecord deserialized = reader.read(null, decoder);
System.out.println(deserialized.get("name")); // "Ryo"
// Write to .avro file (includes schema in file header)
DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<>(writer);
fileWriter.create(schema, new File("users.avro"));
fileWriter.append(user);
fileWriter.close();
// Read .avro file
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new File("users.avro"), new GenericDatumReader<>());
for (GenericRecord r : fileReader) {
System.out.println(r);
}
6. Java: SpecificRecord (Code Generation)
Generate Java classes from .avsc schema files using the Avro Maven plugin. The generated class implements SpecificRecord and has typed getters/setters.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<!-- pom.xml -->
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.11.3</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals><goal>schema</goal></goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/avro</sourceDirectory>
<outputDirectory>${project.build.directory}/generated-sources/avro</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
Place schema files in src/main/avro/. Run mvn generate-sources - the plugin generates a Java class per record.
1
2
3
4
5
6
7
8
9
10
11
12
13
// Generated class: com.example.events.User
User user = User.newBuilder()
.setId(1L)
.setName("Ryo")
.setEmail(null)
.setRole(UserRole.USER)
.setTags(List.of("premium"))
.setMetadata(Map.of("source", "web"))
.build();
// Serialize
DatumWriter<User> writer = new SpecificDatumWriter<>(User.class);
// ... same encode/decode pattern as GenericRecord
SpecificRecord is preferred for production code - compile-time type safety and IDE support. GenericRecord is useful for generic ETL tools or when the schema is not known at compile time.
7. Avro with Confluent Schema Registry
Add the Confluent Avro serializer dependency, then point producers and consumers at the Schema Registry URL. The serializer handles schema registration and the 5-byte wire format header automatically.
1
2
3
4
5
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>7.6.0</version>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Producer
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://schema-registry:8081");
// Optional: auto-register schema on first produce
props.put("auto.register.schemas", true);
KafkaProducer<String, User> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("users", "key1", userSpecificRecord));
// Consumer
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
props.put("schema.registry.url", "http://schema-registry:8081");
props.put("specific.avro.reader", true); // return SpecificRecord instead of GenericRecord
KafkaConsumer<String, User> consumer = new KafkaConsumer<>(props);
The serializer automatically:
- Registers the schema with Schema Registry (if
auto.register.schemas=true) - Prepends the 5-byte magic header (0x00 + schema ID) to each message
The deserializer automatically:
- Reads the schema ID from the message header
- Fetches the writer schema from Schema Registry
- Resolves writer schema vs reader schema for evolution
8. Schema Registry REST API
The Schema Registry exposes a REST API for registering schemas, checking compatibility, and managing subjects.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
REGISTRY=http://schema-registry:8081
# List all subjects
curl $REGISTRY/subjects
# Register a schema
curl -X POST $REGISTRY/subjects/users-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "{\"type\":\"record\",\"name\":\"User\",\"fields\":[{\"name\":\"id\",\"type\":\"long\"},{\"name\":\"name\",\"type\":\"string\"}]}"}'
# Returns: {"id": 1}
# Get latest schema for a subject
curl $REGISTRY/subjects/users-value/versions/latest
# Get schema by global ID
curl $REGISTRY/schemas/ids/1
# Check compatibility before registering
curl -X POST $REGISTRY/compatibility/subjects/users-value/versions/latest \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "..."}'
# Returns: {"is_compatible": true}
# Update compatibility mode for a subject
curl -X PUT $REGISTRY/config/users-value \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"compatibility": "FULL"}'
# Get global default compatibility mode
curl $REGISTRY/config
# Delete a subject (all versions)
curl -X DELETE $REGISTRY/subjects/users-value
9. Avro with Spark
Add spark-avro as a dependency to read and write .avro files or use Avro serialization in Spark Structured Streaming.
1
2
3
4
5
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>3.5.0</version>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Read .avro files
Dataset<Row> df = spark.read()
.format("avro")
.load("hdfs:///data/users/*.avro");
// Write .avro files
df.write()
.format("avro")
.mode("overwrite")
.save("hdfs:///output/users/");
// Write with a specific schema (useful for evolution control)
df.write()
.format("avro")
.option("avroSchema", schemaJson) // JSON string of the Avro schema
.save("hdfs:///output/users/");
9.1. from_avro / to_avro for Kafka + Spark Streaming
from_avro deserializes a binary Avro column into a struct; to_avro does the reverse. Pass the schema as a JSON string, or use the Confluent variant with a registry URL.
1
2
3
4
5
6
7
8
9
10
11
12
import org.apache.spark.sql.avro.functions.*;
// Kafka message value is raw Avro bytes (with Schema Registry 5-byte header stripped)
String userSchemaJson = new String(Files.readAllBytes(Path.of("user.avsc")));
Dataset<Row> parsed = kafkaStream
.select(from_avro(col("value"), userSchemaJson).as("user"))
.select("user.*");
// Serialize back to Avro bytes for writing to Kafka
Dataset<Row> serialized = df
.select(to_avro(struct(col("id"), col("name"))).as("value"));
For Schema Registry integration with Spark Streaming, use the Confluent from_avro/to_avro variant that accepts a registry URL and subject instead of an inline schema string:
1
2
3
4
5
6
7
8
9
10
// Confluent's spark-avro with Schema Registry support
import io.confluent.spark.sql.SchemaRegistryAvroData;
Map<String, String> opts = Map.of(
"schema.registry.url", "http://schema-registry:8081",
"subject", "users-value"
);
Dataset<Row> parsed = kafkaStream
.select(from_avro(col("value"), "users-value", opts).as("user"))
.select("user.*");
10. Data Catalogs & Collibra
doc fields in Avro schemas serve as the primary place to embed business metadata alongside technical schema definitions.
A typical workflow for integration with a data catalog like Collibra or DataHub:
- Engineers define Avro schemas with detailed
docfields (field descriptions, business terms, data owner, PII flags) - A CI job runs on schema changes and exports the schemas from Schema Registry via the REST API
- The exported schemas are parsed and the
docstrings are pushed to the catalog’s API as business attribute descriptions or glossary mappings - The catalog links the technical schema (from Avro) to the business glossary (curated in Collibra)
This makes the Avro schema the single source of truth for both producers/consumers and the data governance layer, avoiding documentation drift.
Example doc conventions to support this:
1
2
3
4
5
{
"name": "customerId",
"type": "string",
"doc": "PII:true | Owner:CRM-team | GlossaryTerm:Customer.ID | Description: UUID of the customer as stored in the CRM system."
}
Comments powered by Disqus.