pyspark.sql.avro.functions.from_avro#

pyspark.sql.avro.functions.from_avro(data, jsonFormatSchema, options=None)[source]#

Converts a binary column of Avro format into its corresponding catalyst value. The specified schema must match the read data, otherwise the behavior is undefined: it may fail or return arbitrary result. To deserialize the data with a compatible and evolved schema, the expected Avro schema can be set via the option avroSchema.

New in version 3.0.0.

Changed in version 3.5.0: Supports Spark Connect.

Parameters

dataColumn or str: the binary column.
jsonFormatSchemastr: the avro schema in JSON string format.
optionsdict, optional: options to control how the Avro record is parsed.

Notes

Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of “Apache Avro Data Source Guide”.

Examples

>>> from pyspark.sql import Row
>>> from pyspark.sql.avro.functions import from_avro, to_avro
>>> data = [(1, Row(age=2, name='Alice'))]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> avroDf = df.select(to_avro(df.value).alias("avro"))
>>> avroDf.collect()
[Row(avro=bytearray(b'\x00\x00\x04\x00\nAlice'))]

>>> jsonFormatSchema = '''{"type":"record","name":"topLevelRecord","fields":
...     [{"name":"avro","type":[{"type":"record","name":"value","namespace":"topLevelRecord",
...     "fields":[{"name":"age","type":["long","null"]},
...     {"name":"name","type":["string","null"]}]},"null"]}]}'''
>>> avroDf.select(from_avro(avroDf.avro, jsonFormatSchema).alias("value")).collect()
[Row(value=Row(avro=Row(age=2, name='Alice')))]