How to read pyspark avro file and extract values?

Question

How to read pyspark avro file and extract values?

How to read twitter.avro files in pyspark and extract values from it?

rdd=sc.textFile("twitter.asvc")

works good

But when I do

rdd1=sc.textFile("twitter.avro")
rdd1.collect()

I am getting below output

['Obj \ x01 \ x02 \ x16avro.schema \ x04 {"type": "record", "name": "episodes", "namespace": "testing.hive.avro.serde", "field": [ {"name": "title", "type": "string", "document": "episode Title"}, {"name": "air_date" "type": "string", "document": "start date "}, {" name ":" doctor "," type ":" int "," doc ":" the main actor playing the Doctor in the episode "}]} \ x00kR \ x03LS \ x17m |] Z ^ {0 \ x10 \ x04 "The Eleventh Hour \ x183 April 2010 \ x16" Doctor \ Wife \ x1614 May 2011 \ x16 & Horror from Fang Rock September 3, 1977 \ x08 $ Unearthly Child November 23, 1963 \ x02 * Mysterious Planet September 6, 1986\ x0c \ x08Rose \ x1a26 March 2005 \ x12. Dalek Power \ x1e5 November 1966 \ x04 \ x14Castrolava \ x1c4 January 1982 ',' kR \ x03LS \ x17m |] Z ^ {0 ']

Is there a python library to read this format?

+3

python pyspark

Sh Ku 07 jul. '15 at 6:34

source to share

1 answer

Dawid wysakowicz · Accepted Answer · 2015-07-07T08:30:08+0000

You have to use FileInputFormat for Avro files.

Unfortunately I am not using python, so I can only link you to a solution. You can take a look at this: https://github.com/apache/spark/blob/master/examples/src/main/python/avro_inputformat.py

The most interesting part of this:

avro_rdd = sc.newAPIHadoopFile(
    path,
    "org.apache.avro.mapreduce.AvroKeyInputFormat",
    "org.apache.avro.mapred.AvroKey",
    "org.apache.hadoop.io.NullWritable",
    keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
    conf=conf)

How to read pyspark avro file and extract values?

More articles: