How to read pyspark avro file and extract values?

How to read twitter.avro files in pyspark and extract values ​​from it?

rdd=sc.textFile("twitter.asvc")

works good

But when I do

rdd1=sc.textFile("twitter.avro")
rdd1.collect()

      

I am getting below output

['Obj \ x01 \ x02 \ x16avro.schema \ x04 {"type": "record", "name": "episodes", "namespace": "testing.hive.avro.serde", "field": [ {"name": "title", "type": "string", "document": "episode Title"}, {"name": "air_date" "type": "string", "document": "start date "}, {" name ":" doctor "," type ":" int "," doc ":" the main actor playing the Doctor in the episode "}]} \ x00kR \ x03LS \ x17m |] Z ^ {0 \ x10 \ x04 "The Eleventh Hour \ x183 April 2010 \ x16" Doctor \ Wife \ x1614 May 2011 \ x16 & Horror from Fang Rock September 3, 1977 \ x08 $ Unearthly Child November 23, 1963 \ x02 * Mysterious Planet September 6, 1986\ x0c \ x08Rose \ x1a26 March 2005 \ x12. Dalek Power \ x1e5 November 1966 \ x04 \ x14Castrolava \ x1c4 January 1982 ',' kR \ x03LS \ x17m |] Z ^ {0 ']

Is there a python library to read this format?

+3


source to share


1 answer


You have to use FileInputFormat for Avro files.

Unfortunately I am not using python, so I can only link you to a solution. You can take a look at this: https://github.com/apache/spark/blob/master/examples/src/main/python/avro_inputformat.py



The most interesting part of this:

avro_rdd = sc.newAPIHadoopFile(
    path,
    "org.apache.avro.mapreduce.AvroKeyInputFormat",
    "org.apache.avro.mapred.AvroKey",
    "org.apache.hadoop.io.NullWritable",
    keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
    conf=conf)

      

+1


source







All Articles