How to read pyspark avro file and extract values?
How to read twitter.avro files in pyspark and extract values ββfrom it?
rdd=sc.textFile("twitter.asvc")
works good
But when I do
rdd1=sc.textFile("twitter.avro")
rdd1.collect()
I am getting below output
['Obj \ x01 \ x02 \ x16avro.schema \ x04 {"type": "record", "name": "episodes", "namespace": "testing.hive.avro.serde", "field": [ {"name": "title", "type": "string", "document": "episode Title"}, {"name": "air_date" "type": "string", "document": "start date "}, {" name ":" doctor "," type ":" int "," doc ":" the main actor playing the Doctor in the episode "}]} \ x00kR \ x03LS \ x17m |] Z ^ {0 \ x10 \ x04 "The Eleventh Hour \ x183 April 2010 \ x16" Doctor \ Wife \ x1614 May 2011 \ x16 & Horror from Fang Rock September 3, 1977 \ x08 $ Unearthly Child November 23, 1963 \ x02 * Mysterious Planet September 6, 1986\ x0c \ x08Rose \ x1a26 March 2005 \ x12. Dalek Power \ x1e5 November 1966 \ x04 \ x14Castrolava \ x1c4 January 1982 ',' kR \ x03LS \ x17m |] Z ^ {0 ']
Is there a python library to read this format?
source to share
You have to use FileInputFormat for Avro files.
Unfortunately I am not using python, so I can only link you to a solution. You can take a look at this: https://github.com/apache/spark/blob/master/examples/src/main/python/avro_inputformat.py
The most interesting part of this:
avro_rdd = sc.newAPIHadoopFile(
path,
"org.apache.avro.mapreduce.AvroKeyInputFormat",
"org.apache.avro.mapred.AvroKey",
"org.apache.hadoop.io.NullWritable",
keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
conf=conf)
source to share