Problem reading gzip file directly in Pyspark

No problem reading tar.gz, but I see a lot of gibberish in the final o / p Used: (on pyspark)

lines=sc.textFile("abc.tar.gz")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,1)).reduceByKey(add)
print count.collect()

      

My o / p has a lot of x00 \ x00 \ Any?

+3


source to share





All Articles