Problem reading gzip file directly in Pyspark

No problem reading tar.gz, but I see a lot of gibberish in the final o / p Used: (on pyspark)

lines=sc.textFile("abc.tar.gz")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,1)).reduceByKey(add)
print count.collect()

My o / p has a lot of x00 \ x00 \ Any?

+3

apache-spark pyspark

Manish ranjan June 17. 15 at 7:12

source to share

No one has answered this question yet

Check out similar questions:

170

Apache Spark: cores versus executors

1

Read the Avro file one line at a time. python

1

pyspark read mongo: errors in python / lib / pyspark.zip / pyspark / rdd.py

1

Field graphs based on another field in RDD in pyspark

0

PySpark: read files without knowing the key of very one line

0

PySpark Simple Record Counting

0

Process 1/2 Billion Rows with PySpark, Create Random Reading Problems

0

reduByKey () and takeOrdered () in pyspark: how to improve performance in word counting problem?

0

Reading CSV files in parallel in pyspark

0

Problems running python script with Pyspark

All Articles