How to open Commoncrawl.org WARC.GZ S3 Data in Spark

I want to access the commoncrawl file from the Amazon public storage repository from a spark shell. The files are in WARC.GZ format.

val filenameList = List("s3://<ID>:<SECRECT>@aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-41/segments/1410657102753.15/warc/CC-MAIN-20140914011142-00000-ip-10-196-40-205.us-west-1.compute.internal.warc.gz")

// TODO: implement functionality to read the WARC.GZ file here
val loadedFiles = sc.parallelize(filenameList, filenameList.length).mapPartitions(i => i)
loadedFiles.foreach(f => f.take(1))

      

Now I would use a function to read the WARC.GZ format inside the mapPartitions function. Is this a good approach to this? I ask because I am fairly new to the Spark platform and wanted to implement a small demo application using a small piece of the commoncrawl corpus. I saw mapPartitions being used in a thread here .

I first tried to open the file directly from my own computer using sc.textFile ("s3: // ....") .take (1), which resulted in an access error. Are the amazon S3 shared repository files only accessible from EC2 instances?

+3


source to share


1 answer


There is a sample code from the Analysis of Vulnerability in Web Domains that shows how to access WARC files from Spark because Spark supports the Honeyop InputFormat Interface. The code itself is hosted on GitHub .



We hope to provide an example soon on the Common Crawl GitHub repository of how this is done for Hadoop using Python and Java.

+4


source







All Articles