How to process / extract .pst using hasoop Map reduce

I am using MAPI tools (its microsoft lib and .NET) and then apache TIKA libraries to process and extract pst . an exchange server that does not scale.

How can I process / extract pst using MR method ... Is there any tool, library available in java that I can use in my MR jobs. Any help would be great.

Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File)

And the problem is in the Hadoop API we don't have anything close to java.io.File

.

The following parameter is always present, but not effective:

  File tempFile = File.createTempFile("myfile", ".tmp");
  fs.moveToLocalFile(new Path (<HDFS pst path>) , new Path(tempFile.getAbsolutePath()) );
  PstFile pstFile = new PstFile(tempFile);

      

+2


source to share


2 answers


Take a look at Behemoth (http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html). It combines Tika and Hadoop.

I also wrote my own Hadoop + Tika jobs. Template:



  • Wrap all pst files in sequence or avro files.
  • Write just a mapping job that reads pst files from avro files and writes them to local disk.
  • Run tika through files.
  • Write tika output back to sequence file

Hope that help.s

+2


source


Unable to process PST file in mapper. after a lot of analysis and debugging, it turned out that the API is not displaying properly and this API requires a local file system to store the extracted pst content. It cannot be directly stored on HDFS. thats bottle neck. And all of these APIs (libs that pull and process) are not free.



what we can do is extract external hdfs and then we can process MR jobs

0


source







All Articles