Need help starting the map, cut down on WordCount work with data stored on amazon s3

I am trying to run the "Reduce WordCount" task on a text file that I have saved in my bucket on Amazon s3. I have set up all the required authentication for the card shortening framework to communicate with Amazon, but I keep working with this error. Any idea why this is happening?

13/01/20 13:22:15 ERROR security.UserGroupInformation:
PriviledgedActionException as:root
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: s3://name-bucket/test.txt
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: s3://name-bucket/test.txt
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
    at org.myorg.WordCount.main(WordCount.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

      

+3


source to share


1 answer


You actually need to replace the protocol s3

with s3n

. They are two different filesystems with different properties:

  • s3n is s3 Native Filesystem . Native file system for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The downside is the 5 GB file size limit imposed by S3. For this reason, it is not suitable as a replacement for HDFS (which supports very large files).
  • s3 - Block file system : Block -based file system supported by S3. Files are stored in blocks, just like HDFS. This allows renaming efficiently. This filesystem requires that you allocate a bucket for the filesystem - you should not use an existing bucket containing files or write other files to the same bucket. The files stored in this file system can be larger than 5 GB, but they are not compatible with other S3 tools.


( source )

In your case, your bucket is probably using the filesystem s3n

, I believe this is the default, most of the boxes used are also s3n

. Therefore you should uses3n://name-bucket/test.txt

+7


source







All Articles