Accessing google cloud storage using hasoop FileSystem api

On my machine, I configured hadoop core-site.xml

to recognize the schema gs://

and added gcs-connector-1.2.8.jar as the Hadoop lib. I can run hadoop fs -ls gs://mybucket/

and get the expected results. However, if I try to make an analog from java using:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] status = fs.listStatus(new Path("gs://mybucket/"));

      

I get files as root in local HDFS instead gs://mybucket/

, but with the files added with gs://mybucket

. If I modify the conf with conf.set("fs.default.name", "gs://mybucket");

before getting fs, then I can see the files in GCS.

My question is:
1. Is this the expected behavior?
2. Is there any downside to using this apoop FileSystem api as opposed to the google cloud storage api client?

+1


source to share


1 answer


As for your first question, "expected" is dubious, but I think I can at least explain. When FileSystem.get () is used, the default FileSystem is returned and the default is HDFS. I am assuming the HDFS client (DistributedFileSystem) has code to automatically add schema + permissions to all files in the file system.

Instead of using FileSystem.get (conf) try

FileSystem gcsFs = new Path("gs://mybucket/").getFS(conf)

      



On the downside, I could probably argue that if you need to access the storage object directly, then you end up drawing code to interact with the storage APIs anyway (and there are things that don't translate very well to the Hadoop FS API, e.g. object composition, complex object write conditions other than simple object overwrite protection, etc.).

I am admittedly biased (working in a team), but if you are going to use GCS from Hadoop Map / Reduce, from Spark, etc., the GCS connector for Hadoop should be pretty safe.

+1


source







All Articles