How do I get data from HDFS? Hive?
Keep in mind that Hive is a batch processing system that transforms SQL queries under the hoods into a bunch of MapReduce jobs with build steps in between. In addition, Hive is a high latency system, that is, based on your data sizes, you look at minutes or hours to process a complex request.
So, if you want to display the results on your MapReduce job job on your website, it is highly recommended to export the results back to the RDBMS using sqoop and then from there from there.
Or, if the data itself is huge and cannot be exported back to the DBMS. Then another option you might be thinking is to use a NoSQL system like HBase .
source to share
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and learn about the different ways to transfer inbound and outbound data from your HDFS cluster. The video is easy to watch and describes the advantages / disadvantages for each tool, but this blueprint should give you the basics of the Hadoop ecosystem:
- Flume - Data integration and flat file import in HDFS. Designed for asynchronous data streams (such as log files). Distributed, scalable and extensible. Supports various endpoints. Allows pre-processing of data before loading into HDFS.
- Sqoop - Bidirectional Structured Data Transfer (RDBMS) and HDFS. Allows incremental imports to HDFS. RDBMS must support JDBC or ODBC.
- Hive is a SQL-like interface for Hadoop. Requires table structure. JDBC and / or ODBC required.
- Hbase - Allows interactive access to HDFS. Sits on top of HDFS and applies structure to data. Allows random reads, scales horizontally with cluster. Incomplete query language; only allows get / put / scan operations (can be used with Hive and / or Impala). Key row indexes on data only. Doesn't use the Map Reduce paradigm.
- Impala - Like Hive, a high-performance SQL Engine for querying huge amounts of data stored in HDFS. Doesn't use Map Reduce. A good alternative to the Beehive.
- Pig is a data flow language for transforming large datasets. Resolves a schema defined at runtime. PigServer (Java API) allows programmatic access.
Note. I am assuming that the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem might be useful for your application or as a general reference, which is why I've included them.
source to share
If you only want to get data from HDFS, then yes, you can do it through Hive. However, you will benefit more from it if your data is already organized (e.g. in columns).
Let's take an example: your map-cutting job created a csv file named wordcount.csv and contains two lines: a word and a number. This csv file is on HDFS.
Now suppose you want to know the origin of the word "gloubiboulga". You can simply achieve this with the following code:
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Note that although this language is very similar to SQL, there are still a few things you need to learn about it.
source to share