Why is HDFS not recommended for low latency applications?

I'm new to Hadoop and HDFS and it confuses me why HDFS is not preferred over applications that require low latency. In a large data sensitizer, we will be distributing data across various community hardware, so the data access needs to be faster.

+3


source to share


3 answers


Hadoop is a complete batch processing system designed for storing and analyzing structured, unstructured, and semi-structured data.

The Hadoop map / reduce framework is relatively slow as it is designed to support different formats, structures, and huge amounts of data.

We shouldn't say HDFS is slower as HBase no-sql database and MPP based databases like Impala, Hawq are hosted on HDFS. These data sources are faster because they don't follow the mapreduce to retrieve and process data.



The slowness is only due to the nature of the map / pruning based execution, where it produces a lot of intermediate data, a lot of data exchanged between nodes, resulting in huge disk I / O latency. In addition, it must store a lot of data on disk for synchronization between phases so that it can support disaster recovery. There is also no way in mapreduce to cache all / subset of data in memory.

Apache Spark is another batch processing system, but it is relatively faster than Hadoop mapreduce as it caches most of the input data in memory using RDD and stores intermediate data in memory itself, eventually writes the data to disk upon completion or as needed ...

+1


source


There is also the fact that HDFS, as a file system, is optimized for large chunks of data. For example, one block is usually 64-128MB instead of the more common .5-4KB. Therefore, even with small operations, there will be a significant delay when reading or writing to disk. Add to that the distributed nature and you see significant overhead (indirection, synchronization, replication, etc.) compared to a traditional file system.



This is in terms of HDFS, which I read to be your main question. Hadoop as a data processing framework has its own set of tradeoffs and inefficiencies (better explained in @ hserus's answer), but they mostly target the same niche: robust bulk processing.

0


source


Low latency or real-time applications usually require specific data. They need to quickly execute some small amounts of data that a user or application is waiting for.

HDFS

is designed for storing big data in a distributed environment that provides fault tolerance and high availability. The actual location of the data is known only for Namenode

. It stores data in an almost random way on anyone Datanode

. Again, it splits data files into smaller, fixed size chunks. Thus, data can be quickly migrated to real-time applications due to network latency and data distribution and filtering of specific data. Where it helps to run MapReduce or to work with data, since the executable program is ported to the machine that stores the data locally (principle of data locality).

0


source







All Articles