Big Data Processing at Datawarehouse

I am involved in big data concepts. Based on my understanding, Big Data is critical to handling unstructured data and high volume. When we look at a large data architecture for a Data Warehouse (DW), the data from the source is retrieved through Hadoop (HDFS and Mapreduce) and the corresponding unstructured information is converted into valid business information, and finally the data is entered into DW or DataMart through ETL processing (along with existing data processing).

However, I would like to know what the new methods / new dimensional model or storage requirements are required for DW to organize (due to big data) as most of the tutorials / resources I am trying to learn about Hadoop only on but not on purpose. How Big Data injection affects the predefined reporting / adhoc analysis of an organization due to this large data volume.

Rate your answer

+3


source to share


2 answers


This is a very broad question, but I will try to provide some answers.

Hadoop can be a data source, a data warehouse, or a "data lake", which is a data warehouse from which warehouses and storefronts can be drawn.

The line between Hadoop data warehouses and RDBMS is becoming increasingly blurred. As SQL-on-Hadoop becomes a reality, interacting with Hadoop-powered data is becoming easier. To be efficient, the data must have a structure.

Some examples of Hadoop / DW interactions:

  • Microsoft Application Platform System with Polybase Interoperability between SQL Server and Hadoop
  • Impala (Cloudera), Stinger (Hortonworks) and others providing SQL-on-Hadoop
  • Actian and Vertica (HP) providing RDBMS compliant MPP on Hadoop

However, Hadoop DW is still immature. It is not as real as an RDBMS based DWD, lacks many security and operational features, and also lacks SQL capabilities. Think about your needs before going down this path.



Another question you should ask is whether you really need this type of platform. Any RDBMS can handle 3-5 TB of data. SQL Server and PostgreSQL are two examples of platforms that will handle DW on commodity hardware and little administration.

Those same RDBMSs can handle 100Tb workloads, but they require much more care and power at this scale.

The MPP RDBMS hardware handles data workloads in the Petabyte range, with less administrative and operational overhead when scaling. I doubt you will get to this scale, very few companies do :) You can opt for an MPP device for a much smaller amount of data if the speed of complex queries were your most important factor. For this reason, I have seen MPP devices deployed in 5TB data volumes.

Depending on the loading technique, you will likely find that RDBMS-based DWs load faster than Hadoop. For example, I am loading hundreds of thousands of rows per second in PostgreSQL and slightly less than in SQL Server. It takes significantly longer to achieve the same result in Hadoop, as I have to ingest the file, install it in Hive and transfer it to Parquet to get the same level of performance. Over time I expect this to change in favor of Hadoop, but not quite there yet.

You mentioned Dimensional Modeling. If your star schema consists of SCD0-SCD1 transactional fact and size tables, so insert processing requires success with SQL-on-Hadoop. If you need to update facts (accumulate snapshots) or dimensions (SCD2, SCD3), you may struggle with both capability and performance - many implementations do not yet support UPDATE queries, and those that do are slow.

Sorry there isn't a simple "Do it!" answer, but it's a tricky topic in an immature field. Hope these comments help you think.

+6


source


The process of data lakes and data storage is not the same. Sizing modeling in the traditional sense starts with identifying business processes and designing a star diagram, where on the data lakes you don't make any assumptions about the business process. Data lakes collect data at a very narrow level, as far as possible, explore it and find a business to process. You can learn more about Data Lakes at Introduction to Enterprise Data Lake - Myths and Miracles



0


source







All Articles