Acquisition of Oracle data before Hadoop in real time

I have a requirement to ingest data from an Oracle database into Hadoop in real time.

What's the best way to achieve this in Hadoop?

+3


source to share


3 answers


An important issue here is getting data from Oracle DB in real time. This is usually called Change Data Collection or CDC. The complete solution depends on how you perform this part.

Other things that are relevant to this answer are:

  • What is the purpose of the data and what are you going to do with it?
    • just store plain HDFS files and access ad hoc queries with something like Impala?
    • stored in HBase for use in other applications?
    • to use in a CEP solution like Storm?
    • ...
  • What tools are familiar to your team?
    • Do you prefer a DIY approach, gluing together existing open source tools and coding for missing pieces?
    • or do you prefer a data integration tool like Informatica?


Coming back to the CDC, there are three different approaches:

  • Simplicity: Unless you need some real real work and there is a way to identify new data with an SQL query that is fast enough for the data latency required. Then you can run that query over and over again and accept its results (the exact method depends on the purpose, the size of each chunk, and your preferred tools).
  • Hard: Deploy your own CDC solution: download the database logs, analyze them in a series of inserts / updates / deletes, running them in Hadoop.
  • Dear: Buy a CDC solution that does it for you (like GoldenGate or Attunity )
+4


source


Expanding a bit on what @Nickolay mentioned, there are a couple of options, but a state based opinion too would be best.

Tungsten (open source)

Tungsten Replicator is an open source replication engine that supports many different extractor and applicator modules. Data can be retrieved from MySQL, Oracle and Amazon RDS and applied to transaction stores including MySQL, Oracle and Amazon RDS; NoSQL stores like MongoDB and data warehouses like Vertica, Hadoop, and Amazon rDS.

Oracle GoldenGate



Oracle GoldenGate is a comprehensive software suite for real-time data integration and replication in heterogeneous IT environments. The product suite provides high availability solutions, real-time data integration, transactional change data collection, data replication, transformation and validation between operating and analytical enterprise systems. It provides a handler for HDFS.

Dell Shareplex

The SharePlex ™ connector for Hadoop® downloads and continuously replicates changes from the Oracle® database to the Hadoop® cluster. This gives you all the benefits of keeping a copy of the original tables in real time or in real time.

+2


source


Apache Sqoop is a data transfer tool for transferring massive data from any DBMS with JDBC connection (also supports Oracle) to hasoop HDFS.

0


source







All Articles