Sqoop lastmodified incremental load not working with updated records

Question

Sqoop lastmodified incremental load not working with updated records

I am working on an incremental sqoop job to load data from mysql to hdf. Following are the following scenarios.

Scenario 1: Following are the entries inserted into the examples table in mysql.

select * from sample;
+-----+--------+--------+---------------------+
| id  | policy | salary | updated_time        |
+-----+--------+--------+---------------------+
| 100 |      1 |   4567 | 2017-08-02 01:58:28 |
| 200 |      2 |   3456 | 2017-08-02 01:58:29 |
| 300 |      3 |   2345 | 2017-08-02 01:58:29 |
+-----+--------+--------+---------------------+

Below is a table of the example table structure in mysql:

create table sample (id int not null primary key, policy int, salary int, updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);

I am trying to import this into hdfs creating a sqoop job as shown below

sqoop job --create incjob -- import --connect jdbc:mysql://localhost/retail_db --username root -P --table sample --merge-key id --split-by id --target-dir /user/cloudera --append --incremental lastmodified --check-column updated_time -m 1

After executing the sqoop job, below are the output records in hdfs.

$ hadoop fs -cat /user/cloudera/par*
100,1,4567,2017-08-02 01:58:28.0
200,2,3456,2017-08-02 01:58:29.0
300,3,2345,2017-08-02 01:58:29.0

Scenario 2: After entering several new records and updating existing records in the examples table. Below is an example of a table.

select * from sample;
+-----+--------+--------+---------------------+
| id  | policy | salary | updated_time        |
+-----+--------+--------+---------------------+
| 100 |      6 |   5638 | 2017-08-02 02:01:09 |
| 200 |      2 |   7654 | 2017-08-02 02:01:10 |
| 300 |      3 |   2345 | 2017-08-02 01:58:29 |
| 400 |      4 |   1234 | 2017-08-02 02:01:17 |
| 500 |      5 |   6543 | 2017-08-02 02:01:18 |
+-----+--------+--------+---------------------+

After doing the same sqoop job, below are the entries in hdfs.

hadoop fs -cat /user/cloudera/par*
100,1,4567,2017-08-02 01:58:28.0
200,2,3456,2017-08-02 01:58:29.0
300,3,2345,2017-08-02 01:58:29.0
100,6,5638,2017-08-02 02:01:09.0
200,2,7654,2017-08-02 02:01:10.0
400,4,1234,2017-08-02 02:01:17.0
500,5,6543,2017-08-02 02:01:18.0

Here, updated records in mysql are inserted as new records in hdfs, instead of updating existing records in hdf. I have used both --merge switches as well as --append in my sqoop job conf command. Can anyone help me with this problem.

+3

mysql hadoop sqoop

Nihar KN 02 Aug 17 at 9:51

source to share

2 answers

Sandeep singh · Answer 1 · 2017-08-02T10:08:43+0000

You use --merge-key

--append

both lastmodified

together. This is not true.

--incremental append

mode Add data to an existing dataset in HDFS. You have to specify the add mode when importing a table to which new rows are constantly being added as the row id values increase
--incremental lastmodified

mode - you should use this when the rows of the original table can be updated and each such update will set the value of the column with the last modified value to the current timestamp.
--merge-key

- The merge tool starts a MapReduce job into which two directories are entered: a new dataset and an older one. They are set using -new-data and -onto, respectively. The result of the MapReduce job will be placed in the HDFS directory specified --target-dir

.
--last-value

(value) Specifies the maximum value of the validation column from the previous import. If you run sqoop from the command line, without specifying Sqoop, you need to add the parameter--last-value

In your case, there are some new entries and some entries are also updating, so you need to go into mode lastmodified

.

Your Sqoop command will:

sqoop job --create incjob -- import --connect jdbc:mysql://localhost/retail_db --username root -P --table sample --merge-key id --target-dir /user/cloudera --incremental lastmodified --check-column updated_time -m 1

Since you only specified one cartographer, there is no need --split-by

.

kumsgs · Answer 2 · 2017-08-02T18:16:13+0000

I understand that you are trying to update existing records in HDFS whenever there is a change to the MySQL Source table.
You should only use --append when you don't want to update the changed records in the original table.
Another approach is you can try to move the changed records to a separate directory as delta_records and then join it using base_records. See the Hortonworks link for more details for clarity.

Sqoop lastmodified incremental load not working with updated records

More articles: