Is it possible to use multi-threaded JDBC insertion?

I am currently working on a Java project that I need to prepare a large (me) mysql database. I have to do a web scraper using Jsoup and store the results in my database. I estimate I will have roughly 1,500,000 - 2,000,000 records. In my first test, I just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow I think. Is it possible to make this process multithreaded so that I can split the records into 3 sets, say 500,000 records per set, and then insert them into one database (specifically one table)?

+3


source to share


5 answers


Multithreading won't help you here. You will simply move the conflicting bottleneck from your application server to the database.

Try using batch inserts instead, they usually do things like this way faster. See "3.4 Creating Batch Updates" in the JDBC tutorial .



Edit: As @Jon commented, you need to decouple the fetch of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You can have multiple threads receiving web pages that add data to the queue data structure and then have one thread pushing the queue to the database using batch insert.

+4


source


Just make sure no two (or more) threads are using the same connection at the same time, using a connection pool allows this. c3po and apache dbcp comes to mind ...



+1


source


You can record your part recording and do so, but you may need to consider other factors as well.

Are you doing a round trip network trip for each INSERT? If so, latency might be the real enemy. Try these queries to reduce network traffic.

Do you have transactions? If so, the size of the rollback log might be an issue.

I would recommend profiling your application server and database server to see where your time is being spent. You can spend a lot of time guessing the root cause.

+1


source


You can insert these records into different streams if they use different primary key values.

You should also take a look at Spring Package , which I believe will be useful in your case.

+1


source


I think a multi-threaded approach is fine for your problem, but you should use connection pool

for example C3P0

or Tomca 7 Connetcion pool

for better performance.

Another solution is using a batch operator such as Spring-batch

, there is also another batch utility.

Another solution is using PL/SQl Procedure

with an input parameter structure

.

0


source







All Articles