Database Performance Recommendations for Batch Importing Large Datasets

I am creating a database web application using Java and Hibernate JPA implementation. The app keeps track of objects. It should also package import objects from legacy source.

For example, let's say we are tracking people. The database has tables called Person and Address. There are corresponding JPA entities and DAO classes.

In addition to the JPA layer, a service layer is used to handle various operations. One operation is to import a potentially large dataset from an external source (such as a phone book). For each person, he must check if he exists in the database. Then he has to create or update the person as needed. Each person has an address, so a corresponding cross-reference and address creation must also occur.

My problem is that this operation can be slow for large datasets. My current algorithm:

for (Person person: allPersons)
{
    check if person exists in database
    check if address exists in database
    create or update person and address as necessary
}

      

What would you recommend to improve productivity?

Off the top of my head, I can think:

  • Modify the import logic to retrieve and store data in the database using queries. For example, instead of checking if a person exists in a for loop, send the person's entire key to the database in a single query. The process of each person received is in memory.
  • Add your own caching to DAO classes.
  • Use an external caching solution (like memcached).

I can always go C # 1 by restructuring to minimize queries. The downside is that my service layer is now very knowledgeable about the DAO layer. Its implementation is now dictated by the lower database layer. There are other problems as well, such as using too much memory. This hijack-from-database-then-process-in-memory seems very home-grown and runs counter to shelf solutions like JPA. I am curious what others will do in this case.

Edit: Caching won't help as each user requested in the loop is different.

0


source to share


2 answers


There, two solutions I found work. One of them is to process a piece at a time. After each chunk finishes restarting the session. I have tried using the flush cleanup methods on the session, but sometimes it just functions the way you would expect. Starting and stopping a batch-to-batch transaction seems to work best.



If performance is a major concern, you just break down and do it in JDBC. Hibernate adds too much overhead for batching large datasets where memory and performance are important.

+1


source


Your approach will result in too many individual database queries; looks like 4n + 1. If possible, I would write a query (possibly in raw SQL) that checks for a person + address in just one shot.

You might want to work with StatelessSession instead of the standard Hibernate session. Since it doesn't have a level 1 cache, it should lower its memory requirements.

http://www.hibernate.org/hib_docs/reference/en/html/batch-statelesssession.html



If that doesn't work for you, you need to take a look at the batch parameters in Hibernate:

http://www.hibernate.org/hib_docs/reference/en/html/batch.html

0


source







All Articles