Performance recommendations for importing large datasets
I have a function that allows users to import contacts (email and name). Some users import files from about 70,000 contacts. Can be xls or csv. This is what I have now.
- The user selects the file (from their computer) from which they want to import contacts.
- I save the file to the server and create a database entry with a link to the file location.
- Amazon SQS is used to process this in the background.
- The first time I run the job, I process the file, keeping only the lines with the email address and name (if found). The data is saved to a json file in the same location and cached. Then I will release the job back to the queue.
- The contacts are now ready to import. I take 1000 contacts in each job and store each contact in its own row in the database. I am using array_slice to skip contacts in a json file. The score of the pass is saved in the database.
- When there are no contacts left, the task is deleted and everything is done.
This is pretty much the whole process. I also have a check (database search) to check for duplicates. Only unique email addresses are allowed.
The problem is that the job seems to be taking too long and I am getting timeouts. This leads to the import taking a long time.
So my question is, is there anything I can do better?
Let me know if you need anything else. I don't have much experience with big data and many users.
EDIT: I don't need any code. What I need is a server problem? Maybe moving the database to your own server will do the trick? Or should I use a different approach?
REV. 2: User can see the import progress. So I need to calculate the number of contacts and do this to filter out the lines without email address first. And I also crop it and the names column. When I did this, it was easier for me to save the new dataset to a JSON file.
EDIT 3: Timeouts happen when saving users to the database, not in the initial processing and json file creation.
EDIT 4 . One way to speed up a job is to keep it in chunks from the beginning (in first processing). This way I don't need to handle the skip count and I don't need to use array_slice for a large dataset. Also now that I think about it, it's silly to store it in a json file and then cache it. Why not cache the array from the start?
source to share
I take 1000 contacts in each job and store each contact in its own row in the database.
I ran into the problem even earlier, but in my problem I need to import employee presence of about 50,000 records, I figured it out using parallelization. You may have noticed this, which is why you take 1000 contacts in each job queue. The real problem is that "Time Time Out" are we colliding with correct if we take so much of it?
So my solution against this is to create more child process to do the job. If I create one job for 1000 imports it will use more time and slower. So I create 100 jobs queu e with e ach doing the import of 100 records . And I run it together. Because of this, it increases your CPU usage. This is not a problem if you have a high specification computer.
- Create more job queue for import.
- Don't use too many loops.
- If possible, store the data in memcached, because this will speed up your process. I think you think so too. Read about APC
You can read how to store your data in memory here. Hope this helps you a little :)
source to share
Is your php program expected to complete this task? It won't work. There will be time. You noticed it.
You need to organize your operation so that your php program runs a job on AWS SQS and then informs your user that the job has started and will be done after a while. Set user expectations to low ("done in 15 minutes") and then exceed them (5 minutes), not vice versa.
You will then need a separate operation to query the job status to see if it has been completed. You can do this by organizing a job to update the row in the table when it is done.
source to share