Big process acceleration is performed on some data retrieved from the database

Question

Big process acceleration is performed on some data retrieved from the database

So I am working on a project in which I have to read a large database (it is large for me) of 10 million records. I cannot filter them, because I have to consider them all and individually. For each record, I have to apply a formula and then write that result to multiple files depending on the specific recording conditions.

I have implemented several algorithms and it takes 2-3 days to process all processing. This is a problem because I am trying to optimize a process that is already taking this time. 1 day is acceptable.

So far I've tried indexes on the database, streams (process on write, not IO). I cannot get a shorter time.

I am using django and I cannot measure how much it really takes to actually start processing data due to its lazy behavior. I would also like to know if I can start processing the data as soon as I receive it, and not wait for all the data to be loaded into memory before I can process it. This could also be my understanding of writing operations in python. Finally, it may happen that I need a better machine (I doubt I have 4 cores and 4 GB of RAM, it should provide the best speed)

Any ideas? I really appreciate the feedback. :)

Edit: Code

Explanation:

The entries I was talking about are customer identifiers (passports) and the conditions are that there are agreements between the different terminals of the company (countries). This process is hashing.

The first strategy tries to process the entire database ... First, we prepared some preparation for processing the conditional part of the algorithm (agreements between countries). Then a large check, belonging or not in the set.

Since I was trying to improve it myself, I tried to solve the problem piece by piece for the second strategy by looking at the request piece by piece (getting records belonging to the country and writing in files those that agree with them)

The threaded strategy is not shown as it was designed for a specific country and I got terrible results compared to the non-threaded one. I honestly have an intuition, it should be memory and sql.

def create_all_files(strategy=0):
    if strategy == 0:
        set_countries_agreements = set()
        file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
        set_countries_temp = set(line.strip() for line in file_countries)
        file_countries.close()
        set_countries = sorted_nicely(set_countries_temp)

        for each_country in set_countries:
            set_agreements = frozenset(get_agreements(each_country))
            set_countries_agreements.add(set_agreements)

        print("All agreements obtained")

        set_passports = Passport.objects.all()

        print("All passports obtained")


        for each_passport in set_passports:
            for each_agreement in set_countries_agreements:
                for each_country in each_agreement:
                    if each_passport.nationality == each_country:
                        with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
                            f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
                    print(".")
                print("_")
            print("-")
        print("~")

    if strategy == 1:

        file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
        set_countries_temp = set(line.strip() for line in file_countries)
        file_countries.close()
        set_countries = sorted_nicely(set_countries_temp)

        while len(set_countries)!= 0:
            country = set_countries.pop()
            list_countries = get_agreements(country)
            list_passports = Passport.objects.filter(nationality=country)
            for each_passport in list_passports:
                for each_country in list_countries:
                    with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
                        f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
                        print("r")
                print("c")
            print("p")
        print("P")

+3

python django mysql

Sergio Andrés Cortés Hernández 04 Aug '14 at 9:23

source to share

2 answers

dani herrera · Answer 1 · 2014-08-04T11:28:36+0000

In your question, you are describing the ETL process . I suggest you use an ETL tool .

To mention some ETL tool for python I can talk about Pygrametl written by Christian Thomsen, in my opinion it works well and its performance is impressive. Check it out and come back with results.

I cannot post this answer without mentioning MapReduce . This programming model can capture your requirements if you plan to distribute the task across nodes.

Iain Shelvington · Answer 2 · 2014-08-04T14:03:09+0000

It looks like you have a file for each country to which you add hashes, instead of opening and closing the descriptors of those files 10 million + times, you have to open them every time and close them all at the end.

countries = {}  # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
    for line in country_file:
        country = line.strip()
        countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")

for country in countries:
    agreements = get_agreements(country)
    for postcode in Postcode.objects.filter(nationality=country):
        for agreement in agreements:
            countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")

for country, file in countries.items():
    file.close()

I don't know how big a list of objects Postcode

Postcode.objects.filter(nationality=country)

will be returned if massive and memory issue is a problem, you will need to start thinking about pagination / pagination of the query using limits

You are using sets for your list of countries and their conventions, if it is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution might be an error when trying to open a different handle to the same file.This can be avoided by adding a simple check. to find out if the country is already a member countries

Big process acceleration is performed on some data retrieved from the database

More articles: