Python iteration order on set

I am parsing two large files (size order Gb), each containing keys

and corresponding values

. Some keys

are shared between two files, but with different corresponding ones values

. For each of the files, I want to write to a new file keys*

and the corresponding one values

, and keys*

represents the keys present in both files1 and file2. I don't care about the order key

in the output, but it should be in exactly the same order in the two files.

File 1:

key1
value1-1
key2
value1-2
key3
value1-3

      

File2:

key1
value2-1
key5
value2-5
key2
value2-2

      

A valid result would be:

Partial file 1:

key1
value1-1
key2
value1-2

      

Partial file 2:

key1
value2-1
key2
value2-2

      

Another valid output:

Partial file 1:

key2
value1-2
key1
value1-1

      

Partial file 2:

key2
value2-2
key1
value2-1

      

Invalid output (keys of different order in file 1 and file 2):

Partial file 1:

key2
value1-2
key1
value1-1

      

Partial file 2:

key1
value2-1
key2
value2-2

      

The last precision is that the sizes of the values โ€‹โ€‹are much larger than the sizes of the keys.

What I'm going to do is:

  • For each input file, parse and submit dict

    (call it file_index

    ) with keys corresponding to the keys in the file and values โ€‹โ€‹corresponding to the offset where the key was found in the input file.

  • Calculate Intersection

    good_keys = file1_index.viewkeys() & file2_index.viewkeys()
    
          

  • do something like (pseudocode):

    for each file:
        for good_key in good_keys:
            offset = file_index[good_key]
            go to offset in input_file
            get corresponding value
            write (key, value) to output file
    
          

Does iterating over the same set guarantee me the exact order (assuming it's the same set: I won't change it between two iterations), or do I need to convert the set to a list first, and iterate over the list?

-1


source to share


3 answers


Python dicts and sets are stable, meaning if you iterate over them without changing them, they are guaranteed to give you the same ordering. This is from the dicts documentation :



Keys and values โ€‹โ€‹are repeated in an arbitrary order, which is non-random, varies by Python implementations, and depends on the history of dictionary attachments and deletions. If the keys, values, and representations of the elements are repeated without any intermediate changes in the dictionary, the order of the elements will correspond directly.

+6


source


Iterating over an unmodified set will always give you the same ordering. The order is informed by the current values โ€‹โ€‹and their insertion history.

See Why is the order in dictionaries and collections arbitrary? if you're wondering why this is.



Note that if you want to change your files in place, this will only work if your entries are of a fixed size. Files cannot be updated somewhere in the middle, where this update consists of fewer or more characters than the characters you replaced.

The data in files is like magnetic tape, you have to splicate longer or shorter chunks to replace the data in the middle, but you cannot do that with a file. You will need to rewrite whatever follows the replaced key-value pair to make everything else match.

+3


source


As already pointed out, dicts and sets are stable and maintain the same order as long as you don't change it. If you need a specific order, you can use OrderedDict

In the collection libraries docs:

>>> from collections import OrderedDict

>>> # regular unsorted dictionary
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}

>>> # dictionary sorted by key -- OrderedDict(sorted(d.items()) also works
>>> OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])

>>> # dictionary sorted by value
>>> OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])

>>> # dictionary sorted by length of the key string
>>> OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
OrderedDict([('pear', 1), ('apple', 4), ('orange', 2), ('banana', 3)])

      

0


source







All Articles