Python optimizes finding duplicate value index and value in a list

Question

Python optimizes finding duplicate value index and value in a list

I have a list with 18,000 unique IDs. ID is a concatenation of letters A, B, C, D

. I made some code that groups the ID with ID[0:-1]

and provides the index position of the duplicate ID.

This works well, but for a very long time: 110 secs

for 18 000 ID

. Do you have an idea to speed up my code?

a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']

startTime = time.time()
b = [i[0:-1] for i in a]
b = list(set(b))


result = range(len(b))
it = 0
for i in result:
    result[i] = [b[i], []]
    for j in xrange(len(a)):
        if b[i] == a[j][0:-1]:
            result[i][1].append(j)

endTime =  time.time()

print endTime - startTime, 'secs !'

Output:

>>> [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]

+3

performance optimization python list duplicates

Guilhain Apr 19 15 at 18:11

source to share

4 answers

This is what groupby does effectively in python:

from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]

Output:

[['1BCABCCCA', [3, 5]], ['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]]]

+5

JuniorCompressor Apr 19 15 at 18:17

source to share

Alternative solution that doesn't use other modules:

grouped = {}
for i, j in enumerate(a):    
    itm = grouped.get(j[0:-1], [])
    itm.append(i)    
    grouped[j[0:-1]] = itm

print [[k, v] for k, v in grouped.items()] # [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]

+2

alfasin Apr 19 15 at 18:35

source to share

You are looking for:

>>> d = {}
>>> for ind, elem in enumerate(a):
    ... d.setdefault(elem[0:-1], []).append(ind)
>>> print d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}

The solution is very similar to Kasra's optimized code, but slightly faster. The difference is where the slicing was done, although not sure why it works any better than the other:

s1 = """
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA',
      '1DDAABBBBA', '1BCABCCCAD']
d = {}
for ind, elem in enumerate(a):
    d.setdefault(elem[0:-1], []).append(ind)
"""

s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
   d.setdefault(j,[]).append(i)
"""

print 'Kasra time/my time: %s' % (str(timeit(stmt=s2, number=100000)/timeit(stmt=s1, number=100000))

Kasra time/my time: 1.24058060531

+1

Saksham Varma Apr 19 15 at 19:11

source to share

Kasramvd · Accepted Answer · 2015-04-19T18:28:24+0000

For a more pythonic way for such problems, use collections.defaultdict

:

>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> new=[i[:-1] for i in a]

>>> d=defaultdict(list)
>>> for i,j in enumerate(new):
...    d[j].append(i)
... 
>>> d
defaultdict(<type 'list'>, {'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]})
>>> d.items()
[('1CDABCABD', [0, 1, 2]), ('1DDAABBBB', [4]), ('1BCABCCCA', [3, 5])]

Note that it defaultdict

is a linear solution and is more efficient than itertools.groupby

and sorted

.

Also you can just use the method dict.setdefault

:

>>> d={}
>>> for i,j in enumerate(new):
...   d.setdefault(j,[]).append(i)
... 
>>> d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}

For more information check the following shortcut labeled ~ 4X :

s1="""
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
   d.setdefault(j,[]).append(i)
d.items()
    """


print ' first: ' ,timeit(stmt=s1, number=100000)
print 'second : ',timeit(stmt=s2, number=100000)

result:

 first:  0.949549913406
second :  0.250894069672

Python optimizes finding duplicate value index and value in a list

More articles: