Python optimizes finding duplicate value index and value in a list

I have a list with 18,000 unique IDs. ID is a concatenation of letters A, B, C, D

. I made some code that groups the ID with ID[0:-1]

and provides the index position of the duplicate ID.

This works well, but for a very long time: 110 secs

for 18 000 ID

. Do you have an idea to speed up my code?

a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']

startTime = time.time()
b = [i[0:-1] for i in a]
b = list(set(b))


result = range(len(b))
it = 0
for i in result:
    result[i] = [b[i], []]
    for j in xrange(len(a)):
        if b[i] == a[j][0:-1]:
            result[i][1].append(j)

endTime =  time.time()

print endTime - startTime, 'secs !'

      

Output:

>>> [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]

      

+3


source to share


4 answers


For a more pythonic way for such problems, use collections.defaultdict

:

>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> new=[i[:-1] for i in a]

>>> d=defaultdict(list)
>>> for i,j in enumerate(new):
...    d[j].append(i)
... 
>>> d
defaultdict(<type 'list'>, {'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]})
>>> d.items()
[('1CDABCABD', [0, 1, 2]), ('1DDAABBBB', [4]), ('1BCABCCCA', [3, 5])]

      

Note that it defaultdict

is a linear solution and is more efficient than itertools.groupby

and sorted

.

Also you can just use the method dict.setdefault

:

>>> d={}
>>> for i,j in enumerate(new):
...   d.setdefault(j,[]).append(i)
... 
>>> d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}

      



For more information check the following shortcut labeled ~ 4X :

s1="""
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
   d.setdefault(j,[]).append(i)
d.items()
    """


print ' first: ' ,timeit(stmt=s1, number=100000)
print 'second : ',timeit(stmt=s2, number=100000)

      

result:

 first:  0.949549913406
second :  0.250894069672

      

+5


source


This is what groupby does effectively in python:

from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]

      



Output:

[['1BCABCCCA', [3, 5]], ['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]]]

      

+5


source


Alternative solution that doesn't use other modules:

grouped = {}
for i, j in enumerate(a):    
    itm = grouped.get(j[0:-1], [])
    itm.append(i)    
    grouped[j[0:-1]] = itm

print [[k, v] for k, v in grouped.items()] # [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]

      

+2


source


You are looking for:

>>> d = {}
>>> for ind, elem in enumerate(a):
    ... d.setdefault(elem[0:-1], []).append(ind)
>>> print d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}

      

The solution is very similar to Kasra's optimized code, but slightly faster. The difference is where the slicing was done, although not sure why it works any better than the other:

s1 = """
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA',
      '1DDAABBBBA', '1BCABCCCAD']
d = {}
for ind, elem in enumerate(a):
    d.setdefault(elem[0:-1], []).append(ind)
"""

s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
   d.setdefault(j,[]).append(i)
"""

print 'Kasra time/my time: %s' % (str(timeit(stmt=s2, number=100000)/timeit(stmt=s1, number=100000))

Kasra time/my time: 1.24058060531 

      

+1


source







All Articles