Python optimizes finding duplicate value index and value in a list
I have a list with 18,000 unique IDs. ID is a concatenation of letters A, B, C, D
. I made some code that groups the ID with ID[0:-1]
and provides the index position of the duplicate ID.
This works well, but for a very long time: 110 secs
for 18 000 ID
. Do you have an idea to speed up my code?
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
startTime = time.time()
b = [i[0:-1] for i in a]
b = list(set(b))
result = range(len(b))
it = 0
for i in result:
result[i] = [b[i], []]
for j in xrange(len(a)):
if b[i] == a[j][0:-1]:
result[i][1].append(j)
endTime = time.time()
print endTime - startTime, 'secs !'
Output:
>>> [['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]], ['1BCABCCCA', [3, 5]]]
source to share
For a more pythonic way for such problems, use collections.defaultdict
:
>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> new=[i[:-1] for i in a]
>>> d=defaultdict(list)
>>> for i,j in enumerate(new):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]})
>>> d.items()
[('1CDABCABD', [0, 1, 2]), ('1DDAABBBB', [4]), ('1BCABCCCA', [3, 5])]
Note that it defaultdict
is a linear solution and is more efficient than itertools.groupby
and sorted
.
Also you can just use the method dict.setdefault
:
>>> d={}
>>> for i,j in enumerate(new):
... d.setdefault(j,[]).append(i)
...
>>> d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
For more information check the following shortcut labeled ~ 4X :
s1="""
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
d.setdefault(j,[]).append(i)
d.items()
"""
print ' first: ' ,timeit(stmt=s1, number=100000)
print 'second : ',timeit(stmt=s2, number=100000)
result:
first: 0.949549913406
second : 0.250894069672
source to share
This is what groupby does effectively in python:
from itertools import groupby
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
key = lambda i: a[i][:-1]
indexes = sorted(range(len(a)), key=key)
result = [[x, list(y)] for x, y in groupby(indexes, key=key)]
Output:
[['1BCABCCCA', [3, 5]], ['1CDABCABD', [0, 1, 2]], ['1DDAABBBB', [4]]]
source to share
You are looking for:
>>> d = {}
>>> for ind, elem in enumerate(a):
... d.setdefault(elem[0:-1], []).append(ind)
>>> print d
{'1CDABCABD': [0, 1, 2], '1DDAABBBB': [4], '1BCABCCCA': [3, 5]}
The solution is very similar to Kasra's optimized code, but slightly faster. The difference is where the slicing was done, although not sure why it works any better than the other:
s1 = """
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA',
'1DDAABBBBA', '1BCABCCCAD']
d = {}
for ind, elem in enumerate(a):
d.setdefault(elem[0:-1], []).append(ind)
"""
s2="""
a = ['1CDABCABDA', '1CDABCABDB', '1CDABCABDD', '1BCABCCCAA', '1DDAABBBBA', '1BCABCCCAD']
new=[i[:-1] for i in a]
d={}
for i,j in enumerate(new):
d.setdefault(j,[]).append(i)
"""
print 'Kasra time/my time: %s' % (str(timeit(stmt=s2, number=100000)/timeit(stmt=s1, number=100000))
Kasra time/my time: 1.24058060531
source to share