Best way to filter a list of dictionaries in Python
I have a list of dictionaries that have a structure like this:
log = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
{'user_id': 'id2', 'action': 'action2', 'timestamp': 'time2'},
...]
and sorted by timestamp value.
I would like to remove consecutive identical actions performed by the same user, leaving only the first one, for example. if i have the following list:
log = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time2'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time3'},
{'user_id': 'id2', 'action': 'action2', 'timestamp': 'time4'},
{'user_id': 'id3', 'action': 'action2', 'timestamp': 'time5'},
{'user_id': 'id3', 'action': 'action2', 'timestamp': 'time6'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time8'}]
I would like to get this list as a result:
log = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
{'user_id': 'id2', 'action': 'action2', 'timestamp': 'time4'},
{'user_id': 'id3', 'action': 'action2', 'timestamp': 'time5'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'}]
I am currently doing it like this:
def merge_actions(log):
merged_log = []
merged_log.append(log[0])
for i in range(1, len(log)):
if log[i]['user_id'] == log[i-1]['user_id']:
if log[i]['action'] == log[i-1]['action']:
continue
merged_log.append(log[i])
return merged_log
Is there a better way to do this?
source to share
If you use itertools.groupby
and group with 'user_id'
and 'action'
, you can grab the first item from each of the group.
>>> [next(group) for key, group in itertools.groupby(log, key = lambda i: (i['user_id'], i['action']))]
[{'timestamp': 'time1', 'action': 'action1', 'user_id': 'id1'},
{'timestamp': 'time4', 'action': 'action2', 'user_id': 'id2'},
{'timestamp': 'time5', 'action': 'action2', 'user_id': 'id3'},
{'timestamp': 'time7', 'action': 'action1', 'user_id': 'id1'}]
source to share
Use itertools.groupby
to group consecutive actions of the same user and then take the first element of each group:
def merge_actions(log):
return [next(group) for key, group in itertools.groupby(log, lambda l: (l['user_id'], l['action']))
source to share
If you need to use a loop, you just need to just keep track of the last key you saw:
it = iter(log)
start = next(it)
od,prev = [start], start["user_id"]
for d in it:
k = d["user_id"]
if prev != k:
od.append(d)
prev = k
print(od)
[{'action': 'action1', 'timestamp': 'time1', 'user_id': 'id1'},
{'action': 'action2', 'timestamp': 'time4', 'user_id': 'id2'},
{'action': 'action2', 'timestamp': 'time5', 'user_id': 'id3'},
{'action': 'action1', 'timestamp': 'time7', 'user_id': 'id1'}]
If actions aren't always grouped, check both keys:
it = iter(log)
start = next(it)
od, prev,act = [start], start["user_id"],start["action"]
for d in it:
k1, k2 = d["user_id"], d["action"]
if prev != k1 or k2 != act:
od.append(d)
prev, act = k1, k2
source to share
Here's a tricky attempt at using groupby
:
from itertools import groupby
a = [{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time1'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time2'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time3'},
{'user_id': 'id2', 'action': 'action2', 'timestamp': 'time4'},
{'user_id': 'id3', 'action': 'action2', 'timestamp': 'time5'},
{'user_id': 'id3', 'action': 'action2', 'timestamp': 'time6'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time7'},
{'user_id': 'id1', 'action': 'action1', 'timestamp': 'time8'}]
for u, grps in groupby(a, lambda d: d['user_id']):
d_with_first_ts = sorted(grps, key = lambda user_dict: user_dict['timestamp'])[0]
print('User: {}; Dict with first timestamp = {}'.format(u, d_with_first_ts))
You will get the following results:
User: id1; Dict with first timestamp = {'timestamp': 'time1', 'action': 'action1', 'user_id': 'id1'}
User: id2; Dict with first timestamp = {'timestamp': 'time4', 'action': 'action2', 'user_id': 'id2'}
User: id3; Dict with first timestamp = {'timestamp': 'time5', 'action': 'action2', 'user_id': 'id3'}
User: id1; Dict with first timestamp = {'timestamp': 'time7', 'action': 'action1', 'user_id': 'id1'}