Python: how to concatenate dataset by repeatable value in one column
Let's say I have a numpy array like this:
import numpy as np
x= np.array(
[[100, 14, 12, 15],
[100, 21, 16, 11],
[100, 19, 10, 13],
[160, 24, 15, 12],
[160, 43, 12, 65],
[160, 17, 53, 23],
[300, 15, 17, 11],
[300, 66, 23, 12],
[300, 44, 70, 19]])
The original array is much larger, so my question is, is there a way to binary or group strings based on the value in column 1? eg:
{'100': [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
,'160': [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
,'300': [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}
source to share
We are talking about a large dataset, so we might need performance as well as the raw data as a NumPy array. This post lists two NumPy approaches.
Approach # 1
Here's one approach using np.unique
to get the indices of the strings separating the groups and then using a loop comprehension to output the dictionary -
unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
dict_out = {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}
This assumes that the first column will be sorted as indicated in the question title - ...repeated value in one column
. If it is not, we need to use x[:,0].argsort()
to sort the lines x
and then continue.
Example input, output -
In [41]: x
Out[41]:
array([[100, 14, 12, 15],
[100, 21, 16, 11],
[100, 19, 10, 13],
[160, 24, 15, 12],
[160, 43, 12, 65],
[160, 17, 53, 23],
[300, 15, 17, 11],
[300, 66, 23, 12],
[300, 44, 70, 19]])
In [42]: dict_out
Out[42]:
{100: array([[14, 12, 15],
[21, 16, 11],
[19, 10, 13]]), 160: array([[24, 15, 12],
[43, 12, 65],
[17, 53, 23]]), 300: array([[15, 17, 11],
[66, 23, 12],
[44, 70, 19]])}
Approach # 2
Here's another way to get rid of np.unique
to further improve performance -
idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
dict_out = {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}
Runtime test
Approaches -
# @COLDSPEED soln
from collections import defaultdict
def defaultdict_app(x):
data = defaultdict(list)
for l in x:
data[l[0]].append(l[1:])
# @David Z soln-1
import pandas as pd
def pandas_groupby_app(x):
df = pd.DataFrame(x)
return {key: group.iloc[:,1:] for key, group in df.groupby(0)}
# @David Z soln-2
import itertools as it
def groupby_app(x):
return {key: list(map(list, group)) for key, group in \
it.groupby(x, lambda row: row[0])}
# Proposed in this post
def numpy_app1(x):
unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
return {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}
# Proposed in this post
def numpy_app2(x):
idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
return {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}
Timing -
In [84]: x = np.random.randint(0,100,(10000,4))
In [85]: x[:,0].sort()
In [86]: %timeit defaultdict_app(x)
...: %timeit pandas_groupby_app(x)
...: %timeit groupby_app(x)
...: %timeit numpy_app1(x)
...: %timeit numpy_app2(x)
...:
100 loops, best of 3: 4.43 ms per loop
100 loops, best of 3: 15 ms per loop
100 loops, best of 3: 12.1 ms per loop
1000 loops, best of 3: 310 ยตs per loop
10000 loops, best of 3: 75.6 ยตs per loop
source to share
Since you marked this as pandas, you can do this using the DataFrame
groupby()
functionality . You have to create DataFrame
from the original array
import pandas as pd
df = pd.DataFrame(x)
and a group on the first column; then you can iterate over the resulting object GroupBy
to get groups of frames that have the same result in the first column.
{key: group for key, group in df.groupby(0)}
Of course, this snippet group
contains the first column. You can remove it using indexing:
{key: group.iloc[:,1:] for key, group in df.groupby(0)}
and if you want to convert subframes back to Numpy arrays use group.iloc[:,1:].values
. (If you want them to be lists of lists, as your question pointed out, it's not hard to write a function to convert, but it would probably be more efficient to store it in Pandas, or at least Numpy if you can.)
An alternative is to use OG from , which avoids Pandas (if you have a reason for doing so) and use the simple old iterative approach. groupby()
itertools
import itertools as it
{key: list(map(list, group))
for key, group in it.groupby(x, lambda row: row[0])}
This again includes the key in the resulting rows, but you can truncate it using indexing
{key: list(map(lambda a: list(a)[1:], group))
for key, group in it.groupby(x, lambda row: row[0])}
You can make the code cleaner by using a function groupby_transform()
from the more-itertools module (which is not included in the Python standard library):
import more_itertools as mt
{key: list(group) for key, group in mt.groupby_transform(
x, lambda row: row[0], lambda row: list(row[1:])
)}
Disclosure: I have implemented a function groupby_transform()
in more-itertools
source to share
You can group your data using collections.defaultdict
and loop.
from collections import defaultdict
data = defaultdict(list)
for l in x:
data[l[0]].append(l[1:])
print(dict(data))
Output:
{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}
source to share
I think you want like this
After editing
ls_dict={}
for ls in x:
key=ls[0]
value=[ls[1:]]
if key in ls_dict:
value = ls[1:]
ls_dict[key].append(value)
else:
value = [ls[1:]]
ls_dict[key]=value
print(ls_dict)
Output
{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}
source to share