Python: how to concatenate dataset by repeatable value in one column

Question

Python: how to concatenate dataset by repeatable value in one column

Let's say I have a numpy array like this:

import numpy as np

x= np.array(
    [[100, 14, 12, 15],
    [100, 21, 16, 11],
    [100, 19, 10, 13],
    [160, 24, 15, 12],
    [160, 43, 12, 65],
    [160, 17, 53, 23],
    [300, 15, 17, 11],
    [300, 66, 23, 12],
    [300, 44, 70, 19]])

The original array is much larger, so my question is, is there a way to binary or group strings based on the value in column 1? eg:

{'100': [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
,'160': [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
,'300': [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

+3

python numpy pandas

Jelmed12 05 jul. 17 at 2:38

source to share

4 answers

Divakar · Answer 1 · 2017-07-05T02:55:34+0000

We are talking about a large dataset, so we might need performance as well as the raw data as a NumPy array. This post lists two NumPy approaches.

Approach # 1

Here's one approach using np.unique

to get the indices of the strings separating the groups and then using a loop comprehension to output the dictionary -

unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
dict_out = {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}

This assumes that the first column will be sorted as indicated in the question title - ...repeated value in one column

. If it is not, we need to use x[:,0].argsort()

to sort the lines x

and then continue.

Example input, output -

In [41]: x
Out[41]: 
array([[100,  14,  12,  15],
       [100,  21,  16,  11],
       [100,  19,  10,  13],
       [160,  24,  15,  12],
       [160,  43,  12,  65],
       [160,  17,  53,  23],
       [300,  15,  17,  11],
       [300,  66,  23,  12],
       [300,  44,  70,  19]])

In [42]: dict_out
Out[42]: 
{100: array([[14, 12, 15],
        [21, 16, 11],
        [19, 10, 13]]), 160: array([[24, 15, 12],
        [43, 12, 65],
        [17, 53, 23]]), 300: array([[15, 17, 11],
        [66, 23, 12],
        [44, 70, 19]])}

Approach # 2

Here's another way to get rid of np.unique

to further improve performance -

idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
dict_out = {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}

Runtime test

Approaches -

# @COLDSPEED soln
from collections import defaultdict
def defaultdict_app(x):
    data = defaultdict(list)
    for l in x:
        data[l[0]].append(l[1:])

# @David Z soln-1
import pandas as pd
def pandas_groupby_app(x):
    df = pd.DataFrame(x)
    return {key: group.iloc[:,1:] for key, group in df.groupby(0)}

# @David Z soln-2
import itertools as it
def groupby_app(x):
    return {key: list(map(list, group)) for key, group in \
                        it.groupby(x, lambda row: row[0])}

# Proposed in this post    
def numpy_app1(x):
    unq, idx = np.unique(x[:,0], return_index=1)
    idx1 = np.r_[idx,x.shape[0]]
    return {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}

# Proposed in this post    
def numpy_app2(x):
    idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
    return {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}

Timing -

In [84]: x = np.random.randint(0,100,(10000,4))

In [85]: x[:,0].sort()

In [86]: %timeit defaultdict_app(x)
    ...: %timeit pandas_groupby_app(x)
    ...: %timeit groupby_app(x)
    ...: %timeit numpy_app1(x)
    ...: %timeit numpy_app2(x)
    ...: 
100 loops, best of 3: 4.43 ms per loop
100 loops, best of 3: 15 ms per loop
100 loops, best of 3: 12.1 ms per loop
1000 loops, best of 3: 310 µs per loop
10000 loops, best of 3: 75.6 µs per loop

David Z · Answer 2 · 2017-07-05T02:59:19+0000

Since you marked this as pandas, you can do this using the DataFrame

groupby()

functionality . You have to create DataFrame

from the original array

import pandas as pd
df = pd.DataFrame(x)

and a group on the first column; then you can iterate over the resulting object GroupBy

to get groups of frames that have the same result in the first column.

{key: group for key, group in df.groupby(0)}

Of course, this snippet group

contains the first column. You can remove it using indexing:

{key: group.iloc[:,1:] for key, group in df.groupby(0)}

and if you want to convert subframes back to Numpy arrays use group.iloc[:,1:].values

. (If you want them to be lists of lists, as your question pointed out, it's not hard to write a function to convert, but it would probably be more efficient to store it in Pandas, or at least Numpy if you can.)

An alternative is to use OG from , which avoids Pandas (if you have a reason for doing so) and use the simple old iterative approach. groupby()

itertools

import itertools as it
{key: list(map(list, group))
    for key, group in it.groupby(x, lambda row: row[0])}

This again includes the key in the resulting rows, but you can truncate it using indexing

{key: list(map(lambda a: list(a)[1:], group))
    for key, group in it.groupby(x, lambda row: row[0])}

You can make the code cleaner by using a function groupby_transform()

from the more-itertools module (which is not included in the Python standard library):

import more_itertools as mt
{key: list(group) for key, group in mt.groupby_transform(
    x, lambda row: row[0], lambda row: list(row[1:])
)}

^{Disclosure: I have implemented a function groupby_transform()

in more-itertools}

coldspeed · Answer 3 · 2017-07-05T02:43:13+0000

You can group your data using collections.defaultdict

and loop.

from collections import defaultdict

data = defaultdict(list)
for l in x:
    data[l[0]].append(l[1:])

print(dict(data))

Output:

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

RAMunna · Answer 4 · 2017-07-05T02:46:07+0000

I think you want like this

After editing

ls_dict={}
for ls in x:
    key=ls[0]
    value=[ls[1:]]
    if key in ls_dict:
        value = ls[1:]
        ls_dict[key].append(value)
    else:
        value = [ls[1:]]
        ls_dict[key]=value
print(ls_dict)

Output

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

Python: how to concatenate dataset by repeatable value in one column

More articles: