Converting one int to multiple bool columns in pandas

Background

I got a data frame with integers. These integers represent a series of functions that are either present or not present on this line.

I want these functions to be called columns in my dataframe.

Problem

My current solution is exploding in memory and insanely slow. How to improve memory efficiency?

import pandas as pd
df = pd.DataFrame({'some_int':range(5)})
df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(columns=dict(zip(range(4), ["f1", "f2", "f3", "f4"])))

  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

      

Seems to .apply(pd.Series)

slow it down. Everything else is pretty quick until I add this.

I cannot skip it because a simple list will not create a dataframe.

+3


source to share


3 answers


Here's a vectorial NumPy approach -

def num2bin(nums, width):
    return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)

      

Example run -

In [70]: df
Out[70]: 
   some_int
0         1
1         5
2         3
3         8
4         4

In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
                    columns = [["f1", "f2", "f3", "f4"]])
Out[71]: 
   f1  f2  f3  f4
0   0   0   0   1
1   0   1   0   1
2   0   0   1   1
3   1   0   0   0
4   0   1   0   0

      

Explanation

1) Inputs:

In [98]: nums = np.array([1,5,3,8,4])

In [99]: width = 4

      

2) Get 2 range ranges:

In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])

      



3) Convert the numbers to a 2-dimensional version of the array, as we will later want to do Bit-Initiation on the elements between it and the 2-given numbers in the vectorized mannner, following the rules broadcasting

:

In [101]: nums[:,None]
Out[101]: 
array([[1],
       [5],
       [3],
       [8],
       [4]])

In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]: 
array([[0, 0, 0, 1],
     [0, 4, 0, 1],
     [0, 0, 2, 1],
     [8, 0, 0, 0],
     [0, 4, 0, 0]])

      

To understand the bit-ANDIng, consider the number 5

from nums

and its bit-ANDing for it against all 2-fed numbers [8,4,2,1]

:

In [103]: 5 & 8    # 0101 & 1000
Out[103]: 0

In [104]: 5 & 4    # 0101 & 0100
Out[104]: 4

In [105]: 5 & 2    # 0101 & 0010
Out[105]: 0

In [106]: 5 & 1    # 0101 & 0001
Out[106]: 1

      

Thus, we see that there is no intersection with [8,2]

, whereas for others we have non-zeros.

4) In the last step, find the matches (non-zeros) and just convert them to 1s and stay in 0s comparing with 0

, resulting in a boolean array, then convert to int dtype:

In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))

In [108]: matches!=0
Out[108]: 
array([[False, False, False,  True],
       [False,  True, False,  True],
       [False, False,  True,  True],
       [ True, False, False, False],
       [False,  True, False, False]], dtype=bool)

In [109]: (matches!=0).astype(int)
Out[109]: 
array([[0, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 0, 0, 0],
       [0, 1, 0, 0]])

      

Runtime test

In [58]: df = pd.DataFrame({'some_int':range(100000)})

# @jezrael soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop

# @jezrael soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop

# @jezrael soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop

# @MaxU soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop

# @MaxU soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop

# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop

      

+4


source


you can use numpy.binary_repr method:

In [336]: df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=4)))) \
            .add_prefix('f')
Out[336]:
  f0 f1 f2 f3
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

      



or

In [346]: pd.DataFrame([list(np.binary_repr(x, width=4)) for x in df.some_int.values],
     ...:              columns=np.arange(1,5)) \
     ...:   .add_prefix('f')
     ...:
Out[346]:
  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

      

+5


source


I think you need:

a = pd.DataFrame(df['some_int'].astype(int)
                               .apply(bin)
                               .str[2:]
                               .str.zfill(4)
                               .apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

      

Another solution, thanks to Jon Clements and ayhan :

a = pd.DataFrame(df['some_int'].apply(lambda x: list('{:04b}'.format(x))).values.tolist(), 
                 columns=['f1', 'f2', 'f3', 'f4'])
print (a)
  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

      

Changed a bit:

a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values], 
                  columns=['f1', 'f2', 'f3', 'f4'])
print (a)
  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

      

Delay

df = pd.DataFrame({'some_int':range(100000)})

In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop

In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop

In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop

In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop

In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop

In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop

      

+3


source







All Articles