Add a column "flag" about whether one identifier has specific values ββin one column
The information frame looks like:
In [1]: df
Out[2]:
userid type
0 1 1
1 1 2
2 2 1
3 3 1
4 3 2
5 3 3
Now I want to add a column to it about whether the user id has specific values ββin type columns (like type1 and type2). This is what I want to get:
In [1]: df
Out[2]:
userid type has_type_12
0 1 1 1
1 1 2 1
2 2 1 0
3 3 1 1
4 3 2 1
5 3 3 1
Is there a quick fix for this?
I have abandoned one situation where userID 3 can have more types, such as 3 or 4. In this case, I would like to mark has_type_12 = 1 for 3. I changed the input and desired output above.
source to share
Use groupby
+ transform
with set
s:
cats = [1,2]
df['has_type_12'] = df.groupby('userid')['type'] \
.transform(lambda x: set(x) >= set((cats))) \
.astype(int)
print (df)
userid type has_type_12
0 1 1 1
1 1 2 1
2 2 1 0
3 3 1 1
4 3 2 1
5 3 3 1
Another solution with double any
(if only a few categories):
cats = [1,2]
df['has_type_12'] = df.groupby('userid')['type'] \
.transform(lambda x: ((x == 1).any()) & ((x == 2).any())) \
.astype(int)
print (df)
userid type has_type_12
0 1 1 1
1 1 2 1
2 2 1 0
3 3 1 1
4 3 2 1
5 3 3 1
source to share
When used, the set
operator >=
checks if the right side is a subset of the left side. I am using the method ge
as a proxy for>=
Using groupby
m = df.groupby('userid').type.apply(set)
df.assign(
has_type_12=df.userid.map(m).ge({1, 2}).astype(int)
)
userid type has_type_12
0 1 1 1
1 1 2 1
2 2 1 0
3 3 1 1
4 3 2 1
5 3 3 1
Using defaultdict
from collections import defaultdict
d = defaultdict(set)
[d[k].add(v) for k, v in zip(df.userid.values.tolist(), df.type.values.tolist())];
df.assign(has_type_12=df.userid.map(d).ge({1, 2}).astype(int))
userid type has_type_12
0 1 1 1
1 1 2 1
2 2 1 0
3 3 1 1
4 3 2 1
5 3 3 1
Timing
big data
np.random.seed([3,1415])
df = pd.DataFrame(dict(
userid=np.random.randint(1000, size=100000),
type=np.random.randint(100, size=100000)
))
%%timeit
d = defaultdict(set)
[d[k].add(v) for k, v in zip(df.userid.values.tolist(), df.type.values.tolist())];
df.userid.map(d).ge({1, 2}).astype(int)
10 loops, best of 3: 55.6 ms per loop
%%timeit
m = df.groupby('userid').type.apply(set)
df.userid.map(m).ge({1, 2}).astype(int)
10 loops, best of 3: 76.1 ms per loop
%timeit df.groupby('userid')['type'] \
.transform(lambda x: set(x) >= set((cats))) \
.astype(int)
1 loop, best of 3: 206 ms per loop
source to share