TypeError: unsupported operand type for -: 'str' and 'str' in python 3.x Anaconda
I am trying to count the number of instances per hour in a large dataset. The code below seems to work fine on python 2.7, but I had to update to the latest python 3.x with all packages updated on Anaconda. When I try to execute the program I get the following str
error
Code:
import pandas as pd
from datetime import datetime,time
import numpy as np
fn = r'00_input.csv'
cols = ['UserId', 'UserMAC', 'HotspotID', 'StartTime', 'StopTime']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Date'] = pd.to_datetime(r.start, unit='s').dt.date
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
#r.to_csv('results.csv', index=False)
#print(r[r.LogCount > 0])
#print (r['StartTime'], r['EndTime'], r['Day'], r['LogCount'], r['UniqueIDCount'])
rout = r[['Date', 'StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
#print rout
rout.to_csv('o_1_hour.csv', index=False, header=False
)
Where can I make changes to get free execution
Mistake:
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\ops.py", line 686, in <lambda>
lambda x: op(x, rvalues))
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Rate the help, thanks in advance
source to share
I think you need to change header=0
to select the first row in the header - then the column names are replaced with a list cols
.
If there is still a problem, you need to to_numeric
, because some values ββin StartTime
and StopTime
are strings that are parsed on NaN
, replace the 0
last transformed column with int
:
cols = ['UserId', 'UserMAC', 'HotspotID', 'StartTime', 'StopTime']
df = pd.read_csv('canada_mini_unixtime.csv', header=0, names=cols)
#print (df)
df['StartTime'] = pd.to_numeric(df['StartTime'], errors='coerce').fillna(0).astype(int)
df['StopTime'] = pd.to_numeric(df['StopTime'], errors='coerce').fillna(0).astype(int)
Without changes:
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
ix
is deprecated in the latest version of pandas, so use the loc
column name in []
:
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df.loc[np.abs(df.m - 2*row.start - interval) < df.d + interval, 'UserId']
r.loc[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Date'] = pd.to_datetime(r.start, unit='s').dt.date
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
print (r)
source to share
df['d'] = df.StopTime - df.StartTime
tries to subtract a string from another string. I don't know what your data looks like, but most likely you want to parse StopTime
and StartTime
how the dates are. Try
df = pd.read_csv(fn, header=None, names=cols, parse_dates=[3,4])
instead of df = pd.read_csv(fn, header=None, names=cols)
.
source to share