Why is my pandas dataframe not updating its values as they change?
I'm trying to make changes to every line in my "tweet_text" Series object, but for some reason, when I print the series object after tweeting changes in the for loop, I get the same lines as the for loop before. How can I fix this?
import pandas as pd
import re
import string
df = pd.read_csv('sample-tweets.csv',
names=['Tweet_Date', 'User_ID', 'Tweet_Text', 'Favorites', 'Retweets', 'Tweet_ID'])
sum_df = df[['User_ID', 'Tweet_ID', 'Tweet_Text']].copy()
sum_df.set_index(['User_ID'])
# print sum_df
tweet_text = df.ix[:, 2]
print type(tweet_text)
# efficiency could be im proved by using translate method
# regex = re.compile('[%s]' % re.escape(string.punctuation))
for tweet in tweet_text:
tweet = re.sub('https://t.co/[a-zA-Z0-9]*', "", tweet)
tweet = re.sub('@[a-zA-Z0-9]*', '', tweet)
tweet = re.sub('#[a-zA-Z0-9]*', '', tweet)
tweet = re.sub('$[a-zA-Z0-9]*', '', tweet)
tweet = ''.join(i for i in tweet if not i.isdigit())
tweet = tweet.replace('"', '')
tweet = re.sub(r'[\(\[].*?[\)\]]', '', tweet) # takes out everything between parentheses also, fix this
# gets rid of all punctuation and emoji's
tweet = "".join(l for l in tweet if l not in string.punctuation)
tweet = re.sub(r'[^\x00-\x7F]+',' ', tweet)
# gets ride of all extra spacing
tweet = tweet.lower()
tweet = tweet.strip()
tweet = " ".join(tweet.split())
count = count + 1
# print tweet
print tweet_text
source to share
It does this because it tweet_text
is a copy of the column df.ix[:, 2]
for starters. Secondly, this is not pandas' way of iterating over Series
- you should use apply()
.
To update your code, everything that goes into the loop is changed to a function:
def parse_tweet(tweet):
## everything from loop goes here
return tweet
Then instead of:
tweet_text = df.ix[:, 2]
do:
df.iloc[:, 2] = df.iloc[:, 2].apply(parse_tweet)
BTW, don't use the index ix
as it is depreciating and will be removed in future versions of pandas.
source to share
Python strings are immutable. You just change the value assigned to the variable tweet
, but never update the actual file.
You just need to re-insert the updated value back into your frame. An example of a simple fix:
for i, tweet in enumerate(tweet_text):
tweet = re.sub('https://t.co/[a-zA-Z0-9]*', "", tweet)
tweet = re.sub('@[a-zA-Z0-9]*', '', tweet)
# ...
# update dataframe
df.ix[i, 2] = tweet
source to share