Why is my pandas dataframe not updating its values ​​as they change?

I'm trying to make changes to every line in my "tweet_text" Series object, but for some reason, when I print the series object after tweeting changes in the for loop, I get the same lines as the for loop before. How can I fix this?

import pandas as pd
import re
import string

df = pd.read_csv('sample-tweets.csv',
                 names=['Tweet_Date', 'User_ID', 'Tweet_Text', 'Favorites', 'Retweets', 'Tweet_ID'])

sum_df = df[['User_ID', 'Tweet_ID', 'Tweet_Text']].copy()
sum_df.set_index(['User_ID'])
# print sum_df

tweet_text = df.ix[:, 2]
print type(tweet_text)

# efficiency could be im proved by using translate method
# regex = re.compile('[%s]' % re.escape(string.punctuation))

for tweet in tweet_text:
    tweet = re.sub('https://t.co/[a-zA-Z0-9]*', "", tweet)
    tweet = re.sub('@[a-zA-Z0-9]*', '', tweet)
    tweet = re.sub('#[a-zA-Z0-9]*', '', tweet)
    tweet = re.sub('$[a-zA-Z0-9]*', '', tweet)
    tweet = ''.join(i for i in tweet if not i.isdigit())
    tweet = tweet.replace('"', '')
    tweet = re.sub(r'[\(\[].*?[\)\]]', '', tweet)  # takes out everything between parentheses also, fix this

    # gets rid of all punctuation and emoji's
    tweet = "".join(l for l in tweet if l not in string.punctuation)
    tweet = re.sub(r'[^\x00-\x7F]+',' ', tweet)

    # gets ride of all extra spacing
    tweet = tweet.lower()
    tweet = tweet.strip()
    tweet = " ".join(tweet.split())

    count = count + 1
    # print tweet

print tweet_text

      

+3


source to share


2 answers


It does this because it tweet_text

is a copy of the column df.ix[:, 2]

for starters. Secondly, this is not pandas' way of iterating over Series

- you should use apply()

.

To update your code, everything that goes into the loop is changed to a function:

def parse_tweet(tweet):
    ## everything from loop goes here
    return tweet

      

Then instead of:



tweet_text = df.ix[:, 2]

      

do:

df.iloc[:, 2] = df.iloc[:, 2].apply(parse_tweet)

      

BTW, don't use the index ix

as it is depreciating and will be removed in future versions of pandas.

+2


source


Python strings are immutable. You just change the value assigned to the variable tweet

, but never update the actual file.

You just need to re-insert the updated value back into your frame. An example of a simple fix:



for i, tweet in enumerate(tweet_text):
    tweet = re.sub('https://t.co/[a-zA-Z0-9]*', "", tweet)
    tweet = re.sub('@[a-zA-Z0-9]*', '', tweet)

    # ...

    # update dataframe
    df.ix[i, 2] = tweet

      

+1


source







All Articles