Count the number of words in each line with Pig

I have a set of tweets that have many different fields

raw_tweets = LOAD 'input.tsv' USING PigStorage('\t') AS (tweet_id, text, 
in_reply_to_status_id, favorite_count, source, coordinates, entities, 
in_reply_to_screen_name, in_reply_to_user_id, retweet_count, is_retweet, 
retweet_of_id, user_id_id, lang, created_at, event_id_id, is_news);

      

I want to find the most common words for each date. I managed to group the texts by date:

r1 = FOREACH raw_tweets GENERATE SUBSTRING(created_at,0,10) AS a, REPLACE 
(LOWER(text),'([^a-z\\s]+)','') AS b;
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));

      

It now looks like this:

(date text text3)
(date2 text2)

      

I removed the special characters, so only "real" words appear in the text. Example:

2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here

      

I want the result to look like

2017-06-18 the are is
2017-06-19 more words and

      

I don't care how many times the word appears. I just want to show the most common, if two words appear the same number of times, show them.

+3


source to share


1 answer


While I'm sure there is a way to do this entirely in Pig, it will probably be more complicated than necessary.

UDFs are the way to go in my opinion and Python is just one option I'll show because it quickly registers it to Pig.

For example,

input.tsv

2017-06-18  the plants are green the dog is black there are words this is
2017-06-19  more words and even more words another phrase begins here

      

py_udfs.py

from collections import Counter
from operator import itemgetter

@outputSchema("y:bag{t:tuple(word:chararray,count:int)}")
def word_count(sentence):
    ''' Does a word count of a sentence and orders common words first '''
    words = Counter()
    for w in sentence.split():
        words[w] += 1
    values = ((word,count) for word,count in words.items())
    return sorted(values,key=itemgetter(1),reverse=True)

      



script.pig

REGISTER 'py_udfs.py' USING jython AS py_udfs;
A = LOAD 'input.tsv' USING PigStorage('\t') as (created_at:chararray,sentence:chararray);
B = FOREACH A GENERATE created_at, py_udfs.word_count(sentence);
\d B

      

Output

(2017-06-18,{(is,2),(the,2),(are,2),(green,1),(black,1),(words,1),(this,1),(plants,1),(there,1),(dog,1)})
(2017-06-19,{(more,2),(words,2),(here,1),(another,1),(begins,1),(phrase,1),(even,1),(and,1)})

      

If you are doing text analysis I would suggest

  • Remove stop words
  • Lemmatization / constriction
  • Use Apache Spark
+1


source







All Articles