Count the number of words in each line with Pig

Question

Count the number of words in each line with Pig

I have a set of tweets that have many different fields

raw_tweets = LOAD 'input.tsv' USING PigStorage('\t') AS (tweet_id, text, 
in_reply_to_status_id, favorite_count, source, coordinates, entities, 
in_reply_to_screen_name, in_reply_to_user_id, retweet_count, is_retweet, 
retweet_of_id, user_id_id, lang, created_at, event_id_id, is_news);

I want to find the most common words for each date. I managed to group the texts by date:

r1 = FOREACH raw_tweets GENERATE SUBSTRING(created_at,0,10) AS a, REPLACE 
(LOWER(text),'([^a-z\\s]+)','') AS b;
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));

It now looks like this:

(date text text3)
(date2 text2)

I removed the special characters, so only "real" words appear in the text. Example:

2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here

I want the result to look like

2017-06-18 the are is
2017-06-19 more words and

I don't care how many times the word appears. I just want to show the most common, if two words appear the same number of times, show them.

+3

hadoop apache-pig

Daniela June 18 17 at 18:54

source to share

1 answer

cricket_007 · Answer 1 · 2017-06-19T03:48:39+0000

While I'm sure there is a way to do this entirely in Pig, it will probably be more complicated than necessary.

UDFs are the way to go in my opinion and Python is just one option I'll show because it quickly registers it to Pig.

For example,

input.tsv

2017-06-18  the plants are green the dog is black there are words this is
2017-06-19  more words and even more words another phrase begins here

py_udfs.py

from collections import Counter
from operator import itemgetter

@outputSchema("y:bag{t:tuple(word:chararray,count:int)}")
def word_count(sentence):
    ''' Does a word count of a sentence and orders common words first '''
    words = Counter()
    for w in sentence.split():
        words[w] += 1
    values = ((word,count) for word,count in words.items())
    return sorted(values,key=itemgetter(1),reverse=True)

script.pig

REGISTER 'py_udfs.py' USING jython AS py_udfs;
A = LOAD 'input.tsv' USING PigStorage('\t') as (created_at:chararray,sentence:chararray);
B = FOREACH A GENERATE created_at, py_udfs.word_count(sentence);
\d B

Output

(2017-06-18,{(is,2),(the,2),(are,2),(green,1),(black,1),(words,1),(this,1),(plants,1),(there,1),(dog,1)})
(2017-06-19,{(more,2),(words,2),(here,1),(another,1),(begins,1),(phrase,1),(even,1),(and,1)})

If you are doing text analysis I would suggest

Remove stop words
Lemmatization / constriction
Use Apache Spark

Count the number of words in each line with Pig

More articles: