Piglatin: filtering records based on values in a bag
I am new to Piglatin, I have a data file that looks like this (message, email, users, spam type)
For simplicity, I've only used spam / non-spam. The value for this field is usually around 100 different options.
message1 user1@email 12345 spam
message2 user1@email 12345 spam
message3 user1@email 12345 not-spam
message10 user2@email 90879 not-spam
message11 user2@email 90879 not-spam
All I need is if any one message from one user is marked as spam - delete / filter all his messages. So the above output will look like
message10 user2@email 90879 not-spam
message11 user2@email 90879 not-spam
The other 3 messages are deleted - as they belong to the same user / session
I am trying to solve the above using grouping and nesting for .. Any help is appreciated
DATA = LOAD './spamdata' using PigStorage() as (message:chararray, mailid:chararray, session:long, spamType:chararray);
GDATA = GROUP DATA BY (mailid,session);
GDATA looks like
GDATA: {group: (message: chararray,session: long),DATA: {(message: chararray,mailid: chararray,session: long,spamType: chararray)}}
All I need is to drop items from this group where none of the items in the bag are of "non-spam" types
source to share
You can find something like this:
DATA = LOAD....;
S = FOREACH (FILTER DATA BY spamType == 'spam') GENERATE mailid, session;
SPAM = DISTINCT S;
JOINED = JOIN DATA BY (mailid, session) LEFT OUTER, SPAM BY (mailid, session);
RES = FOREACH (FILTER JOINED by SPAM::mailid is null)
GENERATE $0 AS message, $1 AS mailid, $2 AS session, $3 AS spamType;
dump RES;
The idea here is to identify first those users who are spammers. After doing a left join with this data in the original dataset, we can have a list of non-spammers by selecting those rows only where there is no correct table match (for example: SPAM :: mailid is null).
source to share