Piglatin: filtering records based on values ​​in a bag

I am new to Piglatin, I have a data file that looks like this (message, email, users, spam type)

For simplicity, I've only used spam / non-spam. The value for this field is usually around 100 different options.

message1  user1@email  12345      spam
message2  user1@email  12345      spam
message3  user1@email  12345      not-spam

message10  user2@email  90879      not-spam
message11  user2@email  90879      not-spam

      

All I need is if any one message from one user is marked as spam - delete / filter all his messages. So the above output will look like

message10  user2@email  90879      not-spam
message11  user2@email  90879      not-spam

      

The other 3 messages are deleted - as they belong to the same user / session

I am trying to solve the above using grouping and nesting for .. Any help is appreciated

DATA = LOAD './spamdata' using PigStorage() as (message:chararray, mailid:chararray,  session:long, spamType:chararray);
GDATA = GROUP DATA BY (mailid,session);

      

GDATA looks like

GDATA: {group: (message: chararray,session: long),DATA: {(message: chararray,mailid: chararray,session: long,spamType: chararray)}}

      

All I need is to drop items from this group where none of the items in the bag are of "non-spam" types

+3


source to share


1 answer


You can find something like this:

DATA = LOAD....;
S =  FOREACH (FILTER DATA BY spamType == 'spam') GENERATE mailid, session;
SPAM = DISTINCT S;
JOINED = JOIN DATA BY (mailid, session) LEFT OUTER, SPAM BY (mailid, session);

RES = FOREACH (FILTER JOINED by SPAM::mailid is null)
  GENERATE $0 AS message, $1 AS mailid, $2 AS session, $3 AS spamType;

dump RES;

      



The idea here is to identify first those users who are spammers. After doing a left join with this data in the original dataset, we can have a list of non-spammers by selecting those rows only where there is no correct table match (for example: SPAM :: mailid is null).

+3


source







All Articles