Apache pig - URL handling the map

Question

Apache pig - URL handling the map

I am new to swing and am wondering about parsing logs. I am currently parsing important tags in my url string via regex_extract, but think I should convert the whole string to a map. I'm working on a sample dataset using 0.10, but I'm starting to get really lost. Actually my url string is repeating tags. Therefore my card should be a card with bags as values. Then I could just write any follow-up work using flatten ..

here is my test data. the last post shows my problem with duplicate tags.

`pig -x local`
grunt> cat test.log
test1   user=3553&friend=2042&system=262
test2   user=12523&friend=26546&browser=firfox
test2   user=205&friend=3525&friend=353

I am using tokenize to create an inner bundle.

grunt> A = load 'test.log' as (f:chararray, url:chararray);
grunt> B = foreach A generate f, TOKENIZE(url,'&') as attr;
grunt> describe B;
B: {f: chararray,attr: {tuple_of_tokens: (token: chararray)}}

grunt> dump B;
(test1,{(user=3553),(friend=2042),(system=262)})
(test2,{(user=12523),(friend=26546),(browser=firfox)})
(test2,{(user=205),(friend=3525),(friend=353)})

Using nested foreach for these relationships, but I think they have some limitations that I am not aware of.

grunt> C = foreach B {
>> D = foreach attr generate STRSPLIT($0,'=');
>> generate f, D as taglist;
>> }

grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})

grunt> G = foreach C {
>> H = foreach taglist generate TOMAP($0.$0, $0.$1) as tagmap;
>> generate f, H as alltags;
>> }

grunt> describe G;
G: {f: chararray,alltags: {tuple_of_tokens: (tagmap: map[])}}

grunt> dump G;
(test1,{([user#3553]),([friend#2042]),([system#262])})
(test2,{([user#12523]),([friend#26546]),([browser#firfox])})
(test2,{([user#205]),([friend#3525]),([friend#353])})

grunt> MAPTEST = foreach G generate f, flatten(alltags.tagmap);
grunt> describe MAPTEST;
MAPTEST: {f: chararray,null::tagmap: map[]}

grunt> res = foreach MAPTEST generate $1#'user';
grunt> dump res;
(3553)
()
()
(12523)
()
()
(205)
()
()

grunt> res = foreach MAPTEST generate $1#'friend';
grunt> dump res;
()
(2042)
()
()
(26546)
()
()
(3525)
(353)

So it's not scary. I think he is close, but not perfect. My big problem is that I need to group the tags as there are 2 tags for "friend" on the last line, at least before I add it to the map.

grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})

I am trying to insert a foreach file into a group but it throws an error.

grunt> G = foreach C {
>> H = foreach taglist generate *;
>> I = group H by $1;
>> generate I;
>> }
2013-01-18 14:56:31,434 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200:   <line 34, column 10>  Syntax error, unexpected symbol at or near 'H'

Anyone have any ideas how to get close to creating this url string in a bag map? Got it there will be a pig macro or whatever, as it looks like a normal use case. Any ideas are greatly appreciated.

+1

apache-pig

jeff 18 jan. 13 at 22:02

source to share

2 answers

I thought I would update this in case someone tries to do this in the future. I've never had a pig paw, but I went the full UDF path. Unfortunately I am not a programmer by profession, so the java examples lost me a bit. But I managed to hack the python UDF that worked so far. You still need to clean it up to handle errors and what not, but this can now be used. I'm sure there is a better Java way to do this.

#!/usr/bin/python
@outputSchema("tagmap:map[{(value:chararray)}]")

def inst_url_parse(url_query):
        query_vals = url_query.split("&")
        url_query_map = {}
        for each_val in query_vals:
                kv = each_val.split("=")
                if kv[0] in url_query_map:
                        url_query_map[kv[0]].append(kv[1])
                else:
                        url_query_map[kv[0]] = [kv[1]]

        return url_query_map

I really like that our url request is stored this way, since each key can have values 0,1, N. Downstream works are just called flatten (tagmap # 'key) in eval and it's pretty painless compared to the fact that I've done before. We can grow much faster using this. We also store data in hcatalog as

querymap<string, array<string>>

and seems to work fine for querying / viewing hives using LATERAL VIEW. Who knew?

Sorry if this is too stubborn for site Q and A.

0

jeff 02 Feb At 2:57 am

source to share

reo katoa · Accepted Answer · 2013-01-18T23:11:01+0000

Good news and bad news. The good news is, it's pretty straightforward. The bad news is that you won't be able to achieve what I assume it is ideal - all tag / value pairs on the same map - without using UDFs.

A few tips first: FLATTEN

Result STRSPLIT

so that you don't have a useless nesting level in your tuples and FLATTEN

nested again foreach

so that you don’t need to do it later. It also STRSPLIT

has an optional third argument to specify the maximum number of lines of output. Use this to guarantee the circuit for its output. Here's the modified version of your script:

A = load 'test.log' as (f:chararray, url:chararray);
B = foreach A generate f, TOKENIZE(url,'&') as attr;
C = foreach B {
    D = foreach attr generate FLATTEN(STRSPLIT($0,'=',2)) AS (key:chararray, val:chararray);
    generate f, FLATTEN(D);
};
E = foreach (group C by (f, key)) generate group.f, TOMAP(group.key, C.val);
dump E;

Output:

(test1,[user#{(3553)}])
(test1,[friend#{(2042)}])
(test1,[system#{(262)}])
(test2,[user#{(12523),(205)}])
(test2,[friend#{(26546),(3525),(353)}])
(test2,[browser#{(firfox)}])

Once you are done separating tags and values, group

also tagged to get your bag of values. Then place this on the card. Note that this assumes that if you have two strings with the same ID ( test2

, here), you want to concatenate them. If it is not, you will need to create a unique identifier for the string.

Unfortunately, there seems to be no way to combine maps without resorting to a UDF, but it should just be the simplest UDF possible. Something like ( untested ):

public class COMBINE_MAPS extends EvalFunc<Map> {
    public Map<String, DataBag> exec(Tuple input) throws IOException {
        if (input == null || input.size() != 1) { return null; }

        // Input tuple is a singleton containing the bag of maps
        DataBag b = (DataBag) input.get(0);

        // Create map that we will construct and return
        Map<String, Object> m = new HashMap<String, Object>();

        // Iterate through the bag, adding the elements from each map
        Iterator<Tuple> iter = b.iterator();
        while (iter.hasNext()) {
            Tuple t = iter.next();
            m.putAll((Map<String, Object>) t.get(0));
        }

        return m;
    }
}

With UDF, you can do the following:

F = foreach (group E by f) generate COMBINE_MAPS(E.$1);

Note that in this UDF, if any of the input cards overlap in their keys, one will overwrite the other, and there is no way to tell in advance what will "win". If that might be a problem, you need to add some kind of error checking code to the UDF.

Apache pig - URL handling the map

More articles: