Excluding special characters in Apache pig data

I am using Apache Pig to process some data.
There are several lines in my dataset that contain ie special characters (#,{}[])


This swing book program says that you cannot avoid these symbols.

So how can I process my data without removing special characters?

I was thinking about replacing them but would like to avoid it.



source to share

3 answers

Have you tried loading your data? There is no way to avoid these characters when they are part of values ​​in a tuple, bag, or map, but there is no problem loading these characters into part of a string. Just specify this field as type chararray


The only problem you will have to face is if your lines always have a character that Pig uses as a field separator, for example if you USING PigStorage(',')

and your lines contain commas. But as long as you don't tell Pig to parse your field as a map #

, [

and ]

will handle just fine.



The easiest way:

input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;


The TextLoader just reads each line of input into a string no matter what's inside that line. Then you can use your own parsing logic.



When writing a loader function instead of returning tuples, e.g. maps as String (and hence later relying on Utf8StorageConverter to get the correct conversion):

Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));


you can build and install Java map directly:

HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);


This is useful especially if you need to parse at boot time.



All Articles