Excluding special characters in Apache pig data

I am using Apache Pig to process some data.
There are several lines in my dataset that contain ie special characters (#,{}[])


This swing book program says that you cannot avoid these symbols.

So how can I process my data without removing special characters?

I was thinking about replacing them but would like to avoid it.



Have you tried loading your data? There is no way to avoid these characters when they are part of values ​​in a tuple, bag, or map, but there is no problem loading these characters into part of a string. Just specify this field as type chararray


The only problem you will have to face is if your lines always have a character that Pig uses as a field separator, for example if you USING PigStorage(',')

and your lines contain commas. But as long as you don't tell Pig to parse your field as a map #

, [

and ]

will handle just fine.



The easiest way:

input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;


The TextLoader just reads each line of input into a string no matter what's inside that line. Then you can use your own parsing logic.



When writing a loader function instead of returning tuples, e.g. maps as String (and hence later relying on Utf8StorageConverter to get the correct conversion):

Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));


you can build and install Java map directly:

HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);


This is useful especially if you need to parse at boot time.



