Excluding special characters in Apache pig data
I am using Apache Pig to process some data.
There are several lines in my dataset that contain ie special characters (#,{}[])
.
This swing book program says that you cannot avoid these symbols.
So how can I process my data without removing special characters?
I was thinking about replacing them but would like to avoid it.
thank
source to share
Have you tried loading your data? There is no way to avoid these characters when they are part of values in a tuple, bag, or map, but there is no problem loading these characters into part of a string. Just specify this field as type chararray
.
The only problem you will have to face is if your lines always have a character that Pig uses as a field separator, for example if you USING PigStorage(',')
and your lines contain commas. But as long as you don't tell Pig to parse your field as a map #
, [
and ]
will handle just fine.
source to share
When writing a loader function instead of returning tuples, e.g. maps as String (and hence later relying on Utf8StorageConverter to get the correct conversion):
Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));
you can build and install Java map directly:
HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);
This is useful especially if you need to parse at boot time.
source to share