Importing a generic constants file declared by pigs into other pigs files
Purpose . Define constants (% declare and% default) in .pig constants for modularity of code and import them into other pig files.
As per the docs: http://pig.apache.org/docs/r0.12.0/cont.html#import-macros,% declare and% default are valid macro statements.
Problem: Pig cannot find the declared parameter.
Pig File: .pig constants
%declare ACTIVE_VALUES 'UK';
Pig: a.pig
IMPORT 'constants.pig';
A = LOAD 'a.csv' using PigStorage(',') AS (country_code:chararray, country_name:chararray);
B = FILTER A BY country_code == '$ACTIVE_VALUES';
dump B;
Login: a.csv
IN,India US,United States UK,United Kingdom
Mistake
Error before Pig is launched
----------------------------
ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : ACTIVE_VALUES
java.io.IOException: org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : ACTIVE_VALUES
at org.apache.pig.impl.PigContext.doParamSubstitution(PigContext.java:414)
at org.apache.pig.Main.runParamPreprocessor(Main.java:810)
at org.apache.pig.Main.run(Main.java:588)
at org.apache.pig.Main.main(Main.java:170)
Caused by: org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : ACTIVE_VALUES
at org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:355)
at org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:303)
at org.apache.pig.tools.parameters.PigFileParser.input(PigFileParser.java:67)
at org.apache.pig.tools.parameters.PigFileParser.Parse(PigFileParser.java:43)
at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.parsePigFile(ParameterSubstitutionPreprocessor.java:95)
at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:76)
at org.apache.pig.impl.PigContext.doParamSubstitution(PigContext.java:410)
... 3 more
My understanding of IMPORT is that the content of the imported pig will be executed and available from the calling pig script. If so, the declared parameter should be available in the lead import file.
Any ins / thoughts on having a generic pig script file that will have constants declared and import it into other pig files to achieve code modularity.
Update:
JIRA has already been raised on this issue. Link below links for details
source to share
The keyword is IMPORT
used to import macros, not constants. %declare
and %default
are preprocessor instructions, and its scope is all other lines in the script. If you declare it in a script, but import it from another, it won't work because it is out of scope.
Both statements are valid in a macro if you use the declared variable inside the macro. If you need to define constants outside of a script for modularity, you need to use a options file:
ACTIVE_VALUES = 'UK'
And then run your Pig script like this:
pig -param_file your_params_file.properties -f your_script.pig
If you really want to use IMPORT
, you can create a macro that takes care of filtering with this constant value:
%declare ACTIVE_VALUES 'UK';
DEFINE my_custom_filter(A) RETURNS B {
$B = FILTER $A BY $0 == '$ACTIVE_VALUES ';
};
And then import it like you did in your script, but instead of calling the function, FILTER
call your own macro:
IMPORT 'macro.pig';
A = LOAD 'a.csv' using PigStorage(',') AS (country_code:chararray, country_name:chararray);
B = my_custom_filter(A);
dump B;
source to share
Despite the hackery, another possible solution is to use a python controller, and in this python controller, the concatenation of these two files. You can read about controllers here .
This is potentially what it might look like, and will least break your current structure:
#!/usr/bin/python
from org.apache.pig.scripting import Pig
def readfile(f):
out = []
with open(f, 'r') as infile:
for line in infile:
out.append(file)
return out
constants = readfile('constants.pig')
script = readfile('a.pig')
# Compile
P = Pig.compile('\n'.join(constants + scripts))
# Run
result = P.bind({}).runSingle()
However, you can also try passing the variables you want to change in the dictionary that is the argument to the method bind
. This is the same process as using parameter substitution , and I recommend doing it this way.
source to share