Wrong line number in parse Exception
I have a simple language defined in pyparsing. The parsing works well, but the problem is with the error messages. They show the wrong line number. I am showing the main part of the code here
communications = Group( Suppress(CaselessLiteral("communications")) + op + ZeroOrMore(communicationList) + cl + semicolon)
language = Suppress(CaselessLiteral("language")) + (CaselessLiteral("cpp")|CaselessLiteral("python")) + semicolon
componentContents = communications.setResultsName('communications') & language.setResultsName('language') & gui.setResultsName('gui') & options.setResultsName('options')
component = Suppress(CaselessLiteral("component")) + identifier.setResultsName("name") + op + componentContents.setResultsName("properties") + cl + semicolon
CDSL = idslImports.setResultsName("imports") + component.setResultsName("component")
Reports the correct line number only up to component
, but for any errors internally component
(i.e. in ComponentContents) it just specifies the line number where the component starts. For example, this is an example of parsed text
import "/robocomp/interfaces/IDSLs/Test.idsl";
Component publish
{
Communications
{
requires test;
implements test;
};
language python;
};
here if i missed the semicolon after python;
or after the test. he would (line:4, col:1)
say i.e. at {
.
source to share
This behavior is pyparsing, not buggy, and needs some extra help to get it going (or getting it going).
When pyparsing cannot match somewhere in a complex expression, it will unbind its parsing pair to its last fully complete expression alternative. You know that after matching a "component" everything after that should be an error in the component definition, but pyparsing doesn't. So when a failure occurs after the open keyword, then pyparsing will back up and report that the keyword expression cannot be matched.
When you have the grammar of such commands, the keywords are often unambiguous. For example, after matching "component", anything that is not an identifier followed by a list of arguments in parentheses would be an error. You can indicate that pyparsing should not support "component" by replacing the "+" operator with the "-" operator.
Looking at your grammar, I'll go back and write a short BNF (always good practice):
communications ::= 'communications' '(' communicationList* ')' ';'
language ::= 'language' ('cpp' | 'python') ';'
componentContents ::= communications | language | gui | options
component ::= 'component' identifier '(' component_contents+ ')' ';'
CDSL ::= idslImports component
When there are keywords in grammar, I always recommend using Keyword
either CaselessKeyword
, not Literal
or CaselessLiteral
. Classes Literal
do not enforce word boundaries, so if I were to use Literal("no")
as part of the grammar it could match leading "no" "no" or "no" or "nothing", etc.
This is how I approach this BNF. (I am using the shorthand version setResultsName
I find to keep this grammar clearer.):
LBRACE,RBRACE,SEMI = map(Suppress, "{};")
identifier = pyparsing_common.identifier
# keywords - extend as needed
(IMPORT, COMMUNICATIONS, LANGUAGE, COMPONENT, CPP,
PYTHON, REQUIRES, IMPLEMENTS) = map(CaselessKeyword, """
IMPORT COMMUNICATIONS LANGUAGE COMPONENT CPP PYTHON
REQUIRES IMPLEMENTS""".split())
# keyword-leading expressions, use '-' operator to prevent backtracking once significant keyword is parsed
communicationItem = Group((REQUIRES | IMPLEMENTS) - identifier + SEMI)
communications = Group( COMMUNICATIONS.suppress() - LBRACE + ZeroOrMore(communicationItem) + RBRACE + SEMI)
language = Group(LANGUAGE.suppress() - (CPP | PYTHON) + SEMI)
componentContents = communications('communications') & language('language') & gui('gui') & options('options')
component = Group(COMPONENT - identifier("name") + Group(LBRACE + componentContents + RBRACE)("properties") + SEMI)
CDSL = idslImports("imports") + component("component")
Analyzing your sample with:
sample = """\
Component publish
{
Communications
{
requires test;
implements test;
};
language python;
};
"""
component.runTests([sample])
gives:
[['COMPONENT', 'publish', [[['REQUIRES', 'test'], ['IMPLEMENTS', 'test']], ['PYTHON']]]]
[0]:
['COMPONENT', 'publish', [[['REQUIRES', 'test'], ['IMPLEMENTS', 'test']], ['PYTHON']]]
- name: 'publish'
- properties: [[['REQUIRES', 'test'], ['IMPLEMENTS', 'test']], ['PYTHON']]
- communications: [['REQUIRES', 'test'], ['IMPLEMENTS', 'test']]
[0]:
['REQUIRES', 'test']
[1]:
['IMPLEMENTS', 'test']
- language: ['PYTHON']
(By the way, I like using the "&" operator to randomly match various content with the pyparsing class Each
- I think this makes a friendlier and more robust parser. It turns out to Each
have a slight conflict with the "-" operator, I'll have to fix that in the next version.)
source to share