Antlr4 is slow on RaspberryPi

We are trying to parse a custom language on RaspberryPi B using Antlr4 (Python2 target). However, it's too slow to do anything serious: parsing multiple lines takes about ten seconds. This is my code:

Transposeur.py:

# -*- coding:Utf-8 -*-

from antlr4 import *
from TransposeurLexer import TransposeurLexer
from TransposeurParser import TransposeurParser
import sys
from Listener import Listener

def transpose(file_path):

  input = FileStream(file_path)
  lexer = TransposeurLexer(input)
  stream = CommonTokenStream(lexer)
  parser = TransposeurParser(stream)
  tree = parser.myfile()
  listener = Listener()
  walker = ParseTreeWalker()
  walker.walk(listener, tree)
  return listener.array

      

Transposeur.g4:

grammar Transposeur;

myfile: block+;

block: title
     | paragraph
     ;

title: firstTitle
     | secondTitle
    ;

firstTitle: '#' ' '? unit+ newline;
secondTitle: '##' ' '? unit+ newline;

paragraph: unit+ newline;

unit: low+
    | upper
    | (low | cap)* cap (low | cap)*
    | ponctuation
    | number
    | space
    ;

upper: cap cap+;
number: digit+;

low: LOW;
cap: CAP;
newline: NEWLINE;
ponctuation: SPACE? PONCT;
space: SPACE;
digit: DIGIT;

LOW: [a-z] | 'ç' | 'é' | 'è' | 'à' | 'â' | 'ê' | 'ù' | 'î' | 'ô' | 'û' | 'ë' | 'ï' | 'ü' | 'œ';
CAP: [A-Z];
NEWLINE: '\r'? '\n';
SPACE: ' ';
DIGIT: [0-9];
PONCT: ',' | '!' | '?' | ';' | '.' | ':';

      

The team takes time tree = parser.myfile()

. Is there a way to make things faster?

+3


source to share


1 answer


I suspect that the problem solving problem is low + vs ( low | cap) * .... where you might have to look arbitrarily far ahead to determine which ones to apply.

I think the real problem is that the unit + reference is ambiguous about the low + . Given the text for a unit consisting of:

      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

      

(fifty a's). You can analyze it like this:

  • a unit having low + all "a",
  • a block with the first low + any prefix and the second low + the rest of the "a" s (which is 2500 possibilities)
  • unit of units with first low + any prefix, last low + any remaining suffix, and middle low + characters in between (path, path more options)
  • unit block units ...

So, I think this part of your grammar is very ambiguous, and ANTLR is researching a huge number of variations trying to pick one. You're probably lucky that ANTLR is fast enough to finish at all: -}



You will have the same problems with unit + and tops (== cap + ).

It is not clear to me what part of the structure you really need to capture. It looks to me like you just want a string. Try rewriting it as:

unit: low
    | cap
    | ponctuation
    | number
    | space
  ; 

      

Better yet, define the unit this way:

unit: LOW | CAP | PONCT | DIGIT | SPACE ;

      

+5


source







All Articles