Optimizing String Parsing with Python

I have a string in a form 'AB(AB(DDC)C)A(BAAC)DAB(ABC)'


  • Each symbol represents the element ( A

    , B

    , C

    or D

  • Between the parentheses on the right, there is a child of each element (which may not be present).

In the example, having 'AB(AB(DDC)C)A(BAAC)DA'

, the top level would be AB (AB (DDC) C) A (BAAC) DA[A, B, A, D, A]

, and the corresponding children would be [None, AB(DDC)C, BAAC, None, None]

. Children must also be parsed recursively.

I have implemented the solution here:

def parse_string(string):

    i = 0                                                                       
    parsed = []                                                                 

    while i < len(string):                                                      
        if string[i] in ('A', 'B', 'C', 'D'):                                        
            parsed.append([string[i], None])                                    
            i += 1                                                              
        elif string[i] == '(':                                                  
            open_brakets = 1                                                    
            i += 1                                                              
            j = i                                                               
            while open_brakets:                                                 
                if string[j] == '(':                                            
                    open_brakets += 1                                           
                elif string[j] == ')':                   
                    open_brakets -= 1                    
                j += 1
            # Parse the children as well
            parsed[-1][-1] = parse_string(string[i:j - 1])       
            i = j                                                               
            i += 1                                                              

    return parsed

print parse_string('AB(AB(DDC)C)A(BAAC)DAB(ABC)') 


Although I think this is a bit ugly and I'm sure it's not very efficient.

I wonder if there is a way to do this with Python in a cleaner / faster / more elegant way? External libraries are allowed (especially if they are written in C !: -P).


Other examples of lines that should work:


In general, the length of a line is not limited by the number of children, their length, or the number of nested levels.


source to share

3 answers

If you need to parse the inner brackets recursively as well:

def parse_tree(tree, string, start=0):
    index = start
    while index < len(string):
        current = string[index]
        if current == "(":
            child = tree[-1][1]
            child_parsed = parse_tree(child, string, index+1)
            index += child_parsed + 2 # adds 2 for the parentheses
        elif current == ")":
            tree.append((current, []))
            index += 1
    return index - start
tree = []
print(parse_tree(tree, 'abc(abc(defg)d)de(f)gh'))


How this works can be thought of as a state machine. The state machine accepts node definitions until it sees open parentheses, in which it pushes a new context (like calling a recursive function) onto the parsing stack to parse the contents of the parentheses. When parsing an internal context, the closed parentheses invoke the context.

Another alternative that can scale better if you have more complex grammars is to use a parsing library such as PyParsing:

from pyparsing import OneOrMore, Optional, oneOf, alphas, Word, Group, Forward, Suppress, Dict

# define the grammar
nodes = Forward()
nodeName = oneOf(list(alphas))
nodeChildren = Suppress('(') + Group(nodes) + Suppress( ')')
node = Group(nodeName + Optional(nodeChildren))
nodes <<= OneOrMore(node)



Partial libraries like PyParsing allow you to define an easy-to-read declarative grammar.

Answer the original non-recursive parsing . One way to do this is with itertools (only accumulates since Python 3.2 and up, the itertools docs have a pure python implementation to accumulate for use in older versions). This avoids using indexes:

from itertools import takewhile, accumulate
PARENS_MAP = {'(': 1, ')': -1}
def parse_tree(tree, string):
    string = iter(string)
    while string:
        current = next(string)
        if current == "(":
            child = iter(string)
            child = ((c, PARENS_MAP.get(c, 0)) for c in child)
            child = accumulate(child, lambda a,b: (b[0], a[1]+b[1]))
            child = takewhile(lambda c: c[1] >= 0, child)
            child = (c[0] for c in child)
            tree[-1][1] = "".join(child)
            tree.append([current, None])


I'm not entirely sure if this is faster or more elegant, but I think using explicit indices is much easier to write, understand and modify.



You can use regex

to parse text.

As a more general line, consider the following line:



You can use re.findall

to find an external template:

>>> re.findall(r'(?<=\))\w+(?=\(|$)|^\w+(?=\()',s)
['AB', 'A', 'DAB', 'DDD']


And use this regex with re.split

to concatenate strings within parentheses:

>>> re.split(r'(?<=\))\w+(?=\(|$)|^\w+(?=\()',s)
['', '(AB(DDC)C)', '(BAAC)', '(ABC)', '']


Bribes explains the previous regex :

This regex has 2 parts that are combined with a pip ( |

) token that works like a boolean or


  • (?<=\))\w+(?=\(|$)


this regex will match any combination of word characters ( \w+

) that precede )

and then (

or $

, that $ is the end of the line modifier that matches the end of the line.

Note using $

for case DDD


  1. ^\w+(?=\()


this regex will match any combination of word characters that appear at the beginning of the line (the modifier ^

will match the beginning of the line) and then(



As for the slightly ugly, it is in the eye of the beholder.

In terms of speed, it will be difficult for you to improve the code.

ADDED: This is how I could do it in C ++. You can adapt to Python if you like it. This shows how to do it with recursion. Top-level function topLevel("blah blah")


bool seeLetter(char* &s){
    if (*s=='A' || *s=='B' || *s=='C' || *s=='D'){
         return true;
    } else {
         return false;

bool seeChar(char* &s, char c){
    if (*s == c){s++; return true;}
    return false;

bool seeList(char* &s){
        if (seeLetter(s)){
        } else if (seeChar(s, '(')){
            if (!seeList(s)) return false;
            if (!seeChar(s, ')')) return false;
        } else break;
    return true;

bool topLevel(char* &s){
    if (!seeList(s)) return false;
    return (*s == '\0'); // check for garbage at end




All Articles