Errors when trying to remove parentheses in python text

I worked on a little bit of code to take a bunch of histograms from other files and put them together. To make sure the legend displays correctly, I tried to take the titles of these original histograms and strip out some information that is no longer needed.

The section I don't need takes the form (A mass = 200 GeV), I had no problem removing what's inside the parentheses, unfortunately, everything I've tried for the parentheses themselves, either has no effect. code that removes text or throws errors.

I've tried using suggestions; Remove brackets and text in a file using Python and How to remove text in parentheses using regex?

The error of my current attempt gives me

'str' object cannot be interpreted as an integer

      

This is the section of code:

histo_name = ''

# this is a list of things we do not want to show up in our legend keys
REMOVE_LIST = ["(A mass = 200 GeV)"]

# these two lines use the re module to remove things from a piece of text
# that are specified in the remove list
remove = '|'.join(REMOVE_LIST)
regex = re.compile(r'\b('+remove+r')\b')

# Creating the correct name for the stacked histogram
for histo in histos:

    if histo == histos[0]:

        # place_holder contains the edited string we want to set the
        # histogram title to
        place_holder = regex.sub('', str(histo.GetName()))
        histo_name += str(place_holder)
        histo.SetTitle(histo_name)

    else:

        place_holder = regex.sub(r'\(\w*\)', '', str(histo.GetName()))
        histo_name += ' + ' + str(place_holder)
        histo.SetTitle(histo_name)

      

the if / else bit is just because the first histogram I pass doesn't fit, so I just want it to keep its own name and the rest to be stacked in sequence so that hence the "+" etc ., but I thought I would include it.

Apologies if I did something really obvious, I'm still learning!

+3


source to share


2 answers


From the python docs - To match literals '(' or ')' use \ (or \) or enclose them inside a character class: [(] [)].

So, use one of the above patterns instead of simple parentheses in your regex. eg,REMOVE_LIST = ["\(A mass = 200 GeV\)"]



EDIT: The problem is with the use of \ b in the regex which, according to the docs linked above, also matches curly braces. My seemingly working example:

import re

# Test input
myTestString = "someMess (A mass = 200 GeV) and other mess (remove me if you can)"
replaceWith = "HEY THERE FRIEND"

# What to remove
removeList = [r"\(A mass = 200 GeV\)", r"\(remove me if you can\)"]

# Build the regex
remove = r'(' + '|'.join(removeList) + r')'
regex = re.compile(remove)

# Try it!
out = regex.sub(replaceWith, myTestString)

# See if it worked
print(out)

      

+1


source


There are 2 problems you are facing

  • You are concatenating your strings to a regex pattern without escaping
  • You are using word boundaries, but some of your entries start / end with a letter without a word (so you will never match )

    with r"\)\b"

    ).

This fixes the first problem, but not the second (it only finds More+[fun]+text

):

REMOVE_LIST = ["(A mass = 200 GeV)", "More+[fun]+text"]
remove = '|'.join([re.escape(x) for x in REMOVE_LIST])
ptrn = r'\b(?:'+remove+r')\b'
print ptrn
regex = re.compile(ptrn)
print regex.findall("Now, (A mass = 200 GeV) and More+[fun]+text inside")

      



You will need a smarter way to create your template. Like this:

import re
REMOVE_LIST = ["(A mass = 200 GeV)", "More+[fun]+text"]

remove_with_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if re.match(r'\w', x) and re.search(r'\w$', x)])
remove_with_no_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if not re.match(r'\w', x) and not re.search(r'\w$', x)])
remove_with_right_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if not re.match(r'\w', x) and re.search(r'\w$', x)])
remove_with_left_boundaries = '|'.join([re.escape(x) for x in REMOVE_LIST if re.match(r'\w', x) and not re.search(r'\w$', x)])

ptrn = ''
if len(remove_with_boundaries) > 0:
    ptrn += r'\b(?:'+remove_with_boundaries+r')\b'
if len(remove_with_left_boundaries) > 0:
    ptrn += r'|\b(?:' + remove_with_left_boundaries + r')'
if len(remove_with_right_boundaries) > 0:
    ptrn += r'|(?:' + remove_with_right_boundaries + r')\b'
if len(remove_with_no_boundaries) > 0:
    ptrn += r'|(?:' + remove_with_no_boundaries + r')'

print ptrn
regex = re.compile(ptrn)
print regex.findall("Now, (A mass = 200 GeV) and More+[fun]+text inside")

      

See IDEONE demo

A ["(A mass = 200 GeV)", "More+[fun]+text"]

regex is generated for the two records as input \b(?:More\+\[fun\]\+text)\b|(?:\(A\ mass\ \=\ 200\ GeV\))

, and the output is ['(A mass = 200 GeV)', 'More+[fun]+text']

.

0


source







All Articles