Regular expression to remove lines from source code

I'm looking for a regex that will replace strings in the original source with some constant string value such as "string" and this also allows for the escaping of the start-of-line character, which is denoted by a double string-start character (eg "he said" "hello" "").

To clarify, I'll give some examples of input and expected output:

input: print("hello world, how are you?")
output: print("string")

input: print("hello" + "world")
output: print("string" + "string")

# here the tricky part:
input: print("He told her ""how you doin?"", and she said ""I'm fine, thanks""")
output: print("string")

      

I am working in Python but I think it is language agnostic.

EDIT: According to one of the answers, this requirement may not be suitable for regex. I'm not sure if this is true, but I'm not an expert. If I try to talk about my requirement with words, then I am looking to find the character sets that are between double quotes, with even groups of contiguous double quotes to be ignored, which sounds to me like the DFA can understand.

Thank.

0


source to share


3 answers


If you are parsing Python code, save yourself and let the standard library parser module do the heavy lifting.

If you're writing your own parser for some custom language, it's terribly tempting to start by just cracking a bunch of regular expressions, but don't. You will dig yourself into an invincible mess. Read the parsing methods and do it right (wikipedia can help ).



This regex does the trick for all three of your examples:

re.sub(r'"(?:""|[^"])+"', '"string"', original)

      

+3


source


May be:

re.sub(r"[^\"]\"[^\"].*[^\"]\"[^\"]",'"string"',input)

      

EDIT:



No, it won't work for the final example.

I don't think your requirements are regular: they cannot match a regular expression. This is because at the heart of it all, you need to match any odd number "

grouped together, since that is your separator.

I think you will have to do it manually, counting "

s.

0


source


There's a very good regex string match on ActiveState. If that doesn't work right for your last example, it should be a pretty trivial repetition for grouping adjacent quoted strings.

0


source







All Articles