Regular expression that excludes characters inside parentheses

I have the following string types.

BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH

      

With these lines I have the following restrictions: 1.all caps 2.may be 2 to 18 charters 3.must have no white spaces or carriage returns at the end 4.country abbreviation inside parens must be excluded 5.Some names will not have country in parens and they must be matched as well

After applying my regex, I would like to get the following:

BILL SMITH (USA)  => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH

      

I came up with the following regex, but I am not getting any matches:

String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)

      

I've been banging my head about this for a while, so if anyone can point out a bug I would appreciate it.

UPDATE:

I have great answers, however, so far none of the regex solutions have met all the constraints. The tricky part seems to be that some lines have a country in brackets and some don't. In one case, the strings without country did not match, and in the other they returned the correct string along with the country abbreviation without parentheses. (See comments for the second answer.) One point of clarification: all strings that I will match will be the starting point of the string. Not sure if this helps or not. Thanks again for your help.

+3


source to share


3 answers


The biggest mistake is what you wrote (^?!...)

where you meant (?=...)

. The former means "an optional starting anchor followed by !

and then ...

within the capture group"; the latter means "the position in the line followed by ...

". Correcting this, along with a few other tweaks, and adding the requirement that the initial line ends with a letter, we get:

[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)

      




Update based on OP's comments. Since this will always match at the beginning of the line, you can use \A

to bind your template to the beginning of the line. Then you can get rid of the lookahead assertion. It:

\A[A-Z][A-Z\s]{0,16}[A-Z]

      

matches the start of a line, followed by an uppercase letter, followed by up to 16 characters, which are capital letters or whitespace, and then an uppercase letter.

+1


source


Here's one solution:

^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$

      



Take a look at Rubular . Note that it is 18 characters before the parenthesis - not sure how you want it to behave specifically. If you want to make sure that the whole string is no more than 18 characters, I suggest you just do. unless line.length < 18 ...

Similarly, if you want to make sure there are no spaces at the end, I recommend using line.strip

. This will greatly reduce the complexity of the required Regexp and make your code more readable.

Edit: also works when no parentheses are used after the name.

+2


source


You can also just use gsub

to remove the part (s) you don't want. To remove everything in parentheses, you can:

str.gsub(/\s*\([^)]*\)/, '')

      

+1


source







All Articles