Regular expression to match special characters EXCLUSIVE hyphen (s) mixed with number (s)

We are currently using [^a-zA-Z0-9]

in Java functions replaceAll

to strip special characters from a string. We have drawn our attention to the fact that we need to resolve hyphen (s) when they are mixed with number (s).

Examples for which the hyphen won't be :

  • 1-2-3
  • -1-23-4562
  • - 1 --- 2--3 --- 4 -
  • - 9 - - 7
  • 425-12-3456

Examples for which hyphens are matched :

  • - A - B - C
  • wali-showcase

We think we have formulated a regex to match the latest criteria using this SO question as a reference, but we have no idea how to combine it with the original regex [^a-zA-Z0-9]

.

We want to do this for the Lucene search string because of the way the standard Lucene tokenizer works when indexed:

Separates words in hyphens if there is no number in the token, in which case the entire token is interpreted as a product number and is not split.

+3
java regex


source to share


4 answers


You cannot do this with a single regex. (Well ... maybe in Perl.)

(edit: Okay, you can do this with variable length negative lookbehind, which Java seems to be able to (almost unambiguously!), see Cyborgx37's answer. No matter imo, you shouldn't be doing this with a single regex. :))

What you can do is split the string into words and process each word separately. My Java is pretty awful, so here's some reliable Python:



# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')

# Split on anything that not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)

stripped_words = []
for word in words:
    if '-' in word and not looks_like_product_number.match(word):
        stripped_word = not_wordlike.sub('', word)
    else:
        # Product number; allow dashes
        stripped_word = not_wordlike_or_hyphen.sub('', word)

    stripped_words.append(stripped_word)

pass_to_lucene(' '.join(stripped_words))

      

When I run this with help 'wal-mart 1-2-3'

, I get back 'walmart 1-2-3'

.

But honestly, the code above reproduces most of what the Lucene tokenizer is already doing. I think you'd be better off just copying StandardTokenizer

into your own project and modifying it to do what you want.

+2


source to share


Have you tried this:



[^a-zA-Z0-9-]

+1


source to share


This question is tricky because Java doesn't allow infinite recursion in searches, which is basically what you want. As I understood, due to the 100 character limit, which you can increase if you expect the words to be longer.

This should work:

(?<![0-9]\S{0,100})[^a-zA-Z](?!\S{0,100}[0-9])|(?<=[0-9]\S{0,100})[^a-zA-Z0-9-](?=\S{0,100}[0-9])

      

A simple replaceAll () with this expression should handle it.

For example, consider this input:

--9-+-a--7 wal-mart

      

The above expression, in which the replacement characters are replaced with a zero-length string, will display the following result:

--9--a--7 walmart

      

You can try it here: http://fiddle.re/ynyu

Note that this expression depends on the fact that the words are separated by a space (spaces, tabs, translation strings, etc.). Other characters, such as commas and semicolons, will cause the expression to treat two words as one. For example, "--- 9-a-0-, wal-mart" will be treated as one word.

EDIT The last paragraph from my previous edit was wrong. If you want to include other characters as delimiters, I recommend replacing them with a space in the first pass (for example, replacing ',' with '').

I am primarily a .NET programmer, otherwise I would give you full Java code to use this pattern.

+1


source to share


Forgive me for posting the second answer instead of editing the first, but I'm not entirely sure if the problem is to eliminate the dashes in cases where they are immediately surrounded by letters, or if the goal is to remove the dashes only on strings that contain no numbers at all. This is the solution for the latter case. My other solution for the first case.

This template

String newValue = myString.replaceAll("[^\\sA-Za-z0-9\\-]|((?<!\\S*\\d)-(?!\\S*\\d))", "");

      

should do it. There are two main parts connected to or

. The first part matches all non-alpha, non-numeric, non-symmetric characters, since we want to separate those characters no matter what. The second half or

will match any dash that does not have a digit anywhere in front of it in the token, and nowhere after it in the token (ie, no digits in the token at all, where tokens consist of all non-whitespace or \S

, characters). This is achieved through negative gaze and expectation. We do make use of the fact that Java supports variable width in these perspectives / behind. Of course the replacement is just an empty string.

I have to admit, although the syntax for using regex is painful in Java (in the case where you have to use Pattern.compile, etc.), at least the engine supports some nice features. Though perhaps not as good as .NET according to Eevee.

I agree with others, however, that this is not what you usually want to do in a single regex. I don't know your specific situation, but a simple thread to determine if it is a product number and then apply the correct pattern would be much more readable.

+1


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics