Regular expression to match special characters EXCLUSIVE hyphen (s) mixed with number (s)
We are currently using [^a-zA-Z0-9]
in Java functions replaceAll
to strip special characters from a string. We have drawn our attention to the fact that we need to resolve hyphen (s) when they are mixed with number (s).
Examples for which the hyphen won't be :
- 1-2-3
- -1-23-4562
- - 1 --- 2--3 --- 4 -
- - 9 - - 7
- 425-12-3456
Examples for which hyphens are matched :
- - A - B - C
- wali-showcase
We think we have formulated a regex to match the latest criteria using this SO question as a reference, but we have no idea how to combine it with the original regex [^a-zA-Z0-9]
.
We want to do this for the Lucene search string because of the way the standard Lucene tokenizer works when indexed:
Separates words in hyphens if there is no number in the token, in which case the entire token is interpreted as a product number and is not split.
You cannot do this with a single regex. (Well ... maybe in Perl.)
(edit: Okay, you can do this with variable length negative lookbehind, which Java seems to be able to (almost unambiguously!), see Cyborgx37's answer. No matter imo, you shouldn't be doing this with a single regex. :))
What you can do is split the string into words and process each word separately. My Java is pretty awful, so here's some reliable Python:
# Precompile some regex
looks_like_product_number = re.compile(r'\A[-0-9]+\Z')
not_wordlike = re.compile(r'[^a-zA-Z0-9]')
not_wordlike_or_hyphen = re.compile(r'[^-a-zA-Z0-9]')
# Split on anything that not a letter, number, or hyphen -- BUT dots
# must be followed by whitespace
words = re.split(r'(?:[^-.a-zA-Z0-9]|[.]\s)+', string)
stripped_words = []
for word in words:
if '-' in word and not looks_like_product_number.match(word):
stripped_word = not_wordlike.sub('', word)
else:
# Product number; allow dashes
stripped_word = not_wordlike_or_hyphen.sub('', word)
stripped_words.append(stripped_word)
pass_to_lucene(' '.join(stripped_words))
When I run this with help 'wal-mart 1-2-3'
, I get back 'walmart 1-2-3'
.
But honestly, the code above reproduces most of what the Lucene tokenizer is already doing. I think you'd be better off just copying StandardTokenizer
into your own project and modifying it to do what you want.
Have you tried this:
[^a-zA-Z0-9-]
This question is tricky because Java doesn't allow infinite recursion in searches, which is basically what you want. As I understood, due to the 100 character limit, which you can increase if you expect the words to be longer.
This should work:
(?<![0-9]\S{0,100})[^a-zA-Z](?!\S{0,100}[0-9])|(?<=[0-9]\S{0,100})[^a-zA-Z0-9-](?=\S{0,100}[0-9])
A simple replaceAll () with this expression should handle it.
For example, consider this input:
--9-+-a--7 wal-mart
The above expression, in which the replacement characters are replaced with a zero-length string, will display the following result:
--9--a--7 walmart
You can try it here: http://fiddle.re/ynyu
Note that this expression depends on the fact that the words are separated by a space (spaces, tabs, translation strings, etc.). Other characters, such as commas and semicolons, will cause the expression to treat two words as one. For example, "--- 9-a-0-, wal-mart" will be treated as one word.
EDIT The last paragraph from my previous edit was wrong. If you want to include other characters as delimiters, I recommend replacing them with a space in the first pass (for example, replacing ',' with '').
I am primarily a .NET programmer, otherwise I would give you full Java code to use this pattern.
Forgive me for posting the second answer instead of editing the first, but I'm not entirely sure if the problem is to eliminate the dashes in cases where they are immediately surrounded by letters, or if the goal is to remove the dashes only on strings that contain no numbers at all. This is the solution for the latter case. My other solution for the first case.
This template
String newValue = myString.replaceAll("[^\\sA-Za-z0-9\\-]|((?<!\\S*\\d)-(?!\\S*\\d))", "");
should do it. There are two main parts connected to or
. The first part matches all non-alpha, non-numeric, non-symmetric characters, since we want to separate those characters no matter what. The second half or
will match any dash that does not have a digit anywhere in front of it in the token, and nowhere after it in the token (ie, no digits in the token at all, where tokens consist of all non-whitespace or \S
, characters). This is achieved through negative gaze and expectation. We do make use of the fact that Java supports variable width in these perspectives / behind. Of course the replacement is just an empty string.
I have to admit, although the syntax for using regex is painful in Java (in the case where you have to use Pattern.compile, etc.), at least the engine supports some nice features. Though perhaps not as good as .NET according to Eevee.
I agree with others, however, that this is not what you usually want to do in a single regex. I don't know your specific situation, but a simple thread to determine if it is a product number and then apply the correct pattern would be much more readable.