Regular expression problem in Java
I am trying to create a regex for a method replaceAll
in Java. Test string abXYabcXYZ
and template is abc
. I want to replace any character except the pattern with +
. For example, string abXYabcXYZ
and pattern [^(abc)]
should return ++++abc+++
, but in my case it returns ab++abc+++
.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the template +
, no problem - abXYabcXYZ
with the template it (abc)
returns abxy+xyz
. The template (^(abc))
returns a string without replacement.
Is there any other way to write NOT (regex) or wildcards as a word?
source to share
What you are trying to achieve is quite tricky with regular expressions, as there is no way to express "replace lines that do not match". You will have to use a "positive" pattern of telling what needs to be matched, not what is not.
Also, you want to replace each character with a replacement character, so you need to make sure your pattern matches a single character. Otherwise, you will replace whole strings with one character, returning a shorter string.
For the toy example, you can use negative imagery and lookbehinds to accomplish this, but this can be trickier for real-life examples with longer or more complex strings, since you have to consider each character in your string separately, as well as its context.
Here is the pattern for "not" abc ":
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five submatrices associated with "or" ( |
), each of which corresponds to one character:
-
[^abc]
matches any character excepta
,b
orc
-
a(?!bc)
matchesa
if not followedbc
-
(?<!a)b
matchesb
if not preceded bya
-
b(?!c)
matchesb
if not followedc
-
(?<!ab)c
matchesc
if not preceded byab
The idea is to match every character that is not in your target word abc
, plus every character in the word that, according to the context, is not part of your word. Context can be viewed using negative references (?!...)
and lookbehinds (?<!...)
.
You can imagine that this method will fail if you have a target word that contains one character more than once, for example example
. It is difficult to express "match e
unless followed x
and not preceded l
".
Especially for dynamic templates, it is much easier to do a positive search and then replace every character that does not match the second blank, as others have suggested.
source to share
[^ ...] will match a single character that is not ...
So your pattern "[^ (abc)]" says "matches a single character that is not a, b, c, or a left or right parenthesis"; and indeed, this is what happens in your test.
It's hard to say "replace all characters that are not part of the string" abc "in one trivial regex. What you could do instead to achieve what you want could be some nasty thing, like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or perhaps if the alphabet entered is limited, you can use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not work as well
source to share
Negation of regular expressions is usually difficult. I think you can use a negative view. Perhaps something like this:
String pattern = "(?<!ab).(?!abc)";
I have not tested it, so it may not work for degenerate cases. And the performance can be terrible. It is probably best to use a multi-stage algorithm.
Edit : No. I think it won't work for every case. You will most likely spend more time debugging the regular expression than you will algorithmically with some extra code.
source to share
Try to solve it without regex:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
source to share
Instead of one replaceAll, you can always try something like:
@Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
source to share
Instead of using, replaceAll(...)
I would go for the approach Pattern/Matcher
:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note what you need to use Pattern.quote(...)
if yours String pattern
contains regex metacharacters.
Edit . I haven't seen the approach Pattern/Matcher
already suggested by the toolkit (although slightly different) ...
source to share