How to remove duplicate characters in a string using regex?
I need to replace duplicate characters in a string. I tried to use
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces duplicate characters, but the position of the characters changes as shown below.
entrance
haih
Output
aih
But I need to get the output hai
. That is, the order of the characters that appear in the string should not be changed. Below are the expected results for some inputs.
entrance
aaaassssddddd
Output
asd
entrance
cdddddggggeeccc
Output
cdge
How can this be achieved?
source to share
It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai
source to share
Overview
This is possible with the Oracle implementation, but I would not recommend this answer for many reasons:
-
He relies on an error in the implementation , which interprets
*
,+
or{n,}
both{0, 0x7FFFFFFF}
,{1, 0x7FFFFFFF}
,{n, 0x7FFFFFFF}
respectively, which allows the look-behind contain such quantifiers. Since it relies on a bug, there is no guarantee that it will work the same way in the future. -
It's a useless mess. Writing normal code and anyone who has a basic knowledge of Java can read it, but using regex in this answer limits the number of people who can understand the code at a glance to people who understand the implementation and regex in regex.
Therefore, this answer is for educational purpose and not for use in production code.
Decision
Below is a one-line replaceAll
regex solution :
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Regular expression printout:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is look to see if the same symbol has appeared before or not. The capture group (.)
first captures the current character, and the look-behind group checks to see if this character has appeared earlier. So far so good.
However, since backlinks \1
have no obvious length, they cannot appear directly in the look-behind.
Here we use the error to find behind the line before the start of the string, and then use the exterior appearance to turn backreference, as you can see (?<=(?=
... ).+)
.
This is not the end of the problem. While the non-assertion pattern inside the look-behind .+
cannot advance up the position after the character in (.)
, the inside look ahead. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure the search does not spill over the current character, I freeze the rest of the string to standby (?=(.*))
and use it to "mark" the current position (?=\\1.*?\\1\\2$)
.
Can this be done in one replacement without using look-behind?
I think that this is impossible. We need to differentiate the first appearance of a character from the subsequent appearance of the same character. While we can do this for one fixed character (for example a
), the problem requires us to do this for all characters in the string.
For your information, this is done to remove all subsequent occurrence of the fixed character (used here h
):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple symbols, we have to keep track of whether the symbol appeared earlier or not, in matches and for all symbols . The regex above shows the part in spokes , but another kind condition makes this impossible.
We obviously cannot do this in one match, since subsequent occurrences can be non-contiguous and arbitrary in number.
source to share