Removing special character from Java string
I am trying to work to remove characters and special characters from raw text in java and could not find a way. The text is taken from a free text box on a website, which can contain literally anything. I am taking this text from an external source and I have no control over changing the settings. Therefore, I have to work from my side. Some examples:
1) belem ๐บ must be โ belem
2) Ariana ๐ should be โ Ariana
3) Harlem ๐ should be โ Harlem
4) Yz ๐ณ๏ธ๐ must be โ Yz
5) ใ ใ ใ ใ ใฏ 7 ๅ ใฏ ่ฆ ใซ ่ก ใ ใ ๐๐ should be โ ใ ใ ใ ใ ใฏ 7 ๅ ใฏ ่ฆ ใซ ่ก ใ ใ
6) ุฏู ู ุงุฒุฑู ูุทูู ุงุฒุฑู ๐๐ต๐ต๐ต๐ต should be โ ุฏู ู ุงุฒุฑู ูุทูู ุงุฒุฑู
Any help please?
source to share
If you mean "special characters" these are surrogate pairs, try this.
static String removeSpecial(String s) {
int[] r = s.codePoints()
.filter(c -> c < Character.MIN_SURROGATE)
.toArray();
return new String(r, 0, r.length);
}
and
String[] testStrs = {
"belem ๐บ",
"Ariana ๐",
"Harlem ๐",
"Yz ๐ณ๏ธโ๐",
"ใใใใใฏ7ๅใฏ่ฆใซ่กใใ๐๐",
"ุฏู
ู ุงุฒุฑู ูุทูู ุงุฒุฑู ๐๐ต๐ต๐ต๐ต"
};
for (String s : testStrs)
System.out.println(removeSpecial(s));
results
belem
Ariana
Harlem
Yz โ
ใใใใใฏ7ๅใฏ่ฆใซ่กใใ
ุฏู
ู ุงุฒุฑู ูุทูู ุงุฒุฑู
source to share
You can try this regex which will find all emojis in a string:
regex = "[\\ud83c\\udc00-\\ud83c\\udfff]|[\\ud83d\\udc00-\\ud83d\\udfff]|[\\u2600-\\u27ff]"
then remove all emojis in it with the method replaceAll()
:
String text = "ใใใใใฏ7ๅใฏ่ฆใซ่กใใ๐๐ ";
String regex = "[\\ud83c\\udc00-\\ud83c\\udfff]|[\\ud83d\\udc00-\\ud83d\\udfff]|[\\u2600-\\u27ff]";
System.out.println(text.replaceAll(regex, ""));
Output
ใใใใใฏ7ๅใฏ่ฆใซ่กใใ
source to share