Removing special character from Java string

I am trying to work to remove characters and special characters from raw text in java and could not find a way. The text is taken from a free text box on a website, which can contain literally anything. I am taking this text from an external source and I have no control over changing the settings. Therefore, I have to work from my side. Some examples:

1) belem ๐Ÿบ must be โ†’ belem

2) Ariana ๐Ÿ‘‘ should be โ†’ Ariana

3) Harlem ๐ŸŒŠ should be โ†’ Harlem

4) Yz ๐Ÿณ๏ธ๐ŸŒˆ must be โ†’ Yz

5) ใ“ ใ“ ใ• ใ‘ ใฏ 7 ๅ›ž ใฏ ่ฆ‹ ใซ ่กŒ ใ ใž ๐Ÿ‘๐Ÿ’Ÿ should be โ†’ ใ“ ใ“ ใ• ใ‘ ใฏ 7 ๅ›ž ใฏ ่ฆ‹ ใซ ่กŒ ใ ใž

6) ุฏู…ูŠ ุงุฒุฑู‚ ูˆุทู†ูŠ ุงุฒุฑู‚ ๐Ÿ’™๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต should be โ†’ ุฏู…ูŠ ุงุฒุฑู‚ ูˆุทู†ูŠ ุงุฒุฑู‚

Any help please?

+3


source to share


3 answers


If you mean "special characters" these are surrogate pairs, try this.

static String removeSpecial(String s) {
    int[] r = s.codePoints()
        .filter(c -> c < Character.MIN_SURROGATE)
        .toArray();
    return new String(r, 0, r.length);
}

      

and



String[] testStrs = {
    "belem ๐Ÿบ",
    "Ariana ๐Ÿ‘‘",
    "Harlem ๐ŸŒŠ",
    "Yz ๐Ÿณ๏ธโ€๐ŸŒˆ",
    "ใ“ใ“ใ•ใ‘ใฏ7ๅ›žใฏ่ฆ‹ใซ่กŒใใž๐Ÿ‘๐Ÿ’Ÿ",
    "ุฏู…ูŠ ุงุฒุฑู‚ ูˆุทู†ูŠ ุงุฒุฑู‚ ๐Ÿ’™๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต"
};

for (String s : testStrs)
    System.out.println(removeSpecial(s));

      

results

belem 
Ariana 
Harlem 
Yz โ€
ใ“ใ“ใ•ใ‘ใฏ7ๅ›žใฏ่ฆ‹ใซ่กŒใใž
ุฏู…ูŠ ุงุฒุฑู‚ ูˆุทู†ูŠ ุงุฒุฑู‚ 

      

+2


source


You can try this regex which will find all emojis in a string:

regex = "[\\ud83c\\udc00-\\ud83c\\udfff]|[\\ud83d\\udc00-\\ud83d\\udfff]|[\\u2600-\\u27ff]"

      

then remove all emojis in it with the method replaceAll()

:



String text = "ใ“ใ“ใ•ใ‘ใฏ7ๅ›žใฏ่ฆ‹ใซ่กŒใใž๐Ÿ‘๐Ÿ’Ÿ ";
String regex = "[\\ud83c\\udc00-\\ud83c\\udfff]|[\\ud83d\\udc00-\\ud83d\\udfff]|[\\u2600-\\u27ff]";
System.out.println(text.replaceAll(regex, ""));

      

Output

ใ“ใ“ใ•ใ‘ใฏ7ๅ›žใฏ่ฆ‹ใซ่กŒใใž 

      

+2


source


Use the character class for spaces and the POSIX character class for "any letter or number from any language":

str = str.replaceAll("[^\\s\\p{Alnum}]", "");

      

0


source







All Articles