Scala - unescape Unicode String without Apache
I have the string "b \ u00f4lovar" and I was wondering if it was possible to unescape without using Commons-lang. It works, but I'm running into a problem on some environments and I would like to keep it to a minimum (ie: it works on my machine, but doesn't work in production).
StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))
How can I cancel it without apache lib?
Thanks in advance.
source to share
Only Unicode is executed
If you want to unescape only formatted sequences \u0000
than simply do it with a single regex replacement:
def unescapeUnicode(str: String): String =
"""\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar match {
case '\\' => """\\"""
case '$' => """\$"""
case c => c.toString
})
And the result
scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bΓ΄lovar γ·
We have to handle the characters $
and \
separately, because they are treated as a special method java.util.regex.Matcher.appendReplacement
:
def wrongUnescape(str: String): String =
"""\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar.toString)
scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
... 46 elided
scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
... 46 elided
All escape characters
Unicode character characters are a bit special: they are not part of string literals, but part of the program code. There is a separate phase for replacing unicode screens with characters:
scala> Integer.toString('a', 16)
res2: String = 61
scala> val \u0061 = "foo"
a: String = foo
scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = " "
The Scala library has a function StringContext.treatEscapes
that supports all the normal escape sequences from the language specification.
So, if you want to support unicode screens and all normal Scala screens, you can unescape as follows:
def unescape(str: String): String =
StringContext.treatEscapes(unescapeUnicode(str))
scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b
scala> unescape("\\u005ct")
res5: String = " "
source to share