Scala - unescape Unicode String without Apache

Question

Scala - unescape Unicode String without Apache

I have the string "b \ u00f4lovar" and I was wondering if it was possible to unescape without using Commons-lang. It works, but I'm running into a problem on some environments and I would like to keep it to a minimum (ie: it works on my machine, but doesn't work in production).

StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))

How can I cancel it without apache lib?

Thanks in advance.

+3

scala unicode

placplacboom Apr 16 15 at 20:59

source to share

1 answer

Kolmar · Accepted Answer · 2015-04-16T22:14:29+0000

Only Unicode is executed

If you want to unescape only formatted sequences \u0000

than simply do it with a single regex replacement:

def unescapeUnicode(str: String): String =
  """\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar match {
      case '\\' => """\\"""
      case '$' => """\$"""
      case c => c.toString
    })

And the result

scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bôlovar シ

We have to handle the characters $

and \

separately, because they are treated as a special method java.util.regex.Matcher.appendReplacement

:

def wrongUnescape(str: String): String =
  """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar.toString)

scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
  at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
  ... 46 elided

scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
   at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
   ... 46 elided

All escape characters

Unicode character characters are a bit special: they are not part of string literals, but part of the program code. There is a separate phase for replacing unicode screens with characters:

scala> Integer.toString('a', 16)
res2: String = 61

scala> val \u0061 = "foo"
a: String = foo

scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = "    "

The Scala library has a function StringContext.treatEscapes

that supports all the normal escape sequences from the language specification.

So, if you want to support unicode screens and all normal Scala screens, you can unescape as follows:

def unescape(str: String): String =
  StringContext.treatEscapes(unescapeUnicode(str))

scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b

scala> unescape("\\u005ct")
res5: String = "    "

Scala - unescape Unicode String without Apache

Only Unicode is executed

All escape characters

More articles: