Scala - unescape Unicode String without Apache

I have the string "b \ u00f4lovar" and I was wondering if it was possible to unescape without using Commons-lang. It works, but I'm running into a problem on some environments and I would like to keep it to a minimum (ie: it works on my machine, but doesn't work in production).

StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))

      

How can I cancel it without apache lib?

Thanks in advance.

+3


source to share


1 answer


Only Unicode is executed

If you want to unescape only formatted sequences \u0000

than simply do it with a single regex replacement:

def unescapeUnicode(str: String): String =
  """\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar match {
      case '\\' => """\\"""
      case '$' => """\$"""
      case c => c.toString
    })

      

And the result

scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bΓ΄lovar γ‚·

      

We have to handle the characters $

and \

separately, because they are treated as a special method java.util.regex.Matcher.appendReplacement

:

def wrongUnescape(str: String): String =
  """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar.toString)

scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
  at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
  ... 46 elided

scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
   at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
   ... 46 elided

      



All escape characters

Unicode character characters are a bit special: they are not part of string literals, but part of the program code. There is a separate phase for replacing unicode screens with characters:

scala> Integer.toString('a', 16)
res2: String = 61

scala> val \u0061 = "foo"
a: String = foo

scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = "    " 

      

The Scala library has a function StringContext.treatEscapes

that supports all the normal escape sequences from the language specification.

So, if you want to support unicode screens and all normal Scala screens, you can unescape as follows:

def unescape(str: String): String =
  StringContext.treatEscapes(unescapeUnicode(str))

scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b

scala> unescape("\\u005ct")
res5: String = "    "

      

+3


source







All Articles