Removing punctuation marks forms text in Scala - Spark
This is one of my data:
case time (especially it purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
I want to remove all punctuation except the period (.) And also remove the words with length < = 2
, for example, my expected result:
case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .
and this should be implemented in Scala, I've tried:
replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
but not working, can anyone help me?
source to share
Looking at the regex javadoc ( http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html ) we can see that the character class is for punctuation \p{Punct}
and that we can remove the character from the class characters using something like [a-z&&[^def]]
. From now on, it's easy to define a regex that will remove all punctuation except the period:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
Removing words with size <= 2 can be done like this:
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
Combining the two, this gives:
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Note how I added \s*
to remove excess spaces.
Also, you can see that the above expression completely removes the '$' because it has a punctuation character (as defined in unicode). If this is undesirable (seems to indicate your expected result), please clarify what you think is punctuation. For example, you can only treat the following characters as punctuation marks ?.!:()
::
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Alternatively, you can simply add "$" to your list of "no punctuation" characters, along with a period:
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
source to share
Try this, it will work:
val str = """
|case time (especially it purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
|xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
""".stripMargin('|')
println(str)
val pat = """[^\w\s\.\$]"""
val pat2 = """\s\w{2}\s"""
println(str.replaceAll(pat, "").replaceAll(pat2, ""))
OUTPUT:
case time especially its purse read manual care follow care instructions make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25.
source to share