How to get scala split string to match python

I am using spark shell and pyspark for word counting in one article. scala flatmap on line.split (") and python split () get different word counts (scala has more). I tried split (" + ") and split (" \ W + ") in scala code but can't get count go down to the same value as python.

Does anyone know which pattern python will match exactly?


source to share

1 answer

Python str.split()

has special behavior for the default delimiter:

spaces of consecutive spaces are treated as one separator and the result will not contain blank lines at the beginning or at the end if the line has leading or trailing spaces. Therefore, splitting an empty string or a string consisting of a simple space with a delimiter None

returns []


For example ' 1 2 3 '.split()

returns['1', '2', '3']

The easiest way to fully conform to this in Scala is probably like this:

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()


Another way is the trim()


scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)


But that's not how Python is for blank lines:

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")


In Scala, split()

an empty string returns an array with one element, but in Python, the result is a list with zero elements.



All Articles