How can I speed up the processing of narrative text?

How can I speed up the following code scalaz-stream

? It currently takes about 5 minutes to process 70MB of text, so I'm probably doing something completely wrong as the normal scala equivalent will take a few seconds.

(continued one more question )

  val converter2: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .flatMap(line => { val words = line.split(" ");
          if (words.length==0 || words(0)!=docSep) Process(line)
          else Process(docSep, words.tail.mkString(" ")) })
      .split(_ == docSep)
      .filter(_ != Vector())
      .map(lines => lines.head + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("correctButSlowOutput.txt"))
      .run
  }

      

+3


source to share


2 answers


The following is based on @ user1763729 sharing assumption. It feels awkward, albeit just as slow as the original version.

  val converter: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .intersperse("\n") // handle empty documents (chunkBy has to switch from true to false)
      .zipWithPrevious // chunkBy cuts only *after* the predicate turns false
      .chunkBy{ 
        case (Some(prev), line) => { val words = line.split(" "); words.length == 0 || words(0) != docSep } 
        case (None, line) => true }
      .map(_.map(_._1.getOrElse(""))) // get previous element
      .map(_.filter(!Set("", "\n").contains(_)))
      .map(lines => lines.head.split(" ").tail.mkString(" ") + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("stillSlowOutput.txt"))
      .run
  }

      

EDIT:



Actually, doing the following (just reading a file, no writing or processing) already takes 1.5 minutes, so I think there is no hope of speeding this up.

  val converter: Task[Unit] = {
    io.linesR("myInput.txt")
      .pipe(text.utf8Encode)
      .run
  }

      

0


source


I think you could just use one of the chunk process1 methods. If you want a lot of concurrent line merge processing into your output format, decide if ordered output is important and use a pipe in combination with a merge or tee. This will also make it reusable. Since you are doing very little processing, you are probably overwhelmed with overhead, so you need to work harder to make your unit of work big enough not to get overwhelmed.



0


source







All Articles