Function for returning a list of cards when repeating by line, number of counters

I am working on creating a k-mer frequency counter (similar to word counting in Hadoop) written in Scala. I'm new to Scala, but I have some programming experience.

Input is a text file containing a sequence of genes, and my task is to get the frequency of each k-mer, where k

is the specific length of the sequence.

Hence the sequence AGCTTTC

has three 5-dimensional (AGCTT, GCTTT, CTTTC)

I parsed the input and created a huge line that represents the entire sequence, newlines discard the k-mer count as the end of one line should still form a k-mer with the start of the next sequence of lines.

Now I am trying to write a function that will generate a list of cards List[Map[String, Int]]

with which it should be easy to use a scala function groupBy

to get a counter of the total k-mers

import scala.io.Source

object Main {
  def main(args: Array[String]) {

    // Get all of the lines from the input file
    val input = Source.fromFile("input.txt").getLines.toArray

    // Create one huge string which contains all the lines but the first
    val lines = input.tail.mkString.replace("\n","")

    val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)

  }

  def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
    for (i <- 0 until seq.length - k) {
      Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
    }
  }
}

      

A couple of questions:

  • How to create / generate List[Map[String,Int]]

    ?
  • How do you do it?

Any help and / or advice is definitely appreciated!

+3


source to share


1 answer


You're pretty close - there are three fairly minor issues with your code.

The first for (i <- whatever) foo(i)

is syntactic sugar for whatever.foreach(i => foo(i))

, which means you don't actually do anything with the content whatever

. You want for (i <- whatever) yield foo(i)

which is sugar for whatever.map(i => foo(i))

and returns the converted collection.

The second problem is what 0 until seq.length - k

is Range

, and not List

, so even after adding, the yield

result will still not match the declared return type.

The third problem is Map(k, v)

trying to create a map with two key-value pairs, k

and v

. You want Map(k -> v)

or Map((k, v))

, each of which explicitly indicates that you have one pair of arguments.

This is how the following should work:



def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
  for (i <- 0 until seq.length - k) yield {
    Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
  }
}

      

You can also convert the range or the entire result to a list with .toList

if you prefer a list at the end.

It is worth noting, incidentally, that the method sliding

to Seq

doing exactly what you want:

scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC

      

I would definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity)

for real code.

+4


source







All Articles