How to split a file into blocks defined by a keyword

Suppose I have a large text file like:

variableStep chrom=chr1
sometext1
sometext1
sometext1
variableStep chrom=chr2
sometext2
variableStep chrom=chr3
sometext3
sometext3
sometext3
sometext3

      

I would like to split this file into 3 files: file 1 has content

sometext1
sometext1
sometext2

      

file 2 has content

sometext2

      

and file 3 has content

sometext3
sometext3
sometext3
sometext3

      

Note that none of the "sometext1" "sometext2" "sometext3" will have the word "variableStep".

I can do this in python by simply iterating over the lines and opening a new file descriptor and writing subsequent lines to it every time I encounter a "Step variable" at the beginning of a line, however I'm wondering if this can be done on the command line. Note that real files are massive (several Gbs, so reading all content in one go will not be feasible).

thank

+3


source to share


2 answers


This will create file1

, file2

etc. with the desired content:

awk '/variableStep/{close(f); f="file" ++c;next} {print>f;}' file

      

How it works

  • /variableStep/{close(f); f="file" ++c;next}

    Every time we reach the line containing variableStep

    , we close the last file used, assign the f

    name of the next file to use, and then skip the rest of the commands and move on to the next line.

    c

    is a counter giving us the number for the current file. It gets incremented ++

    every time we create a new filename.

  • print>f

    For all other lines, we print them to a file named according to the value of the variable f

    .



Since this processes the file line by line, it should be fine even for massive files.

The first output file looks like this:

$ cat file1
sometext1
sometext1
sometext1

      

+4


source


You did not ask for a solution awk

or perl

, you flagged your question bash

. So that's it.

while read line; do
  if [[ $line =~ ^variableStep ]]; then
    outputfile="file-${line#chr}.txt"
    continue
  fi
  if [ -n "$outputfile" ]; then
    echo "$line" >> "$outputfile"
  fi
done < inputfile.txt

      

This skips lines at the beginning of the file until it encounters one containing the pattern used to determine the output file name. It assumes that for chrom=chrN

you want to save the output to file-N.txt

. Salt to taste.

Like John awk's solution, this processes the data over a pipe, line by line, so it doesn't matter what file size you feed it. You can even use any of these solutions to process the stdout of what generates this data, but if you've done that, you probably want to tweak awk's solution to close its output files after writing.



If it's not important to maintain consistent file numbers, you can simplify them a bit. For example:

n=0
while read line; do
  case "$line" in
    variableStep*) ((n++)); continue ;;
  esac
  echo "$line" >> file-${n}.txt
done < inputfile.txt

      

In this example, we evaluate the contents of the string using operator pattern matching case

instead of a regular expression in the expression if

. Typically, pattern matching is faster than regular expression matching. If it matters to you, you should test your evidence.

+3


source







All Articles