How to split a file into blocks defined by a keyword
Suppose I have a large text file like:
variableStep chrom=chr1
sometext1
sometext1
sometext1
variableStep chrom=chr2
sometext2
variableStep chrom=chr3
sometext3
sometext3
sometext3
sometext3
I would like to split this file into 3 files: file 1 has content
sometext1
sometext1
sometext2
file 2 has content
sometext2
and file 3 has content
sometext3
sometext3
sometext3
sometext3
Note that none of the "sometext1" "sometext2" "sometext3" will have the word "variableStep".
I can do this in python by simply iterating over the lines and opening a new file descriptor and writing subsequent lines to it every time I encounter a "Step variable" at the beginning of a line, however I'm wondering if this can be done on the command line. Note that real files are massive (several Gbs, so reading all content in one go will not be feasible).
thank
source to share
This will create file1
, file2
etc. with the desired content:
awk '/variableStep/{close(f); f="file" ++c;next} {print>f;}' file
How it works
-
/variableStep/{close(f); f="file" ++c;next}
Every time we reach the line containing
variableStep
, we close the last file used, assign thef
name of the next file to use, and then skip the rest of the commands and move on to the next line.c
is a counter giving us the number for the current file. It gets incremented++
every time we create a new filename. -
print>f
For all other lines, we print them to a file named according to the value of the variable
f
.
Since this processes the file line by line, it should be fine even for massive files.
The first output file looks like this:
$ cat file1 sometext1 sometext1 sometext1
source to share
You did not ask for a solution awk
or perl
, you flagged your question bash
. So that's it.
while read line; do
if [[ $line =~ ^variableStep ]]; then
outputfile="file-${line#chr}.txt"
continue
fi
if [ -n "$outputfile" ]; then
echo "$line" >> "$outputfile"
fi
done < inputfile.txt
This skips lines at the beginning of the file until it encounters one containing the pattern used to determine the output file name. It assumes that for chrom=chrN
you want to save the output to file-N.txt
. Salt to taste.
Like John awk's solution, this processes the data over a pipe, line by line, so it doesn't matter what file size you feed it. You can even use any of these solutions to process the stdout of what generates this data, but if you've done that, you probably want to tweak awk's solution to close its output files after writing.
If it's not important to maintain consistent file numbers, you can simplify them a bit. For example:
n=0
while read line; do
case "$line" in
variableStep*) ((n++)); continue ;;
esac
echo "$line" >> file-${n}.txt
done < inputfile.txt
In this example, we evaluate the contents of the string using operator pattern matching case
instead of a regular expression in the expression if
. Typically, pattern matching is faster than regular expression matching. If it matters to you, you should test your evidence.
source to share