Replace text on first line in huge tab delimited txt file

I have a huge text file (19 GB in size); it is a genetic data file with variables and observations.
The first line contains the variable names, and they are structured as follows:

id1.var1 id1.var2 id1.var3 id2.var1 id2.var2 id2.var3 

      

I need to exchange id1, id2 ect. with the corresponding values ​​that are in another text file (this file has about 7k lines), the IDs are not in any particular order and are structured like this:

oldId newIds
id1 rs004
id2 rs135

      

I did some google search and couldn't find a language that would do the following:

  • read the first line
  • replace ids with new ids
  • remove the first line from the original file and replace it with a new one.

Is this a good approach or is there a better one?
What's the best language for this?
We have people with experience in python, vbscipt and Perl.

+3


source to share


3 answers


All "replacement" is possible in almost any language (I'm sure Python and Perl), if the length of the replacement line is the same as the original, or if it can be made the same by padding with spaces (otherwise you have to rewrite the entire file).



Open the file for reading and writing ( w+

mode), read the first line, prepare a new line, seek

at position 0 in the file, write a new line, close the file.

+4


source


I suggest you use a module Tie::File

that maps the lines in a text file to a Perl array and will rewrite the line after the header - a simple job.

This program demonstrates. It first reads all old / new ids into a hash and then displays the datafile with Tie::File

. The first line of the file (c $file[0]

) is modified using substitution, and then the array is not used to overwrite and close the file.



You will need to change the filenames from the ones I used. Also be careful that I assumed that identifiers are always "word" characters (alphanumeric and underscore) followed by a period and have no spaces. Of course, you will want to back up your file before modifying it, and you should test the program on a smaller file before updating the real thing.

use strict;
use warnings;

use Tie::File;

my %ids;
open my $fh, '<', 'newids.txt' or die $!;
while (<$fh>) {
  my ($old, $new) = split;
  $ids{$old} = $new;
}

tie my @file, 'Tie::File', 'datafile.txt' or die $!;
$file[0] =~ s<(\w+)(?=\.)><$ids{$1} // $1>eg;
untie @file;

      

+3


source


This should be pretty easy. I would use Python as I am a Python fan. Structure:

  • Read the mapping file and save the mapping (in Python, use a dictionary).

  • Read the data file line at a time, rename the variable names, and output the edited line.

You really can't edit the file in place ... hmm I think you could if every new variable name was always exactly the same length as the old name. But for programming convenience and safety while running, it would be best to always write a new output file and then delete the original. This means that you will need at least 20GB of free disk space before launching, but that shouldn't be a problem.

Here's a Python program that shows you how. I used your example data to generate test files and it seems to work.

#!/usr/bin/python

import re
import sys

try:
    fname_idmap, fname_in, fname_out = sys.argv[1:]
except ValueError:
    print("Usage: remap_ids <id_map_file> <input_file> <output_file>")
    sys.exit(1)

# pattern to match an ID, only as a complete word (do not match inside another id)
# match start of line or whitespace, then match non-period until a period is seen
pat_id = re.compile("(^|\s)([^.]+).")

idmap = {}

def remap_id(m):
    before_word = m.group(1)
    word = m.group(2)
    if word in idmap:
        return before_word + idmap[word] + "."
    else:
        return m.group(0)  # return full matched string unchanged

def replace_ids(line, idmap):
    return re.sub(pat_id, remap_id, line)

with open(fname_idmap, "r") as f:
    next(f)  # discard first line with column header: "oldId newIds"
    for line in f:
        key, value = line.split()
        idmap[key] = value

with open(fname_in, "r") as f_in, open(fname_out, "w") as f_out:
    for line in f_in:
        line = replace_ids(line, idmap)
        f_out.write(line)

      

+1


source







All Articles