How can I do transparent gzip uncompress from both stdin and files in perl?

I've written some scripts to handle FASTA / FASTQ files (like fastx-length.pl ), but I would like to make them more generic and accept compressed and uncompressed files as command line parameters and as stdin (so the scripts "just work "when you drop random files at them). It's pretty common for me to work with both uncompressed and compressed files (like compressed read files, uncompressed assembled genomes) and decimation in things like <(zcat file.fastq.gz)

gets very annoying.

Here's an example snippet from the fastx-length.pl

script:

...
my @lengths = ();
my $inQual = 0; # false
my $seqID = "";
my $qualID = "";
my $seq = "";
my $qual = "";
while(<>){
  chomp; chomp; # double chomp for Windows CR/LF on Linux machines
  if(!$inQual){
    if(/^(>|@)((.+?)( .*?\s*)?)$/){
      my $newSeqID = $2;
      my $newShortID = $3;
      if($seqID){
        printf("%d %s\n", length($seq), $seqID);
        push(@lengths, length($seq));
      }
...

      

I see it IO::Uncompress::Gunzip

supports transparent compression via:

If this parameter is set and the input file / buffer is not compressed data, the module will still allow it to be read.

Also, if the input file / buffer contains compressed data and uncompressed data appears immediately after it, setting this parameter will cause this module to treat the entire file / buffer as one data stream.

I would like in principle to stack the transparent mailing into a diamond operator , between loading each file and reading a line from the input files. Does anyone know how I can do this?

+3


source to share


3 answers


I often use:

die("Usage: prog.pl [file [...]]\n") if @ARGV == 0 && -t STDIN;
push(@ARGV, "-") unless @ARGV;
for my $fn (@ARGV) {
    open(FH, $fn =~ /\.gz$/? "gzip -dc $fn |" : $fn =~ /\.bz2$/? "bzip2 -dc $fn |" : $fn) || die;
    print while (<FH>);
    close(FH);
}

      



This strategy only works when you have gzip

, etc., and files named with proper file extensions, but once you meet those requirements, it works with different file types at the same time. As for -t STDIN

, see description here .

+5


source


This is what I have wanted to do and for a long time. Only recently have I learned to do this reliably.

The approach does not require any file naming conventions. Instead, it checks for the magic gzip number , which is 0x1f8b . This requires reading the first two bytes of each file as a binary stream (using a really nifty function called unpack ) and checking if the bytes match the magic number gzip. This seems to work for me:

$ echo "hi world" | gzip -c > hi_world.gz
$ echo "hi world" > hi_world.txt
$ echo "hi world" | gzip -c > not_a_gz_file
$ perl testgz.pl hi_world.gz hi_world.txt not_a_gz_file
hi_world.gz is gzipped!
hi_world.txt is not gzipped :(
not_a_gz_file is gzipped!

      



The content is testgz.pl

shown below. Please excuse my perl. It was time ...

# testgz.pl
my $GZIP_MAGIC_NUMBER = "1f8b";
my $GZIP_MAGIC_NUMBER_LENGTH = 2; # in bytes

for my $arg (@ARGV){
    if(is_gzipped($arg)){
        print "$arg is gzipped!\n";
    } else{
        print "$arg is not gzipped :(\n";
    }
}


sub is_gzipped{
    my $file_name = shift;
    open(my $fh, "<", $file_name)
      or die "Can't open < $file_name: $!";
    read($fh, $line, $GZIP_MAGIC_NUMBER_LENGTH);
    close($fh);
    return is_line_gzipped($line);
}

sub is_line_gzipped{
    my $line = shift;
    my $is_gzipped = 0;
    if (length($line) >= $GZIP_MAGIC_NUMBER_LENGTH){
        my $magic_number = unpack("H4", $line);
        $is_gzipped = 1 if($magic_number == $GZIP_MAGIC_NUMBER);
    }
    return $is_gzipped
}

      

In response to the question, I would suggest checking the file you are about to open with the function is_gzipped

and then taking a result-based approach.

+2


source


I think I am mostly struggling to tease different bits of the diamond operator. I found some help in the Compress::Zlib

documentation which struck me as close to what I wanted to do, except that it tries to uncompress everything (eventually garbage output for uncompressed files):

use strict ;
use warnings ;
use Compress::Zlib ;

# use stdin if no files supplied
@ARGV = '-' unless @ARGV ;

foreach my $file (@ARGV) {
    my $buffer ;

    my $gz = gzopen($file, "rb") 
         or die "Cannot open $file: $gzerrno\n" ;

    print $buffer while $gz->gzread($buffer) > 0 ;

    die "Error reading from $file: $gzerrno" . ($gzerrno+0) . "\n" 
        if $gzerrno != Z_STREAM_END ;

    $gz->gzclose() ;
}

      

Here's my modification to go to IO::Uncompress::Gunzip

and get transparent compression:

#!/usr/bin/perl
use strict;
use warnings;

use IO::Uncompress::Gunzip qw(gunzip $GunzipError);

# use stdin if no files supplied
@ARGV = '-' unless @ARGV

foreach my $file (@ARGV) {
    my $z = new IO::Uncompress::Gunzip($file, "transparent", 1)
        or die "gunzip failed: $GunzipError\n";
    while(<$z>){
        print;
    }
    close($z);
}

      

This seems to only work for reading and writing files (like zcat), but I have yet to test this in my scripts.

0


source







All Articles