Splitting a very long (4GB) string with newlines

I have a file that needs to be JSON objects, one per line. Unfortunately, there was an incorrect linkage with the creation of the file and the JSON objects only have a space between them, not a newline.

I need to fix this by replacing each instance } {

with }\n{

.

Should be easy for sed or Perl, right?

sed -e "s/}\s{/}\n{/g" file.in > file.out

perl -pe "s/}\s{/}\n{/g" file.in > file.out

But file.in

it is actually 4.4 GB which seems to be causing the problem for both of these solutions.

The sed command ends up with a file at half likelihood, but file.out

is only 335MB and only makes up the first 1 / 10th of the input file, cutting it off in the middle of the line. It's almost the same as sed, just left in the middle of the stream. It may be trying to load the entire 4.4GB file into memory, but ends up running out of about 300MB of stack space and kills itself silently.

Perl command errors with the following message:

[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out

What else should I try?

+3


source to share


5 answers


Perl's default input record separator \n

, but you can change it to any character you want. For this problem you can use {

(octal 173).



perl -0173 -pe 's/}\s{/}\n{/g' file.in > file.out

      

+1


source


Unlike earlier solutions, this one handles {"x":"} {"}

.



use strict;
use warnings;
use feature qw( say );

use JSON::XS qw( );

use constant READ_SIZE => 64*1024*1024;

my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;

binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';

while (1) {
   my $rv = sysread(\*STDIN, my $block, READ_SIZE);
   die($!) if !defined($rv);
   last if !$rv;

   $j_in->incr_parse($block);

   while (my $o = $j_in->incr_parse()) {
      say $j_out->encode($o);
   }
}

die("Bad data") if $j_in->incr_text !~ /^\s*\z/;

      

+4


source


perl -ple 'BEGIN{$/=qq/} {/;$\=qq/}\n{/}undef$\ if eof' <input >output

      

+1


source


You can read input in blocks / chunks and process them one by one.

use strict;
use warnings;

binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';

while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
  $buffer =~ s/\}\s\{/}\n{/sg;
  if( length($buffer) > $CHUNK) { # More than one chunk buffered
    syswrite( STDOUT, $buffer, $CHUNK); # write  FIRST of buffered chunks
    substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
  }
}
syswrite( STDOUT, $buffer) if length($buffer);                                             

      

0


source


Assuming your input does not contain pairs } {

in other contexts that you don't want to replace, you need to:

awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'

      

eg.

$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}

      

The above uses GNU awk for multi-char RS and RT and will work on any size input file, since it does not read the entire file in memory at once, but only every single } {

parsed "line" at a time.

0


source







All Articles