Splitting a very long (4GB) string with newlines
I have a file that needs to be JSON objects, one per line. Unfortunately, there was an incorrect linkage with the creation of the file and the JSON objects only have a space between them, not a newline.
I need to fix this by replacing each instance } {
with }\n{
.
Should be easy for sed or Perl, right?
sed -e "s/}\s{/}\n{/g" file.in > file.out
perl -pe "s/}\s{/}\n{/g" file.in > file.out
But file.in
it is actually 4.4 GB which seems to be causing the problem for both of these solutions.
The sed command ends up with a file at half likelihood, but file.out
is only 335MB and only makes up the first 1 / 10th of the input file, cutting it off in the middle of the line. It's almost the same as sed, just left in the middle of the stream. It may be trying to load the entire 4.4GB file into memory, but ends up running out of about 300MB of stack space and kills itself silently.
Perl command errors with the following message:
[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out
What else should I try?
source to share
Unlike earlier solutions, this one handles {"x":"} {"}
.
use strict;
use warnings;
use feature qw( say );
use JSON::XS qw( );
use constant READ_SIZE => 64*1024*1024;
my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;
binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';
while (1) {
my $rv = sysread(\*STDIN, my $block, READ_SIZE);
die($!) if !defined($rv);
last if !$rv;
$j_in->incr_parse($block);
while (my $o = $j_in->incr_parse()) {
say $j_out->encode($o);
}
}
die("Bad data") if $j_in->incr_text !~ /^\s*\z/;
source to share
You can read input in blocks / chunks and process them one by one.
use strict;
use warnings;
binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';
while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
$buffer =~ s/\}\s\{/}\n{/sg;
if( length($buffer) > $CHUNK) { # More than one chunk buffered
syswrite( STDOUT, $buffer, $CHUNK); # write FIRST of buffered chunks
substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
}
}
syswrite( STDOUT, $buffer) if length($buffer);
source to share
Assuming your input does not contain pairs } {
in other contexts that you don't want to replace, you need to:
awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
eg.
$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}
The above uses GNU awk for multi-char RS and RT and will work on any size input file, since it does not read the entire file in memory at once, but only every single } {
parsed "line" at a time.
source to share