How can I join lines in a CSV file when one of the fields has a newline?

Question

How can I join lines in a CSV file when one of the fields has a newline?

If I have a comma delimited file like:

foo, bar, n
, a, bc, d
one, two, three
, a, bc, d

And I want to join \n,

to create this:

foo, bar, n, a, bc, d
one, two, three, a, bc, d

What is the regex trick? I thought I would if (/\n,/)

catch this.

Also do I need to do something special for the UTF-8 encoded file?

Finally, a Groovy solution would be helpful as well.

+1

perl newline csv groovy

anon 10 nov. '08 at 18:14

source to share

5 answers

This works for me:

open(F, "test.txt") or die;
undef $/;
$s = <F>;
close(F);
$s =~ s/\n,/,/g;
print $s;

$ cat test.txt
foo,bar,n
,a,bc,d
one,two,three
,a,bc,d
$ perl test.pl 
foo,bar,n,a,bc,d
one,two,three,a,bc,d

0

Greg Hewgill 10 nov. '08 at 18:27

source to share

Here's the groovy version. Depending on the requirements, there are some nuances that this may not capture (for example, quoted strings, which may have commas in them). It also needs to be changed if a new line can occur in the middle of the field and not always at the end.

def input = """foo,bar,n
,a,bc,d
one,two,three
,a,bc,d"""

def answer = (input =~ /(.*\n?,){5}.*(\n|$)/).inject ("") { ans, match  ->
    ans << match.replaceAll("\n","") << "\n"
}

assert answer.toString() == 
"""foo,bar,n,a,bc,d
one,two,three,a,bc,d
"""

0

Ted naleid 11 nov. '08 at 5:26

source to share

It may be too easy (or not good enough to handle the general case),

def input = """foo,bar,n
,a,bc,d
one,two,three
,a,bc,d"""

def last
input.eachLine {
    if(it.startsWith(',')) {
        last += it;
        return;
    }
    if(last)
        println last;
    last = it
}
println last

emits;

foo,bar,n,a,bc,d
one,two,three,a,bc,d

0

Bob herrmann 12 nov. '08 at 3:27

source to share

This is primarily the answer to your UTF-8 encoding question.

Depending on the specific encoding, you may also need to look for null bytes. If the above tip didn't work for you, replacing 's / \ n, /, / g' with 's / \ c @? \ N (\ c @?) / $ 1 / g' might work without breaking coding, although it might be it's safer to do this iteratively (applying 's / \ c @? \ n (\ c @?) / $ 1 /' for each line, rather than concatenating them and applying it globally). This is really a hack and not a replacement for real Unicode support, but if you just need a quick fix or have guarantees about the encoding, this might help.

0

Jeff bragg 12 nov. At 23:04

source to share

Michael carman · Accepted Answer · 2008-11-10T18:26:38+0000

You should use Text :: CSV_XS instead . It supports inline newlines as well as Unicode files. You need to provide the correct parameters when creating the parser, so be sure to read the documentation carefully.

How can I join lines in a CSV file when one of the fields has a newline?

More articles: