How can I join lines in a CSV file when one of the fields has a newline?

If I have a comma delimited file like:

foo, bar, n
, a, bc, d
one, two, three
, a, bc, d

And I want to join \n,

to create this:

foo, bar, n, a, bc, d
one, two, three, a, bc, d

What is the regex trick? I thought I would if (/\n,/)

catch this.

Also do I need to do something special for the UTF-8 encoded file?

Finally, a Groovy solution would be helpful as well.

+1


source to share


5 answers


You should use Text :: CSV_XS instead . It supports inline newlines as well as Unicode files. You need to provide the correct parameters when creating the parser, so be sure to read the documentation carefully.



+12


source


This works for me:



open(F, "test.txt") or die;
undef $/;
$s = <F>;
close(F);
$s =~ s/\n,/,/g;
print $s;

$ cat test.txt
foo,bar,n
,a,bc,d
one,two,three
,a,bc,d
$ perl test.pl 
foo,bar,n,a,bc,d
one,two,three,a,bc,d

      

0


source


Here's the groovy version. Depending on the requirements, there are some nuances that this may not capture (for example, quoted strings, which may have commas in them). It also needs to be changed if a new line can occur in the middle of the field and not always at the end.

def input = """foo,bar,n
,a,bc,d
one,two,three
,a,bc,d"""

def answer = (input =~ /(.*\n?,){5}.*(\n|$)/).inject ("") { ans, match  ->
    ans << match.replaceAll("\n","") << "\n"
}

assert answer.toString() == 
"""foo,bar,n,a,bc,d
one,two,three,a,bc,d
"""

      

0


source


It may be too easy (or not good enough to handle the general case),

def input = """foo,bar,n
,a,bc,d
one,two,three
,a,bc,d"""

def last
input.eachLine {
    if(it.startsWith(',')) {
        last += it;
        return;
    }
    if(last)
        println last;
    last = it
}
println last

      

emits;

foo,bar,n,a,bc,d
one,two,three,a,bc,d

      

0


source


This is primarily the answer to your UTF-8 encoding question.

Depending on the specific encoding, you may also need to look for null bytes. If the above tip didn't work for you, replacing 's / \ n, /, / g' with 's / \ c @? \ N (\ c @?) / $ 1 / g' might work without breaking coding, although it might be it's safer to do this iteratively (applying 's / \ c @? \ n (\ c @?) / $ 1 /' for each line, rather than concatenating them and applying it globally). This is really a hack and not a replacement for real Unicode support, but if you just need a quick fix or have guarantees about the encoding, this might help.

0


source







All Articles