Convert CSV to JSON in bash

Trying to convert CSV file to JSON

Here are two lines:

-21.3214077;55.4851413;Ruizia cordata
-21.3213078;55.4849803;Cossinia pinnata

      

I would like to get something like:

"occurrences": [
                 {
                "position": [-21.3214077, 55.4851413],
                "taxo": {
                    "espece": "Ruizia cordata"
                 },
                 ...
             }]

      

Here's my script:

    echo '"occurences": [ '

cat se.csv | while read -r line
  do
      IFS=';' read -r -a array <<< $line;
      echo -n -e '{ "position": [' ${array[0]}
      echo -n -e ',' ${array[1]} ']'
      echo -e ', "taxo": {"espece":"' ${array[2]} '"'
done
echo "]";

      

I am getting really weird results:

   "occurences": [ 
 ""position": [ -21.3214077, 55.4851413 ], "taxo": {"espece":" Ruizia cordata
 ""position": [ -21.3213078, 55.4849803 ], "taxo": {"espece":" Cossinia pinnata

      

What's wrong with my code?

+13


source to share


7 replies


The right tool for the job is jq

.

jq -Rsn '
  {"occurrences":
    [inputs
     | . / "\n"
     | (.[] | select(length > 0) | . / ";") as $input
     | {"position": [$input[0], $input[1]], "taxo": {"espece": $input[2]}}]}
' <se.csv

      

emits, given your input:

{
  "occurences": [
    {
      "position": [
        "-21.3214077",
        "55.4851413"
      ],
      "taxo": {
        "espece": "Ruizia cordata"
      }
    },
    {
      "position": [
        "-21.3213078",
        "55.4849803"
      ],
      "taxo": {
        "espece": "Cossinia pinnata"
      }
    }
  ]
}

      




By the way, a less original version of your script might look like this:

#!/usr/bin/env bash

items=( )
while IFS=';' read -r lat long pos _; do
  printf -v item '{ "position": [%s, %s], "taxo": {"espece": "%s"}}' "$lat" "$long" "$pos"
  items+=( "$item" )
done <se.csv

IFS=','
printf '{"occurrences": [%s]}\n' "${items[*]}"

      

Note:

  • Absolutely pointless to use cat

    to pass to a loop (and good reason not to ); thus, we use redirection ( <

    ) to open the file directly as a stdin loop.
  • read

    a list of target variables can be passed; thus, there is no need to read into an array (or read into a string first and then generate a hysteresis and read from it into an array). _

    at the end ensures that extra columns are discarded (by putting them in a dummy variable named _

    ) rather than being added to pos

    .
  • "${array[*]}"

    generates a string by concatenating elements array

    with a character in IFS

    ; thus, we can use this to ensure that commas are only present in the output when they are needed.
  • printf

    is used instead echo

    as stated in the USING THE APPLICATION section of the specification itselfecho

    .
  • This is still wrong as it generates JSON via string concatenation. Don't use this.
+13


source


Here is an article on the topic: https://infiniteundo.com/post/99336704013/convert-csv-to-json-with-jq

It also uses JQ, but a slightly different approach using split()

and map()

.



jq --slurp --raw-input \
   'split("\n") | .[1:] | map(split(";")) |
      map({
         "position": [.[0], .[1]],
         "taxo": {
             "espece": .[2]
          }
      })' \
  input.csv > output.json

      

However, it does not handle the output of the delimiter.

+4


source


The accepted answer is using it jq

to parse the input. This works, but jq

does not handle the outputs, i.e. input from CSV generated from Excel or similar tools is quoted like this:

foo,"bar,baz",gaz

      

will lead to wrong output as jq will see 4 fields and not 3.

One option is to use tab-separated values ​​instead of commas (unless your input contains tabs!), As well as the accepted answer.

Another option is to combine your tools and use the best tool for each part: a CSV parser to read the input and convert it to JSON, and jq

to convert the JSON to the target format.

A Python based csvkit will intelligently parse CSV and comes with a tool csvjson

that will be much better at converting CSV to JSON. It can then be piped through jq to convert flat JSON output using csvkit to the target form.

With the data provided by the OP, for the desired result, it's as simple as:

csvjson --no-header-row  |
  jq '.[] | {occurrences: [{ position: [.a, .b], taxo: {espece: .c}}]}'

      

Note that csvjson automatically detects ;

as the delimiter and without the header line in the input assigns json keys as a

, b

and c

.

The same goes for writing to CSV files - csvkit

can read a JSON or JSON array with newline delimiters and intelligently output the CSV via in2csv

.

+2


source


If you want to go crazy, you can write a parser using jq. Here is my implementation, which can be thought of as the inverse of the filter @csv

. Add this to your .jq file.

def do_if(pred; update):
    if pred then update else . end;
def _parse_delimited($_delim; $_quot; $_nl; $_skip):
    [($_delim, $_quot, $_nl, $_skip)|explode[]] as [$delim, $quot, $nl, $skip] |
    [0,1,2,3,4,5] as [$s_start,$s_next_value,$s_read_value,$s_read_quoted,$s_escape,$s_final] |
    def _append($arr; $value):
        $arr + [$value];
    def _do_start($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; [""])]
        elif $c == $delim then
            [$s_next_value, null, [""], .[3]]
        elif $c == $quot then
            [$s_read_quoted, [], [], .[3]]
        else
            [$s_read_value, [$c], [], .[3]]
        end;
    def _do_next_value($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; _append(.[2]; ""))]
        elif $c == $delim then
            [$s_next_value, null, _append(.[2]; ""), .[3]]
        elif $c == $quot then
            [$s_read_quoted, [], .[2], .[3]]
        else
            [$s_read_value, [$c], .[2], .[3]]
        end;
    def _do_read_value($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif $c == $delim then
            [$s_next_value, null, _append(.[2]; .[1]|implode), .[3]]
        else
            [$s_read_value, _append(.[1]; $c), .[2], .[3]]
        end;
    def _do_read_quoted($c):
        if $c == $quot then
            [$s_escape, .[1], .[2], .[3]]
        else
            [$s_read_quoted, _append(.[1]; $c), .[2], .[3]]
        end;
    def _do_escape($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif $c == $delim then
            [$s_next_value, null, _append(.[2]; .[1]|implode), .[3]]
        else
            [$s_read_quoted, _append(.[1]; $c), .[2], .[3]]
        end;
    def _do_final($c):
        .;
    def _do_finalize:
        if .[0] == $s_start then
            [$s_final, null, null, .[3]]
        elif .[0] == $s_next_value then
            [$s_final, null, null, _append(.[3]; [""])]
        elif .[0] == $s_read_value then
            [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif .[0] == $s_read_quoted then
            [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif .[0] == $s_escape then
            [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        else # .[0] == $s_final
            .
        end;
    reduce explode[] as $c (
        [$s_start,null,null,[]];
        do_if($c != $skip;
            if .[0] == $s_start then
                _do_start($c)
            elif .[0] == $s_next_value then
                _do_next_value($c)
            elif .[0] == $s_read_value then
                _do_read_value($c)
            elif .[0] == $s_read_quoted then
                _do_read_quoted($c)
            elif .[0] == $s_escape then
                _do_escape($c)
            else # .[0] == $s_final
                _do_final($c)
            end
        )
    )
    | _do_finalize[3][];
def parse_delimited($delim; $quot; $nl; $skip):
    _parse_delimited($delim; $quot; $nl; $skip);
def parse_delimited($delim; $quot; $nl):
    parse_delimited($delim; $quot; $nl; "\r");
def parse_delimited($delim; $quot):
    parse_delimited($delim; $quot; "\n");
def parse_delimited($delim):
    parse_delimited($delim; "\"");
def parse_csv:
    parse_delimited(",");

      

For your data, you can change the separator to semicolons.

$ cat se.csv
-21.3214077;55.4851413;Ruizia cordata
-21.3213078;55.4849803;Cossinia pinnata
$ jq -R 'parse_delimited(";")' se.csv
[
  "-21.3214077",
  "55.4851413",
  "Ruizia cordata"
]
[
  "-21.3213078",
  "55.4849803",
  "Cossinia pinnata"
]

      

This will work fine for most inputs to parse a line at a time, but if your data has literal line breaks, you will need to read the entire file as a string.

$ cat input.csv
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
$ jq -Rs 'parse_csv' input.csv
[
  "Year",
  "Make",
  "Model",
  "Description",
  "Price"
]
[
  "1997",
  "Ford",
  "E350",
  "ac, abs, moon",
  "3000.00"
]
[
  "1999",
  "Chevy",
  "Venture \"Extended Edition\"",
  "",
  "4900.00"
]
[
  "1999",
  "Chevy",
  "Venture \"Extended Edition, Very Large\"",
  "",
  "5000.00"
]
[
  "1996",
  "Jeep",
  "Grand Cherokee",
  "MUST SELL!\nair, moon roof, loaded",
  "4799.00"
]

      

+1


source


Since the solution jq

does not handle CSV escaping, first row column names, commented out rows, and other common CSV features, I extended the CSV Cruncher tool to allow reading CSV and writing it as JSON. It's not really "Bash", but not jq

:)

This is primarily a CSV-as-SQL processing application, so it's not entirely trivial, but here's the trick:

./crunch -in myfile.csv -out output.csv --json -sql 'SELECT * FROM myfile'

      

It also allows you to output as a JSON object to a string or a valid JSON array. See the documentation.

This is in beta quality, so all feedback and requests are welcome.

0


source


In general, if your jq has a built-in filter inputs

(available since jq 1.5) then it is better to use that over the -s command line option.

There is a solution here anyway using inputs

. This solution is also without variables.

{"occurrences":
  [inputs
   | select(length > 0)
   | . / ";"
   | {"position": [.[0], .[1]], 
      "taxo": {"espece": .[2]}} ]}

      

SSV, CSV and all that

The above assumes, of course, that the file has semicolon -s separated fields on each line, and that there are no complications associated with CSV files.

If the input has fields that are strictly limited to one character, then jq should have no problem handling. Otherwise, it might be better to use a tool that can reliably convert to TSV (tabulated -s) format that jq can handle directly.

0


source


For the sake of completeness, Xidel along with some XQuery magic can do this too:

xidel -s input.csv --xquery '
  {
    "occurrences":for $x in tokenize($raw,"\n") let $a:=tokenize($x,";") return {
      "position":[
        $a[1],
        $a[2]
      ],
      "taxo":{
        "espece":$a[3]
      }
    }
  }
'

      

{
  "occurrences": [
    {
      "position": ["-21.3214077", "55.4851413"],
      "taxo": {
        "espece": "Ruizia cordata"
      }
    },
    {
      "position": ["-21.3213078", "55.4849803"],
      "taxo": {
        "espece": "Cossinia pinnata"
      }
    }
  ]
}

      

0


source







All Articles