Unmarshal json stream (not newline separated)

I want to turn a JSON stream into a stream of objects. It's easy to do this with JSON stripped from a newline. From the Go docs: https://golang.org/pkg/encoding/json/#Decoder.Buffered

However, I need to create a stream from JSON arrays like this:

        [{"Name": "Ed", "Text": "Knock knock."},
        {"Name": "Sam", "Text": "Who there?"},
        {"Name": "Ed", "Text": "Go fmt."},
        {"Name": "Sam", "Text": "Go fmt who?"},
        {"Name": "Ed", "Text": "Go fmt yourself!"}]

      

What is an efficient way to do this?

I considered this method:

  • Drop outside angle brackets
  • When there are matching top-level curly braces, reverse the line between the curly braces (inclusive) to get one top-level object at a time.

I don't want to do this due to the performance impact of scanning each part of the string twice.

The best alternative I can make is to copy the source code for the decoder in the Golang / json encoding package and modify it so that it returns a Reader that spits out one object at a time. But it works too much for such a simple requirement.

Is there a better way to decode a stream that is a JSON array?

EDIT

I am looking for parsing JSON with nested objects and arbitrary structure.

+3


source to share


3 answers


You can use a streaming parser. For example the megajson scanner :

package main

import (
    "fmt"
    "strings"

    "github.com/benbjohnson/megajson/scanner"
)

func main() {
    // our incoming data
    rdr := strings.NewReader(`[
        {"Name": "Ed", "Text": "Knock knock."},
        {"Name": "Sam", "Text": "Who there?"},
        {"Name": "Ed", "Text": "Go fmt."},
        {"Name": "Sam", "Text": "Go fmt who?"},
        {"Name": "Ed", "Text": "Go fmt yourself!"}
    ]`)

    // we want to create a list of these
    type Object struct {
        Name string
        Text string
    }
    objects := make([]Object, 0)

    // scan the JSON as we read
    s := scanner.NewScanner(rdr)

    // this is how we keep track of where we are parsing the JSON
    // if you needed to support nested objects you would need to
    // use a stack here ([]state{}) and push / pop each time you
    // see a brace
    var state struct {
        inKey   bool
        lastKey string
        object  Object
    }
    for {
        tok, data, err := s.Scan()
        if err != nil {
            break
        }

        switch tok {
        case scanner.TLBRACE:
            // just saw '{' so start a new object
            state.inKey = true
            state.lastKey = ""
            state.object = Object{}
        case scanner.TRBRACE:
            // just saw '}' so store the object
            objects = append(objects, state.object)
        case scanner.TSTRING:
            // for `key: value`, we just parsed 'key'
            if state.inKey {
                state.lastKey = string(data)
            } else {
                // now we are on `value`
                if state.lastKey == "Name" {
                    state.object.Name = string(data)
                } else {
                    state.object.Text = string(data)
                }
            }
            state.inKey = !state.inKey
        }
    }
    fmt.Println(objects)
}

      



This is probably as efficient as you can get, but it takes a lot of manual handling.

+1


source


Let's assume the json stream is like:

{"Name": "Ed", "Text": "Knock knock."}{"Name": "Sam", "Text": "Who there?"}{"Name": "Ed", "Text": "Go fmt."}

      



I have an idea, pseudo code like below:

1: skip prefix whitespace
2: if first char not {, throw error
3: load some chars, and find the first "}"
    4: if found, try json.Unmarshal()
        5: if unmarshal fail, load more chars, and find second "}"
             6: redo STEP 4

      

0


source


Below is the implementation already working in my project:

package json

import (
    "bytes"
    j "encoding/json"
    "errors"
    "io"
    "strings"
)

// Stream represent a json stream
type Stream struct {
    stream *bytes.Buffer
    object *bytes.Buffer
    scrap  *bytes.Buffer
}

// NewStream return a Stream that based on src
func NewStream(src []byte) *Stream {
    return &Stream{
        stream: bytes.NewBuffer(src),
        object: new(bytes.Buffer),
        scrap:  new(bytes.Buffer),
    }
}

// Read read a json object
func (s *Stream) Read() ([]byte, error) {
    var obj []byte

    for {
        // read a rune from stream
        r, _, err := s.stream.ReadRune()
        switch err {
        case nil:
        case io.EOF:
            if strings.TrimSpace(s.object.String()) != "" {
                return nil, errors.New("Invalid JSON")
            }

            fallthrough
        default:
            return nil, err
        }

        // write the rune to object buffer
        if _, err := s.object.WriteRune(r); err != nil {
            return nil, err
        }

        if r == '}' {
            obj = s.object.Bytes()

            // check whether json string valid
            err := j.Compact(s.scrap, obj)
            s.scrap.Reset()
            if err != nil {
                continue
            }

            s.object.Reset()

            break
        }
    }

    return obj, nil
}

      

Usage as below:

func process(src []byte) error {
    s := json.NewStream(src)

    for {
        obj, err := s.Read()
        switch err {
        case nil:
        case io.EOF:
            return nil 
        default:
            return err 
        }   

        // now you can try to decode the obj to a struct/map/...
        // it is also support mix stream, ex.:
        a = new(TypeOne)
        b = new(TypeTwo)
        if err := j.Unmarshal(obj, a); err == nil && a.Error != "" {
             // it is a TypeOne object
        } else if err := j.Unmarshal(obj, b); err == nil && a.ID != "" {
             // it is a TypeTwo object
        } else {
             // unkown type
        }
    }

    return nil
}

      

0


source







All Articles