TAGS: featured golang

Is json.Decoder broken in golang?!

TL;DR: No.

Although unintuitive, json.Decoder isn’t actually broken! However, some of its behavior can appear seemingly wrong when used incorrectly. (Tread carefully.)

In this post, I will dive into the src code of json.Decoder and explore how it works; then I will make sense of these observed “incorrect” behaviors.

The “Issue”

Let us start with a simple demonstration of the problem. Consider the following:

package main
import (
	"encoding/json"
	"fmt"
	"strings"
	"github.com/alecthomas/repr"
)
func main() {
	bad_json := `
{
  "hello": ["foobar"]
}", "foobaz"],
  "world": ["some other str"],
}`
	var fields map[string][]string
	reader := strings.NewReader(bad_json)
	if err := json.NewDecoder(reader).Decode(&fields); err != nil {
		fmt.Println("Error %w", err)
	}
	repr.Println(fields)
}

(playground)

This code is fairly straightforward - given some JSON, attempt to decode it into a string map. If decoding fails, print the error.

Take a look at the JSON string more closely:

	bad_json := `
{
  "hello": ["foobar"]
}", "foobaz"],
  "world": ["some other str"],
}`

This JSON string is clearly malformatted. In particular, the issue is here:

}", "foobaz"],

We have a closing bracket } which shouldn’t be followed by anything, but it is!

As such, we would expect any attempt at parsing to fail. And indeed, json.Unmarshal-ing this dude expectedly fails:

package main

import (
	"fmt"
	"encoding/json"
)

func main() {
	bad_json := `
{
  "hello": ["foobar"]
}", "foobaz"],
  "world": ["some other str"],
}`

	var fields map[string][]string
	err := json.Unmarshal([]byte(bad_json), &fields)
	fmt.Println(err)
}

(playground)

invalid character '"' after top-level value

<whomp></whomp>

However, when running this “bad” json string against the json.NewDecoder().Decode(...) method, it is surprising to see that there is no error!!! Instead, everything after the offensive line is simply ignored!

Here’s an excerpt from the codeblock above:

// ... set up code here
var fields map[string][]string
reader := strings.NewReader(bad_json)
if err := json.NewDecoder(reader).Decode(&fields); err != nil {
	// => We EXPECT this 
	fmt.Println("Error %w", err)
}
// But instead, we get this! 
repr.Println(fields)

Executing our code yields the following output:

map[string][]string{
  "hello": []string{
    "foobar",
  },
}

(playground)

W.T.F!

Ok, so - this is unexpected!

We would expect the err != nil condition to be true, forcing the fmt.Println(...) line to run and for fields to be empty.

To find the answer, we’ll need to go spelunking into golang’s json.Decoder source code and implementation.

(Heads up, others have also warned about the behavior we are seeing here. However we will demonstrate by the end of this post that the observed behavior is in fact expected (once we have a better understanding of what it is doing) and not really wrong.)

Alright. Let’s do this.

Grokking the json pkg

Our interest primarily lies in two structs within the src/encoding/json pkg:

The scanner struct is an internal mechanism used by the Decoder to parse a JSON string. It defines a collection of state transition functions and transition values that manage tracking various phases of parsing the JSON string itself.

For example, consider the following func in scanner.go#L263 to better understand how transition functions return transition values.

func stateBeginString(s *scanner, c byte) int {
	// ... non relevant code lines
	if c == '"' {
		s.step = stateInString
		return scanBeginLiteral
	}
	// ... non relevant code lines
}

This is an example of a “state transition function”. It marks the beginning of parsing a value in JSON that starts with " character.

Notice here if our input char (c byte) is ", we update our step attribute to the next state transition function, in this case stateInString. Furthermore, we return a new transition value: scanBeginLiteral (indicating that our scanner is currently in the process of interpreting a token literal such as a number or a string in our JSON string).

The transition values are primarily used by code that actually calls the scanner.step transition functions, such as Decoder or decodeState, to understand the current state of parsing by the scanner.

Here’s the full list of transitions values defined and returned by scanner state transition functions:

const (
	// Continue.
	scanContinue     = iota // uninteresting byte
	scanBeginLiteral        // end implied by next result != scanContinue
	scanBeginObject         // begin object
	scanObjectKey           // just finished object key (string)
	scanObjectValue         // just finished non-last object value
	scanEndObject           // end object (implies scanObjectValue if possible)
	scanBeginArray          // begin array
	scanArrayValue          // just finished array value
	scanEndArray            // end array (implies scanArrayValue if possible)
	scanSkipSpace           // space byte; can skip; known to be last "continue" result

	// Stop.
	scanEnd   // top-level value ended *before* this byte; known to be first "stop" result
	scanError // hit an error, scanner.err.
)

The Decoder struct, which is public, manages an instance of scanner as an attribute and tracks the state of JSON parsing (using the transition values such as scanEnd or scanEndArray). The added twist here (and this is significant) is that Decoder loads a portion of the JSON string into a buf attribute and the scanner processes the JSON string in portions, reading chars one at a time from buf.

In addition to tracking the scanner, the Decoder struct also manages an instance of decodeState, which is the mechanism used to actually unmarshal data read from the JSON via the scanner.

For the purposes of our exploration, we will largely ignore the decodeState struct and instead focus primarily on the specifics of scanner and a few methods of the Decoder.

Tracing bad_json through json.Decoder

Hopefully that was a good (but brief) introduction to how the json decoder generally works to parse strings. Let us now apply this understanding to our initial example:

	bad_json := `
{
  "hello": ["foobar"]
}", "foobaz"],
  "world": ["some other str"],
}`

This poorly formatted JSON string is processed with json.Decoder like so (repeating from example on top):

var fields map[string][]string
reader := strings.NewReader(bad_json)
if err := json.NewDecoder(reader).Decode(&fields); err != nil {
	fmt.Println("Error %w", err)
}

Let’s start at json.NewDecoder and given our new (high level) understanding of how the decoding process generally works, attempt to pinpoint exactly why/how the Decode method defies our expectations and does NOT raise an err when processing the bad_json string.

1. json.NewDecoder instantiates a json.Decoder struct

To begin, json.NewDecoder creates a new Decoder struct with an r attribute that manages an io.Reader instance.

func NewDecoder(r io.Reader) *Decoder {
	return &Decoder{r: r}
}

(sauce)

From the golang src, here’s the full definition of type Decoder:

// A Decoder reads and decodes JSON values from an input stream.
type Decoder struct {
	r       io.Reader
	buf     []byte
	d       decodeState
	scanp   int   // start of unread data in buf
	scanned int64 // amount of data already scanned
	scan    scanner
	err     error

	tokenState int
	tokenStack []int
}

(sauce)

For the purposes of this analysis, we really only care about the following fields:

type Decoder struct {
	scanp   int   // start of unread data in buf
	scan    scanner
	err     error
}

scanp

This is an index that we advance from position 0 to the length of our buffer. As we advance this index, we read a single character from the buffer and analyze it via the scanner state machine.

scan

An instance of the internal json.scanner struct. At each value of scanp, we read the character from buffer and pass it into the scanner.step func which processes state transitions such as stateBeginString or stateInString (more on this shortly)

err

We expect err to be NOT nil when we run into malformed JSON. Clearly as it stands from our observations, err IS nil which is the problem.

Ok so - now we have an instance of Decoder available to us that knows to read data from our bad_json into an internal buffer. Now, let’s look at how the Decoder.Decode(...) method of our Decoder struct commences parsing our JSON data.

2. Decoder.Decode(v) internally calls Decoder.readValue()

Consider the following snippet from the Decoder.Decode() implementation below:

func (dec *Decoder) Decode(v interface{}) error {
	// ... non relevant code lines

	// Read whole value into buffer.
	n, err := dec.readValue()
	if err != nil {
		return err
	}
	// ... non relevant code lines

	return err
}

(sauce)

We are only focusing on the method call relevant to our current analysis - there are conditionals checked before readValue is executed, which are simply sanity checks for various poorly formatted string states. These conditional checks work as expected so for the purposes of this analysis they are “uninteresting”.

In short, dec.readValue() uses the dec.scan (which, recall, is an instance of the json.scanner struct) attribute to process chars in the buffer one by one until an error or a scanEnd transition value state is reached. (This is the focus of the next section)

Similarly, assuming dec.readValue() does not generate an error, there is some additional work done by the Decode method (this is the part indicated as “non relevant code lines” above) that actually relies on the decodeState struct to unmarshal the bytes read and processed by the scanner.

Having looked at Decode, let’s now look at the src for readValue to understand how the scanner is used to process data in our JSON string and more importantly, generate errors for invalid JSON.

3. Decoder.readValue processes chars w/scanner.step()

At this point it is clear that whatever the “issue” is here with the behavior we observe, it must be in readValue. This func is somewhat long so let’s only look at the relevant lines:

func (dec *Decoder) readValue() (int, error) {
	dec.scan.reset()

	scanp := dec.scanp
	var err error
Input:
	// help the compiler see that scanp is never negative, so it can remove
	// some bounds checks below.
	for scanp >= 0 {

		// Look in the buffer for a new value.
		for ; scanp < len(dec.buf); scanp++ {
			c := dec.buf[scanp]
			dec.scan.bytes++
			switch dec.scan.step(&dec.scan, c) {
			case scanEnd:
				// scanEnd is delayed one byte so we decrement
				// the scanner bytes count by 1 to ensure that
				// this value is correct in the next call of Decode.
				dec.scan.bytes--
				break Input
			case scanEndObject, scanEndArray:
				// scanEnd is delayed one byte.
				// We might block trying to get that byte from src,
				// so instead invent a space byte.
				if stateEndValue(&dec.scan, ' ') == scanEnd {
					scanp++
					break Input
				}
			case scanError:
				dec.err = dec.scan.err
				return 0, dec.scan.err
			}
		}

		// ... non relevant code lines
}

(full sauce)

Upon inspection of this snippet, it is clear that readValue() generates an error if our scanner.step func returns a scanError transition value.

Because we can clearly see that the parsing of our bad_json does not raise an error, it must be that the scanEnd or scanEndObject/scanEndArray state is reached before the “bad” formatting is processed by the scanner.

Still, let’s be sure. Let’s do one final exercise and run through our bad_json to prove to ourselves that indeed state scanEnd or scanEndObject/scanEndArray is reached before a scanError state can be processed.

4. Stepping through Decoder.readValue with bad_json

For convenience, here is bad_json again:

	bad_json := `
{
  "hello": ["foobar"]
}", "foobaz"],
  "world": ["some other str"],
}`

Let’s start from the beginning. We just instantiated a new Decoder using json.NewDecoder.

When we instantiate a Decoder struct here, we expect the following initial attributes:

// which character in the JSON string are we currently considering?
dec.scanp => 0

// initial scanner
dec.scan => scanner{}

(There are others, but these are the two we care about for now).

We then call Decode to start processing our bad_json string. The first line in readValue is:

dec.scan.reset()

This method importantly sets our scanner.step state transition function to stateBeginValue.

After some initialization steps, we end up at the big for loop in readValue. To make things easier to grok, let’s look at (only) the for loop again:

// Look in the buffer for a new value.
for ; scanp < len(dec.buf); scanp++ {
	c := dec.buf[scanp]
	dec.scan.bytes++
	switch dec.scan.step(&dec.scan, c) {
	case scanEnd:
		// scanEnd is delayed one byte so we decrement
		// the scanner bytes count by 1 to ensure that
		// this value is correct in the next call of Decode.
		dec.scan.bytes--
		break Input
	case scanEndObject, scanEndArray:
		// scanEnd is delayed one byte.
		// We might block trying to get that byte from src,
		// so instead invent a space byte.
		if stateEndValue(&dec.scan, ' ') == scanEnd {
			scanp++
			break Input
		}
	case scanError:
		dec.err = dec.scan.err
		return 0, dec.scan.err
	}
}

scanp here pertains to our dec.scanp attribute which is initially set to 0 a few lines above. So, our first char is { from bad_json.

We call dec.scan.step and pass in a reference to scanner and c = "{" as args. If the three expected transition states are not returned (scanEnd, scanEndObject, scanEndArray) the loop continues and we advance to the next character (in this case \n).

5. bad_json iteration table

To better understand this code as it steps through the chars in bad_json, let’s consider a table that describes the “state” of Decoder and Decoder.scanner as we iterate.


Here are the column definitions:

row

Mainly so that we can look closely at a few rows of this table.

ch

The current character we are considering within bad_json

scanp

The index corresponding to our current char.

step

This reflects two key items:

step.parseState

The scanner also initializes a stack to keep track of opening and closing brackets (for instance, as we parse a JSON string if we encounter a {, we add parseObjectKey (an iota, sauce) which indicates that we are currently inside {} and have encountered only the left side)


To interpret the table below, consider the follow approach:

1: from stream.go#L103, determine the state transition function (the transition function corresponds to now under the step col).

switch dec.scan.step(&dec.scan, c) {

In the case of Row 1, it would be:

switch dec.scan.step(&dec.scan, "{") {

2: Find the step function in scanner.go. Walk through it with c = "{" or c = "\n", etc (whatever is in the ch col)

In the case of Row 1, our step function would be:

// stateBeginValue is the state at the beginning of the input.
func stateBeginValue(s *scanner, c byte) int {
	if isSpace(c) {
		return scanSkipSpace
	}
	switch c {
	case '{':
		s.step = stateBeginStringOrEmpty
		return s.pushParseState(c, parseObjectKey, scanBeginObject)
	// ... more cases here, not relevant
	}
	// ... more logic here, not relevant
}

(sauce)

3: If, in scanner src, pushParseState or popParseState is called within the transition func, expect col step.parseState to reflect the value added or removed.

In the case of Row 1, pushParseState is called with parseObjectKey so we end up with one item in that array.

4: Once the step function has completed, it should have a return value (the transition value) and a new step function (could be same as now). Expect col step’s ret to correspond with the returned transition value and next to correspond to the new value of scanner.step

In the case of Row 1, our returned transition value is scanBeginObject and our new step function is stateBeginStringOrEmpty.

Rinse and repeat until scanEnd or scanError is encountered.

(Note that below \s is just shorthand for ' ')

row ch scanp step step.parseState
1 { 0 now: stateBeginValue
ret: scanBeginObject
next: stateBeginStringOrEmpty
parseObjectKey
2 \n 1 now: stateBeginStringOrEmpty
ret: scanSkipSpace
next: stateBeginStringOrEmpty
parseObjectKey
3 2 now: stateBeginStringOrEmpty
ret: scanBeginLiteral
next: stateInString
parseObjectKey
4 h 3 now: stateInString
ret: scanContinue
next: stateInString
parseObjectKey
5
6 8 now: stateInString
ret: scanContinue
next: stateEndValue
parseObjectKey
7 : 9 now: stateEndValue
ret: scanObjectKey
next: stateBeginValue
parseObjectKey
8 \s 10 now : stateBeginValue
ret : scanSkipSpace
next : stateBeginValue
parseObjectKey
9 [ 11 now : stateBeginValue
ret : scanBeginArray
next : stateBeginValueOrEmpty
parseObjectKey, parseArrayValue
10 12 now : stateBeginValueOrEmpty
ret : scanBeginLiteral
next : stateInString
parseObjectKey, parseArrayValue
11
12 ] 20 now : stateEndValue
ret : scanEndArray
next : stateEndValue
parseObjectKey
13 \s - now : stateEndValue
ret : scanSkipSpace
next : stateEndValue
parseObjectKey
14 } 21 now : stateEndValue
ret : scanEndObject
next : stateEndValue
15 \s - now : stateEndValue
ret : scanEnd
next : stateEndTop

6. Making sense of the final iteration steps

Hopefully, that table is helpful in grokking how the decoding logic works. In particular, let’s zoom in on the final 4 rows (12-15).

Row 12 + 13

Let’s look at the stateEndValue func:

// stateEndValue is the state after completing a value,
// such as after reading `{}` or `true` or `["x"`.
func stateEndValue(s *scanner, c byte) int {
	// ... not relevant currently
	ps := s.parseState[n-1]
	switch ps {
	// ... other cases here, not relevant
	case parseArrayValue:
		if c == ',' {
			s.step = stateBeginValue
			return scanArrayValue
		}
		if c == ']' {
			s.popParseState()
			return scanEndArray
		}
		return s.error(c, "after array element")
	}
	return s.error(c, "")
}

(sauce)

Here’s the row:

row ch scanp step step.parseState
12 ] 20 now : stateEndValue
ret : scanEndArray
next : stateEndValue
parseObjectKey

Because c = "[", we can see that s.popParseState() is called (which removes parseArrayValue from our parseState stack)

Importantly, we are returned scanEndArray, which - from readValue() - we see requires special handling:

// Look in the buffer for a new value.
for ; scanp < len(dec.buf); scanp++ {
	c := dec.buf[scanp]
	dec.scan.bytes++
	switch dec.scan.step(&dec.scan, c) {
	// ... other cases here, not relevant
	case scanEndObject, scanEndArray:
		// scanEnd is delayed one byte.
		// We might block trying to get that byte from src,
		// so instead invent a space byte.
		if stateEndValue(&dec.scan, ' ') == scanEnd {
			scanp++
			break Input
		}
	case scanError:
		dec.err = dec.scan.err
		return 0, dec.scan.err
	}
}

This feels a bit like cheating but essentially, since scanEndArray is returned by our step function, we explicitly call stateEndValue to test and see if we have reached scanEnd.

Let’s look at the relevant lines of stateEndValue to understand the logic:

// stateEndValue is the state after completing a value,
// such as after reading `{}` or `true` or `["x"`.
func stateEndValue(s *scanner, c byte) int {
	n := len(s.parseState)
	if n == 0 {
		// Completed top-level before the current byte.
		s.step = stateEndTop
		s.endTop = true
		return stateEndTop(s, c)
	}
	if isSpace(c) {
		s.step = stateEndValue
		return scanSkipSpace
	}
	// ... ignore logic here, not relevant right now
	return s.error(c, "")
}

If our parseState stack were empty, it would return scanEnd which could complete our loop. Otherwise, since we pass in c = ' ' here, we “short circuit” and return the same step transition function.

Row 14 + 15

The final two iterations are very similar to the previous two iterations. The key difference is for the last iteration (row 15), our parseState stack ends up being empty. As such, we actually enter into the n == 0 condition:

// excerpt from stateEndValue
if n == 0 {
	// Completed top-level before the current byte.
	s.step = stateEndTop
	s.endTop = true
	return stateEndTop(s, c)
}

and return the coveted scanEnd transition value. In our readValue() func:

switch dec.scan.step(&dec.scan, c) {
// ... other cases here, not relevant
case scanEndObject, scanEndArray:
	// scanEnd is delayed one byte.
	// We might block trying to get that byte from src,
	// so instead invent a space byte.
	if stateEndValue(&dec.scan, ' ') == scanEnd {
		scanp++
		break Input
	}
// ...
}

the stateEndValue(&dec.scan, ' ') == scanEnd condition is satisfied and we break out of our loop.

In other words, the Decode() method does NOT error because we stop processing our JSON string before the scanner has a change to encounter the malformed portion of our bad_json string!

Expounding on this a bit: when that final } is read, Decoder says: “ok! I’m done. I have a completed scan.” For this reason, it never actually even tries to read the rest of the line even though there are plenty more characters left to read.

Additionally, note that we could get it to return an error btw, if we call Decode again - since now it will start at " (the first char after }) and it will definitely see and raise an error then.

Furthermore, if we neglected to have the closing } as our first character, Decoder.Decode would actually catch the problem earlier.

Basically, Decode will always respect the closing } because it is expecting to read a stream of data - meaning it expects JSON of the form:

{...}
{...}
{...}

or even:

{...}{...}{...}

Therefore, using Decode to read a single JSON document is not ideal. However it is worth noting that if we really wanted to, we still could use Decode to parse a single doc - we simply must alter how we leverage Decode to fully read out document body.

Personally, the proper way to use Decode really ought to be like so:

package main
import (
	"encoding/json"
	"fmt"
	"strings"
	"io"
)

func readJsonStream(jsonStr string, fields map[string][]string) error {
	reader := strings.NewReader(jsonStr)
	dec := json.NewDecoder(reader)
	for {
		err := dec.Decode(&fields)
		if err == io.EOF {
			break
		}
		if err != nil {
			fmt.Println(err)
			fmt.Println(fields)
			return err
		}
	}
	fmt.Println(fields)
	return nil
}

func main() {
	ok_json := `
{
  "hello": ["foobar"]
}{"world": ["some other str"]}`
	fmt.Println("++++++++++++++++++ ok json +++++++++++++++++++")
	_ = readJsonStream(ok_json, map[string][]string{})
	
	bad_json := `
{
  "hello": ["foobar"]
}", "foobaz"],
  "world": ["some other str"],
}`	
	fmt.Println("++++++++++++++++++ bad json +++++++++++++++++++")
	_ = readJsonStream(bad_json, map[string][]string{})

	seemingly_bad_but_not_json := `
{}{"world": ["some other str"]}{}{}{}{"hello": ["foobar"]}`	
	fmt.Println("++++++++++++++++++ not actually bad json +++++++++++++++++++")
	_ = readJsonStream(seemingly_bad_but_not_json, map[string][]string{})
}

(playground)

output:

++++++++++++++++++ ok json +++++++++++++++++++
map[hello:[foobar] world:[some other str]]
++++++++++++++++++ bad json +++++++++++++++++++
json: cannot unmarshal string into Go value of type map[string][]string
map[hello:[foobar]]
++++++++++++++++++ not actually bad json +++++++++++++++++++
map[hello:[foobar] world:[some other str]]

Here, we continously decode in a loop, breaking only when io.EOF is reached or a non-nil error is discovered. In fact, with this refactor, we actually can safely use Decoder to parse any arbitrary JSON strings!

Final Remarks

In short - json.Decoder is not meant to be a standalone JSON unmarshal-er. Use json.Unmarshal for that. However, for streaming JSON tasks it has a few nice features that are quite useful and assuming your input is indeed streaming json it does not actually silently ignore invalid syntax.

Personally, I would update this comment in the docs:

A Decoder reads and decodes JSON values from an input stream.

to include some more context and background about expected usage of Decode. Based on current documentation, it is not unreasonable to assume Decoder might be used interchangeably with Unmarshal and then become surprised with the unexpected behavior.

But, after all is said and done: json.Decoder’s behavior is not actually a bug, if anything it’s a feature!

Share