JSON Path

Today, JSON is widely used format for representing data structures. Together with encoding/decoding rules, it specifies data types which are supported by most of modern programming languages and platforms.

JSON Path provides basic functions for referencing and manipulating deeply nested JSON data structure.

Hat Open provides libraries implementing this functionality:

Python - hat-json
JavaScript - @hat-open/util

Definitions

Following definitions describe JSON Data, JSON Path and operations based on these data types. Mathematical notation is used only as "neutral" tool to describe data structures and operations without usage of any particular programming language or paradigm. Definitions themselves are not strict - they should be taken as guidelines to implementation of JSON Path libraries.

Data

JSON Data types can be defined as set $D a t a$ :

D a t a = C o n s t a n t \cup N u m b e r \cup S t r i n g \cup A r r a y \cup O b j e c t

where:

$C o n s t a n t$

$C o n s t a n t = {n u l l, t r u e, f a l s e}$

Constant values represented with literals null, true and false.
$N u m b e r$

$N u m b e r = ℝ$

Real numbers (JSON doesn't distinguish between integers and floating point values).
$S t r i n g$

$S t r i n g = (c_{1}, . . ., c_{n}), n \geq 0, c_{i} \in Unicode characters$

Sequence of zero or more Unicode characters including additional escaped sequences.
$A r r a y$

$A r r a y = (a_{1}, . . ., a_{n}), n \geq 0, a_{i} \in D a t a$

Ordered set of zero or more elements which are themselves JSON Data.
$O b j e c t$

$O b j e c t = {(k_{1}, v_{1}), . . ., (k_{n}, v_{n})}, n \geq 0, k_{i} \in S t r i n g, v_{i} \in D a t a$

Associative sequence of key/value pairs where keys are strings and values are one of JSON Data

Path

JSON Path is reference to part of composite JSON data. It is itself represented as JSON Data and can be defined as set $P a t h$ :

P a t h = I n t e g e r \cup S t r i n g \cup P A r r a y

where:

\begin{matrix} I n t e g e r & = ℕ_{0} \\ P A r r a y & = (a_{1}, . . ., a_{n}), n \geq 0, a_{i} \in P a t h \end{matrix}

In following definitions, we will use operator $&$ as reference to data and operator $*$ as value of referenced data.

Algorithm, used as basis for resolving path references, can be represented with function $r e f$ :

r e f (d a t a, p a t h) = {\begin{cases} r e f_{i n t} (d a t a, p a t h) & p a t h \in I n t e g e r \\ r e f_{s t r} (d a t a, p a t h) & p a t h \in S t r i n g \\ r e f_{a r r} (d a t a, p a t h) & p a t h \in P A r r a y \end{cases}

where:

d a t a \in D a t a, p a t h \in P a t h

Usage of different data types as paths, enables one to reference data in different data structures:

$p a t h \in I n t e g e r$

$r e f_{i n t} (d a t a, p a t h) = {\begin{cases} & a_{p a t h + 1} & d a t a \in A r r a y, d a t a = (a_{1}, . . ., a_{n}), p a t h < n \\ & n u l l & otherwise \end{cases}$

Integer paths are used for referencing elements of array. If referenced element doesn't exist or provided data is not an array, neutral null element is referenced.
$p a t h \in S t r i n g$

$r e f_{s t r} (d a t a, p a t h) = {\begin{cases} & v_{i} & d a t a \in O b j e c t, d a t a = {(k_{1}, v_{1}), . . ., (k_{n}, v_{n})}, p a t h = k_{i} \\ & n u l l & otherwise \end{cases}$

String paths reference object entries based on object's key values. If referenced key doesn't exist or provided data is not an object, neutral null element is referenced.
$p a t h \in P A r r a y$

$r e f_{a r r} (d a t a, p a t h) = {\begin{cases} & d a t a & p a t h = \emptyset \\ r e f (* r e f (d a t a, a_{1}), (a_{2}, . . ., a_{n})) & p a t h = (a_{1}, . . ., a_{n}) \end{cases}$

Array paths are used for composition of other paths. Array elements are used for recursive path application on result of previous path application.

Normalization

Each path can be normalized - represented as array of strings and integers:

N P a t h = (a_{1}, . . ., a_{n}), n \geq 0, a_{i} \in I n t e g e r \cup S t r i n g

Path normalization is defined as function $n o r m$ :

\begin{matrix} n o r m ∶ P a t h \to N P a t h \\ n o r m (p a t h) = {\begin{cases} (p a t h) & p a t h \in I n t e g e r \cup S t r i n g \\ \emptyset & p a t h \in P A r r a y, p a t h = \emptyset \\ n o r m (p_{1}) \cup n o r m ((p_{2}, . . ., p_{n})) & p a t h \in P A r r a y, p a t h = (p_{1}, . . ., p_{n}) \end{cases} \end{matrix}

When used as argument to $r e f$ function, normalized path is equivalent to its original non-normalized form:

r e f (d a t a, p a t h) = r e f (d a t a, n o r m (p a t h))

These property of normalized path is useful in case of path functions' implementations. By normalizing path prior to its usage, implementation or $r e f$ can be based on sequential reduction of provided data instead of recursive application.

Functions

$g e t$

\begin{matrix} g e t ∶ D a t a \times P a t h \to D a t a \\ g e t (d a t a, p a t h) = v a l u e \end{matrix}

Function $g e t$ is used for obtaining part of $d a t a$ structure referenced by $p a t h$ .

Examples:

data = {"a": [1, 2, {"b": true}, []]}

get(data, []) = {"a": [1, 2, {"b": true}, []]}
get(data, "a") = [1, 2, {"b": true}, []]
get(data, ["a", 0]) = 1
get(data, ["a", 2, "b"]) = true
get(data, ["a", [2, ["b"]]]) = true
get(data, [[], [[]]]) = {"a": [1, 2, {"b": true}, []]}
get(data, 0) = null
get(data, "b") = null
get(data, ["a", 4]) = null

$s e t$

$\begin{matrix} s e t ∶ D a t a \times P a t h \times D a t a \to D a t a \\ s e t (d a t a, p a t h, v a l u e) = d a t a' \end{matrix}$

Function $s e t$ is used for creating new data structure $d a t a'$ . Difference, between $d a t a$ and $d a t a'$ , is in part of data structure referenced by $p a t h$ . In $d a t a'$ this part is replaced with $v a l u e$ .

Edge cases:
- array index out of bound
  
  If integer path references array with length less than path, additional null elements are created so that referenced array element can be set to provided value.
- object key not available
  
  If string path references object which doesn't contain entry with key equal to path, new entry is created.
- path type doesn't match data type
  
  If integer path references data which is not array, data is replaced with empty array and previously described array index out of bound edge case is applied.
  
  If string path references data which is not object, data is replaced with empty object and previously described object key not available edge case is applied.
  
  Examples:
```
data = {"a": [1, 2, {"b": true}, []]}

set(data, ["a", 2, "b"], false) = {"a": [1, 2, {"b": false}, []]}
set(data, "a", 42) = {"a": 42}
set(data, ["a", [3], 0], 42) = {"a": [1, 2, {"b": true}, [42]]}
set(data, ["a", [3], 1], 42) = {"a": [1, 2, {"b": true}, [null, 42]]}
set(data, [], 42) = 42
set(null, [1, "a", 2], 42) = [null, {"a": [null, null, 42]}]
```
$r e m o v e$

$\begin{matrix} r e m o v e ∶ D a t a \times P a t h \to D a t a \\ r e m o v e (d a t a, p a t h) = d a t a' \end{matrix}$

Function $r e m o v e$ is used for creating new data structure $d a t a'$ based on provided $d a t a$ . Difference, between $d a t a$ and $d a t a'$ , is in part of data structure referenced by $p a t h$ . In $d a t a'$ this part is omitted.

In edge cases:
- array index out of bound
- object key not available
- path type doesn't match data type
$d a t a'$ is same as $d a t a$ .

Examples:
```
data = {"a": [1, 2, {"b": true}, []]}

delete(data, ["a", 1]) = {"a": [1, {"b": true}, []]}
delete(data, []) = null
delete(data, ["a", 2, "b"]) = {"a": [1, 2, {}, []]}
delete(data, "b") = {"a": [1, 2, {"b": true}, []]}
```

With this basic functions, other specialized functions can be defined. Example of commonly used derived function is $c h a n g e$ :

\begin{matrix} c h a n g e ∶ D a t a \times P a t h \times (D a t a \to D a t a) \to D a t a \\ c h a n g e (d a t a, p a t h, f) = s e t (d a t a, p a t h, f (g e t (d a t a, p a t h))) \end{matrix}

where $f$ is arbitrary data transformation function:

f ∶ D a t a \to D a t a

It should be noted that all of these functions are "pure functions" that shouldn't make in-place changes of provided data arguments. Implementations usually take this into account by optimizing re usability of shared data.

Characteristics

Some of the interesting characteristics of JSON Path approach to JSON Data referencing are:

full JSON Data coverage

Paths enable operations on all kinds of JSON Data without additional constrains on structural complexity or used data types.
get/set operations

Same path instances can be used for both retrieval and change of referenced data. This is result of single path reference resolving algorithm, used as basis for get and set implementation.

flexible path composition

Support for path normalization provides opportunities for composition of multiple path parts into single path.

Example:

p1 = [ ..first-path.. ]
p2 = [ ..second-path.. ]
p3 = [ ..third-path.. ]

[p1, p2, p3] ≅ [p1, [p2, [p3]]] ≅ [p1, [p2, p3]] ≅ [[p1, p2], p3]

safe retrieval of deeply nested optional elements

In case of complex array paths, if part of referenced data is not available, path traversal can be short-circuited without additional repetitive checking.

Example:
```
data = {'a': {'b': {'c': 123}}}
path = ['a', 'd', 'c']
get(data, path) == null
```
JSON Path is subset of JSON Data

This property enables easy serialization and exchange of paths. Also, all path functions can be used for operations on paths themselves.
implementation simplicity

With representation of paths as JSON Data and normalization into single "flat" array, no additional parsing is required and implementation can be based on optimal short-circuited iteration. This enables efficient implementations in wide range of modern programming languages and platforms.

Python implementation

Python implementation of JSON Path functions is available as part of hat-json library.

Function signature is similar to abstract definition of JSON Path functions. Notable differences are:

possibility to define alternative neutral null value in case of get function
function set is named set_ to avoid name clash with builtin function

Array = typing.List['Data']
Object = typing.Dict[str, 'Data']
Data = typing.Union[None, bool, int, float, str, Array, Object]
Path = typing.Union[int, str, typing.List['Path']]

def get(data: Data, path: Path, default: typing.Optional[Data] = None) -> Data:
    ...

def set_(data: Data, path: Path, value: Data) -> Data:
    ...

def remove(data: Data, path: Path) -> Data:
    ...

JavaScript implementation

JavaScript implementation of JSON Path functions is available as part of @hat-open/util library.

This implementation provides full functionality of JSON Path definition with some changes to API itself. Most of these changes are made to enable more functional programming style:

all functions are curried
delete is renamed to omit
position of arguments are changed

// get : Path -> Data -> Data
function get(path, data) {
    // return value
}

// change : Path -> (Data -> Data) -> Data
function change(path, fn, data) {
    // return new data
}

// set : Path -> Data -> Data -> Data
function set(path, value, data) {
    // return new data
}

// omit : Path -> Data -> Data
function omit(path, data) {
    // return new data
}

Comparison to other JSON Data functions

Referencing parts of deeply nested complex JSON Data structures is the well known problem. There exists a lot of different applications and libraries that try to provide a solution to this problem.

To compare previously described JSON Path to alternatives, we can group other implementations based on some of theirs significant characteristics:

string based paths

Some of the libraries use paths encoded as strings. Usually, this encodings consist of custom rules that try to mimic XPath or JavaScript notation.

Main benefit of this approach is condensed path definition which is usually well suited for usage as command line arguments to applications.

Drawbacks of this approach are:
- additional path string decoder
- variety of custom non-standard notations
- difficult composition of path segments
Some of the notable implementations:
- JSONPath
- jq
- lodash (with limited array based composition)
lenses

Usage of lens functions if approach popularized by Haskell Lens library. It is based on functions that can be used as references to parts of composite data.

Advantage of lenses is mostly associated with functional programming style and possibility of lens composition by usage of function composition.

Drawback of this approach are:
- tightly dependent on specific programming language function definitions
- not appropriate for serialization
Some of the notable implementations:
- ramda.js