Little Big Detail #3: Detecting Maps

little-big-details

Little Big Details are the subtle niceties that quicktype takes care of when turning your JSON into lovely, readable code. This Little Big Detail is about how quicktype decides whether a JSON object should be represented as a class or a map.

The problem

Suppose you're writing an app that uses the Bitcoin blockchain and it downloads data like this:

{
    "0000000000000000000e222e4e7afc29c49f6398783a94c846dee2e13c6408f5": {
        "size": 969709,
        "height": 510599,
        "difficulty": 3007383866429.732,
        "previous": "000000000000000000552a7783efd39eaa1c5ff6789e21a0bbe7547bc454fced"
	},
	"000000000000000000552a7783efd39eaa1c5ff6789e21a0bbe7547bc454fced": {
	    "size": 991394,
        "height": 510598,
        "difficulty": 3007383866429.732,
        "previous": "00000000000000000043aba4c065d4d92aec529566287ebec5fe9010246c9589"
	},
	"00000000000000000043aba4c065d4d92aec529566287ebec5fe9010246c9589": {
        "size": 990527,
        "height": 510597,
        "difficulty": 3007383866429.732,
        "previous": "00000000000000000009025b9e95911a4dc050de129ea4eb5e40ef280751a0cb"
	}
}

How would you represent this data as a type in your program? Here’s a naive translation of the JSON into Swift types:

struct Blocks {
  let _0000000000000000000e222e4e7afc29c49f6398783a94c846dee2e13c6408f5: Block
  let _00000000000000000043aba4c065d4d92aec529566287ebec5fe9010246c9589: Block
  let _00000000000000000009025b9e95911a4dc050de129ea4eb5e40ef280751a0cb: Block
}

struct Block {
    let size, height: Int
    let difficulty: Double
    let previous: String
}

Is this the Swift you would write? No, of course not! How about:

typealias Blocks = [String: Block] // a.k.a. Dictionary<String, Block>

struct Block {
    let size, height: Int
    let difficulty: Double
    let previous: String
}

That's more like it! We're faced with this decision because JSON syntax doesn’t differentiate maps from classes—they both just look like:

{ "x": …, "y": …, "z": … }

quicktype must decide which JSON represents a class (fixed properties) and which JSON represents a map (dynamic keys) so it can generate the same code you would in the example above. In this post we’ll explain how quicktype uses Markov chains to do this. If you’d rather skip to examples of quicktype’s map detection in action, check out our map detection playground.

Simplistic heuristics for map detection

A first observation about maps is that their values are almost always the same type or "homogeneous". In fact, in statically-typed programming languages maps must be homogeneous. Our first heuristic for map detection could be:

If all values in a JSON object have the same type, consider it a map.

Unfortunately this heuristic fails immediately:

{
    "name": "Jon",
    "pet": "Garfield"
}

You'd probably want a class for that data even though all values are strings.

Another observation is that maps often have more keys than classes have properties; if a JSON object has 100 properties, all of which are strings, it's probably a map, so let's add another condition to our heuristic:

If all values in a JSON object have the same type and it has at least 20 properties, consider it a map.

This was the previous heuristic quicktype used. It was adequate for large, regular samples but failed whenever a map had fewer than 20 keys, which happens all the time.

A heuristic that decides like a human programmer does

Let's come back to our original Bitcoin example. Most developers would correctly deduce that the type for each block should be a class, but that the outer object should be a map. Our previous heuristic would fail in this case, since there are only three blocks in the map.

How are we as human programmers able to infer that this data is a map? One strong clue is that the names of the properties (e.g. 000000000000000...c846dee2e13c6408f5) don't look like class property names. If only we could teach quicktype to look for this clue. Enter Markov chains!

What's a Markov chain?

If you’re a developer, you may know Markov chains as a way of generating apparently meaningful nonsense, but they have many other important applications.

A Markov chain is a set of states and transitions among those states, where each transition has a probability. Transitions occur randomly, weighted by the probabilities.

For example, let's use a Markov chain to model what I’ll eat for lunch. On most days I eat sushi, but now and then I'll have tacos. We can model this with two states: 🍣 and 🌮. The transitions between these states give the probabilities for which food I'll eat the next day. There are four possible transitions:

  • 🍣 → 🍣: if I eat sushi today, what's the likelihood I'll eat sushi tomorrow? Let's say 85%.

  • 🍣 → 🌮: if I eat sushi today, how likely am I to have tacos tomorrow? This must be 15%, since I only eat either sushi or tacos, and we already know what the probability is for sushi.

  • 🌮 → 🍣: if I eat tacos today, what’s the likelihood I’ll eat sushi tomorrow? Let's say 60%.

  • 🌮 → 🌮: by similar reasoning, this must be 40%.

Diagram of the sushi/taco Markov chain

If you run this model you could get this sequence:

🍣 🍣 🍣 🍣 🍣 🍣 🍣 🍣 🌮 🌮 🍣 🍣 🍣 🍣 🍣 🍣 🍣 🌮 🌮 🌮 🍣 🌮 🌮 🌮 🌮 🍣 🍣 🍣

Yum! So how does this relate to detecting maps versus classes in JSON? What do you notice about the following sequence:

🌮 🌮 🍣 🌮 🌮 🍣 🌮 🌮 🌮 🌮 🍣 🌮 🌮 🌮 🌮 🍣 🍣 🌮 🌮 🌮 🍣 🌮 🌮 🌮 🌮 🌮 🌮 🍣

If you had to guess, would you say it's more or less likely that this sequence was produced by the same Markov chain? In other words, is it more or less likely that the second sequence is what I ate for lunch last month?

Markov chains in quicktype

Consider a Markov chain for English words that looks at three letters at a time. Our state is the first two letters of any three-letter sequence, and our transitions give the probabilities for each possible third letter, given the first two.

For example, if the two letters are qu, then there is a high-probability transition for the letter i because many English words contain qui (e.g. acquire, colloquial, equip). On the other hand, the probability of a transition for j is zero, because no English word contains quj. What's the state that qu transitions to for i? It's ui, since we look at two characters at a time.

Part of the English Markov chain

Rather than using the Markov chain to generate pseudo-English words, we can feed it a sequence of letters, look at the probabilities of the transitions that would have produced that sequence, then use these probabilities to judge whether the sequence "behaves" like an English word. We do this by comparing the average[1] transition probability to what we would expect from an English word.

Explore transition probabilities by typing in the widget below. The color of each letter corresponds to its probability given its two predecessors (green is likely, red is unlikely, and the first letters are white because they don't have predecessors). The outer color gives the average probability—our judgement for whether the word is English:


Actually, the Markov chain used for this widget[2] is not built for detecting English, but rather for detecting class property names. In fact, it's the same Markov chain quicktype uses to judge whether a JSON property name is a class property or a map key. JSON objects usually have more than one property, though, so quicktype averages the probabilities of all property names in a given JSON object. Also, assuming JSON objects with fewer properties are more likely to be classes, quicktype allows property names to be a bit “weirder” when there are fewer of them[3].

How did I arrive at the probability threshold for determining class property names? Ideally I'd have access to all the JSON that ever has been, and ever will be produced, and enough computing power to tally up all probabilities, but for now I wrote a little script that generates a million simulated property names using an English dictionary and a list of acronyms:

pseudoscarusEEPROM
Undisappointed12
Wmv Outbabble
complexionlessDingledangle
marchpane_gb_5
horsebacker79
_canineTentacle
perigynium caq 8
noctambulousGuardable
_INDICANT_AV_9
...

See map detection in action in our map detection playground.

What if it fails?

If quicktype still fails to identify your JSON object correctly, here's what you can do:

  • If your object should be a class but quicktype thinks it's a map, you can disable map detection:
    Demo showing how to turn off map detection

  • If quicktype thinks your object is a class when it should be a map, a bit more work is needed. The easiest way to correct this is to trick quicktype by making the property names look "weird":
    Demo showing how to turn a class into a map

If you want total control over the types generated by quicktype, you can always output JSON Schema, correct the schema, then use the schema as the input to quicktype. This is the workflow we recommend for using quicktype in general.

Future improvements

Here are a few areas where we can improve map detection even further, and we’d love help from contributors on any of them:

  • quicktype should take the length of JSON property names into account. If a property name is a whole English sentence, that object is likely not a class.

  • English is not the only language in the world! Right now, Mandarin is Greek to quicktype (and so is Greek). Maybe quicktype needs to classify the JSON's "language" first and run a different Markov chain depending on the language? Or maybe a single Markov chain trained on more languages could do the trick? Where do we get training data?

  • Another clue in the Bitcoin JSON is that the map values are complex (Bitcoin objects in this case). Some classes have relatively many string properties, but it's less common to find classes with lots of array, class, or map properties.

  • If quicktype sees more than one sample for the same type, and they have few shared property names, that makes that type more likely to be a map.

  • Some common forms of map keys are not easy for a Markov chain to detect, but can be covered easily with regular expressions. Email addresses are an example.

If you’re interested in working on these problems, please come talk to us on Slack and check out quicktype on GitHub. Thanks!


  1. With probabilities we can't just take the arithmetic mean of our n individual probabilities—we need the geometric mean, which is the nth root of their product. ↩︎

  2. Explore the Markov widget further at github.com/quicktype/markov-react. ↩︎

  3. The details of that are not very scientific. ↩︎

Mark

Mark