quicktype under the hood

internals

The Pipeline

quicktype does its magic in three stages, much like a classic compiler:

  1. Read: quicktype reads the input and converts it to an internal representation. The formats it can read so far are JSON and JSON Schema. If the input is JSON, it also does some more involved type processing here.

  2. Simplify: this stage only runs for JSON input. In this stage, quicktype tries to simplify the internal representation it constructed in the Read stage. So far, it does two things: unify classes that look similar, and detect when it's better to represent a class as a map.

  3. Render: renders the simplified internal representation as source code in Java, C#, Swift, TypeScript, Go, JSON Schema,[1] etc.

Types

Before exploring these stages in detail, let's discuss the internal representation. Here's quicktype's definition of what a JSON value's type can be:

data IRType
    = IRNoInformation
    | IRAnyType
    | IRNull
    | IRInteger
    | IRDouble
    | IRBool
    | IRString
    | IRArray IRType
    | IRClass Int
    | IRMap IRType
    | IRUnion IRUnionRep

Most of these correspond directly to JSON types:

  • IRNull, IRBool, and IRString represent null values, booleans, and strings, respectively.

  • IRInteger and IRDouble are for numbers. JSON doesn't distinguish between integers and doubles semantically, but many applications do, so quicktype does, too.

  • IRArray is for arrays with elements of a specific type, for example [1, 2, 3, 4, 5]. In JSON, arrays are not homogeneous–each element can be any type–but quicktype keeps arrays homogeneous if possible. Once it can't do that any longer (e.g. for the array [1, true, "foo"]), it uses IRUnion, which we discuss below.

  • IRClass represents JSON objects, such as { "name": "Frank", "age": 34 }, with a specific set of property names and types. It's weird that it has an associated integer and nothing else–we'll get into that later.

  • IRMap also represents JSON objects, but only where quicktype has determined (in the Simplify stage) that the property names are probably not fixed, and that it's better to consider it a map from strings to a fixed type, like in this U.S. climate data sample, where data is rendered as a map type.

  • quicktype uses IRNoInformation when it knows nothing about a type. This currently only happens with empty arrays: when quicktype encounters an empty array it represents it as IRArray IRNoInformation. Before quicktime renders output, it replaces all IRNoInformations with IRAnyType:

  • The permitted values for IRAnyType can be anything, hence the name. The array type IRArray IRAnyType, for example, will accept any combination of element types, such as in [1, true, "foo"]. IRAnyType can come about not only via IRNoInformation, but also when reading JSON Schema, which can express the concept directly.

  • Finally, IRUnion describes values that can be any one type within a set of types (e.g. an integer or a string). When quicktype encounters heterogeneous JSON values, it creates an IRUnion rather than simply inferring a more general type like object. Heterogeneous values are extremely common in JSON, and this approach preserves more type information. We'll see how those IRUnions come about when we talk about transforming the input into the internal representation.

The Class Graph

Many data formats contain some self-referential parts. Let's say you have some JSON data for a family tree. Each object describing a person could have a biologicalMother field, the value of which would also be a person. The type for person, therefore, is self-referential. quicktype's internal representation is able to represent such self-referential types. PureScript (the language quicktype is implemented in) cannot construct self-referential values[2], so quicktype does this indirectly through a table. Each slot in the table stores the data for a class, and classes are referred to via integers indices in that table.[3]

Entries in that class table look like this:

newtype IRClassData = IRClassData
    { names :: Named (Set String)
    , properties :: Map String IRType
    }

names is a value used to name classes which we will cover in a future blog post. properties is a map with the IRTypes for all the class's properties.

As an example, let's represent a classic binary tree as an IRType. The equivalent PureScript type would be

newtype Tree = Tree
    { Data :: Int
    , Left :: Maybe Tree
    , Right :: Maybe Tree
    }

An example tree in JSON:

{
  "data": 31415,
  "left": { "data": 9265, "left": null, "right": null },
  "right": null
}

left and right are either null or a tree, so they have to be IRUnions of IRNull and an IRClass. The Int of that class is the index of the table slot with the IRClassData of the Tree class itself, so if we put the Tree class at index 123, its IRType would be IRClass 123, and its IRClassData (at slot 123 in the table) would look like this (IRUnion details omitted, and Map syntax simplified):

IRClassData
    { names: ...
      properties:
          { "data": IRInt
          , "left": IRUnion [ IRNull, IRClass 123 ]
          , "right": IRUnion [ IRNull, IRClass 123 ]
          }
    }

Finally, pulling all of the types we've discussed so far together is IRGraph:

newtype IRGraph = IRGraph
    { classes :: Seq Entry
    , toplevels :: Map String IRType
    }

classes is the class table we just covered[4], and toplevels is one or more top-level types–the top-level types of your JSON sample data or JSON schema. The graph from the quicktype.io web app currently allows only a single top-level type, but it's possible to have any number of top-levels when using the quicktype CLI. Note that the top-level types don't have to be IRClasses; quicktype will happily accept an array as a top-level input, or even a primitive type, like a boolean[5].

Read

The Read stage is simple except for one detail. For the most part, it just converts JSON values into their corresponding IRTypes as described above. The one complication arises when it encounters arrays that have elements of more than one type. To solve the problem, it "unifies" the element types. These are the rules of unification:

  • Unifying any type T with itself gives that same type T.

  • Unifying IRInteger and IRDouble gives IRDouble.

  • Arrays are unified with other arrays by unifying their element types.

  • Classes are unified with other classes by building the union of their properties. In cases where the classes have properties with the same name but with different types, the types of those properties are unified.

  • Unifying IRNoInformation with any type T gives that same type T. Since IRNoInformation is the element type of empty arrays, it would seem like IRNoInformation would never have to be unified with any other type. Consider, however, the array [ [1, 2, 3], [] ]. The first element array will have type IRArray IRInteger, and the second one IRArray IRNoInformation. These two types have to be unified now to get a result type for the whole array, and per the IRArray rule it's therefore necessary to unify IRInteger and IRNoInformation. The result of that is IRInteger per this very rule, so the result for the whole array is IRArray IRInteger.

  • Unifying IRAnyType with any type T results in IRAnyType. Unifying two types always generalizes, and IRAnyType can't be generalized any further.

  • Unifying two IRUnions: IRUnions cannot directly contain other IRUnions, so the previous rule cannot unify them. Instead, quicktype forms a new IRUnion containing the union of the types found in the two IRUnions; however, in our current implementation, an IRUnion can contain at most one class, one array, and one map[6]. When unifying two unions where both contain, for example, an array, the unified union will contain one array that's the result of unifying the two arrays.

  • Unifying any other pair of types produces an IRUnion containing both of them.

The way IRUnion currently works makes it impossible to express some type constraints. For example, quicktype can represent the type of arrays that contain either integers or strings (Array<int | string> in TypeScript), such as

[1, 2, "foo", "bar", 3]

but it cannot express the type of array of integers or array of strings (Array<int> | Array<string> in TypeScript). This was a design choice to keep things simple for now, but we plan to enable this later.

Simplify

We only run Simplify when generating code based on JSON sample data rather than JSON Schema; when generating code from JSON Schema, quicktype assumes that the user wants that exact schema, so there's no room for interpretation.

Currently, quicktype performs two transformations in this stage:

  • Unifying similar classes. This transformation considers two classes A and B similar if at least three-fourths of A's properties are also in B with the same type, and vice versa. Similar classes are unified into a single one[7], via the unification algorithm discussed above.

  • Transforming classes into maps. quicktype has a simple heuristic for deciding when to represent a class as a map: if the class has 20 or more properties, and all its properties are of the same type, then it becomes a map. There are more heuristics we want to implement.

Render

Each renderer targets a different language, but there are many commonalities among them. A renderer for a particular target language is a record of type Renderer. It includes metadata like the target language name; contains strategies for naming types in the target language; and, most importantly, each renderer has a function for producing the output. Here's what it looks like for C#:

renderer =
    { name: "C#"
    , aceMode: "csharp"
    , extension: "cs"
    , doc: csharpDoc
    , options: [listOption.specification]
    , transforms:
        { nameForClass: simpleNamer nameForClass
        , nextName: \s -> "Other" <> s
        , forbiddenNames
        , topLevelName: noForbidNamer csNameStyle
        , unions: Just
            { predicate: unionIsNotSimpleNullable
            , properName: simpleNamer nameForType
            , nameFromTypes: simpleNamer (unionNameIntercalated csNameStyle "Or")
            }
        }
    }

  • name is the renderer's name in the UI.

  • aceMode is the name of the syntax highlighting mode for the code editor in the UI, Ace.

  • extension is the file name extension used for the language.

  • doc is the rendering function, to be discussed in a future blog post.

  • options is an array of customization options for this renderer.

  • transforms.nameForClass is a function for naming classes, given the name that comes from the original JSON or JSON Schema.

  • transforms.nextName produces a new name if a name is already taken. quicktype is not very smart about this yet, and just prepends "Other" to the original name.

  • transforms.forbiddenNames are names that must not be used for types from JSON. Those are usually keywords and names of common types in the target language.

  • transforms.topLevelName is a naming function for top-level types.

  • transforms.unions is a Maybe that can be Nothing for renderers that don't treat IRUnions specially. quicktype generates C# classes to "emulate" unions, for example, but in TypeScript it expresses them directly with the native union types, so this field isn't needed in the TypeScript renderer. If we do have unions, it contains:

    • transforms.unions.predicate decides which unions require special treatment, like generating classes for them. C#, for example, has a Nullable type which allows expressing the union of IRBoolean and IRNull directly as "bool?", and reference types like classes, arrays, and strings, are always nullable, so "emulated" unions are not needed in those cases, hence predicate will return false for them.

    • transforms.unions.properName and transforms.unions.nameFromTypes are naming functions for those unions that do get special treatment. The former is used when quicktype has inferred a name for the union, the latter when it hasn't, in which case this particular naming function will produce something like "IntOrString".

We'll see how these rendering components work together to produce valid source code in future posts. If you're interested, please subscribe to our RSS feed, check out our GitHub repo, or simply say "👋" on Slack.


  1. Yes, that means you can make quicktype produce JSON Schema from a JSON Schema input. Our test suite actually checks this configuration and requires that input JSON Schemas be identical to the outputs. ↩︎

  2. PureScript supports self-referential types, such as linked lists (List), but quicktype don't represent JSON types as PureScript types, it represents JSON types as PureScript values (of type IRType), so it needs self-referential values. ↩︎

  3. We could have done that for all IRTypes instead of just for classes, which would make quicktype able to represent recursive array types, for examples. Whichever way we do it when we revise the internal representation, the implementation details of this indirection shouldn't be exposed anymore. ↩︎

  4. For the purposes of this blog post please pretend Entry is the same as IRClassData. ↩︎

  5. Swift's JSONSerialization only accepts objects and arrays as top-level values, so only IRClass, IRArray, and IRMap will work there. ↩︎

  6. Allowing both maps and classes in the same IRUnion is a bug. ↩︎

  7. We don't look for pairs of similar classes, but for groups, but we're not rigorous about it. ↩︎

Mark

Mark