Skip to content

The WhatEvery1Says Manifest Schema

Introduction

The WhatEvery1Says Schema (hereafter "WE1S Schema") is a set of recommendations for the construction of manifest documents for the WE1S project. Manifests are documents which describe resources available to the WhatEvery1Says Ecosystem. It consists of:

  • Metadata that describes the structure and contents of resources (widely defined)
  • Pointers to other resources including other manifests and data files

Pointers may be provided as:

  • Remote resources, referenced by URL
  • "Metapaths" which indicate hierarchically-arranged relationships between resources in the WE1S ecosystem
  • Inline resources such as data included directly in the manifest

WE1S manifests can be used for a variety of purposes. They may include metadata describing a publication, a process, a set of data, or an output of some procedure. Their primary intent is to help humans document and keep track of their workflow.

Manifests are designed to be read easily by humans but parsed just as easily by programming languages. By default, manifests are stored in JSON format. Manifests may be standalone files ("JSON files"), or they may be stored in a database. They may also contain data themselves.

The JSON Format

Each JSON document contains a JSON object containing a series of comma-separated key-value pairs enclosed in curly brackets ({}).

{
  "keyword": "value",
  "keyword": "value"
}

These pairs are inherently unordered so, to give them sequence, they may be placed in an array, designated by square brackets ([]):

{
  [
    {"keyword1": "value"},
    {"keyword2": "value"}
  ]
}

For multiple properties, it may be useful to construct a more elaborate structure consisting of arrays of objects:

{
  "keyword1": [
    {"sequence": 1},
    {"keyword3": "value"}
  ],
  "keyword2": [
    {"sequence": 2},
    {"keyword4": "value"}
  ]
}

Sequential order is assumed for the values of arrays; items within objects have no inherent order.

A JSON document can have an unlimited number of key-value pairs. The WE1S schema places restrictions on what keywords can be used to document resources and makes recommendations for structuring manifests in a consistent manner. The schema is based on JSON Schema, a vocabulary that allows you to annotate and validate JSON documents. An excellent overview is provided in Michael Droettboom's Understanding JSON Schema tutorial. WE1S manifests follow the syntax of JSON Schema so that they always have a valid (i.e. predictable) format. This ensures maximum interoperability for a variety of uses.

Manifests in the WE1S Ecosystem

The WE1S ecosystem consists of a framework of data, tools, and resources which are meant to be used together, with manifests used to control and describe workflow. Manifests may be stored in a standard operating system's hierarchical file storage system. However, WE1S employs the MongoDB database to manage and search the large number of files generated by the project. Because MongoDB stores its records in a JSON-like format, it is an ideal medium for working with WE1S manifests. MongoDB also allows the project to implement a "materialized path" data model, which mirrors the characteristics of hierarchical file storage. Each manifest is given a "materialized path" property called a metapath which is similar to an operating system's file path. The similarity to an actual file path is deliberate; it allows human readers to see directly from the manifest where the file lives within the project ecosystem. The metapath is a useful property for importing manifests to and exporting them from the database in an intuitive manner.

Metapaths, the Database Structure, and Data Packages

"Metapaths" are equivalent to operating system file/folder paths, except that they do not indicate actual locations within the local file hierarchy. Instead, they serve to model conceptually the relationships of resources in a manner similar to local file storage. Because metapaths reference nodes above the level of a given manifest, content can be easily queried in these higher nodes. A given manifest thus effectively inherits continent from manifests above it along the same metapath.

The concept of the metapath is formalised as the metapath property in the WE1S schema. A metapath is a string with the following additional constraints:

  • A metapath MUST is a unix-style POSIX path, except that it uses , as a separator, rather than /.
  • Absolute paths equivalent to Unix-style '/' and relative parent paths equivalent to Unix-style ../ MUST NOT be used, and implementations SHOULD NOT support these path types.

The reason why , is specified as the separator instead / if that MongoDB is the assumed storage medium for the WE1S project. MongoDB searches documents using regex, patterns, which use / as a delimiter. A choice must be made whether to store the metapath separator as a comma and convert it to a slash for display purposes or to store it as a slash and convert it to a comma every time a database query is made. Since metapath does not represent a real file location, the former strategy seemed the better of the two solutions.

The WE1S ecosystem consists of four inter-related database-like structures which can be referenced through the metapath property.

  1. Corpus: The storage category for all data, including primary source material, transformed data, the results of analysis, and related documents.
  2. Sources: The storage category for metadata about all source material used to compile the data in Corpus.
  3. Processes: The storage category for metadata describing the procedures used to collect and analyse the data in Corpus.
  4. Scripts: The storage category for files containing code used to implement the procedures described in Processes where these procedures were not implemented using external tools or scripts.

Individual projects may create other database-like structures as needed. At present, WE1S has only fully developed and employed the Corpus and Sources schemas.

The WE1S schema builds on the Frictionless Data notion of a data package. A data package is a special type of manifest (called datapackage.json) used to containerise data and associated resources. When data is exported from the WE1S database it will be ideally exported in the form of a data package with content in subfolders corresponding to the database-like categories described above.

Conventions

Language

Manifests storing information in these four categories can be considered to belong to different manifest "types", depending on their function or the nature of their content. Other types of manifests are used specifically for storing certain forms of data, e.g. raw or processed, or to create branching structures in the metapath hierarchy.

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Manifests of all types are REQUIRED to contain certain common properties. Other REQUIRED and OPTIONAL properties will depend on the manifest type. The REQUIRED and OPTIONAL properties for each manifest type are described in the Specification.

Formatting Dates

In the WE1S Schema, dates are given as strings in date (YYYY-MM-DD) or datetime (e.g. 2017-09-16T12:49:05Z) format.

{
  "date": "2017-09-16"
}

Multiple dates may be given in an array.

{
  "date": [
    "2017-09-16",
    "2017-09-16T12:49:05Z"
  ]
}

If it is necessary to specify the format, the date may be given as an object containing text and format properties:

{
  "date": [
    {
      "text": "2017-09-16",
      "format": "date"
    },
    {
      "text": "2017-09-16T12:49:05Z",
      "format": "datetime"
    }
  ]
}

Date ranges can be specified with an object with the keyword range. The object MUST contain a start property and MAY contain an end property. Both MUST have string values as in the example below:

{
  "date": {
    "range": {
      "start": "2017-09-16",
      "end": "2018-09-16"
    }
  }
}

start and end values may also be expressed as objects containing text and format properties.