Specification
Author | Scott Kleinman |
Created | 1 May 2018 |
Updated | 18 March 2021 |
JSON Schema | See the GitHub repository |
Version | 2.0.1 |
Manifest
A manifest is a JSON file which describes the nature of a resource of any type. A manifest MUST be a valid JSON object
as defined in RFC 4627. The name of the manifest file must be the value of its name
property followed by the .json
extension. An exception is made for the manifest of a Frictionless Data data package, which for compatibility must be called datapackage.json
.
Manifests MUST contain certain REQUIRED properties and MAY contain any number of OPTIONAL properties. Adherence to the WE1S specification does not imply that additional, non-specified properties cannot be used: a manifest MAY include any number of properties in addition to those described as REQUIRED and OPTIONAL properties. For example, if you were storing
time series data and wanted to list the temporal coverage of the data in a
source, you could add a property temporal
(cf. Dublin Core terms-temporal):
"temporal": {
"name": "19th Century",
"start": "1800-01-01",
"end": "1899-12-31"
}
This flexibility enables specific communities to extend the schema for the data they manage.
A Global property is a JSON keyword which is available to all manifests, regardless of type. Global properties can either be REQUIRED in all manifests or OPTIONAL in all manifests. Additional REQUIRED and OPTIONAL properties for specific manifest types are discussed individually under separate headings.
Global REQUIRED Properties
name
A short URL-usable (and preferably human-readable) name of
the package. This MUST be lower-case and contain only alphanumeric characters
along with ".", "_" or "-" characters. It will function as a unique
identifier and therefore SHOULD be unique at the level of the terminal node in the metapath
property (also globally REQUIRED). The value of name
taken together with the value of metapath
should form a globally unique identifier.
The name
SHOULD be invariant, meaning that it SHOULD NOT change when a manifest is updated, unless the new manifest should be considered a
distinct manifest, e.g. due to significant changes in structure or
interpretation. Version distinction SHOULD be left to the version
property. As
a corollary, the name
also SHOULD NOT include an indication of time range
covered.
metapath
A string
providing a "materialised" path representing the location of the this manifest relative to its root. A metapath takes the form of a POSIX file path, except that the normal /
delimiter is replaced with a comma.
namespace
A string
representing the WE1S namespace and version number (e.g. "we1sv2.0"). The presence of namespace
ensures that applications be designed to be handle legacy materials as the schema changes over time.
Note
The value of namespace
should ultimately be replaced by an object
consisting of a name
and a url
to the location of the WE1S JSON schema file. For example:
"namespace": {
"name": "we1sv2.0",
"url": "https://github.com/whatevery1says/manifest/tree/master/schema/manifest.json"
}
title
A string
providing a title or one sentence description for this manifest.
Global OPTIONAL Properties
The following are commonly used properties that the manifest MAY contain:
id
A property reserved for globally unique identifiers, typically conforming to some external schema. Examples of identifiers that are unique include UUIDs and DOIs.
While at the level of the specification, global uniqueness cannot be validated, consumers using the id
property MUST ensure identifiers are globally unique.
Because WE1S assumes that manifests will be stored in a MongoDB database, the value of id
MAY be filled by MongoDB's auto-generated primary key ObjectId. See discussion of the _id
property below.
_id
The _id
property is generated by MongoDB's ObjectId method for all records stored in the database. The 12-byte ObjectId value consists of:
- a 4-byte value representing the seconds since the Unix epoch,
- a 3-byte machine identifier,
- a 2-byte process id, and
- a 3-byte counter, starting with a random value.
In most cases, the _id
will not be used for database interactions since most queries will be expected to return manifests specified by the name
property along a specific metapath
.
Whilst all manifests stored in the database will automatically gain an _id
value, it may be necessary to keep this separate from the value of id
if the latter is used to store an external identifier such as a DOI. In these cases, the _id
may be written to exported manifests files where it will live as a redundant property.
description
A description of the package. The description MUST be Markdown formatted string
— this also allows for simple plain text as plain text is itself valid markdown. The first paragraph (up to the first double line break) should be usable as summary information for the package.
version
A string
identifying the version of the package. It should conform to the Semantic Versioning requirements and should roughly follow the Frictionless Data Data Package Version pattern.
The value of the image property MUST be a string
pointing to the location of the image. The string must be either a URL or a metapath
value, typically something like Corpus,collection_name,Related,image_file
.
shortTitle
A string
providing a shortened or alternative version of the manifest's title
value.
label
A string
providing an abbreviated or other identifier for the manifest which can be used in graphs and other displays where space is limited.
notes
An array
of text strings which can contain extended prose commentary about the manifest's content. Individual notes MUST be formatted in Markdown.
keywords
An array
of string
keywords to assist users searching for the
manifest using terms from a controlled vocabulary or some other method of classification.
image
An image to use when displaying the manifest, for instance, in a list of manifests. The value of the image property MUST be a string
pointing to the location of the image. The string must be a fully qualified HTTP address, a relative POSIX path, or a metapath
to a storage location in a database.
updated
The updated
property is used to describe changes made to a manifest after its initial creation. The value of the property MUST be an array
of objects
. Each object
MUST contain the change
and date
properties and MAY contain a contributors
property. The change
property must be a string
. The date
property follows the standard pattern described under Formatting Dates and the contributors
property follows the specifications described under collection
manifests.
Important
The updated
property describes ONLY changes to the manifest document in which it is include. It does not apply to changes in linked data. For instance, if additional data files are added to an existing collection
. The change
property may indicate this, but it may not be possible to implement a procedure to recover the prior state of the data. This may affect the reproducibility of certain processes. Recommended methods of addressing this include creating a project based on the collection prior to the change, creating a separate collection for the changed data, or implementing changes along different metapath
branches of the original collection
. The addition of new branches can be described using the updated
property in the collection
manifest.
Sources
Manifests
Sources
manifests contain bibliographical information about the sources (typically publications) of data in the WE1S Corpus
. All Sources
manifests are stored in the WE1S Sources
database and will therefore have the metapath
value Sources
.
REQUIRED Properties
The following globally REQUIRED properties MUST be included in every Sources
manifest: name
, title
, namespace
.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Sources
manifest: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional OPTIONAL properties MAY also be included in a Sources
manifest.
publisher
A string
containing the name of the source's publisher.
webpage
A string
containing a URL for the source. This is an alternative for publisher
since sometimes that information will not be available for web pages.
authors
An array
containing the names of the author(s) of a publication (e.g. a book). The array
can consist of string
values or objects
. In general, string
values should be standard full representations of authors' names since it is expected that this information will be queried by regex. However, an object
can be used to indicate collective authorship using the group
and organization
properties:
"authors": [
{
"group": "Summer Research Camp 2018",
"organization": "UC Santa Barbara"
}
]
Objects
can also be used to encode specific parts of author names such as surname
and forename
. However, WE1S currently has no standard nomenclature for parts of names.
date
An array
containing the date of publication or range of publication dates where known. Dates should be given in ISO 8601 date (YYYY-MM-DD
) or datetime (e.g. 2017-09-16T12:49:05Z
) format, wherever sufficient information is known. For further information, see the section on Formatting Dates.
edition
A string
indicating the edition number (e.g. "2nd") or medium (e.g. "print" or "online") of the publication. WE1S currently has no controlled vocabulary for edition
values.
contentType
A string
representing the nature or genre of the data (e.g. "newspaper"). WE1S does not currently have a controlled vocabulary for the value of this field.
country
A string
value taken from the ISO 3166-1 ALPHA-2 country codes.
language
A string
value taken from the ISO 639-2 list of language codes. If multiple languages are required, an array
of strings
can be supplied.
citation
An object
containing a bibliographic citation for the source, generally one that is intended for display.
Citations MUST contain a schema
property indicating the style guidelines followed in formatting the citation. This may be either a string
value like "Chicago, 17th edition" or, preferably, a URL to a schema description website (e.g. "https://github.com/citation-style-language/schema").
A citation MAY contain a text
property containing a fully-formatted string
citation. Formatting must be given in Markdown.
A citation MAY contain an object
with field names corresponding to a schema such as the Citation Style Language. For example:
{
"citation": {
"schema": "https://github.com/citation-style-language/schema",
"fields": {
/* A partial citation */
"title": "The Cambridge Companion to Textual Scholarship",
"publisher": "Cambridge University Press",
"ISBN": "978-0-521-60329-4"
}
}
}
There is no standard JSON schema for citations, but WE1S recommends the Citation Style Language. CSL is used by Zotero, which can export its records as JSON objects that can be inserted in Sources
manifests.
Data
Manifests
A Data
manifest is a JSON file which contains inline data or, points to data in another location.
Data
manifests containerise data files, generally, but not always textual data. All WE1S Data
manifests are stored in the Corpus
database and therefore will have metapath
values beginning with Corpus
.
The illustrative example below points to data available at a certain URL:
{
"name": "an_article",
"title": "Title of the Article",
"path": "http://example.com/an-article.txt"
}
Local data resources are assumed to belong to data sets called collections
. A Data
manifest can contain "inline" data by referencing the manifest's location relative to a collection
in the metapath
property. The data itself is given in the data
property.
{
"name": "an_article",
"title": "An Article",
"metapath": "Corpus,collection_name,RawData",
"data": "This is the text of the article."
}
A variation of this points to an actual data file using the path
property:
{
"name": "an_article",
"title": "An Article",
"metapath": "Corpus,collection_name,RawData,txt",
"path": "Corpus,collection_name,RawData,txt,an_article.txt"
}
Data Nodes and Property Inheritance
The final node of the metapath
will either be a manifest containing inline data or a pointer to a data file. Parent nodes for this manifest will represent "branches" within the collection
. Some properties specified by parent nodes are inherited by their children. For instance, the OCR
property applied to the RawData
node of a collection
will apply to all the individual data files classified as RawData
. This inheritance can be overriden by including the OCR
property in individual data manifests. In the discussion below, inheritable properties are discussed in the context of manifest types where they are most likely to be used. This does not mean that they are unavailable for use in other types of data manifests.
REQUIRED properties
Data
manifests MUST have the following optional properties: name
, title
, namespace
, metapath
.
OPTIONAL properties
The following globally OPTIONAL properties MAY be included in a Data
manifest: id
, _id
, authors
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
A Data
manifest MAY contain any of the following additional properties:
path
A string
indicating the location of the data file. The path
value has the following additional constraints:
- It MUST either be a URL or a POSIX path
- URLs MUST be fully qualified. MUST be using either http or https scheme. (Absence of a scheme indicates MUST be a POSIX path)
- POSIX paths (unix-style with
/
as separator) are supported for referencing local files, with the security restraint that they MUST be relative siblings or children of the descriptor. Absolute paths (/
) and relative parent paths (../
) MUST NOT be used, and implementations SHOULD NOT support these path types. - The value of
path
MUST end in a file name.
The path
property differs from the metapath
in that the latter indicates the parent container for the data file and does not directly indicate the storage location of the file. The path
value provides the full path to the storage location.
format
A string
providing the standard file extension for the type of resource (e.g. "csv", "xls", "json", etc.).
mediatype
A string
providing the mediatype/mimetype of the resource, e.g. "text/csv", "application/vnd.ms-excel". A list of common media types can be found on Wikipedia.
encoding
A string
providing the specific character encoding of the resource's data file. The values should be one of the "Preferred MIME Names" for a character encoding registered with IANA). If no value for this key is specified then the default is UTF-8.
Collection
Manifests
WE1S data sets are assumed to be stored in the Corpus
database as collections
. A collection
is defined by a Collection
manifest containing metadata. Here is an example:
{
"name": "name_of_collection",
"title": "Collection Name",
"metapath": "Corpus",
}
Important
MongoDB also uses the term collection
to define a table-like database structure. This usage should be distinguished from the WE1S usage, where a collection refers to a nameable body of documents.
Collection
manifests can contain several branch nodes:
RawData
: Used for data in its source formProcessedData
: Used for data that has been transformed by one or more processesMetadata
: Used for metadata files which were collected along with the data.Outputs
: Used for data files produced from analytic processesRelated
: Used for files such as documentation associated with the data set
These subcategories are created by placing child nodes along the collection
metapath
: Corpus,collection_name,RawData
, Corpus,collection_name,ProcessedData
, etc.
It is possible to use the same technique to create sub-branches of data. For instance, Corpus,collection_name,ProcessedData,lower_case
, Corpus,collection_name,ProcessedData,stopwords_removed
.
In database storage, collection
manifests are simply floating manifests which hold metadata relevant to all manifests and data further along the metapath
. When data is exported from the Corpus
database, it is typically assembled into a Frictionless Data data package with a datapackage.json
manifest in the containing folder and subfolders corresponding to the branches along the metapath
. Typically the branch manifest will be placed alongside each subfolder to preserve the metadata relevant to the files in that folder.
The following sections detail the REQUIRED and OPTIONAL metadata for each type of Collection
manifest.
Collection
Nodes
A collection
is defined by a manifest that serves as the root node of the collection.
REQUIRED Properties
Manifests serving as root nodes of collections
MUST include the following properties: name
, title
, namespace
, and metapath
. In addition, Collection
manifests MUST include created
, sources
, and contributors
properties as detailed below.
created
An array
containing the date or dates on which the collection
was created. Dates should be given in ISO 8601 date (YYYY-MM-DD
) or datetime (e.g. 2017-09-16T12:49:05Z
) format, wherever sufficient information is known. For further information, see the section on Formatting Dates.
sources
An array
containing the published sources for this collection
. The value of sources
MUST be an array of objects
, typically from the Sources
database. Each source object MUST have a title
and path
properties and MAY have an email
property. Example:
"sources": [
{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}
]
title
: title of the source (e.g. document or organization name)
path
: A url-or-path string, that is a fully qualified HTTP address, or a relative POSIX path.
email
: An email address
In most cases, the path
will point to a manifest in the Sources
database, but other possibilities are allowed for projects not using a database to store sources.
contributors
The people or organizations who contributed to the harvesting, downloading, collecting, or assembling the collection
. The value of the contributors
property MUST be an array
. Each entry is a Contributor and MUST be an object
. A Contributor MUST have a title
property and MAY contain path
, email
, role
, group
, and organization
properties. An example of the object
structure is as follows:
{
contributors: [
{
"title": "Joe Bloggs",
"email": "joe@bloggs.com",
"path": "http://www.bloggs.com",
"role": "author"
}
]
}
title
: Astring
containing the name/title of the contributor (name for person, name/title of organization).path
: Astring
containing a fully qualified http URL pointing to a relevant location online for the contributor.email
: Astring
containing an email address.role
: Astring
describing the role of the contributor. It MUST be one of:author
,publisher
,maintainer
,wrangler
, andcontributor
. Defaults tocontributor
.- Note on semantics: use of the
author
property does not imply that that person was the original creator of the data in thecollection
- merely that they created the data. group
: Astring
describing a smaller body of contributors within anorganization
.organization
: Astring
describing the organization to which this contributor is affiliated.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a collection
manifest: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional OPTIONAL properties MAY also be included in a collection
manifest.
workstation
A string
value providing information about the environment in which the data was collected (e.g. "Windows 8.1"). There is currently no controlled vocabulary for this property.
queryTerms
An array
providing keywords used to define the scope of the collection
(typically used in an API query to collect the data). The value can be used to query the Corpus
for data matching a particular description.
processes
An array
containing embedded processes or paths to separate process
manifests. Both types follow the same schema, described under Processes
.
RawData
Nodes
RawData
manifests serve as root nodes for all data in the collection
that has not been transformed by any processing steps after collection.
REQUIRED Properties
RawData
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST be Corpus,collection_name,RawData
.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a RawData
manifest: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional OPTIONAL properties MAY also be included in a RawData
manifest.
documentType
A string
providing a description of the nature of the data accoring to some controlled vocabulary (e.g. "bag of words", "table data). WE1S does not currently have a standard controlled vocabulary for this property.
relationships
An array
of strings
or objects
. The schema below uses the relationships property to describe the data as being a part of another collection
("collection1") combined with material from a third collectio
n ("collection2"). Terms from Dublin Core are used in this example, but it is possible to use other terms from any controlled vocabulary.
"relationships": [
{"isPartOf": "Corpus,collection1,"},
{"hasPart": "Corpus,collection2"}
]
OCR
A Boolean
to indicate whether the data has been digitized using Optical Character Recognition. If omitted, the default value is false
.
licenses
The license(s) under which the package is provided.
Warning
The license
property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.
licenses
MUST be an array. Each item in the array is a License. Each MUST be an object
. The object
MUST contain a name
property and/or a path
property. It MAY contain a title
property.
Here is an example:
"licenses": [
{
"name": "ODC-PDDL-1.0",
"path": "http://opendatacommons.org/licenses/pddl/",
"title": "Open Data Commons Public Domain Dedication and License v1.0"
}
]
name
: The name MUST be an Open Definition license IDpath
: A url-or-path string, that is a fully qualified HTTP address, or a relative POSIX path.title
: A human-readable title.
Omission of the licenses
property assumes a single value with "Free Culture" as the name
and an empty path
.
Further information about applying licenses can be found in the Frictionless Data documentation.
ProcessedData
Nodes
ProcessedData
manifests serve as root nodes for all data in the collection
that has been transformed or processed after it was collected.
REQUIRED Properties
ProcessedData
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST be Corpus,collection_name,ProcessedData
. In addition, ProcessedData
manifests MUST include the processes
property.
processes
An array
containing processes used in the transformation of the RawData
source material. The array
MAY contain a list of paths to separate process manifests OR inline descriptions of the processes. In the latter case, the descriptions MUST be objects
conforming to the schema described for Processes
manifests detailed below.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a ProcessedData
manifest: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a ProcessedData
manifest: format
, mediatype
, encoding
, and documentType
. These will be inherited by all data along the ProcessedData
metapath unless overridden in individual data manifests.
Metadata
Manifests
Metadata
manifests define metapath
routes for documents containing metadata that may have been acquired along with the raw data.
REQUIRED Properties
Metadata
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST be Corpus,collection_name,Metadata
.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Metadata
manifest: id
, _id
,description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a Metadata
manifest: format
, mediatype
, encoding
, and documentType
. These will be inherited by all data along the Metadata
metapath unless overridden in individual data manifests.
Outputs
Manifests
Outputs
manifests define metapath
routes for data and metadata generated through WE1S analytic processes. It is important to note that storing materials along the Outputs
metapath makes them a permanent part of the collection
.
REQUIRED Properties
Outputs
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST be Corpus,collection_name,Outputs
.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Outputs
manifest: id
, _id
,description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a Outputs
manifest: format
, mediatype
, encoding
, and documentType
. These will be inherited by all data along the Outputs
metapath unless overridden in individual data manifests.
Related
Manifests
Related
manifests define metapath
routes for documents (typically files) such as documentation which are archived for reference.
REQUIRED Properties
Related
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST be Corpus,collection_name,Related
.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Related
manifest: id
, _id
,description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a Related
manifest: format
, mediatype
, encoding
, and documentType
. These will be inherited by all data along the Related
metapath unless overridden in individual data manifests.
Data Packages
Frictionless Data data packages are the default export format for data stored in the WE1S database. Every effort has been made to make the WE1S manifest schema compatible with the Frictionless Data specification. The major difference is that Frictionless Data requires that all resources be listed in the resources
array found in the datapackage.json
file. A shorthand is to include only paths to the higher node manifests (e.g. RawData
, ProcessedData
) stored in the same directory. An application can then reference data in subfolders using information contained therein. To do this, the application must be able to implement the WE1S schema. Generic data package tools may not be able to locate the data out of the box.
Processes
Manifests
A Processes
manifest documents the ways in which data is modified by analytic or other processes. It is primarily a method of recording the steps a user has taken in executing a workflow and is a means by which those steps can be duplicated. All Processes
manifests are stored in the WE1S Processes
database.
REQUIRED Properties
Processes
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST begin with Processes
. In addition, Processes
manifests MUST include the following properties:
steps
An array
of JSON objects providing each step in the Processes
as described under step
Manifests. Alternatively, the value of steps
can be an array
of string
paths to other Processes
manifests or step
manifests.
date
An array
containing a date or dates indicating when the steps detailed in the manifest were implemented. The value should be a string
containing the date in date (YYYY-MM-DD
) or datetime (e.g. 2017-09-16T12:49:05Z
) format. See the Formatting Dates section for further details.
Important
The date
property is only REQUIRED for inline processes in Collection
manifests since it would not make sense for re-usable processes in the Processes
database. namespace
and metapath
are not required for inline processes
.
contributors
The people or organizations who contributed to the implementing the processes described in the manifest. The value of the contributors
property MUST be an array
. Each entry is a Contributor and MUST be an object
. A Contributor MUST have a title
property and MAY contain path
, email
, role
, group
, and organization
properties. For further discussion of conventions, see the description of the contributors
under Collection
manifests.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Processes
manifest: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a Processes
manifest.
created
A string
containing the date when the process was created. The value should be a string
containing the date in date (YYYY-MM-DD
) or datetime (e.g. 2017-09-16T12:49:05Z
) format. For further details, see the Formatting Dates section.
source
If the Processes
manifest is dependent on a particular collection
or other set of source data, the source
property should be used to indicate the location of the data. The value MUST be a string
containing a URL, local path, or metapath
to the data manifest or files. If it does not include the name of the data's collection
, this should be mentioned in the Processes
manifest's description
. The source
property is unnecessary for inline processes
embedded in a collection
since the collection
is assumed to be the data source.
Step
Manifests
A Step
manifest describes the workflow parameters of a single step in a process
. It MUST be an object
. Step
manifests MAY be embedded within or referenced from a Processes
manifest.
REQUIRED Properties
Step
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST begin with Processes,process_name,Steps
. In addition, Step
manifests MUST include the following properties:
description
A string
describing the processing step. Whereas description
is OPTIONAL for many manifests, it is required for a Step
manifest.
implementation
A string
describing the means by which the processing step was implemented. There is currently no controlled vocabulary for this property. Possible values are "script", "tool", or "API".
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Step
manifest: id
, _id
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a Step
manifest.
path
A string
providing a reference to the location of the tool or script. If it is a tool, this can be a URL to the tool's website. If the step involves a script, the reference should be a URL to the script's repository or a metapath
to the script's manifest.
options
An array
containing information about the configuration of the tool or script used in the step. Each option
MUST be an object
. The option
name should be given as the "argument" and the "value" should be the option setting. This structure is ideally suited for command-line tools, but the JSON object can contain fields like the following to record a sample configuration file.
{
"settings.cfg": "Sample config file:..."
}
Likewise, you might have
{
"api": "http://api.nytimes.com/svc/search/v2/articlesearch"
}
for an API query with further arguments for the query terms.
outputs
An array
of paths to the root node where all the step
's outputs are stored.
instructions
A string
containing instructions for implementing the step
. Although instructions MAY also be put in the notes
and description
fields, an explicit instructions
field may be helpful in some instances.
Scripts
Manifests
Scripts
manifests include information about external software and tools, as well as scripts authored by WE1S staff. Scripts
manifests make extensive use of the metapath
property to create branching structures for standard types of scripted procedures:
- collecting
- preprocessing
- analysis
- visualization
Whilst these branches are expected in WE1S project, others may be used as necessary.
Each branch MAY have child branches for different tools. However, WE1S expects sub-branches dividing scripts by language. Hence a possible metapath
would be Scripts,preprocessing,python,strip_tags
. If the manifest is for a tool or external program, the last item in the branch will be the manifest containing metadata about the tool or program. If it is a WE1S script, the manifest MAY additionally contain the code of the script itself inline.
REQUIRED Properties
Script
manifests MUST include the following properties: name
, title
, namespace
, and metapath
. The metapath
MUST begin with Scripts
. In addition, Script
manifests MUST include the following properties:
contributors
The people or organizations who contributed to the creation of the tool or script. The value of the contributors
property MUST be an array
. Each entry is a Contributor and MUST be an object
. A Contributor MUST have a title
property and MAY contain path
, email
, role
and organization
properties. An example of the object
structure is as follows:
{
contributors: [
{
"title": "Joe Bloggs",
"email": "joe@bloggs.com",
"path": "http://www.bloggs.com",
"role": "author"
}
]
}
title
: Astring
containing the name/title of the contributor (name for person, name/title of organization).path
: Astring
containing a fully qualified http URL pointing to a relevant location online for the contributoremail
: Astring
containing an email address.role
: Astring
describing the role of the contributor. It MUST be one of:author
,publisher
,maintainer
,wrangler
, andcontributor
. Defaults tocontributor
.- Note on semantics: use of the
author
property does not imply that that person was the original creator of the data in the data package — merely that they created the data. organization
: Astring
describing the organization this to which this contributor is affiliated.
OPTIONAL Properties
The following globally OPTIONAL properties MAY be included in a Scripts
manifest: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
.
The following additional properties MAY be included in a Scripts
manifest.
created
The date when the script was authored or last updated. The value should be a string
containing the date in date (YYYY-MM-DD
) or datetime (e.g. 2017-09-16T12:49:05Z
) format. See Formatting Dates for further information on how to represent the value of the created
property.
path
For external scripts, the URL of the location where the script was accessed for the creation of the manifest.
accessed
For external scripts, such as those stored on GitHub, the accessed
property indicates when the script location was accessed for the creation of the manifest. The value should be a string
containing the date in date (YYYY-MM-DD
) or datetime (e.g. 2017-09-16T12:49:05Z
) format. See Formatting Dates for further information on how to represent the value of the accessed
property.
script
A string
copy of the script code. Line breaks in the encode should be given as \n
, and all quotation marks should be properly escaped.
Conventions
Formatting Dates
The date
, created
, and accessed
properties all contain dates which should be given according to the following conventions:
- Individual dates MUST be
strings
in date (YYYY-MM-DD
) or datetime (e.g.2017-09-16T12:49:05Z
) format.
{
"date": "2017-09-16"
}
- Multiple dates may be given in an
array
.
{
"date": [
"2017-09-16",
"2017-09-16T12:49:05Z"
]
}
- If it is necessary to specify the format, the date may be given as an
object
containingtext
andformat
properties:
{
"date": [
{
"text": "2017-09-16",
"format": "date"
},
{
"text": "2017-09-16T12:49:05Z",
"format": "datetime"
}
]
}
- Date ranges can be specified with an
object
with the keywordrange
. Theobject
MUST contain astart
property and MAY contain anend
property. Both MUST havestring
values as in the example below:
{
"date": {
"range": {
"start": "2017-09-16",
"end": "2018-09-16"
}
}
}
start
andend
values may also be expressed asobjects
containingtext
andformat
properties.
WE1S Projects
A "project" is a containerised set of manifests and data that can be stored and manipulated outside the database. They may in turn be stored in a separate Projects
database for future reference.
In form, a project is a Frictionless Data data package built from the metapath
contents of its resources. It consists of the following:
- A folder containing a
datapackage.json
file with the following limitation: theresources
property must contain the pathsSources
,Corpus
,Processes
, andScripts
AND only these paths. The folder should also contain subfolders with these same names. - Each subfolder SHOULD contain at least one manifest file. For instance, the
Corpus
folder might contain aCollection
manifest callednew_york_times.json
. This SHOULD have a corresponding folder called with the same name minus the file extension (e.g.new_york_times
). - Additional subfolders and manifests should be created at the next level of the file hierarchy for each resource added to the data package. For instance, if there is a
Corpus,new_york_times,RawData
manifest added, the data package folder should contain aRawData.json
and aCorpus/new_york_times/RawData
folder inside theCorpus/new_york_times
folder.
Note that this does not exactly follow the Frictionless Data specificiation because not all resources are listed in the datapackage.json
file. However, it is possible to create a complete list of resources programmatically by recursively listing the contents of the folders or querying properties in the manifests contained therein.
Project Manifests
The following REQUIRED properties MUST be included in every Projects
manifest: name
, title
, namespace
, content
, contributors
, created
.
The created
and contributors
properties are the same as found in collection
manifests. The content
property MUST contain a BSON-formatted .zip
archive, the name of which must the same value as the name
property. The other properties are globally REQUIRED properties.
In addition, Projects
manifests MAY contain the following OPTIONAL properties: id
, _id
, description
, version
, keyword
, image
, shortTitle
, label
, notes
, updated
, webpage
, contentType
, citation
. The last three are the same as found in Sources
manifests.
In general, project resources can be reconstructed by iterating through the project's folders and subfolders. However, for some applications this can be an inconvenience, especially if the project is archived as a zip file. In these cases, the Projects
manifest MAY contain a resources
property. This MUST be an array of strings or objects. By default, the array will contain strings corresponding to the paths to the individual resources, whether within the project's file structure or on the internet:
"resources": [
"Corpus/collection_name/RawData",
"http://example.com/resource-path.csv"
]
Paths ending in folders are assumed to be parents of all files and subfolders contained therein. However, it is up to the individual application to parse them recursively or filter the data as necessary.
The example above may alternatively be represented as an array of objects with the path
property:
"resources": [
{
"path": "Corpus/collection_name/RawData"
},
{
"path": "http://example.com/resource-path.csv"
}
]
Object representation of resources is most useful when the resources contain other methods of accessing content such as database queries. In this case, the db_query
property may be used with a platform
property.
"resources": [
{
"db_query": "Corpus/collection_name",
"platform": "MongoDB"
}
]