Dump-Things
version: unreleased
This is a knowledge base/graph dump specification, and a companion of the Things schema and its derivatives and extensions.
It defines a data structure for dumping arbitrarily complex information, expressed in these data models, in a version-controllable fashion directly on a filesystem.
The data structure also allows for storing related information, such as the schema specifications themselves.
Example
/metadata | arbitrarily named root directory
/.dumpthings.yaml | global configuration
/myschema-v3-fmta | data record collection, compliant with one particular schema
/.dumpthings.yaml | configuration for a data record collection
/Person | data model record collection (for one particular class)
/<id>.<format> | an individual data record
Directory structure
Three types of directories are distinguish:
- root directory
- data record collection compliant with one particular schema
- data record collection for one particular data model
Root directory
All files and directories described in this specification are contained in an arbitrarily named root directory.
The root directory must contain a .dumpthings.yaml
configuration file.
Data record collection (for one schema)
Data record collection directories are arbitrarily named.
Any number of record collection directories can coexist in a single dump.
The corresponding directories must be direct subdirectories of the dump's root directory.
Each collection directory must contain a .dumpthings.yaml
configuration file, specifying (among other things) the schema this particular collection is compliant with.
Data model record collection (for one data model/class)
Data model record collection directories must be named after the class that defines the respective model in the given schema. Any number of data model directories can coexist in a record collection. The corresponding directories must be direct subdirectories of the record collection's root directory. Depending on the choice of record file identifier type, this directory may contain an arbitrary number of additional subdirectory levels.
In order to avoid issues with file system compatibility, it is recommended to use simple "PascalCase" class names in a schema. There may be support for a class name to directory mapping in a future version of this specification. But also for reasons of compatibility with code generators it is recommended to keep the class names aligned with established conventions across programming languages.
Configuration file
All configuration files must be named .dumpthings.yaml
.
The configuration file format is YAML.
However, the content of configuration files are simple key/value mappings, keeping it feasible to read configuration files without dedicated YAML IO libraries.
Two different types of configuration files are distinguished, depending on their location in the directory structure:
- dump-global configuration
- record collection configuration
Dump-global (top-level) configuration file
This file must declare the configuration type
to be collections
.
There must be the declaration of compliance with a major version of this specification.
Here is an example of a complete and valid configuration:
type: collections
version: 1
Record collection configuration file
This file must declare the configuration type
to be records
.
There must be the declaration of compliance with a major version of this specification.
The configuration must declare the schema
all data records are compliant with.
A schema must be identified either by a URL, or be a local relative path.
A relative path must be given in POSIX conventions.
A relative path must not point outside the collection directory (i.e., must not contain ..
path elements.
The configuration must declare the file format
of all record files.
The format label can be arbitrary (i.e., there is no official list), but it must match the file name extension of all record files.
The configuration must declare an mapping function that is used to compute file path and file name from a record's identifier (id
).
The mapping function label must match one of the methods listed under ID mapping methods.
Here is an example of a complete and valid configuration:
type: records
version: 1
schema: .schema.yaml
format: yaml
idfx: digest-md5
Record ID mapping methods
A record ID mapping method is an algorithm that takes the identifier (id
) of a data record on a Thing
, and produces a file name (with a potential path-prefix) for the record to be stored at (or where it can be read from).
Record IDs are processed in their literal form, with no implied preprocessing or resolution before they are passed to a transformation method.
The full record file name is created by appending .<format>
to the returned file name, where <format>
is the format
value declared in the collection configuration file.
The following methods are recognized:
digest-md5
Returns the hex digest of the MD5 hash of the record identifier.
digest-md5-p3
Like digest-md5
, but after the first three characters of the hex digest a POSIX directory separator (/
) is inserted.
This method can help to limit the number of records per directory.
digest-sha1
Like digest-md5
, but using a SHA1 hash.
digest-sha1-p3
Like digest-md5-p3
, but using a SHA1 hash.
after-last-colon
All characters prior to, and including, the last colon (:
) are stripped.
The remainder of the identifier is returned as the file name.
A method that can be of utility for generating human-readable/recognizable file names.
When, for example, identifiers take the form of orcid:...
or doi:...
, the resulting filename represent ORCIDs, or DOIs.
This method does not safeguard against non-portable filenames, nor does it guarantee uniqueness of filenames (e.g. in case of identical remainders, but different, stripped prefixes).
Notes on IO implementations
The things
schema supports inlining other Thing
records in a Thing
's relations
slot.
Any "write" IO implementation for this dump format should extract such inlined records, and store them into dedicated files -- one for each individual record.
Previously inlined records would then be referenced by their id
only.