Metadata schema for describing a distribution of a dataset
This schema is centered on the concept of a Distribution: a specific representation of a dataset, in the form of a (collection of) file(s), in a specific format.
Rather than modeling a Distribution
as a property of a Dataset (more abstract, format-flexible), this schema is focused on the inverse is_distribution_of
, to link a concrete distribution to its dataset concept.
This choice of directionality makes this schema particularly suitable for systems with metadata capabilities that are limited to or focused on annotation of concrete files in a storage system.
"Open world" attitude
Relationships
The schema is built on three foundational classes (taken from PROV-O):
Agent
Activity
Entity
Linking concrete instances of these classes via qualified relationships is the key pattern promoted by this schema.
For example, describing the relationship of a Distribution
(entity) the Resource
(entity) it is a distribution of is done by defining the Resource
instance within the relation
property declaration of the Distribution
.
The relationship is then qualified by referencing the Resource
instance (by its identifier) via the Distribution
's is_distribution_of
property.
When there is no suitable, dedicated property available to characterize a relationship, the qualified_relation
property can be used.
Here, the object (entity) is related to the subject (entity) via a declaration of the kind of influence the object had on the subject by means of particular roles.
Roles can be any kind of (external) identifier, thereby enabling arbitrary precision and fit to specialized use cases, without a need to inflate the number of properties in the schema.
Relationships between other combinations of the three foundational classes can also be specified.
For example influences of agents on an entity via was_attributed_to
and qualified_attribution
, using the same pattern.
TODO: how to declare relationships when no dedicated support for a particular type combination exists.
Types
Properties that are used as containers to define related objects support the declaration of specific subtypes of the respective range-defining class.
For example, was_attributed_to
accepts any Agent
, but specialized classes maybe be more suitable in particular contexts.
Such a derived class can be indicated via the meta_type
property.
If declared, it is then used for data structure definition and validation for this particular record.
Independent of this structure-focused type declaration, the type
property can be used to detail the semantics of an object.
For example, a scientific journal can be sufficiently described using the basic Entity
schema class, but
it is still useful to declare its type
to be, for example, obo:NCIT_C93226
(peer-reviewed scientific journal).
Custom properties
The schema provides a limited set of classes and properties that aim to capture a wide range of use cases in a generic fashion that balances schema complexity and applicability to particular scenarios.
Whenever more specialized properties are required and desired for detailing an Agent
, Activity
, or Entity
the has_property
property and the associated Property
and QuantitativeProperty
classes can be used.
For example, the Publication
schema class does not offer detailed bibliographic properties focused on scientific journal publications, because it aims to capture any kind of publication equally well.
Arbitrary custom properties can be defined by declaring property type
, associated value
, and range
(type of the value).
Here is an example that declare the number of pages (of a journal article):
property:
- type: bibo:numPages
name: Number of pages
value: "17"
type: xsd:nonNegativeInteger
The associated Property
class is permissive.
Properties can be declared without any type definitions (and just a name
or description
instead).
This approach works best for property values with simple data types.
However, value
can also be xsd:anyURI
to reference arbitrary (externally defined) concepts.
Identifiers
Identifiers play a key role in this schema. Any Agent
, Activity
, or Entity
has to have an IRI.
This IRI makes it possible to refer to definitions across potentially detached metadata documents (for example attached to individual files in a storage system that has no metadata capabilities beyond annotation of individual files).
The schema makes no assumption about the nature of these identifiers, beyond them being IRIs. For many use cases suitable identifiers and registries readily exist: DOI for publications, ROR for research organizations, ORCID for researchers, to name a few. This can and should be used whenever possible. However, sometimes no identifiers are available, and there are no resources for establishing a persistent identification schema properly. For such cases, the schema provide three built-in prefixes that map to exemplary IRI prefixes. These prefixes can be used to establish an implicit identification schema that is local to a particular scope.
exthisns
: A custom umbrella namespace relevant in the context of a datasetexthisds
: A custom namespace that is unique to the particular dataset that is being describedexthisver
: A custom namespace that is unique to the particular version of the dataset that is being described
The exthisns
namespace is the most important one.
It can be used to declare and refer to definitions of an Agent
, Activity
, or Entity
using a "localized" identification concept for which no setup for dereferenceable IRIs exists (yet).
For example the identification of people in a consortium that spans multiple organizations, where a global identifier like ORCID cannot be required.
Likewise, exthisds
and exthisdsver
can be used as abstract (yet undetermined) namespace references in store metadata records.
This can be useful when a suitable schema for persistent datasets and/or dataset version identifier does not yet exist, at the time for creation of such records.
Identifiers for data entities and contextual entities
This schema does not require, but enables a formal/visible distinction of identifiers for data entities (e.g., files) and contextual entities (e.g., people, licenses, grants), as done, for example in the RO-crate specification.
For example, in order to identify a file in a distribution of a particular version of a dataset, the identifier exthisdsver:./some/file.txt
can be used.
The ./
would indicate a data entity.
The exthisdsver
declares the scope of this localized identifier to be that of this particular version of the dataset.
A custom license can be assigned an identifier exthisdsver:#customlicense
.
The #
would indicate a contextual entity.
Again, the exthisdsver
declares the scope of this localized identifier to be that of this particular version of the dataset.
This makes it possible to declare custom license terms for a particular data distribution at hand, without having to be concerned with the analysis of term changes across versions that would require a new identifier.