The DDI-RDF Discovery Vocabulary is a draft specification of the DDI Alliance.
This specification is produced by the subgroup on Disco (chair Joachim Wackerow) of the RDF Vocabularies Working Group, a working group at the DDI Alliance.
Resources:
The namespace for all terms in this ontology is: http://rdf-vocabulary.ddialliance.org/discovery#".
Normative formats of the DDI-RDF Discovery Vocabulary specification are
There is also a non-canonical RDF/XML version of the Turtle file.
Open issues are discussed on the issue tracker: open issues.
A detailed overview of the Disco vocabulary is available as LODE view or a web view using the web application Web-based Visualization of Ontologies.
For a detailed explanation of DDI terms please refer to section 2.
This specification is designed to support the discovery of microdata sets and related metadata using RDF technologies in the Web of Linked Data. Many archives and other organizations have large amounts of data, sometimes publically available, but often confidential in nature, requiring applications for access. Many such organizations use the Data Documentation Initiative standard, which is a proven and highly detailed XML metadata format for describing rectangular data sets of this type. This vocabulary makes use of the DDI specification to create a simplified version of this model for the discovery of data files.
The data holdings of data archives are often collected by researchers, and only afterwards disseminated by archives. Other data-producing organizations such as research centers and statistical agencies are also increasingly interested in the DDI standards for documenting their own microdata. In general terms, most DDI metadata describes data sets for the social, behavioural, and economic sciences. This data is fairly consistent in format, consisting of rectangular data files with columns containing variables for a set of cases, contained in the rows. It is often collected by survey, although in some cases may come from administrative sources, sensors, or registers.
This vocabulary is intended not only for use by the research data community, but also by any others needing an RDF vocabulary for describing this type of rectangular data. This vocabulary will provide a useful model for describing some of the data sets now being published by open government initiatives, by providing a rich metadata structure for them. While the data sets may be available (typically as CSV files) the metadata which accompanies them is not necessarily coherent, making the discovery of these data sets difficult. This vocabulary would help to overcome this difficulty by allowing for the creation of standard queries to programmatically identify data sets, whether made available by government or held within a data archive.
Disco could be used to discover datasets by searching for specific questions, topics, and geographical coverage. Depending on the complexity of the search respectively of the data portal, parts of Disco could be used, the complete Disco, or Disco together with related vocabularies. The document [Scenarios] by Vompras, Gregory, Bosch, Capadisli, and Wackerow describes typical use cases for the applicability of the DDI-RDF Discovery vocabulary. In the Section Use Cases and Example Queries of the Appendix additional discovery use cases are illustrated by several SPARQL queries.
Statistical domain experts (core members of the DDI Alliance Technical Implementation Committee, representatives of national statistical institutes, national data archives) and Linked Open Data community members have selected the DDI elements which are seen as most important to solve problems associated with use cases in the area of data discovery. Section 2 gives an overview of the conceptual model. More detailed descriptions of all the properties are given in the specification and two conference papers [Linked-Statistical-Data] [DDI-RDF-Discovery-Vocabulary]. Disco is intended to provide means to describe microdata by essential metadata for the discovery purpose. Existing DDI-XML instances can be transformed into this RDF format and therefore exposed as Linked Data. The vice-versa process is not intended, as we have defined Disco components and reused components of other RDF vocabularies which make only sense in the Linked Data field.
The Data Documentation Initiative standards are produced and maintained by a member-based consortium of global scope, the DDI Alliance. Housed currently at the Interuniversity Consortium for Political and Social Research (ICPSR) at the University of Michigan, there are currently more than 30 member institutions. The standards have been under development for more than ten years, and are in widespread use among data archives and libraries, producers of research data, secure data centers, and statistical agencies.
There are two major versions of DDI (both serialied in XML format): the “Codebook” version, which allows for holding general information about a study, along with its data dictionary; and the “Lifecycle” version of DDI, which allows for the description of more complex multi-wave studies, throughout the data lifecycle, from study conception through data collection and processing.
This vocabulary contains a selection of the major types of metadata defined by these two versions in a highly simplified form, for the purposes of discovery. The XML Codebook and Lifecycle versions of DDI are very broad: these standards contain hundreds of metadata elements, providing enough information to programmatically work with the data files for such functions as the automatic creation of databases, and transformations between statistical packages. DDI in both versions is generally used to describe data found in ASCII files, whether positional files with fixed-width fields or files using a delimited format such as CSV.
It is difficult to claim that there is a single agreed conceptual model for describing research data in the social, behavioural, and economic sciences—there is a wide range of models and terms. However, the issues faced in this area have been the subject of discussion within the DDI community for many years, and the DDI model represents the best consensus which exists today. As such, it gives us a good basis for creating a vocabulary which will be recognizable to researchers familiar with this type of data.
The Discovery Vocabulary (Disco) is aligned to several other metadata vocabularies used in the RDF community. Disco is designed to be used in conjunction with other vocaularies.
The Data Catalog
Vocabulary (DCAT) is a W3C standard for describing catalogs
of datasets, and we map to it in two places:
Our LogicalDataSet
is a subclass of DCAT’s Dataset, and our
DataFile
is a subclass of DCAT’s Distribution. DCAT makes few
assumptions about the kind of datasets being described,
and focuses on general metadata about the datasets
(mostly using Dublin Core), and on different ways of
distributing and accessing the dataset, including availability
of the dataset in multiple formats. Combining terms from both
DCAT and the Discovery Vocabulary can be useful for a number of
reasons:
DCAT is richer for the description of collections and catalogue. Disco supports richer descriptions of groups of datasets or individual datasets. In this spec, some of our examples are partially based on DCAT (and we will indicate when this is the case).
The Data Cube vocabulary is a W3C standard for representing data cubes, that is, multidimensional aggregate data. Data cubes are often generated by tabulating or aggregating record-level datasets. For example, if an observation in a census data cube indicates the population of a certain age group in a certain region is 12345, then this fact was obtained by aggregating that number of individual records from a record-level (or “microdata”) dataset. The Discovery Vocabulary contains a property “aggregation” (pointing from a Disco data set to a Data Cube dataset) that indicates that a Cube dataset was derived by tabulating a record-level dataset.
Data Cube provides for the description of the structure of such cubes, but also for the representation of the cube data itself, that is, the observations that make up the cube dataset. This is not the case for the Discovery Vocabulary, which only describes the structure of a dataset, but is not concerned with representing the actual data in it. The actual data is assumed to sit in a data file (e.g., a CSV file, or in a proprietary statistics package file format) that is not represented in RDF.
The interplay of Data Cube and Disco needs further exploration regarding the relationship of aggregate data, aggregation methods, and the underlying microdata. The goal would be to drill down to the related microdata based on a search resulting in aggregate data. On the one hand aggregate data are often easily available and gives a quick overview. On the other hand microdata enable more detailed analyses.
The use of formal statistical classifications is very common in research data sets—these are treated in our vocabulary as SKOS concepts, but in some cases those working with formal statistical classifications may desire more expressive capability than SKOS provides. To support such users, the DDI Alliance also publishes XKOS, a vocabulary which extends SKOS to allow for a more complete description of such classifications. While the use of XKOS is not required by this vocabulary, the two are designed to work in complementary fashion.
More details on the relationship to Data Cube, DCAT and XKOS as well as to other vocabularies are provided in Section 9.
To understand the DDI Discovery Vocabulary, there are a few
central classes, which can serve as entry points. The first of
these is the Study
class. A Study
in our model represents the
process by which a data set was generated or collected. Literal
properties include information about the funding, organizational
affiliation, abstract, title, version, and other such high-level
information. In some cases, where data collection is cyclic or
on-going, data sets may be released as a StudyGroup
, where each
cycle or "wave" of the data collection activity produces one or
more data sets. This is typical for longitudinal studies, panel
studies, and other types of "series" (to use the DDI term). In
this case, a number of Study
objects would be collected into a
single StudyGroup
.
Data sets have two representations in our model: a logical
representation, which describes the contents of the data set,
and a physical representation, which is a distributed file
holding that data. It is possible to format data files in many
different ways, even if the logical content is the same. In our
model the LogicalDataSet
represents the content of the file
(its organization into a set of variables (Variable
)). The LogicalDataSet
is an extension of the dcat:DataSet
class. Physical, distributed
files are represented by the class DataFile
(not depicted in the diagram), which is itself an
extension of the dcat:Distribution
.
When it comes to understanding the contents of the data set,
this is done using the Variable
class. Variables (Variable
) provide a
definition of the column in a rectangular data file, and can
associate it with a particular Concept, and a Question
(the Question
in the
Questionnaire
which was used to collect the data). Variables (Variable
) are
related to a representation of some form, which may be a set of
codes and categories (a "codelist") or may be one of other
normal data types (dateTime, numeric, textual, etc.) Codes and
Categories are represented using SKOS concepts and concept
schemes.
Data is collected about a specific phenomenon, typically
involving some target population, and focusing on the analysis
of a particular type of subject. These are respectively
represented by the Universe
class and the AnalysisUnit
class.
If, for example, the adult population of Finland is being
studied, the AnalysisUnit
would be individuals or persons and the Universe
would be the adult population of Finland.
Bosch, Cyganiak, Wackerow, and Zapilko give a detailed overview of
the DDI-RDF Discovery Vocabulary in a full paper written for the Dublin
Core conference [Linked-Statistical-Data].
We have a sample of a survey which has been documented using DDI XML—the 1980 Argentine National Population and Housing Census. We are using for this example the version disseminated by IPUMS, which provides internationally harmonized census data, to make it more useful for cross-border research. Thus, this data set is produced by two organizations: The Argentine National Institute of Statistics and Censuses, and the Minnesota Population Center hosted in the University of Minnesota.
To give some idea of what is contained in the metadata set, we will use some screen shots from OpenMetadata Survey Catalog, a portal which indexes the DDI files to facilitate searching, and reflects the contents in a fashion which is easy to view. Follow this link for the information about this DDI file at the OpenMetadata Survey Catalog.
Figure 2 shows us the overview page for this study, giving us some basic information - title, identifier for the study, data producers, year, country, and a link to the access policies. If we look at the right-hand panel, we see an outline of the metadata contents of the file, including information about the questionnaire used, sampling methodology, and data collection activities, as well as the two data files which contains detailed information about its variables.
Not all of this information is useful in a data discovery scenario—sampling and data collection methodologies are not typically indexed for searches. Information about the questionnaire is, as is detailed information about the variables contained in the files. We will look more closely at the metadata of primary interest for our discovery scenario.
Using RDF and the DDI Discovery Vocabulary, the study can also be described in triples: an instance of type ofStudy
is given the title and the identifier; also, the two
data producers are linked and further described.
The year and country are described in the form of a temporal and spatial coverage of the study.
Also, the topics of the study are represented.
The study instance further contains an abstract.
Since a study is a versionable object in DDI, we attach a version to it.
A study is further described using additional information which is described in the following Example 1.
# We will use the namespace 'ddi' in all of our examples. ddi:Study_1 a disco:Study; dcterms:title "National Population and Housing Census, 1980"@en; dcterms:identifier "ARG_1980_PHC_v01_A_IPUMS"; dcterms:creator [ rdfs:label "Minnesota Population Center"@en; skos:notation "MPC"; org:memberOf [ rdfs:label "University of Minnesota"@en; ]; ]; dcterms:creator [ rdfs:label "Argentine National institute of Statistics and Censuses"@en; ] dcterms:temporal [ a dcterms:PeriodOfTime ; disco:startDate "1980-10-22"^^xsd:date; disco:endDate "1980-10-22"^^xsd:date; rdfs:comment "The interviews take place on the expected census day. In some areas the enumeration took place the following day because of access problems due to heavy rains."; ]; dcterms:spatial [ # This is the DC-strictly compatible way to do it a dcterms:Location; rdfs:label "Argentina, national coverage"@en; ]; # Only a subset of subjects mentioned in the original file dcterms:subject [ skos:definition "Technical Variables -- HOUSEHOLD"@en ; ] ; dcterms:subject [ skos:definition "Group Quarters Variables -- HOUSEHOLD"@en ; ] ; dcterms:abstract "IPUMS-International is an effort to inventory, preserve, harmonize, and disseminate census microdata from around the world. The project has collected the world's largest archive of publicly available census samples. The data are coded and documented consistently across countries and over time to facilitate comparative research. IPUMS- International makes these data available to qualified researchers free of charge through a web dissemination system. The IPUMS project is a collaboration of the Minnesota Population Center, National Statistical Offices, and international data archives. Major funding is provided by the U.S. National Science Foundation and the Demographic and Behavioral Sciences Branch of the National Institute of Child Health and Human Development. Additional support is provided by the University of Minnesota Office of the Vice President for Research, the Minnesota Population Center, and Sun Microsystems."; owl:versionInfo "Version 1.0. This version contains selected variables from the original census microdata plus harmonized variables from the IPUMS International data base."@en; disco:universe ddi:Universe_1; disco:instrument ddi:Questionnaire_1; disco:product ddi:Dataset_1; disco:analysisUnit ddi:AnalysisUnit_1; disco:kindOfData ddi:KindOfData_1; # stdyInfo/notes currently not represented. disco:variable ddi:AR80A401, ddi:AR80A402, ddi:AR80A404, ddi:AR80A407, ddi:AR80A411.
While the sampling methodology may not be of great interest for those searching for data, one field within this section is: the “universe”, that is, the population being studied. Figure 3 gives us an example of this information.
Thus, the study refers to a specific universe.
ddi:Universe_1 a disco:Universe; skos:definition "All the population in the national territory at the moment the census is carried out."@en .Using a type of instrument - a questionnaire -, the study produced a dataset. The dataset has access rights. The dataset has a concrete data file (physical representation or distributed file) populated by certain variables.
ddi:Dataset_1 a disco:LogicalDataSet; disco:instrument ddi:Questionnaire_1; dcterms:accessRights ddi:AccessRights_1; disco:dataFile ddi:Datafile_1; disco:variable ddi:AR80A401, ddi:AR80A402, ddi:AR80A404, ddi:AR80A407, ddi:AR80A411. ddi:AccessRights_1 a dctermsRightsStatement; dcterms:description "IPUMS-International distributes integrated microdata of individuals and households only by agreement ... designed to extend this record."; rdfs:seeAlso <http://microdata.worldbank.org/index.php/catalog/442/accesspolicy>.
Figure 4 shows us the information about access policies, which typically is of interest to those searching for data.
The Unit of Analysis and Kind of Data further describe the study.
ddi:AnalysisUnit_1 a disco:AnalysisUnit ; skos:definition "Dwelling, quarter dwelling, census household, and population"@en . ddi:KindOfData_1 a skos:Concept ; rdfs:label "Census/enumeration data [cen]"@en .
In some cases we may have a lot of information about the questionnaires used, and it is very common to search for data by the text of the question used to collect it. Sometimes there will be a PDF of a questionnaire, and sometimes question text may be linked to individual variables within a file. In this case, we have only a textual description of the set of forms used in the census (Figure 5).
The following example illustrates three questions. Each question does have a text.
ddi:Questionnaire_1 a disco:Questionnaire; disco:question ddi:QuestionGender; disco:question ddi:QuestionAge; disco:question ddi:QuestionCitizenship. ddi:QuestionGender a disco:Question; disco:questionText "2. Is the person a man or a woman? [] Man, [] Woman"@en. ddi:QuestionAge a disco:Question; disco:questionText "3. What is his or her age? _ _ Mark the age in completed years at the date of the census for those younger than one year old mark 00. For those younger than 10 years old, mark 01, 02, 03, etc. For those older than 99 years old, mark 99."@en. ddi:QuestionCitizenship a disco:Question; disco:questionText "6. [Immigration status] Only for persons who have usual residence in Argentina and were born in another country. [Questions 6A and 6B asked only of persons born outside Argentina and who currently reside in Argentina.] B. Are you a naturalized citizen of Argentina? [] Yes [] No [] Unanswered"@en.
In Figure 6 we see the list of variables contained in the data file. For each of these we will also have a detailed view, showing the codes and categories used to encode the actual responses in the variables (Figure 7).
Any variable has a text and is based on a variable definition.
Please note that the turtle example describes the variable labels from the screenshot above and references to the related represented variable and question.
ddi:AR80A401 a disco:Variable; dcterms:identifier "AR80A401"; skos:prefLabel "Sex"@en, "Sexe"@fr; dcterms:description "This variable indicates the person's gender."@en; disco:basedOn ddi:SexVD; disco:question ddi:QuestionGender. ddi:AR80A402 a disco:Variable; dcterms:identifier "AR80A402"; dcterms:description "This variable indicates the person's age in years."@en; skos:prefLabel "Age"@en, "Âge"@fr. disco:basedOn ddi:AgeVD; disco:question ddi:QuestionAge. ddi:AR80A407 a disco:Variable; dcterms:identifier "AR80A407"; dcterms:description "This variable indicates whether or not the person is a naturalized citizen of Argentina."@en; skos:prefLabel "Citizenship"@en, "Citoyenneté"@fr; disco:basedOn ddi:CitizenshipVD; disco:question ddi:QuestionCitizenship.Any variable definition has a representation defining the possible values of a variable. Also, a variable definition has its own universe (may be the same as the study or possibly narrower) and (DDI) concepts further describing the variable.
ddi:SexVD a disco:RepresentedVariable; disco:universe ddi:UniversePerson; disco:representation ddi:SexRepr; disco:concept ddi:IpumsC1; skos:prefLabel "Sex"@en, "Sexe"@fr; dcterms:description "Sex data element"@en. ddi:SexRepr a skos:ConceptScheme, disco:Representation; skos:hasTopConcept ddi:SexM, ddi:SexF. ddi:SexM a skos:Concept; skos:notation "1"; skos:prefLabel "Male"@en, "Homme"@fr; skos:inScheme ddi:SexRepr. ddi:SexF a skos:Concept; skos:notation "2"; skos:prefLabel "Female"@en, "Femme"@fr; skos:inScheme ddi:SexRepr. ddi:ageVD a disco:RepresentedVariable; disco:universe ddi:UniversePerson; disco:representation ddi:AgeRepr; disco:concept ddi:IpumsC1; skos:prefLabel "Age"@en, "Âge"@fr; dcterms:description "Age data element"@en. ddi:AgeRepr a skos:ConceptScheme, disco:Representation; skos:hasTopConcept ddi:Age0, ddi:Age1, ddi:Age99. ddi:Age0 a skos:Concept; skos:notation "0"; skos:prefLabel "0"; skos:inScheme ddi:AgeRepr. ddi:Age1 a skos:Concept; skos:notation "1"; skos:prefLabel "1"; skos:inScheme ddi:AgeRepr. # ... ddi:Age99 a skos:Concept; skos:notation "99"; skos:prefLabel "99"; skos:inScheme ddi:AgeRepr. ddi:CitizenshipVD a disco:RepresentedVariable; disco:universe ddi:UniverseNonArgentines; disco:representation ddi:CitizenshipRepr; disco:concept ddi:IpumsC2; skos:prefLabel "Citizenship"@en; dcterms:description "Citizenship data element"@en. ddi:CitizenshipRepr a skos:ConceptScheme, disco:Representation; skos:hasTopConcept ddi:CYes, ddi:CNo, ddi:CUnknown, ddi:CNIU. ddi:CYes a skos:Concept; skos:notation "1"; skos:prefLabel "Yes"; skos:inScheme ddi:CitizenshipRepr. ddi:CNo a skos:Concept; skos:notation "2"; skos:prefLabel "No"; skos:inScheme ddi:CitizenshipRepr. ddi:CUnknown a skos:Concept; skos:notation "8"; skos:prefLabel "Unknown"; skos:inScheme ddi:CitizenshipRepr. ddi:CNIU a skos:Concept; skos:notation "9"; skos:prefLabel "NIU (not in universe)"; skos:inScheme ddi:CitizenshipRepr.Any universe of a variable definition is a subset of the universe of the entire study. In our example, two questions are addressing the universe of persons, the third question is addressing a specific subset of the universe of persons.
ddi:UniversePerson a disco:Universe; skos:definition "All persons."@en ; skos:narrower ddi:Universe_1. ddi:UniverseNonArgentines a disco:Universe; skos:definition "Foreign-born persons who reside in Argentina."@en ; skos:narrower ddi:Universe_1; skos:narrower ddi:UniversePerson.
At the bottom of the screen showing the variable detail, we can see that the variable for roofing material is associated with a high-level concept, “Dwelling characteristics variables.” (Figure 8.)
In Disco, DDI concepts can be hierarchically structured
ddi:IpumsCS a skos:ConceptScheme; skos:hasTopConcept ddi:IpumsC1. ddi:IpumsC1 a skos:Concept; skos:prefLabel "Demographic Variables - PERSON"@en, "Variables démographiques - PERSONNE"@fr; skos:inScheme ddi:IpumsCS. ddi:IpumsC2 a skos:Concept; skos:prefLabel "Nativity and Birthplace Variables -- PERSON"@en; skos:inScheme ddi:IpumsCS.The variable within a data file can be described using category statistics. In the following example, absolute and relative frequencies of the variable categories are described. This variable represents the sex of the respondent. A variable is represented by a code list containing the code, the category statistics resource is pointing to.
ddi:CatStatistics_1 a disco:CategoryStatistics; disco:frequency 13314444; disco:percentage 49.97; disco:statisticsCategory ddi:SexM; disco:statisticsDataFile ddi:Datafile_1. ddi:CatStatistics_2 a disco:CategoryStatistics; disco:frequency 1336270; disco:statisticsCategory ddi:SexF; disco:statisticsDataFile ddi:Datafile_1.
Next we find some general information about the data files produced by this study (Figure 9).
Finally, the data file more concretely describes the actual physical file.
ddi:Datafile_1 a disco:Datafile; dcterms:identifier "ARG1900-P-H.dat"; dcterms:description "Person records"@en; disco:caseQuantity 2667714; dcterms:format "ascii"; dcterms:provenance "Minnesota Population Center"@en; owl:versionInfo "Version 1.0, IPUMS sample"@en; dcterms:spatial [ # This is the DC-strictly compatible way to do it a dcterms:Location; rdfs:label "Argentina, national coverage"@en ]; dcterms:temporal "PeriodOfTime"@en; dcterms:subject "To be defined"@en.
A simple Study
supports the stages of the full data lifecycle
in a modular manner. A Study
represents the
process by which a data set was generated or collected. Literal
properties include information about the funding, organizational
affiliation, abstract, title, version, and other such high-level
information. The key criteria for a study are:
a single conceptual model (e.g. survey research concept),
a single instrument (e.g. questionnaire) made up of one or more parts
(ex. employer survey, worker survey), and a single logical data structure
of the initial raw data (multiple data files can be created from this such
as a public use microdata file or aggregate data files).
In some cases, where data collection is cyclic or
on-going, data sets may be released as a StudyGroup
, where each
cycle or "wave" of the data collection activity produces one or
more data sets. This is typical for longitudinal studies, panel
studies, and other types of "series" (to use the DDI term). In
this case, a number of Study
objects would be collected into a
single StudyGroup
.
Studies (Study
) may be contained in at most 1 StudyGroup
and groups of studies may include
0 to n studies.
Studies (Study
) may have 0 to n instruments (Instrument
) relationships to instruments (Instrument
).
Particular instruments (Instrument
), however, are connected with exactly 1 Study
.
Studies (Study
) may have DataFile
connections with 0 to n data files (DataFile
) and data files (DataFile
) must have
1 to n DataFile
relationships to studies (Study
).
Studies (Study
) are associated with 0 to n variables (Variable
) using the object property Variable
.
On the other hand, variables (Variable
) must be related to 1 to n studies (Study
).
Studies (Study
) may have 0 to n logical data sets (LogicalDataSet
) (product
) and logical data sets (LogicalDataSet
) must
have 1 to n product
relationships to studies (Study
).
Studies (Study
) or groups of studies (StudyGroup
) (the union of Study
and groups of studies (StudyGroup
))
may have different datatype properties.
Studies (Study
) or groups of studies (StudyGroup
) may have an abstract (dcterms:abstract
), a title
(dcterms:title
), a subtitle (subtitle
), an alternative title
(dcterms:alternative
), a purpose (purpose
), and information about
the date and the time since when the Study
is publicly available
(dcterms:available
). Studies (Study
) or groups of studies (StudyGroup
) may have multiple object properties.
The object properties kindOfData
and
dcterms:subject
guide to skos:Concepts
.
kindOfData
describes, with a string or a term from a controlled vocabulary,
the kind of data documented in the logical product(s) of a Study
.
Examples include survey data, census/enumeration data, administrative data, measurement data, assessment data, demographic data, voting data, etc.
Coverage describes the temporal, spatial and topical coverage of a study. Coverage specifies the population from which observations for a particular topic can be drawn.
You can use dcterms:subject
to describe the topical coverage of studies (Study
) and groups of studies (StudyGroup
).
ddiFile
to foaf:Documents
which are the DDI-XML files
containing further descriptions of the Study
or the StudyGroup
.
Use dcterms:temporal
for temporal coverages related to the union of studies (Study
) and groups of studies (StudyGroups
).
For the spatial coverage use dcterms:spatial
.
The cardinalities of all the object properties are in both directions 0 to n.
The only exception is that studies (Study
) and groups of studies (StudyGroup
) may have 0 or 1
kindOfData
relationships to skos:Concepts
.
Creators (dcterms:creator
), contributors (dcterms:contributor
),
and publishers (dcterms:publisher
) of Studies (Study
) and groups of studies (StudyGroup
) are foaf:Agents
which are either foaf:Persons
or org:Organizations
whose members
are foaf:Persons
. Studies (Study
) or groups of studies (StudyGroup
) may be funded by
(fundedBy
)
foaf:Agents
. The object property fundedBy
is defined as sub-property of
dcterms:contributor
. The cardinalities of these object properties are in both
directions always 0 to n.
foaf:Agents
may have roles such as analyst, data modeler, programmer, and co-investigator.
These roles are represented using skos:Concepts
. foaf:Agents
and skos:Concepts
are related by disco:hadRole
.
Roles can be defined (skos:definition
), identified (skos:notation
), and described (skos:prefLabel
).
Universe
is the total membership or population of a defined class of people,
objects or events.
A population is the number of statistical units sharing at least one common property which is of interest in a statistical analysis.
There are two types of population, target population and
survey population. A target population is the population outlined in the survey
objects about which information is to be sought. A survey population (also known
as the coverage of the survey) is the population from which information can be
obtained in the survey.
AnalysisUnit
is defined as follows: The process of collecting data focuses
on the analysis of a particular type of subject. If, for example, the adult population
of Finland is being studied, the AnalysisUnit
would be individuals or persons.
Studies (Study
) and groups of studies (StudyGroup
) must have 1 to n universes (Universe
) and 1 particular
Universe
may be in a Universe
relationship with 0 to n unions of Studies (
Study
) and groups of studies (StudyGroup
). Universes (Universe
)
are sub-classes of skos:Concepts
.
For universes (Universe
) you can state definitions using skos:definition
.
The union of Study
and StudyGroup
may have 0 or 1
AnalysisUnit
reached by the object property AnalysisUnit
and a specific AnalysisUnit
may be
in a AnalysisUnit
relationship to 0 to n studies (Study
) or groups of studies (StudyGroup
). AnalysisUnit
is specified as a sub-class of skos:Concepts
.
In DDI, a lot of entities hold particular identifiers. This can be identifiers for different versions of DDI, but also persistent identifiers for, e.g. persons or organizations, that are encoded in a particular identifier scheme, e.g. ORCID or FundRef. In general, such identifiers can be added to each entitiy in DDI-RDF, since every entity is defined as an rdfs:Resource
.
General metadata elements which can be used on every resource in a DDI-RDF description include:
skos:prefLabel
(rdf:langString
): the preferred label of this elementadms:identifier
(rdfs:Resource
, adms:Identifier
): the identifier of this elementEach Disco resource must have an identifier (see figure below). The identifier is stated using the object property adms:identifier pointing from any rdfs:Ressource to 1 to n identifiers (adms:Identifier). The class adms:Identifier can include the actual identifier itself and information on identifier scheme, its version, and its agency.
ddi:Study_1 a disco:Study; dcterms:title "National Population and Housing Census, 1980"@en; adms:identifier [ a adms:Identifier; skos:notation "us:ddi:us.mpc:ARG_1980_PHC_v01_A_IPUMS:1"; adms:schemaAgency "DDI Alliance"@en. ]; dcterms:creator [ rdfs:label "Minnesota Population Center"@en; skos:notation "MPC"; adms:identifier [ a adms:Identifier; skos:notation "us.mpc"; adms:schemaAgency "DDI Alliance"@en. ]; ].
See section 'Asset Description Metadata Schema (ADMS)' for more information about the reuse of ADMS for representing identifiers.
Use of the owl:versionInfo
property is recommended to indicate the version number and/or additional versioning text of entities.
Any entity can have version information. As you can see in the next UML class diagram, the property owl:versionInfo has rdfs:Resource as domain. As a consequence, each DDI object can have attached versioning information. However, the most typical cases are:
Since the Discovery Vocabulary only covers a subset of an original DDI-XML file, it may be worthwhile to have a relationship to the original DDI-XML file. Such a relationship can be represented using dcterms:relation
. This way, every element can be related to any foaf:Document
. The cardinalities are in both directions 0 to n.
So far, we can use the general property dcterms:relation
for relations between publications and studies.
The domain of dcterms:relation
is rdfs:Resource
and the range is foaf:Document
.
Other kinds of relations could be primaryLiterature and secondaryLiterature.
Every logical dataset may have access rights statements and licensing information attached to it. For those purposes, the Dublin Core properties dcterms:accessRights
and dcterms:license
are used.
Access rights are defined in a dcterms:RightsStatement
object,
which may reference an external document stating the access rights in more detail (rdfs:seeAlso
).
For dcterms:RightsStatements
descriptions (dcterms:description
) and labels (skos:prefLabel
) can be assigned:
ddi:Dataset_1 a disco:LogicalDataSet ; dcterms:accessRights ex:AccessRights1 . ddi:AccessRights_1 a dcterms:RightStatement ; dcterms:description "Everybody may see access this document." ; rdfs:seeAlso <http://www.example.org/access.html> .
License information is captured in a dcterms:LicenseDocument
, which is a subtype of dcterms:RightsStatements
:
ddi:Dataset_1 a disco:LogicalDataSet ; dcterms:license ddi:License_1 . ddi:License_1 a dcterms:LicenseDocument ; dcterms:description "Published under Open Content License." ; skos:prefLabel "OCL 1.0" ; rdfs:seeAlso <http://opencontent.org/opl.shtml> .
Logical data sets (LogicalDataSet
) may have dcterms:accessRights
relationships to dcterms:RightsStatements
and dcterms:license
connections with dcterms:LicenseDocument
.
dcterms:RightsStatements
is associated with foaf:Documents
using the object property rdfs:seeAlso
.
The multiplicities for these object properties are in any case 0 to n.
Coverage comprehends the key features of the scope of the data (e.g. geographic product occupation).
Studies (Study
), logical datasets, and data files may have a spatial, temporal, and topical coverage.
Unlike in DDI-XML, there is no dedicated Coverage type in DDI-RDF. The comprehensive description by spatial, temporal, and topical coverage is directly attached to the respective study, logical dataset, and datafile (using DCMI terms).
For spatial coverage, dcterms:spatial
is used, pointing to any geographic location (dcterms:Location
):
ddi:Study_1 dcterms:spatial <http://sws.geonames.org/2921044/> .
In this example, Geonames is used to refer to a spatial region, in this case, the country Germany. Geonames provides URIs for continents, countries, regions, and cities, among others, and is therefore a possible option to use for describing spatial coverage.
For temporal coverage, dcterms:temporal
is used pointing to dcterms:PeriodOfTime
.
For time periods, labels can be attachted ( skos:prefLabel
). It is also possible to define
start (startDate
) and end dates (endDate
). Please note that these properties are a feature at risk, since the domain is not Disco. Maintainers of the domain ontology may introduce their own properties for this purpose.
A possible way to describe temporal coverage is the use of the W3C time ontology:
ddi:Study_1 dcterms:temporal [ a time:Interval ; time:hasBeginning [ time:inXSDDateTime "2012-01-01T00:00:00+01:00"^^xsd:dateTime ]; time:hasEnd [ time:inXSDDateTime "2012-01-31T23:59:59+01:00"^^xsd:dateTime ] ] .
This example describes a study that has been conducted between January 1st and January 31st.
Topical coverage can be expressed using dcterms:subject
. DDI-RDF foresees the use skos:Concept
for the description of topical coverage:
ddi:Study_1 dcterms:subject [ a skos:Concept ; skos:prefLabel "Alcohol consumption" ] .
The multiplicities for each of the three object properties dcterms:subject
, dcterms:temporal
, and dcterms:spatial
are in any case 0 to n.
The following elements from Dublin Core may be used to describe general metadata of DDI-RDF elements (see the DC definitions for more detailed descriptions):
dcterms:abstract
(used with Study
): an abstract of the studydcterms:alternative
(used with Study
): an alternative name for the studydcterms:available
(used with Study
): the date (or date range) at which this study has or will become availabledcterms:title
(used with Study
, LogicalDataSet
): the element’s titledcterms:description
(used with RepresentedVariable
, DataFile
, Instrument
, Variable
, dcterms:RightsStatement
): a human readable description of the elementdcterms:provenance
(used with DataFile
): defines the provenance information for the data file. The object is a dcterms:ProvenanceStatement
.Data sets have two representations in our model: a logical
representation, which describes the contents of the data set,
and a physical representation, which is a distributed file
holding that data. It is possible to format data files in many
different ways, even if the logical content is the same. In our
model the LogicalDataSet
represents the content of the file
(its organization into a set of variables (Variable
)). The LogicalDataSet
is an extension of the dcat:DataSet
class. Physical, distributed
files are represented by the class DataFile
, which is itself an
extension of the dcat:Distribution
. DescriptiveStatistics
, i.e. SummaryStatistics
as well as CategoryStatistics
, are associated with data files (
DataFile
) by the object property statisticsDataFile
. Descriptive statistics simply describe what the data shows. See also the entry on descriptive statistics in the OECD glossary of statistical terms.
Logical data sets (LogicalDataSet
) and data files (DataFile
) are
connected using the object property data files (DataFile
). A specific logical data set (LogicalDataSet
)
may be linked to 0 to n data files (DataFile
) and a particular DataFile
may be connected
with 0 to n logical data sets (LogicalDataSet
) via DataFile
.
DescriptiveStatistics
are accociated with data files (
DataFile
) by the object property statisticsDataFile
. A concrete
DescriptiveStatistics
object may have statisticsDataFile
relationships
to multiple (0 - n) data files (DataFile
). Data files (DataFile
), however, may have 0 to n statisticsDataFile
relations to
DescriptiveStatistics
instances.
Each study has a set of logical metadata (LogicalDataSet
) associated with
the processing of data, at the time of collection or later during cleaning, and re-coding.
LogicalDataSet
represents the microdata dataset.
LogicalDataSet
is defined as a sub-class of dcat:Dataset
.
You can state a title (dcterms:title
) and a flag indicating if the
microdata dataset is publicly available (isPublic).
You can specify access rights (dcterms:accessRights
) and LicenseStatements
(dcterms:license
) for microdata datasets.
For a LogicalDataSet
the three dimensions of coverage can be specified:
Spatial (dcterms:spatial
), temporal (dcterms:temporal
), and topical (dcterms:subject
).
The cardinalities of the object properties dcterms:spatial
, dcterms:temporal
, dcteerms:subject
,
dcterms:accessRights
, and dcterms:license
are 0 to n.
Microdata datasets may have Instrument
associations to
multiple (0 - n) instruments (Instrument
) and instruments (Instrument
) are connected with
multiple (0 - n) logical data sets (LogicalDataSet
).
Each LogicalDataSet
has exactly 1 Universe
(Universe
) and
one specific Universe
may be in multiple (0 - n) Universe
relations to logical data sets (LogicalDataSet
).
Logical data sets (LogicalDataSet
) may contain (variable
) 0 to n variables (Variable
) and variables (
Variable
) must be contained in 1 to n logical data sets (LogicalDataSet
).
Logical data sets (LogicalDataSet
) can be aggregated (aggregation
) to 0 to n data sets (qb:DataSet
) and data sets (
qb:DataSet
) can be aggregations of 0 to n logical data sets (LogicalDataSet
).
At last, logical data sets (LogicalDataSet
) refer to 0 to n data files (DataFile
) using the object property
data files (DataFile
) and data files (DataFile
) may be linked to 0 to n logical data sets (LogicalDataSet
).
The class qb:DataSet
is defined in the RDF Data Cube
Vocabulary. 0 to n data sets (qb:DataSet
) may point to multiple (0 - n) variables (Variable
)
(inputVariable
). Please note that this property is a feature at risk, since the domain is not Disco. Maintainers of the domain ontology may introduce their own property for this purpose.
Just like there is the caseQuantity
data property on
DataFile
,
there is also the data property variableQuantity
on DataFile
and
LogicalDataSet
.
This is useful to have when
(1) no variable level information is available and when
(2) only a stub of the RDF is requested
e.g when returning basic information on a study of file,
we do not need to return information on potentially hundreds or thousands of variables references or metadata.
ddi:Dataset_1 a LogicalDataSet; dcterms:accessRights ddi:AccessRights_1; disco:dataFile ddi:Datafile_1; disco:instrument ddi:Questionnaire_1; disco:variable ddi:AR80A401, ddi:AR80A402, ddi:AR80A404, ddi:AR80A407, ddi:AR80A411.
The collected data result in the microdata represented by the DataFile
.
Data sets have a logical representation, which describes the contents of the data set,
and a physical representation, which is a distributed file
holding that data. It is possible to format data files in many
different ways, even if the logical content is the same.
data files (DataFile
), which are also dcmitype:Datasets
as well as dcat:Distributions
,
represents all the physical distributed data files containing the microdata datasets.
ddi:Datafile_1 a disco:Datafile; dcterms:identifier "ARG1900-P-H.dat"; dcterms:description "Person records"@en; disco:caseQuantity 2667714; dcterms:format "ascii"; dcterms:provenance "Minnesota Population Center"@en; owl:versionInfo "Version 1.0, IPUMS sample"@en; dcterms:spatial [ # This is the DC-strictly compatible way to do it a dcterms:Location; rdfs:label "Argentina, national coverage"@en ]; dcterms:temporal "PeriodOfTime"@en; dcterms:subject "To be defined"@en.
It is possible to describe data files (DataFile
) (dcterms:description
).
Data files (DataFile
), case quantities (disco:caseQuantity
) and versions (owl:versionInfo
) can also be stated.
Using the object property dcterms:format
, data files (DataFile
) formats can be defined.
Data files (DataFile
) must have exactly 1 dcterms:format
relationship to an instance of the class
dcterms:MediaTypeOrExtend
which is a sub-class of skos:Concept
.
Specific formats can be assigned to multiple (0 - n) data files (DataFile
).
Provenance information can be assigned to data files (DataFile
). Data files (
DataFile
) may have multiple (0 - n)
dcterms:provenance
relationships to dcterms:ProvenanceStatements
.
Dcterms:ProvenanceStatements
, however, may have 0 to n dcterms:provenance
relations to data files (DataFile
).
The topical, spatial, and temporal coverage of data files (DataFile
) is realized by
the object properties dcterms:subject
, dcterms:spatial
, and dcterms:temporal
,
all with the cardinalities 0 to n on both sides.
Just like there is the caseQuantity
data property on
DataFile
,
there is also the data property variableQuantity
on DataFile
and
LogicalDataSet
.
This is useful to have when
(1) no variable level information is available and when
(2) only a stub of the RDF is requested
e.g when returning basic information on a study of file,
we do not need to return information on potentially hundreds or thousands of variables references or metadata.
An overview over the microdata can be given either by the descriptive statistics
or the aggregated data. DescriptiveStatistics
may be minimal, maximal, mean values,
and absolute and relative frequencies. qb:DataSet
originates from the RDF Data Cube Vocabulary,
an approach to map the SDMX information model to an ontology. A qb:DataSet
represents aggregated data
(also known as macrodata) such as multi-dimensional tables. Aggregated data are derived from microdata
by statistics on groups, or aggregates such as counts, means, or frequencies.
SummaryStatistics
pointing to variables and
CategoryStatistics
pointing to categories and codes are both
descriptive statistics.
DescriptiveStatistics
may have statisticsDataFile
relations to 0 to n data files (DataFile
) and
data files (DataFile
) may be in 0 to n
statisticsDataFile
relations to
DescriptiveStatistics
individuals.
SummaryStatistics
point to 0 to n variables
(Variable
) using the object property
statisticsVariable
. Variables
(Variable
), however, may be in 0 to n of such relationships to
SummaryStatistics
objects.
CategoryStatistics
may be connected with 0 to n
skos:Concepts
using the property statisticsCategory
and skos:Concepts
representing codes (values) and categories (value labels) may be in 0 to n of such relationships.
SummaryStatistics
and
CategoryStatistics
may have a
weightedBy
relation to a
Variable
.
A statistical weight is an amount given to increase or decreased the importance of an item.
ddi:CatStatistics_1 a disco:CategoryStatistics; disco:frequency 13314444; disco:percentage 49.97; disco:statisticsCategory ddi:SexM; disco:statisticsDataFile ddi:Datafile_1. ddi:CatStatistics_2 a disco:CategoryStatistics; disco:frequency 1336270; disco:statisticsCategory ddi:SexF; disco:statisticsDataFile ddi:Datafile_1.
Available category statistics types are frequency
, percentage
, and cumulativePercentage
.
Available summary statistics types are organized in the controlled vocabulary SummaryStatisticsType. Each summary statistics type is a skos:Concept. Particular summary statistics types are included into a disco:SummaryStatistics class with the property disco:summaryStatisticType. The particular value is modelled with rdf:value. More information on the SKOS representation of the controlled vocabulary SummaryStatisticsType can be found at the DDI-controlled-vocabularies project page.
There are two possibilities to define new types of summary statistics.
First, the term 'other' with a new value can be used in association with the existing vocabulary.
Second, a new vocabulary can be defined. In the ISSP example below, the term 'other' is used in class issp:XYZ_17, though not included in the following tables.
There are two properties which describe details of a category or summary statistic value, computationBase
and weightedBy
.
computationBase
expresses if the cases - which are the basis of the computation of a statistics value -
are valid, invalid or the total of both.
In statistics, missing data (i.e. invalid data), or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
The usage of computationBase
for frequency differs from the usage for the
percentage statistics and the summary statistics. A distinction regarding computationBase
doesn’t apply to frequency as category statistic.
The following table describes the details of usage of computationBase
in dependency of the respective statistics type.
Table 1: Description of Statistics of Valid/Invalid Cases
Statistics Type |
computationBase |
|||
valid |
invalid |
total |
not used |
|
Category Statistics Type |
||||
frequency |
n/a |
n/a |
n/a |
++ |
percentage |
++ |
+ |
++ |
n/a |
cumulativePercentage |
++ |
+ |
++ |
n/a |
Summary Statistics Type |
||||
percentage |
++ |
+ |
n/a |
n/a |
Any other summary statistics type |
++ |
+ |
++ |
n/a |
Legend: ++ used frequently, + rarely used, n/a not applicable
weightedBy
defines the weight variable of a category or summary statistic computation respectively value.
It can also be used to indicate if a weight variable is used but the related variable is not known.
weightedBy
may be assigned to a category statistic value or to a summary statistic value.
Table 2. Description of Statistics of Non-weighted/Weighted Variables
Statistics Value of ... | Value of weightedBy |
unweighted variable | not used |
weighted variable Weight variable is not known. |
Reference to blank node |
weighted variable Weight variable is known. |
Reference to weight variable |
The following example shows different categories of an ISSP data set and the values of the related summary and category statistics. Each category is defined as a skos:Concept and the used name is issp:category_X, which is the corresponding category value in the frequency table above (see Figure 23, second column).
The category issp:category_1 is the category with the code 1 (skos:notation '1'), the category label ‘Yes, have partner; live
in same household’ (skos:prefLabel 'Yes, have partner; live in same household') and which is valid (disco:isValid
true). Please note that the property isValid
is a feature at risk, since the domain is not Disco. Maintainers of the domain ontology may introduce their own property for this or a similar purpose.
issp:XYZ_1 defines the frequency (disco:frequency '15893') of the category issp:category_1 ( disco:statisticsCategory issp:category_1).
@prefix issp: <http://www.issp.org/> @prefix ddi-cv: <http://rdf-vocabulary.ddialliance.org/DDICV#> issp:Category_1 a skos:Concept; skos:notation "1"; skos:prefLabel "Yes, have partner; live in same household"; disco:isValid true. issp:Category_3 a skos:Concept; skos:prefLabel "valid total"; disco:isValid true. issp:Category_2 a skos:Concept; skos:notation "0"; skos:prefLabel "Not available (GB))"; disco:isValid false. issp:Category_4 a skos:Concept; skos:prefLabel "missing total"; disco:isValid false. issp:XYZ_1 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_1; disco:frequency 15893. issp:XYZ_2 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_2; disco:frequency 936. issp:XYZ_3 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_1; disco:percentage 60.6. disco:computationBase "total". issp:XYZ_4 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_2; disco:percentage 3.6; disco:computationBase "total"; disco:weightedBy issp:WeightVariable_1. issp:XYZ_5 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_1; disco:percentage 63.7; disco:computationBase "validOnly". issp:XYZ_6 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_1; disco:cumulativePercentage 63.7; disco:computationBase "validOnly". # optional: harmonized CategoryStatistics resource if computationBase and category is the same issp:XYZ_7 a disco:CategoryStatistics; disco:statisticsCategory issp:Category_1; disco:percentage 63.7; disco:cumulativePercentage 63.7; disco:computationBase "validOnly". # SummaryStatistics of variable PARTLIV issp:XYZ_8 a disco:SummaryStatistics; disco:statisticsVariable issp:PARTLIV; disco:summaryStatisticType ddicv-sumstats:ValidCases; rdf:value "24965". issp:XYZ_9 a disco:SummaryStatistics; disco:statisticsVariable issp:PARTLIV; disco:summaryStatisticType ddicv-sumstats:PercentOfValidCases; rdf:value "95.2". issp:XYZ_10 a disco:SummaryStatistics; disco:statisticsVariable issp:PARTLIV; disco:summaryStatisticType ddicv-sumstats:InvalidCases; rdf:value "1251". issp:XYZ_11 a disco:SummaryStatistics; disco:statisticsVariable issp:PARTLIV; disco:summaryStatisticType ddicv-sumstats:PercentOfInvalidCases; rdf:value "4.8". # SummaryStatistics of variable WRKHS issp:XYZ_12 a disco:SummaryStatistics; disco:statisticsVariable issp:WRKHRS; disco:summaryStatisticType ddicv-sumstats:ValidCases; rdf:value "14237". issp:XYZ_13 a disco:SummaryStatistics; disco:statisticsVariable issp:WRKHRS; disco:summaryStatisticType ddicv-sumstats:Minimum; rdf:value "1". issp:XYZ_14 a disco:SummaryStatistics; disco:statisticsVariable issp:WRKHRS; disco:summaryStatisticType ddicv-sumstats:Maximum; rdf:value "96". issp:XYZ_15 a disco:SummaryStatistics; disco:statisticsVariable issp:WRKHRS; disco:summaryStatisticType ddicv-sumstats:ArithmeticMean; rdf:value "41.74". issp:XYZ_16 a disco:SummaryStatistics; disco:statisticsVariable issp:WRKHRS; disco:summaryStatisticType ddicv-sumstats:StandardDeviation; rdf:value "14.265". # SummaryStatistics of variable WRKHS not included in the tables issp:XYZ_17 a disco:SummaryStatistics; disco:statisticsVariable issp:WRKHRS; disco:summaryStatisticType ddicv-sumstats:Other; rdfs:label "Gini Coefficient"; rdf:value "0.63".
# minimum # ------- missy:Minimum a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:Minimum; rdf:value "1". # maximum # ------- missy:Maximum a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:Maximum; rdf:value "4". # arithmentic mean # ---------------- missy:Mean a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:ArithmeticMean; rdf:value "2.17". # standard deviation # ------------------ missy:StandardDeviation a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:StandardDeviation; rdf:value "0.9061". # valid cases # ----------- missy:ValidCases a disco:SummaryStatistics ; disco:statisticsVariable missy:PB100 ; disco:summaryStatisticType ddicv-sumstats:ValidCases; rdf:value "470950". # percent of valid cases # ---------------------- missy:PercentOfValidCases a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:PercentOfValidCases; rdf:value "99.1". # invalid cases # ------------- missy:InvalidCases a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:InvalidCases; rdf:value "4195". # percent of invalid cases # ------------------------ missy:PercentOfInvalidCases a disco:SummaryStatistics ; disco:statisticsVariable missy:PB100 ; disco:summaryStatisticType ddicv-sumstats:PercentOfInvalidCases ; rdf:value "0.9" . # total cases # ----------- missy:TotalCases a disco:SummaryStatistics; disco:statisticsVariable missy:PB100; disco:summaryStatisticType ddicv-sumstats:NumberOfCases; rdf:value "475145". # codes and categories # -------------------- missy:1 a skos:Concept ; skos:notation "1" ; skos:prefLabel "January,February,March" ; disco:isValid true . missy:Missing a skos:Concept ; skos:notation "M" ; skos:prefLabel "Missing" ; disco:isValid false . # valid cases # ----------- missy:CS1 a disco:CategoryStatistics ; disco:statisticsCategory missy:1 ; disco:frequency 102710 ; disco:percentage 21.6 ; disco:cumulativePercentage 21.8 ; disco:computationBase "valid" . # invalid cases # ------------- missy:CS2 a disco:CategoryStatistics ; disco:statisticsCategory missy:Missing ; disco:frequency 4195 ; disco:percentage 0.9 ; disco:computationBase "invalid" .
When it comes to understanding the contents of the data set,
this is done using the Variable
class. Variables (Variable
) provide a
definition of the column in a rectangular data file, and can
associate it with a Concept, and a Question
. Variables (Variable
) are
related to a Representation
of some form, which may be a set of
codes and categories (a "codelist") or may be one of other
normal data types (dateTime, numeric, textual, etc.) Codes and
Categories are represented using skos:Concept and skos:ConceptScheme. Variable definitions (RepresentedVariable
) encompasse study-independent,
re-usable parts of variables like occupation classification.
Variables (Variable
) may be based on (basedOn
) 0 or 1 variable definitions (RepresentedVariable
)
and variable definitions (RepresentedVariable
) can be in 0 to n basedOn
relationships to variables (Variable
).
Both variables (Variable
) and variable definitions (RepresentedVariable
) have Representation
object properties
with the class Representation
as range.
Variables (Variable
) must have exactly 1 Representation
and variable definitions (RepresentedVariable
) may have 0 to n
Representation
connections to Representation
.
On the other hand, representations have 0 to n links to variable definitions (RepresentedVariable
) and to variables (Variable
).
Variables (Variable
) as well as variable definitions (RepresentedVariable
) have both 1 connection to the concept which should be measured.
Concepts have 0 to n relationships to variables (Variable
) and variable definitions (RepresentedVariable
) using the object property concept
.
Disco variables are inline with statistical variables, where experiments examine the relationship between variables. In the RDF Data Cube vocabulary, variables are used as dimensions, measures, or attributes to identify and describe observations.
Variable
s provide a definition of the column in a rectangular data file.
Variable
is a characteristic of a unit being observed. A variable might be the
answer of a question, have an administrative source, or be derived from other variables (e.g. age group derived from age).
RepresentedVariable
s encompasse study-independent, re-usable parts of variables like occupation classification.
Variables (Variable
) can be described (dcterms:description
),
skos:notation
is used to associate names to variables and
labels can be assigned to variables via the datatype property skos:prefLabel
.
Variable definitions (RepresentedVariable
) can also be described using dcterms:description
.
Labels can be assigned to variable definitions (RepresentedVariable
) via the datatype property
skos:prefLabel
.
Variables (Variable
) may be based on (BasedOn
) 0 to 1 RepresentedVariable
.
BasedOn
also connects variable definitions (RepresentedVariable
) with 0 to n variables (Variable
).
Variables (Variable
) and variable definitions (RepresentedVariable
) are connected with exactly 1 skos:Concept
via Concept
.
skos:Concept
have this connection to 0 to n variables (Variable
) and variable definitions (RepresentedVariable
).
Variables (Variable
) are represented by 1 Representation
and variable definitions (RepresentedVariable
) are represented by multiple (0 - n) representations (Representation
).
Representations (Representation
) may be linked to 0 to n variables (Variable
) and their definitions.
Variables (Variable
) may have (Question
) 0 or more questions (Question
) and questions (Question
) may be associated with 0 to n variables (Variable
).
Universe
is used to link 1 Universe
to 0 to n variables (Variable
) and 0 to n universes (Universe
) to 0 to n variable definitions (RepresentedVariable
).
The following example illustrates the three variables Sex, Age and Citizenship.
ddi:AR80A401 a disco:Variable; dcterms:identifier "AR80A401"; skos:prefLabel "Sex"@en, "Sexe"@fr; dcterms:description "This variable indicates the person's gender."@en; disco:basedOn ddi:SexVD; disco:question ddi:QuestionGender. ddi:AR80A402 a disco:Variable; dcterms:identifier "AR80A402"; dcterms:description "This variable indicates the person's age in years."@en; skos:prefLabel "Age"@en, "Âge"@fr. disco:basedOn ddi:AgeVD; disco:question ddi:QuestionAge. ddi:AR80A407 a disco:Variable; dcterms:identifier "AR80A407"; dcterms:description "This variable indicates whether or not the person is a naturalized citizen of Argentina."@en; skos:prefLabel "Citizenship"@en, "Citoyenneté"@fr; disco:basedOn ddi:CitizenshipVD; disco:question ddi:QuestionCitizenship.
The three variables refer to universe, representations and concepts in their RepresentedVariable
.
ddi:SexVD a disco:RepresentedVariable; disco:universe ddi:UniversePerson; disco:representation ddi:SexRepr; disco:concept ddi:IpumsC1; skos:prefLabel "Sex"@en, "Sexe"@fr; dcterms:description "Sex data element"@en. ddi:AgeVD a disco:RepresentedVariable; disco:universe ddi:UniversePerson; disco:representation ddi:AgeRepr; disco:concept ddi:IpumsC1; skos:prefLabel "Age"@en, "Sexe"@fr; dcterms:description "Age data element"@en. ddi:CitizenshipVD a disco:RepresentedVariable; disco:universe ddi:UniverseNonArgentines; disco:representation ddi:CitizenshipRepr; disco:concept ddi:IpumsC2; skos:prefLabel "Citizenship"@en; dcterms:description "Citizenship data element"@en.
The Representation
of a variable is the combination of a value domain,
datatype, and, if necessary, a unit of measure or a character set.
Representation
is one of a set of values to which a numerical measure or a category
from a classification can be assigned (e.g. income, age, and sex: male coded as 1).
Questions
(ResponseDomain
), variables (Variable
)
(Representation
), and variable definitions (RepresentedVariable
) (Representation
) may have
representations.
Representation
is defined as sub-class of the union of rdfs:Datatype
(e.g. numeric or
textual values), skos:ConceptScheme
, and skos:OrderedCollection
, as for example questions
may have as response domain a mixture of a numeric response domain containing numeric values (rdfs:Datatype
) and
an unordered code response domain (skos:ConceptScheme
) as well as an ordered code response domain (skos:OrderedCollection
).
Questions (Question
)
(responseDomain
),
variables (Variable
)
(representation
), and variable definitions (
RepresentedVariable
)
(representation
)
may have representations.
Questions (Question
) must have 1 to n representations
(representation
), variables
(Variable
) must have exactly 1
Representation
,
and variable definitions (RepresentedVariable
)
may have 0 to n representations (Representation
).
Each Representation
can be in 0 to n
Representation
relationships
with questions (Question
), variables
(Variable
), and variable definitions
(RepresentedVariable
).
The following example shows the representations of the three previously introduced variables Sex, Age and Citizenship. All of them refer to the particular concepts.
ddi:SexRepr a skos:ConceptScheme, disco:Representation; skos:hasTopConcept ddi:SexM, ddi:SexF. ddi:AgeRepr a skos:ConceptScheme, disco:Representation; skos:hasTopConcept ddi:Age0, ddi:Age1, ddi:Age99. ddi:CitizenshipRepr a skos:ConceptScheme, disco:Representation; skos:hasTopConcept ddi:CYes, ddi:CNo, ddi:CUnknown, ddi:CNIU.
DDI concepts, hierarchies of DDI concepts, code values, and category labels are represented by skos:Concepts
.
SKOS defines the term skos:Concept
, which is a unit of knowledge created by a unique
combination of characteristics. In context of statistical (meta)data, concepts
are abstract summaries, general notions, knowledge of a whole set of behaviours,
attitudes or characteristics which are seen as having something in common.
Concepts may be associated with variables and questions.
A skos:ConceptScheme
, also defined within the SKOS namespace, is a set of metadata
describing statistical concepts.
Skos:Concept
is reused to a large extent to represent DDI concepts, codes, and categories.
DDI concepts can be described using skos:definition
.
Furthermore, you can describe code values (skos:notation
) and category labels
(skos:prefLabel
).
Hierarchies of DDI concepts can be built using the object properties skos:broader
and skos:narrower
.
The domains and the ranges of skos:broader
and skos:narrower
are skos:Concept
.
The cardinalities are in both directions 0 to n.
Skos:Concept
may be organized in 0 to n skos:ConceptSchemes
by means of skos:inScheme
.
skos:ConceptSchemes
may have multiple (0 - n) skos:Concept
as parts.
The top concept in a specific ConceptScheme is indicated by skos:hasTopConcept
pointing
to 0 to n top skos:Concept
. A specific skos:Concept
may be the top concept to multiple
(0 - n) skos:ConceptSchemes
.
ddi:SexRepr a skos:ConceptScheme, disco:Representation; rdfs:label "Code list for Sex (SEX) - codelist class"@en; rdfs:comment "This code list provides the gender."@en; skos:hasTopConcept ddi:SexM, ddi:SexF. ddi:SexM a skos:Concept; skos:notation "1"; skos:prefLabel "Male"@en, "Homme"@fr; skos:inScheme ddi:SexRepr. ddi:SexF a skos:Concept; skos:notation "2"; skos:prefLabel "Female"@en, "Femme"@fr; skos:inScheme ddi:SexRepr.
@prefix issp: <http://www.issp.org/> issp:Category_1 a skos:Concept; skos:notation "1"; skos:prefLabel "Yes, have partner; live in same household"; disco:isValid true. issp:Category_2 a skos:Concept; skos:notation "2"; skos:prefLabel "Yes, have partner; don't live in same household"; disco:isValid true. issp:Category_3 a skos:Concept; skos:notation "3"; skos:prefLabel "No partner"; disco:isValid true. issp:Category_4 a skos:Concept; disco:isValid true. issp:Category_5 a skos:Concept; skos:notation "0"; skos:prefLabel "Not available (GB))"; disco:isValid false. issp:Category_6 a skos:Concept; skos:notation "7"; skos:prefLabel "Refused"; disco:isValid false. issp:Category_7 a skos:Concept; skos:notation "9"; skos:prefLabel "No answer"; disco:isValid false. issp:Category_8 a skos:Concept; disco:isValid false.
Please note that only code and categories are part of the turtle example.
In DDI, variables, logical data sets, questions, and categories are typically organized themselves in a particular order. For obtaining this order, skos:OrderedCollection
s are used. For example, a collection of variables is represented as being of the type skos:OrderedCollection
containing multiple variables (each represented as skos:Concept
) in a skos:memberList
.
The following example shows an ordered collection of categories represented using abbreviated and complete syntax forms.
@prefix issp: <http://www.issp.org/> issp:XYZ_1 a disco:Variable; skos:notation "PARTLIV"; skos:prefLabel "Living in steady partnership"; disco:representation issp:OrderedCollection_1. # abbreviated syntax: issp:OrderedCollection_1 rdf:type skos:OrderedCollection; skos:memberList ( issp:Category_1 issp:Category_2 issp:Category_3 issp:Category_4 issp:Category_5 issp:Category_6 issp:Category_7 issp:Category_8 ). # complete syntax: issp:OrderedCollection_1 rdf:type skos:OrderedCollection; skos:memberList [ rdf:first issp:Category_1; rdf:rest [ rdf:first issp:Category_2; rdf:rest [ rdf:first issp:Category_3; rdf:rest [ rdf:first issp:Category_4; rdf:rest [ rdf:first issp:Category_5; rdf:rest [ rdf:first issp:Category_6; rdf:rest [ rdf:first issp:Category_7; rdf:rest [ rdf:first issp:Category_8; rdf:rest rdf:nil.] ] ] ] ] ] ] ].
If no order inside a collection of variables and questions is necessary, they are represented as unordered skos:ConceptSchemes
.
The classes Variable, LogicalDataSet, and Question are defined as sub-classes of skos:Concept
.
The data collection produces the datasets in a data catalog.
In some cases, where data collection is cyclic or on-going, data sets may be released as a StudyGroup,
where each cycle or "wave" of the data collection activity produces one or more data sets.
The data for the study are collected by an instrument.
The purpose of an Instrument
,
i.e. an interview, a questionnaire or another entity used as a means of data collection,
is in the case of a survey to record the flow of a questionnaire, its use of questions, and additional component parts.
A questionnaire contains a flow of questions.
A Question
is designed to get information upon a subject, or sequence of subjects, from a respondent.
The data for the study are collected by an Instrument
.
The purpose of an Instrument
, i.e. an interview, a questionnaire or
another entity used as a means of data collection,
is in the case of a survey to record the flow of a questionnaire,
its use of questions, and additional component parts.
A questionnaire contains a flow of questions.
Instruments (Instrument
) can be labeled and described using (dcterms:description
) and (skos:prefLabel
). Instruments
(Instrument
) may have (externalDocumentation
)
multiple (0 - n) external documentations which are of the type foaf:Documents
.
Foaf:Documents
may be external documentations of 0 to n instruments (Instrument
).
collectionMode
are special instruments having at
least 1 (1 - n) collection mode (Question
), which is a skos:Concept
.
A specific collection mode can be associated with 0 to n questionnaires (Questionnaire
). Questionnaires
(Questionnaire
) must contain 1 to n questions (Question
) using the object property Question
.
Particular questions (Question
) may be contained in 0 to n questionnaires (Questionnaire
).
The following example illustrates a questionnaire with three example questions. The questions are defined the next section.
ddi:Questionnaire_1 a disco:Questionnaire; disco:question ddi:QuestionGender; disco:question ddi:QuestionAge; disco:question ddi:QuestionCitizenship.
A Question
is designed to get information upon a subject,
or sequence of subjects, from a respondent.
Questions (Question
) have a question text (questionText
), a label
(skos:prefLabel
), exactly 1 universe (Universe
), multiple (1 - n)
concepts (concept
), and at least 1 response domain
(responseDomain
).
Representations (Representation
) may have 0 to n responseDomain
relations to questions (Question
).
Particular universes (Universe
) may be connected with 0 to n questions (Question
).
Skos:Concepts
are associated with 0 to n questions (Question
).
ddi:QuestionGender a disco:Question; disco:questionText "2. Is the person a man or a woman? [] Man, [] Woman"@en. ddi:questionAge a disco:Question; disco:QuestionText "3. What is his or her age? _ _ Mark the age in completed years at the date of the census for those younger than one year old mark 00. For those younger than 10 years old, mark 01, 02, 03, etc. For those older than 99 years old, mark 99."@en. ddi:questionCitizenship a disco:Question; disco:QuestionText "6. [Immigration status] Only for persons who have usual residence in Argentina and were born in another country. [Questions 6A and 6B asked only of persons born outside Argentina and who currently reside in Argentina.] B. Are you a naturalized citizen of Argentina? [] Yes [] No [] Unanswered"@en.
Widely accepted and adopted vocabularies are reused to a large extend. Many features of DDI can be addressed by classes and properties of other vocabularies, such as: describing metadata for citation purposes using the DCMI Metadata Terms (DCMI) [DCMI], describing catalogues of datasets using the Data Catalog Vocabulary (DCAT) [DCAT], describing aggregate data like multi-dimensional tables using the RDF Data Cube Vocabulary [RDF Data Cube Vocabulary], describing formal statistical classifications using the SKOS Extension for Statistics (XKOS) [XKOS], describing arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes [SIO], and delineating code lists, category schemes, mappings between them, and concepts like topics using the Simple Knowledge Organization System (SKOS) [SKOS]. Furthermore, the external vocabularies Friend of a Friend (FOAF) [FOAF], the Organization Ontology (ORG) [ORG], the Asset Description Metadata Schema (ADMS) [ADMS], and the PROV Ontology (PROV-O) [PROV-O] are used. Whenever terms from other vocabularies are used within the Disco context, these terms are not re-defined but only applied for the purposes of disco.
It is distinguished between required, recommended and optional vocabularies that are reused. Required vocabularies contain classes and properties that are required in order to represent particular aspects of Disco completely. Recommended vocabularies hold classes and properties that are recommended to be used for representing particular aspects of Disco. Finally, optional vocabularies contain classes and properties that may support the modelling of particular aspects of Disco. This strongly depends on in which extent and for which purpose data is represented in Disco. Terms of optional vocabularies are not necessarily required for representing DDI metadata in Disco.
Required vocabularies are:
DCMI is reused in order to describe general metadata of Disco constructs such as a study abstract (dcterms:abstract), a study or dataset title (dcterms:title), a human readable description of a Disco construct (dcterms:description), provenance information for a data file (dcterms:provenance), or the date (or date range) at which a study will become available (dcterms:available).
skos:Concept
is reused to a large extent to represent DDI concepts, codes, and categories.
SKOS defines the term skos:Concept, which is a unit of knowledge created by a unique combination of characteristics.
In context of statistical (meta)data, concepts are abstract summaries, general notions, knowledge of a whole set of behaviours,
attitudes or characteristics which are seen as having something in common. Skos:Concepts may be associated with variables,
variable definitions, and questions and are reused to a large extent to represent DDI concepts (skos:prefLabel),
codes (skos:notation), and category labels (skos:prefLabel). Skos:Concepts may be organized in skos:ConceptSchemes (skos:inScheme),
sets of metadata describing statistical concepts. Hierarchies of DDI concepts can be built using the object properties
skos:broader and skos:narrower. Topical coverage can be expressed using dcterms:subject. Disco foresees the use of skos:Concept
for the description of topical coverage. Spatial, temporal, and topical coverage are directly attached to studies, logical datasets,
and datafiles. Universes and AnalysisUnits are also skos:Concepts. Therefore the properties defined for skos:Concept can be reused.
KindOfData, pointing to a skos:Concept , describes, with a string or a term from a controlled vocabulary, the kind of data documented
in the logical product(s) of a Study. Using dcterms:format, DataFiles formats can be defined.
skos:Concept
.
skos:notation
with skos:Concept
as domain.skos:prefLabel
and the domain class skos:Concept
to describe category valuesskos:definition
pointing from skos:Concept
classes.skos:broader
and skos:narrower
. The domains and the ranges of skos:broader
and skos:narrower
are skos:Concept
. skos:ConceptSchemes
: Skos:Concepts
may be organized in skos:ConceptSchemes
by means of skos:inScheme
. The top concept in a specific ConceptScheme is indicated by skos:hasTopConcept
pointing to top skos:Concept
.dcterms:subject
. DDI-RDF foresees the use of skos:Concept
for the description of topical coverage. Spatial, temporal, and topical coverage are directly attached to studies, logical datasets, and datafiles.CategoryStatistics
: CategoryStatistics
like frequencies and percentages are associated to the respectve Category using the object property statisticsCategory. skos:Concept
represents categories.Question
) are associated with concepts via the object property concept.Universe
: Each universe is also a skos:Concept
. Therefore the properties defined for skos:Concept
can be reused for universes.Questionnaire
) may have multiple collection modes which are represented by skos:Concept
.Variable
definitions are associated with concepts via the object property concept.Variable
) are linked to concepts via the object property concept.Study
. Examples include survey data, census/enumeration data, administrative data, measurement data, assessment data, demographic data, voting data, etc. The range of kindOfData is skos:Concept
dcterms:format
, data files (DataFile
) formats can be defined. Data files (DataFiles
) must have exactly 1 dcterms:format
relationship to an instance of the class dcterms:MediaTypeOrExtend
which is a sub-class of skos:Concept
.AnalysisUnit
: Each analysis unit is also a skos:Concept
. Therefore the properties defined for skos:Concept
can be reused for analysis units.DCAT is a W3C standard for describing catalogs of datasets. DCAT makes few assumptions about the kind of datasets being described, and focuses on general metadata about the datasets (mostly using Dublin Core), and on different ways of distributing and accessing the dataset, including availability of the dataset in multiple formats. Combining terms from both DCAT and Disco can be useful for a number of reasons:
The LogicalDataSet is an extension of the dcat:DataSet. Physical, distributed files are represented by the DataFile, which is itself an extension of dcat:Distribution.
ddi:DataCatalog_1 a dcat:Catalog; dcat:record ddi:EuropeanStudy; dcat:dataset ddi:EuropeanDataset; ddi:EuropeanStudy a dcat:CatalogRecord; a disco:Study; foaf:primaryTopic ddi:EuropeanDataset; disco:product ddi:EuropeanDataset. ddi:EuropeanDataset a dcat:Dataset; a disco:LogicalDataSet; dcat:theme ddi:topics/WellBeing; dcat:theme ddi:topics/PoliticalAttitudes; dcat:keyword "Europe"@en; dcat:keyword "Politics"@en.
Within the context of Disco, FOAF as well as ORG are reused. Creators (dcterms:creator), contributors (dcterms:contributor), and publishers (dcterms:publisher) of Studies and StudyGroups are foaf:Agents which are either foaf:Persons or org:Organizations whose members are foaf:Persons. Studies and StudyGroups may be funded by (disco:fundedBy) foaf:Agents. The object property disco:fundedBy is defined as sub-property of dcterms:contributor.
Especially persons and organizations may hold one or more persistent identifiers of particular schemes and agencies (e.g. ORCID, FundRef) that are not considered by the specific IDs of Disco. In order to include those identifiers and for distinguishing between multiple identifiers for the same class, ADMS is utilized. As a profile of DCAT, ADMS aims to describe semantic assets, i.e. reusable metadata and reference data. The class adms:Identifier can be added to a rdfs:Resource by using the property adms:identifier. That identifier class can contain properties that define the particular identifier itself, but also its scheme, version and managing agency. However, although utilized primarily for describing identifiers of persons and organizations, it is allowed to attach an adms:Identifier class to all classes in Disco.
In order to represent detailed provenance information of Web data and metadata, classes and properties of PROV-O can be used. Thus, it can be used as a natural vocabulary to attach provenance information to Disco metadata. Terms of PROV-O are organized among three main classes: prov:Entity, prov:Activity and prov:Agent. While classes of Disco can be represented either as entities or agents, particular processes for, e.g. creating, maintaining and accessing data can be modeled as activities. Properties like prov:wasGeneratedBy, prov:hadPrimarySource, prov:wasInvalidatedBy, or prov:wasDerivedFrom describe the relationship between classes for the generation of data in more detail. In order to link from a disco:Study to its original DDI XML file, the property prov:wasDerivedFrom can be used. Moreover, PROV-O allows for representing versioning information by e.g., using the terms prov:Revision, prov:hadGeneration and prov:hadUsage.
The RDF Data Cube Vocabulary is a W3C standard for representing data cubes, that is, multidimensional aggregate data.
A qb:DataSet
represents aggregate data such as multi-dimensional tables.
Aggregate data is derived from microdata by statistics on groups, or aggregates such as counts, means, or frequencies.
Data cubes are often generated by tabulating or aggregating unit-record datasets.
For example, if an observation in a census data cube indicates the population of a certain age group in a certain region is 12345,
then this fact was obtained by aggregating that number of individual records from a unit-record dataset.
Disco contains a property “aggregation” that indicates that a Cube dataset was derived by tabulating a unit-record dataset.
Data Cube provides for the description of the structure of such cubes, but also for the representation of the cube data itself,
that is, the observations that make up the cube dataset [Semantic Statistics]. This is not the case for Disco,
which only describes the structure of a dataset, but is not concerned with representing the actual data in it.
The actual data are assumed to sit in a data file (e.g. a CSV file, or in a proprietary statistical package file format) that is not represented in RDF.
@prefix prov: <http://www.w3.org/ns/prov#> . ddi:AggregatedDataSet a prov:Entity; prov:wasDerivedFrom ddi:MicrodataDataSet. ddi:MicrodataDataSet a prov:Entity .
@prefix prov: <http://www.w3.org/ns/prov#> . ddi:AggregatedDataSet a prov:Entity; prov:wasDerivedFrom ddi:MicrodataDataSet; prov:wasGeneratedBy ddi:AggregationActivity; prov:qualifiedDerivation [ a prov:Derivation; prov:entity ddi:MicrodataDataSet; prov:hadActivity ddi:AggregationActivity ]. ddi:AggregationActivity a prov:Activity . ddi:MicrodataDataSet a prov:Entity;
The use of formal statistical classifications is very common in research datasets - these are treated in Disco as SKOS concepts, but in some cases those working with formal statistical classifications may desire more expressive capability than SKOS provides. To support such users, the DDI Alliance also develops XKOS, a vocabulary which extends SKOS to allow for a more complete description of such classifications [eXtended Knowledge Organization System]. While the use of XKOS is not required by this vocabulary, the two are designed to work in complementary fashion. SKOS properties may be substituted by additional XKOS properties.
XKOS extends SKOS with two main objectives: the first one is to allow the description of statistical classifications, the second one is to introduce refinements of the semantic properties defined in SKOS. The semantic properties extend the possible relations that can be applied between pairs of skos:Concepts. SKOS allows the following relations: skos:broader than, skos:narrower than, and skos:related to. The first two are hierarchical relations, one in each direction. In Disco, these SKOS properties may be substituted by additional XKOS properties like xkos:generalizes, xkos:hasPart, xkos:caused, xkos:previous, and xkos:next.
One question, typically asked by social science researchers, could be to query all the datasets (disco:LogicalDataSet) which have a specific statistical classification (skos:ConceptScheme) like ISCO (International Standard Classification of Occupations) or ANZSIC (Australian and New Zealand Industry Classification). It is also possible to query on the semantic relationships which are defined for statistical classifications using XKOS properties. By means of these properties not only hierarchical relations can be queries but also for example part of relationships (xkos:hasPart), more general (xkos:generalizes) and more specific (xkos:specializes) concepts, and positions of concepts in lists (xkos:previous, xkos:next).
The Semanticscience Integrated Ontology (SIO) provides a simple, integrated ontology of types and relations for rich description of objects, processes and their attributes. A sio:SIO_000367
(Variable) represents a value that may change within the scope of a given or set of operations. For instance, in the context of a mathematics or statistics, a sio Variable is an information content entity that can be used to indicate the independent, dependent, or control variables of a study or experiment. Here, the similarity between sio Variable and disco:Variable
is that, they are both associated to a concept e.g., Sex, Age and Citizenship.
The main intention of Disco is to provide a RDF representation of DDI resources for discovery purposes in the Linked Data web. Nevertheless, bidirectional mappings between disco and DDI Lifecycle (DDI-L) are provided. In this section, bidirectional mappings between Disco and DDI Lifecycle (DDI-L) is provided. It allows an easy adoption of the DDI Discovery Vocabulary for existing DDI metadata. XSLTs for converting any XML output of DDI Codebook (DDI-C) and DDI-L are available at the DDI-RDF-tools project page.
Official Mapping DocumentThere is also an official document containing all bidirectional mappings between Disco and DDI-L: official mapping document These mapping tables will be transformed to the official specification in form of a turtle file and in form of html tables in this html specification.
Bidirectional Mappings between Disco and DDI-LIn order to avoid inconsistencies (as mapping tables may changes over time), we only offer mappings between Disco and the concrete version DDI 3.1 of DDI-L. There are various mapping documents between DDI 3.1 and other DDI versions (like DDI 3.2 and DDI 2.1) on the DDI Alliance website.
Mappings between Disco and DDI 4DDI 4 will be the next model-driven specification of DDI including mappings to multiple representations such as RDF, XML, relational databases, and Java. DDI 4 should have a clear mapping from DDI-XML 3.2. We assume that all items used in Disco will have a clear mapping to DDI-XML 3.2, and these items in DDI-XML 3.2 will have a clear mapping to items in the DDI 4 model (therefore to a representation in OWL/RDF as well). If the latter should not be possible, then a mapping of items in DDI-XML 3.2 to DDI 4 XML and DDI 4 RDF should be possible.
Turtle File Containing Mappings in RDFThe mappings are defined within a separate turtle file
skos:notation a rdfs:Class, owl:Class ; disco:mapping [ a disco:Mapping ; disco:ddi-L-XPath "//l:Variable/l:VariableName" ; disco:ddi-L-Documentation "http://www.ddialliance.org/Specification/DDI-Lifecycle/3.1/XMLSchema/FieldLevelDocumentation/logicalproduct_xsd/elements/Variable.html" disco:context "skos:notation represents variable label" ; disco:context "SELECT ?notation WHERE { ?notation rdfs:domain ?variable. ?variable a disco:Variable. }" ]
# | property |
domain class |
range class |
DDI-L |
description |
DDI-L Documentation |
#1 | disco:AnalysisUnit | r:AnalysisUnit | ||||
#2 | disco:RepresentedVariable | |||||
#3 | disco:DataFile | |||||
#4 | disco:DescriptiveStatistics | |||||
#5 | disco:SummaryStatistics | |||||
#6 | disco:CategoryStatistics | p:CategoryStatistics | ||||
#7 | disco:Instrument | d:Instrument | ||||
#8 | disco:LogicalDataSet | |||||
#9 | disco:Question | d:QuestionItem | d:MultipleQuestionItem | ||||
#10 | disco:responseDomain | |||||
#11 | disco:Questionnaire | d:Instrument | The instument of the study | |||
#12 | disco:Study | s:StudyUnit | ||||
#13 | disco:StudyGroup | |||||
#14 | disco:Variable | //l:Variable |
# |
property |
domain class |
range class |
DDI-L |
description |
DDI-L Documentation |
#1 | skos:ConceptScheme | //l:Variable/l:CodeScheme | Variables can have a coded representaion |
# |
property |
domain class |
range class |
DDI-L |
description |
DDI-L Documentation |
#1 | disco:analysisUnit | |||||
#2 | disco:basedOn | |||||
#3 | disco:collectionMode | |||||
#4 | disco:variable | |||||
#5 | disco:concept | //l:Vaiable/l:ConceptReference | Varialbe has a concept | |||
#6 | disco:concept | //d:QuestionItem/r:ConceptReference | Question is defined by concept | |||
#7 | " | |||||
#8 | disco:aggregation | |||||
#9 | disco:dataFile | |||||
#10 | disco:ddifile | |||||
#11 | disco:externalDocumentation | |||||
#12 | disco:fundedBy | |||||
#13 | disco:inGroup | |||||
#14 | disco:inputVariable | |||||
#15 | disco:instrument | //d:DataCollection/[d:QuestionItem d:MultipleQuestionItem] | The instrument of the study questionaire | |||
#16 | disco:kindOfData | |||||
#17 | disco:product | |||||
#18 | disco:question | //l:Variable/l:QuestionReference | Variable can have a question | |||
#19 | disco:question | //[d:QuestionItem d:MultipleQuestionItem] | Questions in a questionaire | |||
#20 | disco:representation | //l:Variable/l:Representation/l:CodeRepresentation/[r:CodeSchemeReference l:NumericRepresentation l:TextRepresentation l:DateTimeRepresentation] | Variables can have a representation | |||
#21 | disco:statisticsCategory | |||||
#22 | disco:statisticsDataFile | |||||
#23 | disco:statisticsVariable | |||||
#24 | disco:weightedBy | |||||
#25 | disco:universe | disco:universe | Variable can have a concept |
# |
property | domain class | range class | DDI-L | description |
DDI-L Documentation |
#1 | dcterms:identifier | //l:Variable/l:VariableName | dcterms:identifier represents variable label | |||
#2 | skos:prefLabel | //l:Variable/r:Label | skos:prefLabel represents the label of the variable | |||
#3 | skos:prefLabel | //d:QuestionItem/d:QuestionItemName | Name of question |
# |
property |
domain class |
range class |
DDI-L |
description |
DDI-L Documentation |
#1 | skos:notation | //l:Variable/l:VariableName | skos:notation represents variable label | DDI-L Documentation | ||
#2 | disco:frequency | p:CaseQuantity | ||||
#3 | disco:isPublic | |||||
#4 | disco:isValid | |||||
#5 | disco:questionText | d:QuestionText | ||||
#6 | disco:percentage | |||||
#7 | disco:computationBase | |||||
#8 | disco:cumulativePercentage | |||||
#9 | disco:purpose | s:Purpose | ||||
#10 | disco:subtitle | r:SubTitle | ||||
#11 | disco:standardDeviation | |||||
#12 | disco:numberOfCases | |||||
#13 | disco:maximum | |||||
#14 | disco:mean | |||||
#15 | disco:median | |||||
#16 | disco:minimum | |||||
#17 | disco:mode | |||||
#18 | disco:startDate |
# |
property |
domain class |
range class |
DDI-L |
description |
DDI-L Documentation |
#1 | skos:notation | //l:Variable/l:VariableName | skos:notation represents variable label | DDI-L Documentation | ||
#2 | skos:notation | skos:notation represents code |
# |
property |
domain class |
range class |
DDI-C |
DDI-L |
1 | universe | union of Study and StudyGroup | Universe | X | X |
2 | dcterms:subject | union of Study and StudyGroup | skos:Concept | X | |
3 | dcterms:temporal | union of Study and StudyGroup | dcterms:PeriodOfTime | ||
4 | dcterms:spatial | union of Study and StudyGroup | dcterms:Location | ||
5 | kindOfData | union of Study and StudyGroup | skos:Concept | X | |
6 | analysisUnit | union of Study and StudyGroup | AnalysisUnit | ||
7 | dcterms:abstract | union of Study and StudyGroup | rdf:langString | X | X |
8 | dcterms:alternative | union of Study and StudyGroup | rdf:langString | X | X |
9 | dcterms:available | union of Study and StudyGroup | xsd:dateTime | X | |
10 | dcterms:title | union of Study and StudyGroup | rdf:langString | X | X |
11 | purpose | union of Study and StudyGroup | rdf:langString | X | |
12 | subtitle | union of Study and StudyGroup | rdf:langString | X | X |
13 | ddiFile | union of Study and StudyGroup | foaf:Document | ||
14 | fundedBy | union of Study and StudyGroup | foaf:Agent | ||
15 | dcterms:creator | union of Study and StudyGroup | foaf:Agent | X | |
16 | dcterms:contributor | union of Study and StudyGroup | foaf:Agent | ||
17 | dcterms:publisher | union of Study and StudyGroup | foaf:Agent | - | X |
18 | instrument | Study | Instrument | X | |
19 | inGroup | Study | StudyGroup | X | |
20 | dataFile | Study | DataFile | X | |
21 | variable | Study | Variable | X | X |
22 | product | Study | LogicalDataSet | X | |
23 | owl:versionInfo | Study | |||
24 | skos:definition | Universe | rdf:langString | X |
# |
property |
domain class |
range class |
DDI-C |
DDI-L |
1 | adms:identifier | disco:Study | adms:Identifier | X | |
2 | adms:identifier | disco:StudyGroup | adms:Identifier | ||
3 | adms:identifier | disco:AnalysisUnit | adms:Identifier | ||
4 | adms:identifier | disco:Universe | adms:Identifier | ||
5 | adms:identifier | disco:LogicalDataSet | adms:Identifier | ||
6 | adms:identifier | disco:DataFile | adms:Identifier | X | |
7 | adms:identifier | disco:DescriptiveStatistics | adms:Identifier | ||
8 | adms:identifier | disco:SummaryStatistics | adms:Identifier | ||
9 | adms:identifier | disco:CategoryStatistics | adms:Identifier | ||
10 | adms:identifier | disco:Variable | adms:Identifier | X | |
11 | adms:identifier | disco:RepresentedVariable | adms:Identifier | ||
12 | adms:identifier | disco:Question | adms:Identifier | ||
13 | adms:identifier | disco:Instrument | adms:Identifier | ||
14 | adms:identifier | disco:Questionnaire | adms:Identifier | ||
15 | skos_prefLabel | rdfs:Resource | rdf:langString | ||
16 | dcterms:relation | rdfs:Resource | foaf:Document | ||
17 | dcterms:description | dcterms:RightsStatement | rdf:langString | ||
18 | skos:prefLabel | dcterms:RightsStatement | rdf:langString | ||
19 | rdfs:seeAlso | dcterms:RightsStatement | foaf:Document | ||
20 | skos:prefLabel | dcterms:PeriodOfTime | rdf:langString | ||
21 | startDate | dcterms:PeriodOfTime | xsd:date | ||
22 | endDate | dcterms:PeriodOfTime | xsd:Date | ||
23 | skos:prefLabel | dcterms:MediaTypeOrExtent | rdf:langString | ||
24 | org:memberOf | foaf:Person | org:Organization |
# |
property |
domain class |
range class |
DDI-C |
DDI-L |
1 | instrument | LogicalDataSet | Instrument | ||
2 | dataFile | LogicalDataSet | DataFile | ||
3 | aggregation | LogicalDataSet | qb:DataSet | ||
4 | variable | LogicalDataSet | Variable | ||
5 | universe | LogicalDataSet | Universe | X | |
6 | dcterms:title | LogicalDataSet | rdf:langString | X | |
7 | isPublic | LogicalDataSet | xsd:boolean | ||
8 | dcterms:accessRights | LogicalDataSet | dcterms:RightsStatement | X | |
9 | dcterms:license | LogicalDataSet | dcterms:LicenseDocument | ||
10 | inputVariable | qb:DataSet | Variable | ||
11 | caseQuantity | DataFile | xsd:nonNegativeInteger | X | |
12 | dcterms:description | DataFile | rdf:langstring | ||
13 | owl:versioninfo | DataFile | string | X | |
14 | dcterms:temporal | DataFile | dcterms:PeriodOfTime | ||
15 | dcterms:spatial | DataFile | dcterms:Location | X | |
16 | dcterms:provenance | DataFile | dcterms:ProvenanceStatement | ||
17 | dcterms:subject | DataFile | skos:Concept | ||
18 | dcterms:format | DataFile | dcterms:MediaTypeOrExtend | ||
19 | statisticsDataFile | DescriptiveStatistics | DataFile | ||
20 | statisticsVariable | SummaryStatistics | Variable | ||
21 | invalidcases | SummaryStatistics | xsd:nonNegativeInteger | ||
22 | maximum | SummaryStatistics | xsd:decimal | ||
23 | mean | SummaryStatistics | xsd:decimal | ||
24 | median | SummaryStatistics | xsd:decimal | ||
25 | minimum | SummaryStatistics | xsd:decimal | ||
26 | mode | SummaryStatistics | xsd:decimal | ||
27 | standardDeviation | SummaryStatistics | xsd:decimal | ||
28 | validCases | SummaryStatistics | xsd:nonNegativeInteger | ||
29 | weightedInvalidCases | SummaryStatistics | xsd:nonNegativeInteger | ||
30 | weightedMean | SummaryStatistics | xsd:decimal | ||
31 | weightedMedian | SummaryStatistics | xsd:decimal | ||
32 | weightedMode | SummaryStatistics | xsd:decimal | ||
33 | weightedValidCases | SummaryStatistics | xsd:nonNegativeInteger | ||
34 | statisticsCategory | CategoryStatistics | skos:Concept | ||
35 | cumulativePercentage | CategoryStatistics | xsd:decimal | ||
36 | frequency | CategoryStatistics | xsd:nonNegativeInteger | ||
37 | percentage | CategoryStatistics | xsd:decimal | ||
38 | weightedCumulativePercentage | CategoryStatistics | xsd:decimal | ||
39 | weightedFrequency | CategoryStatistics | xsd:nonNegativeInteger | ||
40 | weightedPercentage | CategoryStatistics | xsd:decimal |
# |
property |
domain class |
range class |
DDI-C |
DDI-L |
1 | skos:inScheme | skos:Concept | skos:ConceptScheme | ||
2 | skos:hasTopConcept | skos:ConceptScheme | skos:Concept | ||
3 | skos:broader | skos:Concept | skos:Concept | X | |
4 | skos:narrower | skos:Concept | skos:Concept | ||
5 | skos:definition | skos:Concept | rdf:langString | ||
6 | skos:notation | skos:Concept | rdfs:Literal | X | |
7 | skos:prefLabel | skos:Concept | rdf:LangString | ||
8 | question | Variable | Question | X | |
9 | universe | Variable | Universe | X | X |
10 | analysisUnit | Variable | AnalysisUnit | ||
11 | concept | Variable | skos:Concept | X | |
12 | representation | Variable | Representation | ||
13 | basedOn | Variable | RepresentedVariable | ||
14 | dcterms:description | Variable | rdf:langString | X | |
15 | skos:notation | Variable | rdfs:Literal | X | |
16 | skos:prefLabel | Variable | rdf:langString | X | |
17 | concept | RepresentedVariable | skos:Concept | ||
18 | universe | RepresentedVariable | Universe | ||
19 | representation | RepresentedVariable | Representation | ||
20 | dcterms:description | RepresentedVariable | rdf:langString | ||
21 | skos:prefLabel | RepresentedVariable | rdf:langString |
# |
property |
domain class |
range class |
DDI-C |
DDI-L |
1 | universe | Question | Universe | X | X |
2 | concept | Question | skos:Concept | X | |
3 | responseDomain | Question | Representation | ||
4 | questionText | Question | rdf:langString | X | |
5 | skos:prefLabel | Question | rdf:langString | X | |
6 | question | Questionnaire | Question | ||
7 | collectionMode | Questionnaire | skos:Concept | ||
8 | externalDocumentation | Instrument | foaf:Document | ||
9 | dcterms:description | Instrument | rdf:langString | X | |
10 | skos:prefLabel | Instrument | rdf:langString | X |
# |
property |
domain class |
range class |
mapping |
1 | universe | union of Study and StudyGroup | Universe | /codeBook/stdyDscr/stdyInfo/sumDscr/universe |
2 | dcterms:subject | union of Study and StudyGroup | skos:Concept | |
3 | dcterms:temporal | union of Study and StudyGroup | dcterms:PeriodOfTime | |
4 | dcterms:spatial | union of Study and StudyGroup | dcterms:Location | |
5 | kindOfData | union of Study and StudyGroup | skos:Concept | |
6 | analysisUnit | union of Study and StudyGroup | AnalysisUnit | |
7 | dcterms:abstract | union of Study and StudyGroup | rdf:langString | /codeBook/stdyDscr/stdyInfo/abstract |
8 | dcterms:alternative | union of Study and StudyGroup | rdf:langString | /codeBook/stdyDscr/citation/altTitl |
9 | dcterms:available | union of Study and StudyGroup | xsd:dateTime | |
10 | dcterms:title | union of Study and StudyGroup | rdf:langString | /codeBook/stdyDscr/citation/titl |
11 | purpose | union of Study and StudyGroup | rdf:langString | |
12 | subtitle | union of Study and StudyGroup | rdf:langString | /codeBook/stdyDscr/citation/subTitl |
13 | ddiFile | union of Study and StudyGroup | foaf:Document | |
14 | fundedBy | union of Study and StudyGroup | foaf:Agent | |
15 | dcterms:creator | union of Study and StudyGroup | foaf:Agent | |
16 | dcterms:contributor | union of Study and StudyGroup | foaf:Agent | |
17 | dcterms:publisher | union of Study and StudyGroup | foaf:Agent | |
18 | instrument | Study | Instrument | |
19 | inGroup | Study | StudyGroup | |
20 | dataFile | Study | DataFile | |
21 | variable | Study | Variable | /codeBook/dataDscr/var/@id |
22 | product | Study | LogicalDataSet | |
23 | owl:versionInfo | Study | ||
24 | skos:definition | Universe | rdf:langString |
# |
property |
domain class |
range class |
mapping |
1 | adms:identifier | disco:Study | adms:Identifier | |
2 | adms:identifier | disco:StudyGroup | adms:Identifier | |
3 | adms:identifier | disco:AnalysisUnit | adms:Identifier | |
4 | adms:identifier | disco:Universe | adms:Identifier | |
5 | adms:identifier | disco:LogicalDataSet | adms:Identifier | |
6 | adms:identifier | disco:DataFile | adms:Identifier | |
7 | adms:identifier | disco:DescriptiveStatistics | adms:Identifier | |
8 | adms:identifier | disco:SummaryStatistics | adms:Identifier | |
9 | adms:identifier | disco:CategoryStatistics | adms:Identifier | |
10 | adms:identifier | disco:Variable | adms:Identifier | |
11 | adms:identifier | disco:RepresentedVariable | adms:Identifier | |
12 | adms:identifier | disco:Question | adms:Identifier | |
13 | adms:identifier | disco:Instrument | adms:Identifier | |
14 | adms:identifier | disco:Questionnaire | adms:Identifier | |
15 | skos_prefLabel | rdfs:Resource | rdf:langString | |
16 | dcterms:relation | rdfs:Resource | foaf:Document | |
17 | dcterms:description | dcterms:RightsStatement | rdf:langString | |
18 | skos:prefLabel | dcterms:RightsStatement | rdf:langString | |
19 | rdfs:seeAlso | dcterms:RightsStatement | foaf:Document | |
20 | skos:prefLabel | dcterms:PeriodOfTime | rdf:langString | |
21 | startDate | dcterms:PeriodOfTime | xsd:date | |
22 | endDate | dcterms:PeriodOfTime | xsd:Date | |
23 | skos:prefLabel | dcterms:MediaTypeOrExtent | rdf:langString | |
24 | org:memberOf | foaf:Person | org:Organization |
# |
property |
domain class |
range class |
mapping |
1 | instrument | LogicalDataSet | Instrument | |
2 | dataFile | LogicalDataSet | DataFile | |
3 | aggregation | LogicalDataSet | qb:DataSet | |
4 | variable | LogicalDataSet | Variable | |
5 | universe | LogicalDataSet | Universe | /codeBook/stdyDscr/stdyInfo/sumDscr/universe |
6 | dcterms:title | LogicalDataSet | rdf:langString | |
7 | isPublic | LogicalDataSet | xsd:boolean | |
8 | dcterms:accessRights | LogicalDataSet | dcterms:RightsStatement | |
9 | dcterms:license | LogicalDataSet | dcterms:LicenseDocument | |
10 | inputVariable | qb:DataSet | Variable | |
11 | caseQuantity | DataFile | xsd:nonNegativeInteger | |
12 | dcterms:description | DataFile | rdf:langstring | |
13 | owl:versioninfo | DataFile | string | |
14 | dcterms:temporal | DataFile | dcterms:PeriodOfTime | |
15 | dcterms:spatial | DataFile | dcterms:Location | |
16 | dcterms:provenance | DataFile | dcterms:ProvenanceStatement | |
17 | dcterms:subject | DataFile | skos:Concept | |
18 | dcterms:format | DataFile | dcterms:MediaTypeOrExtend | |
19 | statisticsDataFile | DescriptiveStatistics | DataFile | |
20 | statisticsVariable | SummaryStatistics | Variable | |
21 | invalidcases | SummaryStatistics | xsd:nonNegativeInteger | |
22 | maximum | SummaryStatistics | xsd:decimal | |
23 | mean | SummaryStatistics | xsd:decimal | |
24 | median | SummaryStatistics | xsd:decimal | |
25 | minimum | SummaryStatistics | xsd:decimal | |
26 | mode | SummaryStatistics | xsd:decimal | |
27 | standardDeviation | SummaryStatistics | xsd:decimal | |
28 | validCases | SummaryStatistics | xsd:nonNegativeInteger | |
29 | weightedInvalidCases | SummaryStatistics | xsd:nonNegativeInteger | |
30 | weightedMean | SummaryStatistics | xsd:decimal | |
31 | weightedMedian | SummaryStatistics | xsd:decimal | |
32 | weightedMode | SummaryStatistics | xsd:decimal | |
33 | weightedValidCases | SummaryStatistics | xsd:nonNegativeInteger | |
34 | statisticsCategory | CategoryStatistics | skos:Concept | |
35 | cumulativePercentage | CategoryStatistics | xsd:decimal | |
36 | frequency | CategoryStatistics | xsd:nonNegativeInteger | |
37 | percentage | CategoryStatistics | xsd:decimal | |
38 | weightedCumulativePercentage | CategoryStatistics | xsd:decimal | |
39 | weightedFrequency | CategoryStatistics | xsd:nonNegativeInteger | |
40 | weightedPercentage | CategoryStatistics | xsd:decimal |
# |
property |
domain class |
range class |
mapping |
1 | skos:inScheme | skos:Concept | skos:ConceptScheme | |
2 | skos:hasTopConcept | skos:ConceptScheme | skos:Concept | |
3 | skos:broader | skos:Concept | skos:Concept | |
4 | skos:narrower | skos:Concept | skos:Concept | |
5 | skos:definition | skos:Concept | rdf:langString | |
6 | skos:notation | skos:Concept | rdfs:Literal | |
7 | skos:prefLabel | skos:Concept | rdf:LangString | |
8 | question | Variable | Question | |
9 | universe | Variable | Universe | /codeBook/stdyDscr/stdyInfo/sumDscr/universe |
10 | analysisUnit | Variable | AnalysisUnit | |
11 | concept | Variable | skos:Concept | |
12 | representation | Variable | Representation | |
13 | basedOn | Variable | RepresentedVariable | |
14 | dcterms:description | Variable | rdf:langString | |
15 | skos:notation | Variable | rdfs:Literal | |
16 | skos:prefLabel | Variable | rdf:langString | |
17 | concept | RepresentedVariable | skos:Concept | |
18 | universe | RepresentedVariable | Universe | |
19 | representation | RepresentedVariable | Representation | |
20 | dcterms:description | RepresentedVariable | rdf:langString | |
21 | skos:prefLabel | RepresentedVariable | rdf:langString |
# |
property |
domain class |
range class |
mapping |
1 | universe | Question | Universe | /codeBook/stdyDscr/stdyInfo/sumDscr/universe |
2 | concept | Question | skos:Concept | |
3 | responseDomain | Question | Representation | |
4 | questionText | Question | rdf:langString | |
5 | skos:prefLabel | Question | rdf:langString | |
6 | question | Questionnaire | Question | |
7 | collectionMode | Questionnaire | skos:Concept | |
8 | externalDocumentation | Instrument | foaf:Document | |
9 | dcterms:description | Instrument | rdf:langString | |
10 | skos:prefLabel | Instrument | rdf:langString |
# |
property |
domain class |
range class |
mapping |
1 | universe | union of Study and StudyGroup | Universe | /ddi:DDIInstance/s:StudyUnit/r:UniverseReference/r:ID |
2 | dcterms:subject | union of Study and StudyGroup | skos:Concept | /ddi:DDIInstance/s:StudyUnit/r:TopicalCoverage/r:Subject |
3 | dcterms:temporal | union of Study and StudyGroup | dcterms:PeriodOfTime | |
4 | dcterms:spatial | union of Study and StudyGroup | dcterms:Location | |
5 | kindOfData | union of Study and StudyGroup | skos:Concept | /ddi:DDIInstance/s:StudyUnit/r:KindOfData |
6 | analysisUnit | union of Study and StudyGroup | AnalysisUnit | /ddi:DDIInstance/s:StudyUnit/r:AnalysisUnit |
7 | dcterms:abstract | union of Study and StudyGroup | rdf:langString | /ddi:DDIInstance/s:StudyUnit/s:Abstract/r:Content |
8 | dcterms:alternative | union of Study and StudyGroup | rdf:langString | /ddi:DDIInstance/s:StudyUnit/r:Citation/r:AlternateTitle |
9 | dcterms:available | union of Study and StudyGroup | xsd:dateTime | /ddi:DDIInstance/s:StudyUnit/r:Embargo/r:Date/r:SimpleDate |
10 | dcterms:title | union of Study and StudyGroup | rdf:langString | /ddi:DDIInstance/s:StudyUnit/r:Citation/r:Title |
11 | purpose | union of Study and StudyGroup | rdf:langString | /ddi:DDIInstance/s:StudyUnit/s:Purpose/r:Content |
12 | subtitle | union of Study and StudyGroup | rdf:langString | /ddi:DDIInstance/s:StudyUnit/r:Citation/r:SubTitle |
13 | ddiFile | union of Study and StudyGroup | foaf:Document | |
14 | fundedBy | union of Study and StudyGroup | foaf:Agent | /ddi:DDIInstance/s:StudyUnit/r:FundingInformation |
15 | dcterms:creator | union of Study and StudyGroup | foaf:Agent | /ddi:DDIInstance/s:StudyUnit/r:Citation/r:Creator |
16 | dcterms:contributor | union of Study and StudyGroup | foaf:Agent | /ddi:DDIInstance/s:StudyUnit/r:Citation/r:Contributor |
17 | dcterms:publisher | union of Study and StudyGroup | foaf:Agent | /ddi:DDIInstance/s:StudyUnit/r:Citation/r:Publisher |
18 | instrument | Study | Instrument | /ddi:DDIInstace/s:StudyUnit/d:DataCollection/@id |
19 | inGroup | Study | StudyGroup | //s:StudyUnit/ancestor::g:Group[1]/@id |
20 | dataFile | Study | DataFile | //s:StudyUnit/pi:PhysicalInstance/@id |
21 | variable | Study | Variable | /ddi:DDIInstance/s:StudyUnit//l:Variable/@id |
22 | product | Study | LogicalDataSet | //s:StudyUnit/l:LogicalProduct/@id |
23 | owl:versionInfo | Study | ||
24 | skos:definition | Universe | rdf:langString | c:Universe/c:HumanReadable |
# |
property |
domain class |
range class |
mapping |
1 | adms:identifier | disco:Study | adms:Identifier | /ddi:DDIInstance/s:StudyUnit/@id |
2 | adms:identifier | disco:StudyGroup | adms:Identifier | |
3 | adms:identifier | disco:AnalysisUnit | adms:Identifier | |
4 | adms:identifier | disco:Universe | adms:Identifier | |
5 | adms:identifier | disco:LogicalDataSet | adms:Identifier | |
6 | adms:identifier | disco:DataFile | adms:Identifier | //pi:PhysicalInstance/pi:DataFileIdentification |
7 | adms:identifier | disco:DescriptiveStatistics | adms:Identifier | |
8 | adms:identifier | disco:SummaryStatistics | adms:Identifier | |
9 | adms:identifier | disco:CategoryStatistics | adms:Identifier | |
10 | adms:identifier | disco:Variable | adms:Identifier | //l:Variable/l:VariableName |
11 | adms:identifier | disco:RepresentedVariable | adms:Identifier | |
12 | adms:identifier | disco:Question | adms:Identifier | |
13 | adms:identifier | disco:Instrument | adms:Identifier | |
14 | adms:identifier | disco:Questionnaire | adms:Identifier | |
15 | skos_prefLabel | rdfs:Resource | rdf:langString | |
16 | dcterms:relation | rdfs:Resource | foaf:Document | |
17 | dcterms:description | dcterms:RightsStatement | rdf:langString | |
18 | skos:prefLabel | dcterms:RightsStatement | rdf:langString | |
19 | rdfs:seeAlso | dcterms:RightsStatement | foaf:Document | |
20 | skos:prefLabel | dcterms:PeriodOfTime | rdf:langString | |
21 | startDate | dcterms:PeriodOfTime | xsd:date | |
22 | endDate | dcterms:PeriodOfTime | xsd:Date | |
23 | skos:prefLabel | dcterms:MediaTypeOrExtent | rdf:langString | |
24 | org:memberOf | foaf:Person | org:Organization |
# |
property |
domain class |
range class |
mapping |
1 | instrument | LogicalDataSet | Instrument | |
2 | dataFile | LogicalDataSet | DataFile | |
3 | aggregation | LogicalDataSet | qb:DataSet | |
4 | variable | LogicalDataSet | Variable | |
5 | universe | LogicalDataSet | Universe | |
6 | dcterms:title | LogicalDataSet | rdf:langString | //l:LogicalProduct/r:Label |
7 | isPublic | LogicalDataSet | xsd:boolean | |
8 | dcterms:accessRights | LogicalDataSet | dcterms:RightsStatement | ancestor::s:StudyUnit/a:Archive/a:DefaultAccess/a:AccessConditions |
9 | dcterms:license | LogicalDataSet | dcterms:LicenseDocument | |
10 | inputVariable | qb:DataSet | Variable | |
11 | caseQuantity | DataFile | xsd:nonNegativeInteger | //pi:PhysicalInstance/pi:GrossFileStructure/pi:CaseQuantity |
12 | dcterms:description | DataFile | rdf:langstring | |
13 | owl:versioninfo | DataFile | string | //pi:PhysicalInstance/@version |
14 | dcterms:temporal | DataFile | dcterms:PeriodOfTime | |
15 | dcterms:spatial | DataFile | dcterms:Location | pi:PhysicalInstance/r:Coverage/r:SpatialCoverage/@id | pi:PhysicalInstance/r:Coverage/r:SpatialCoverageReference/r:ID |
16 | dcterms:provenance | DataFile | dcterms:ProvenanceStatement | |
17 | dcterms:subject | DataFile | skos:Concept | |
18 | dcterms:format | DataFile | dcterms:MediaTypeOrExtend | |
19 | statisticsDataFile | DescriptiveStatistics | DataFile | |
20 | statisticsVariable | SummaryStatistics | Variable | |
21 | invalidcases | SummaryStatistics | xsd:nonNegativeInteger | |
22 | maximum | SummaryStatistics | xsd:decimal | |
23 | mean | SummaryStatistics | xsd:decimal | |
24 | median | SummaryStatistics | xsd:decimal | |
25 | minimum | SummaryStatistics | xsd:decimal | |
26 | mode | SummaryStatistics | xsd:decimal | |
27 | standardDeviation | SummaryStatistics | xsd:decimal | |
28 | validCases | SummaryStatistics | xsd:nonNegativeInteger | |
29 | weightedInvalidCases | SummaryStatistics | xsd:nonNegativeInteger | |
30 | weightedMean | SummaryStatistics | xsd:decimal | |
31 | weightedMedian | SummaryStatistics | xsd:decimal | |
32 | weightedMode | SummaryStatistics | xsd:decimal | |
33 | weightedValidCases | SummaryStatistics | xsd:nonNegativeInteger | |
34 | statisticsCategory | CategoryStatistics | skos:Concept | |
35 | cumulativePercentage | CategoryStatistics | xsd:decimal | |
36 | frequency | CategoryStatistics | xsd:nonNegativeInteger | |
37 | percentage | CategoryStatistics | xsd:decimal | |
38 | weightedCumulativePercentage | CategoryStatistics | xsd:decimal | |
39 | weightedFrequency | CategoryStatistics | xsd:nonNegativeInteger | |
40 | weightedPercentage | CategoryStatistics | xsd:decimal |
# |
property |
domain class |
range class |
mapping |
1 | skos:inScheme | skos:Concept | skos:ConceptScheme | |
2 | skos:hasTopConcept | skos:ConceptScheme | skos:Concept | |
3 | skos:broader | skos:Concept | skos:Concept | c:Universe/c:SubUniverse/@id |
4 | skos:narrower | skos:Concept | skos:Concept | |
5 | skos:definition | skos:Concept | rdf:langString | c:Universe/c:UniverseName |
6 | skos:notation | skos:Concept | rdfs:Literal | c:Universe/c:MachineReadable [skos:notation is only used to represent codes] |
7 | skos:prefLabel | skos:Concept | rdf:LangString | c:Universe/r:Label [skos:notation is only used to represent categories] |
8 | question | Variable | Question | //l:Variable/r:QuestionReference/r:ID |
9 | universe | Variable | Universe | //l:Variable/r:UniverseReference/r:ID |
10 | analysisUnit | Variable | AnalysisUnit | |
11 | concept | Variable | skos:Concept | //l:Variable/r:ConceptReference/r:ID |
12 | representation | Variable | Representation | |
13 | basedOn | Variable | RepresentedVariable | |
14 | dcterms:description | Variable | rdf:langString | //l:Variable/r:Description |
15 | skos:notation | Variable | rdfs:Literal | //l:Variable/l:VariableName |
16 | skos:prefLabel | Variable | rdf:langString | //l:Variable/r:Label |
17 | concept | RepresentedVariable | skos:Concept | |
18 | universe | RepresentedVariable | Universe | |
19 | representation | RepresentedVariable | Representation | |
20 | dcterms:description | RepresentedVariable | rdf:langString | |
21 | skos:prefLabel | RepresentedVariable | rdf:langString |
# |
property |
domain class |
range class |
mapping |
1 | universe | Question | Universe | //l:Variable/r:UniverseReference/r:ID |
2 | concept | Question | skos:Concept | //l:Variable/r:ConceptReference/r:ID |
3 | responseDomain | Question | Representation | |
4 | questionText | Question | rdf:langString | //d:QuestionItem | d:MultipleQuestionItem/d:QuestionText/d:LiteralText/d:Text |
5 | skos:prefLabel | Question | rdf:langString | //d:QuestionItem/d:QuestionItemName | d:MultipleQuestionItem/d:MultipleQuestionItemName |
6 | question | Questionnaire | Question | |
7 | collectionMode | Questionnaire | skos:Concept | |
8 | externalDocumentation | Instrument | foaf:Document | |
9 | dcterms:description | Instrument | rdf:langString | d:Intrument/r:Description |
10 | skos:prefLabel | Instrument | rdf:langString | d:Instrument/r:Label |
The Microdata Information System (MISSY) is an online service platform that provides systematically structured metadata for official statistics. This includes data documentation at the study and variable level (6 series, 73 studies, 121 data sets, 22,719 variables, and 6,481 questions) as well as documentation materials, tools, and further information. We developed
We use Disco as core data model and extend it with a project-specific data model as Disco does not meet all of our project requirements. We provide open-source reference implementations of the Disco and the project-specific data model in Java, see Software Resources below. As instances of these data models may be physically stored in multiple formats such as DDI-XML, Disco, relational databases, and Java, we offer persistence implementations for each of these models according to their individual persistence APIs. Diverse export routines (e.g., Disco and DDI-Lifecycle) are available to enable the reuse of metadata in other systems.
Disco provides a detailed structure to describe data for discovery purposes in the Semantic Web. This way, DDI XML repositories can be transformed to Disco and provided to the Linked Data Web. For other purposes like preservation, exchange, replication or metadata-driven approach, the DDI XML specifications Lifecycle and also Codebook provide a richer structure. Examples are advanced missing value description and the notion of a conceptual variable (derived from GSIM – Generic Statistical Information Model).
disco:Study
disco:variable
(Domain:disco:Study
-> Range: disco:Variable
)
disco:inGroup
(Domain:disco:Study
-> Range: disco:StudyGroup
)
disco:product
(Domain:disco:Study
-> Range: http://purl.org/linked-data/cube#LogicalDataSet
)
disco:StudyGroup
disco:AnalysisUnit
Sub Class of: skos:Concept
disco:Universe
Sub Class of: skos:Concept
disco:LogicalDataSet
Sub Class of: http://www.w3.org/ns/dcat#Dataset
disco:variable
(Domain:disco:LogicalDataSet
-> Range: disco:Variable
)
disco:aggregation
(Domain:disco:LogicalDataSet
-> Range: http://purl.org/linked-data/cube#DataSet
)
disco:isPublic
(Domain:disco:LogicalDataSet
-> Range: xsd:boolean
)
disco:variableQuantity
(Domain:disco:LogicalDataSet
-> Range: xsd:nonNegativeInteger
)
disco:DataFile
Sub Class of: http://www.w3.org/ns/dcat#Distribution
disco:caseQuantity
(Domain:disco:DataFile
-> Range: xsd:nonNegativeInteger
)
disco:variableQuantity
(Domain:disco:DataFile
-> Range: xsd:nonNegativeInteger
)
disco:DescriptiveStatistics
disco:statisticsDataFile
(Domain:disco:DescriptiveStatistics
-> Range: disco:DataFile
)
disco:SummaryStatistics
Sub Class of: disco:DescriptiveStatistics
disco:statisticsVariable
(Domain:disco:SummaryStatistics
-> Range: disco:Variable
)
disco:summaryStatisticsType
(Domain:disco:SummaryStatistics
-> Range: skos:Concept
)
disco:weightedBy
(Domain:disco:SummaryStatistics
-> Range: disco:Variable
)
disco:CategoryStatistics
Sub Class of: disco:DescriptiveStatistics
disco:statisticsCategory
(Domain:disco:CategoryStatistics
-> Range: skos:Concept
)
disco:weightedBy
(Domain:disco:CategoryStatistics
-> Range: disco:Variable
)
disco:frequency
(Domain:disco:CategoryStatistics
-> Range: xsd:nonNegativeInteger
)
disco:percentage
(Domain:disco:CategoryStatistics
-> Range: xsd:decimal
)
disco:computationBase
(Domain:disco:CategoryStatistics
-> Range: rdf:langString
)
disco:cumulativePercentage
(Domain:disco:CategoryStatistics
-> Range: xsd:decimal
)
disco:Representation
disco:RepresentedVariable
disco:Variable
disco:basedOn
(Domain:disco:Variable
-> Range: disco:RepresentedVariable
)
disco:Question
disco:responseDomain
(Domain:disco:Question
-> Range: disco:Representation
)
disco:questionText
(Domain:disco:Question
-> Range: rdf:langString
)
disco:Instrument
disco:externalDocumentation
(Domain:disco:Instrument
-> Range: foaf:Document
)
disco:Questionnaire
Sub Class of: disco:Instrument
disco:collectionMode
(Domain:disco:Questionnaire
-> Range: skos:Concept
)
disco:Question
disco:universe
(Domain:
disco:Study
, disco:StudyGroup
, disco:RepresentedVariable
, disco:Variable
, disco:Question
, disco:LogicalDataSet
-> Range: disco:Universe
)
disco:concept
(Domain:
disco:RepresentedVariable
, disco:Question
, disco:Variable
-> Range: skos:Concept
)
disco:questionText
(Domain:disco:Question
-> Range: rdf:langString
)
disco:Instrument
disco:externalDocumentation
(Domain:disco:Instrument
-> Range: foaf:Document
)
disco:Questionnaire
Sub Class of: disco:Instrument
disco:collectionMode
(Domain:disco:Questionnaire
-> Range: skos:Concept
)
disco:question
(Domain:
disco:Variable
, disco:Questionnaire
-> Range: disco:Question
)
The following figure shows the object properties between the most important classes of the DDI-RDF Discovery Vocabulary. Additionally, the cardinalities of these object properties and class hierarchies are visualized.
A scalable version of this diagram can be found here.
Vompras, Gregory, Bosch, Capadisli, and Wackerow [Scenarios] have written a paper describing typical use cases associated with the DDI-RDF Discovery Vocabulary. The specification the DDI-RDF Discovery Vocabulary does not contain the full list of all the possible use cases. The complete list can be found in the mentioned paper. We now show a couple of representative use cases associated with the DDI-RDF Discovery Vocabulary.
Find studies from years 2000 and after about climate change.
SELECT ?studyTitle ?studyAbstract ?logicalDataSetTitle WHERE { ?study a disco:Study ; dcterms:title ?studyTitle ; dcterms:abstract ?studyAbstract ; dcterms:subject [ skos:prefLabel “Climate Change” ] ; dcterms:temporal [ disco:startDate ?date ] ; disco:product ?logicalDataSet . ?logicalDataSet a disco:LogicalDataSet ; dcterms:title ?logicalDataSetTitle . FILTER (?date >= 2000) }
Find titles of data sets which are publicly available under the Canadian Data Liberation Initiative Community policy. Optionally give links to the rights statement and the license.
SELECT ?logicalDataSetTitle WHERE { ?logicalDataSet a disco:LogicalDataSet ; dcterms:title ?logicalDataSetTitle ; disco:isPublic ?isPublic ; dcterms:accessRights ?rightsStatement . ?rightsStatement skos:prefLabel ?rightsStatementLabel . FILTER ( ?isPublic = "true" && ?rightsStatementLabel = "Data Liberation Initiative Community" ) OPTIONAL { ?rightsStatement rdfs:seeAlso ?rightsStatementURL . } OPTIONAL { ?logicalDataSet dcterms:license ?licenseDocument . } }
Find all studies with questions about commuting to work.
SELECT ?studyTitle ?studyAbstract WHERE { ?study a disco:Study ; disco:instrument ?questionnaire ; dcterms:title ?studyTitle ; dcterms:abstract ?studyAbstract . ?questionnaire disco:question ?question . ?question disco:questionText ?questionText . FILTER (regex(?questionText, "commut.*work")) }
Find study groups where the study uses the species variable and has a variable defined as Bufo alvarius
SELECT ?studyGroupTitle ?studyGroupAbstract WHERE { ?study a disco:Study ; disco:inGroup ?studyGroup ; disco:variable ?variable . ?studyGroup dcterms:title ?studyGroupTitle . ?studyGroup dcterms:abstract ?studyGroupAbstract . ?variable disco:concept ?variableConcept . FILTER (regex(?variableConcept, "species", "i")) ?variable disco:basedOn ?representedVariable . ?representedVariable disco:concept ?representedVariableConcept . FILTER (regex(?representedVariableConcept, "Bufo alvarius", "i")) }
Within the context of Disco, we reuse other well elaborated and accepted vocabularies as often as possible and reasonable. DCMI, FOAF, ORG, ADMS, and PROV-O build one block of complementary vocabularies. Their use is shown in one combined use case. DCMI is used in order to describe general metadata, FOAF and ORG are used to describe persons and organizations, we use ADMS for the persistent identification of objects like persons and organizations, and PROV-O is used to provide provenance information. A typical scenario within the social sciences community could be the following one:
ddi:EuropeanStudy a disco:Study; disco:product ddi:EuropeanDataSet; disco:fundedBy ddi:GESIS; ddi:John a foaf:Person; a prov:Agent; adms:identifier [ a adms:Identifier ]; prov:wasAssociatedWith ddi:AggregationActivity; prov:actedOnBehalfOf ddi:DERI; org:memberOf ddi:GESIS. ddi:EuropeanDataSet a disco:LogicalDataSet; a prov:Entity; disco:aggregation ddi:AggregatedEuropeanDataSet. ddi:AggregatedEuropeanDataSet a qb:DataSet; a prov:Entity. ddi:AggregationActivity a prov:Activity; prov:used ddi:EuropeanDataSet; prov:wasGeneratedBy ddi:AggregatedEuropeanDataSet; ddi:DERI a prov:Agent; a org:Organization; adms:identifier [ a adms:Identifier ]. ddi:GESIS a org:Organization; adms:identifier [ a adms:Identifier ]. ----- SELECT ?person WHERE { ?person rdf:type foaf:Person. ?person org:memberOf ?gesis. ?gesis a org:Organization. ?allbus a disco:StudyGroup. ?allbus dcterms:creator ?person. } ----- SELECT ?organization ?person WHERE { ?organization rdf:type org:Organization. ?person rdf:type foaf:Person. ?euSILC rdf:type disco:Study. {?euSILC dcterms:contributor ?person} UNION {?euSILC dcterms:contributor ?organization} } ----- SELECT ?identifierOrganization ?identifierPerson WHERE { ?organization rdf:type org:Organization. ?orgnization rdf:type foaf:Agent. ?organization adms:identifier ?identifierOrganization. ?person rdf:type foaf:Person. ?person rdf:type foaf:Agent. ?person adms:identifier ?identifierPerson. ?euLFS rdf:type disco:Study. {?euLFS dcterms:publisher ?person} UNION {?euLFS dcterms:publisher ?organization} }
XKOS extends SKOS with two main objectives: the first one is to allow the description of statistical classifications, the second one is to introduce refinements of the semantic properties defined in SKOS. The semantic properties extend the possible relations that can be applied between pairs of skos:Concepts. SKOS allows the following relations: skos:broader than, skos:narrower than, and skos:related to. The first two are hierarchical relations, one in each direction. In Disco, these SKOS properties may be substituted by additional XKOS properties like xkos:generalizes, xkos:hasPart, xkos:caused, xkos:previous, and xkos:next.
One question, typically asked by social science researchers, could be to query all the datasets (disco:LogicalDataSet) which have a specific statistical classification (skos:ConceptScheme) like ISCO (International Standard Classification of Occupations) or ANZSIC (Australian and New Zealand Industry Classification). It is also possible to query on the semantic relationships which are defined for statistical classifications using XKOS properties. By means of these properties not only hierarchical relations can be queried but also for example part of relationships (xkos:hasPart), more general (xkos:generalizes) and more specific (xkos:specializes) concepts, and positions of concepts in lists (xkos:previous, xkos:next).
The following figure gives an example inspired by the ANZSIC (Australian and New Zealand Industry Classification), which is a classification covering the field of economic activity. A small excerpt is shown here, limited to the classification object itself and its levels, as well as one item of the most detailed level (Class 6720 – Real Estate Services) and its parent items. Note that the URI employed in this example are entirely fictitious, since the ANZSIC has not yet been published as RDF.
For clarity, the properties of the classification items (code, labels, notes) have not been included in the figure.
On the left of the figure is the skos:ConceptScheme instance that corresponds to the ANZIC 2006 classification scheme, with its various SKOS and Dublin Core properties. Additionnal XKOS properties indicate that the classification has four levels and covers the field of economic activity, represented here as a concept from the EuroVoc thesaurus. In this case, the coverage is intended to be exhaustive and without overlap, so xkos:coversExhaustively and xkos:coversMutuallyExclusively could have been used together instead of xkos:covers.
The four levels are instances of xkos:ClassificationLevel; they are organized as a rdf:List which is attached to the classification by the xkos:levels property. Some level information has been represented on the top level, for example its depth in the classification (xkos:depth) and the concept that characterizes the items it is composed of (xkos:organizedBy). In the same fashion, concepts of subdivision, group and class could be created to describe the items of the lower levels.
The usual SKOS properties are used to connect the classification items to their respective level (skos:member) and to the classification (skos:inScheme or its specialization skos:topConceptOf) for the items of the first level). Similarly, skos:narrower is used to express the hierarchical relations between the items, but the subproperties defined in this specification could also be used. For example, xkos:hasPart could express the partitive relation between subdivision 67 ("Property Operators and Real Estate Services") and group 672 ("Real Estate Services").
While Disco and Data Cube provide terms for the description of datasets, both on a different level of aggregation, DCAT enables the representation of these datasets inside of data collections like repositories, catalogs or archives. The relationship between data collections and their contained datasets is useful, since such collections are a typical entry point when searching for data.
A search for data may consist of two phases. In a first phase, the user searches for different records described by dcat:CatalogRecord inside a data catalog. This search can differ according to the users’ information need. While it is possible to search for metadata provided inside such a record like dcterms:title, dcterms:description, etc., the user can also formulate a query to search for more detailed information about the dataset (represented as dcat:Dataset) or its distribution (dcat:Distribution), which are part of the record. For example, a user may want to search for datasets covering a particular topic (dcat:keyword), particular temporal and spatial coverages (dcterms:temporal and dcterms:spatial), or particular formats in which a distribution of the data is available (dcterms:format). Instances of dcat:DataSet are also described by specific themes they cover (dcat:theme). Since these themes are organized in a theme taxonomy (implemented by a skos:ConceptScheme and classes of skos:Concept), these themes can also be used for an overall search in all datasets of the data catalog.
Nevertheless, the search of the first phase will result in one or presumably multiple hits of datasets. Hence, another search has to be executed in a second phase in order to find out which datasets are relevant for the user, e.g. particular universes or samples. The search regarding particular criteria in multiple Disco datasets materializes as those described in the previous two use case sections and those presented in [9]. However, the user may find data sets which are published in Data Cube. In order to discover the original microdata source of a qb:DataSet, the property prov:wasDerivedFrom can hold the link the particular DDI data set disco:Study.
A user searching for data regarding dissatisfaction with politics in Europe may find the records :EuropeanStudy and :AggregatedEuropeanData in a :DataCatalog. By analyzing the information given in the themes and keywords of the associated data sets, the user can decide which data set is best suitable for his information need. He notices also that :AggregatedEuropeanDataset has been derived from :EuropeanDataset and seems to cover only a subset of the microdata set. If he is interested in the microdata instead of aggregated data, he is thus able to find the underlying microdata set.
ddi:DataCatalog_1 a dcat:Catalog; dcat:record ddi:EuropeanStudy; dcat:record ddi:AggregatedEuropeanData; dcat:dataset ddi:EuropeanDataset; dcat:dataset ddi:AggregatedEuropeanDataset. ddi:EuropeanStudy a dcat:CatalogRecord; a disco:Study; foaf:primaryTopic ddi:EuropeanDataset; disco:product ddi:EuropeanDataset. ddi:AggregatedEuropeanData; a dcat:CatalogRecord; foaf:primaryTopic ddi:AggregatedEuropeanDataset. ddi:EuropeanDataset a dcat:Dataset; a disco:LogicalDataSet; dcat:theme ddi:topics/WellBeing; dcat:theme ddi:topics/PoliticalAttitudes; dcat:keyword "Europe"@en; dcat:keyword "Politics"@en. ddi:AggregatedEuropeanDataset a dcat:Dataset; a qb:DataSet; dcat:theme ddi:topics/PoliticalDissatisfaction; dcat:keyword "Europe"@en; dcat:keyword "Politics"@en; prov:wasDerivedFrom ddi:EuropeanStudy.
This work has been started at the first workshop on “Semantic Statistics for Social, Behavioural, and Economic Sciences: Leveraging the DDI Model for the Linked Data Web” at Schloss Dagstuhl - Leibniz Center for Informatics, Germany in September 2011 organized by Richard Cyganiak, Arofan Gregory, Wendy Thomas, and Joachim Wackerow. This work has been continued at these three meetings:
This work has been supported by contributions of the participants of the events mentioned above:
We would like to thank the following organizations which have supported this work: