What is SDMX?

SDMX provides a metamodel for describing data in any statistical domain. The origins of the SDMX Information Model can be traced directly back to the model for the Generic Statistical Message (GESMES) which is a UN/EDIFACT standard, but in reality the core of the model is an evolution over time dating back to the 1970's. Following the launch of SDMX in 2001 this model has been enhanced significantly and is implemented fully in XML (SDMX-ML) and specific parts of it in UN/EDIFACT (a sub set of the GESMES message supporting time series), and JSON (to support data dissemination over the web).

The SDMX initiative is sponsored by seven organisations:

Bank for International Settlements (BIS); European Central Bank (ECB); Eurostat; International Monetary Fund (IMF); Organisation for Economic Co-operation and Development (OECD); United Nations Statistics Division (UNSD); World Bank.

SDMX supports many statistical activities and the processes supporting these activities:

  • data collection – data registration and data retrieval, data validation
  • data reporting and data mapping
  • data dissemination – data discovery, data query, data portal
  • structural metadata repository for metadata management, persistence, query, and retrieval
  • reference metadata reporting and dissemination, and linking metadata to data points

Why Use SDMX?

Support for Multiple Use Cases

First, it is important to know that SDMX can deliver more than just a common format for data collectors and data reporters to use for data and metadata exchange, even though the acronym would suggest otherwise.

Second, don’t worry about the syntax representation of the data and metadata, there are plenty of tools and open source code that hide this complexity and thus enable you to use SDMX to solve your problems. Such as how to consume SDMX in your favourite statistical tool such as SAS or R, or Excel, or how to create a dissemination web site, or how to build a robust data collection system.

And finally, look at the SDMX Information Model and see how this can support your use cases. When you follow this model (we explain this model later) we are convinced that you will see that it meets many your needs. Then with the tools and open source you will find you can implement your systems far quicker than you would have imagined.

Established Model

SDMX is, at its heart, an Information Model implemented in specific syntaxes (mainly XML). SDMX is responsive to new technologies as new syntax representations can be constructed easily as the base of these is the Information Model. For example, a new JSON format has been developed and this is gaining popularity with web developers.

We have implemented data collection, data validation, and data dissemination systems for a number clients. We have been building SDMX tools since 2005, and we have developed and released the SdmxSource (open source). Like many other organisations we have discovered that it is not the syntax, but the Information Model that is the power behind SDMX. We are pleased to share this knowledge in the explanations that follow to show how systems with the “SDMX Inside” can enable you to build solutions with less cost , less resources both initially and on-ongoing, and in less time than would otherwise be the case.

The Information Model is syntax and format agnostic and consequently the majority of processes and functions can be developed around the model, and not around the syntaxes. This is the essence of SDMX Source (www.sdmxsource.org) which comprises an open source toolbox for developing SDMX-based applications. Many useful tools have been developed to process SDMX (reading, writing, validation, transformation, mapping) and these can be mixed and matched in systems regardless of the author of the component in the knowledge that they all obey the same API.

SDMX Source is developed by Metadata Technology and is used by many big institutions, including Eurostat who has adopted SDMX Source as the underlying framework for its tools and applications. Building an application on this framework ensures full SDMX compliance for both import and export of information.

ISO Standard

SDMX is here to stay. The sponsors are international organisations (BIS, ECB, Eurostat,IMF, OECD, UN, World Bank) and the SDMX standard has ISO status (international standard 17369). SDMX is maintained actively, responding to requests for new functions through an open process.

Return on Investment

It is often said that it is difficult to justify the investment in a specific standard if it is to be used to do only one thing. SDMX can do far more than just act as a common format for exchanging statistical data and metadata. If you use SDMX as the model on which your data collection, data reporting, and data dissemination systems are built, then the benefits will roll in. The more you use the power of SDMX in your systems, the more benefit you will gain.

Brief Overview of the SDMX Information Model

The Problem Domain

Brief Problem Statement

A statistical system comprises many sub systems and components. A major issue with many computer systems, and statistical systems are no exception, is that the systems and therefore the software is built in silos or, at best, are built with tightly coupled, non-reusable, software components.

Consider the following very simple process flows, one for data import to a database, and one for data dissemination.

Data Import

Data Dissemination

Here are some possible drawbacks in the systems currently supporting these processes.

Database Design

A database schema is designed internally which can hold all the data that the department works with. The department decide to adopt an Oracle database, and the database tables are tailored to store the specific data relevant to the domain the department works with. There is no need to support more than 2 languages, so each table which contains specific labels for each of the languages wherever these appear in the tables

There is one large database table containing observation values for each dataset. There are many linked database tables which contain further information about the observations. The linked tables are used to facilitate filtering when querying the database for data and the information in these tables are required when constructing query results to the user.

Data Import

The department receives data from many providing organisations, and as such the data format of each dataset is dependent on the sending organisation. A data importer is written to support each data offering, in some situations mapping tables need to be defined which map the client’s classifications to the ones used internally. There are a set of validation rules defined to ensure the data conforms to what is expected. Each importer implements its own validation logic. There is also additional validation once the data is in the database.

Disseminating Data

To disseminate data, the business analysis team define which types of query are required and the development team write database queries specific to the defined use cases which extract the relevant dataset. There is no formal model for data, so the output syntax is dependent on the recipient of the data. If a new client requires a different output format, the development team either write new database queries, or they write a transformation from an existing format.

The website is built upon APIs defined by the business analysis team, with web designers and backend developers working closely to build the web pages. As more use cases evolve the development team write new APIs to support the website.

Internally, other departments within the organisation are given direct access to the database tables. These departments write their own query logic based on the table structure, in order to get the data into their own systems.

System Maintenance

This system combines both the data, with the metadata required to understand the data. Internal applications are written to give users access to lookup information such as classifications, and other users are allowed to modify and add information.

In some situations there is no user interface for certain types of information, so users query the database directly. In some instances the only way to modify certain types of metadata is by modifying the database directly.

What is Wrong with this Approach?

This approach provides a good example of a system which is highly coupled at almost every possible point. A highly coupled system means that changing one aspect of the system has repercussions across the system as a whole. Highly coupled systems have low resilience to change and high maintenance costs in terms of both time and money.

The first example of high coupling is the database schema itself, whose design is coupled to the known datasets and uses cases at the time of design. Any change to the schema will impact the website, all outputs for each client, and as this database is queried by other departments it will also affect all internal clients. The internal clients are also coupled to Oracle platform.

The backend APIs are coupled to the current requirements of the website and maintenance tool, as the database queries have been built to service these APIs they too are coupled to the current requirements. As the website is constructed by directly calling backend APIs, the website is coupled to the programming language that the system is written in. Any changes to requirements impacts these APIs, any changes to the APIs impacts the website.

Each importer is coupled to each client’s data format. If the client changes their data format, this impacts the processor specific to that client. Having specific import logic per format leads to bugs in some importers which are not present in others. This in turn leads to a high testing overhead as there is little shared logic between each importer. When new import formats are required, the maintenance burden is increased.

There is no internal model which can lead to ad hoc queries being written as required to support internal clients, external clients, the website, and maintenance application. This leads to many APIs with duplication of logic, which over time, as the code becomes less structured, results in spaghetti code. The application becomes more complex to maintain, and much more prone to the introduction of bugs, as changes to one part of the application has an undesirable impact on a seemingly unrelated part of the application.

There is no provision to change the database platform, so if the organisation favours SQL Server over Oracle, there is a huge maintenance burden on porting over all the query logic, migration scripts need to be written to migrate the data, and all of the software code needs to be modified.

The application written to view and maintain the supporting metadata is a maintenance burden, and coupled to both the database platform and design. Introducing new types of metadata to store, or modifying current metadata requires not only database changes, but modification to the maintenance application.

There is no provision to support data for new data domains. If the department is required to store a new type of data, both code changes and database modifications are required. Due to the high coupling of the system, these changes require a lot of additional testing.

Queries to the database require a lot of table joins, which leads to poor performance. As the website, maintenance application, external and internal clients are coupled to the table structure, it is not possible to improve performance easily.

Is this Typical?

Whilst the above paints an almost apocalyptic picture of what can happen, and we are not suggesting that all of these situations are present in any one system, we have observed all of these aspects in systems in which we have given consulting advice on a (SDMX) model-driven approach.

There is a different way of designing a system for collection, reporting, and dissemination of both data and metadata, and integration with data analysis tools used by the organisation. This is to use a model-driven approach and a component architecture that supports the model.

A Model-Driven Approach

By harmonising the language used to describe data, and associated metadata, it is possible to integrate disparate data sources enabling software applications to be able to access diverse datasets using a common language regardless of the software products used to store the data.

An SDMX solution introduces a generic, yet powerful internal model of data and associated metadata. The SDMX Information model was built by analysing the internal processes of many statistical agencies and central banks, and realising that even though each of their applications was different, they all did the same thing. Being able to describe the data and metadata supporting any statistical application in a generic way, leads to the ability to develop generic software modules being built which can process data in any statistical domain in a common way.

The SDMX Information Model is a data model. it does not in itself specify behaviour (e.g. what behaviour should a system have when processing a Code), though the various specifications may include specific high level behaviour such as submitting structural metadata to an SDMX Registry.

Fundamentally, a data model specifies the scope of the system or standard in terms of:

  • Information to be shared between processes or organisations in terms of the information objects (e.g. Code) and the content of the object (e.g. code id, code label)
  • Relationships between the information objects

In order for the model to be useful it must have an implementation. For instance, there must be a way of representing a specific code list and its codes in a specific syntax such as XML. There can be, and in SDMX there are, more than one way of representing specific instances of the information objects. This is an important point and it is a major benefit of having an information model. Different syntax representations can be supported and if the system architecture is designed well there is no need for most of the system components to be concerned with the syntax implementations: the components are built to understand the model objects, not the syntax in which these objects are imported or exported.

This is, in essence, the model driven approach to system engineering. Clearly for such a system to work the objects in the model must be realised as objects that do have behaviour. Software components can then be built that implement this behaviour (e.g. return the Id and Name of a Code). Importantly, this behaviour is, for the most part, context free i.e. the component returning the Code Id and Code Name does not know why these pieces of information are required and has no need to know. This component is just doing its job to service Codes.

Therefore a model driven approach to system engineering results in re-usable components that are de-coupled and cohesive: the system is not brittle and it easily maintained and enhanced.

SDMX has a Common Component Architecture based on the SDMX Information Model and an open source implementation of this architecture. This is available at www.sdmxsource.org.

The Dataflow is a pivotal construct in the SDMX Information Model: it is the construct for which data is both reported against, and disseminated against. It makes use of the structural information defined by the Data Structure Definition, but enables for further restrictions to be specified for the allowable content ((Valid) Content Constraint)

The Data Structure Definition (DSD) is a fundamental structure: it defines the valid content of a data set in terms of its dimensionality, variables, concepts, and valid content for the variables (e.g. code list or other data type)

The Provision Agreement contains information about the supply of data by one Data Provider for one Dataflow. In a data collection environment it can contain a link to a Valid Content Constraint that further constrains the allowed values that can be reported by a Data Provider. In a Data Dissemination environment is can link to a Registered Data Source that identifies the location of the data and how it can be retrieved (e.g. SDMC query), and the content of the data source ((Actual) Content Constraint).

Each of the Dataflows can be connected to one or more Categories and any one Category can be connected to zero or more Dataflows. This connection supports data discovery by organised topics such as Demography, Census, Health, Finance.

Some Use Cases Supported by the Model

Data Collection

Database Load

Data Dissemination

SDMX Support for Processes

Process

Constructs

Role

Register data available

Provision Agreement

Data Registration

Data collection can be automated by data reporters registering the location of new data to be reported.

Receive/Retrieve data

Provision Agreement

Data Registration

An application can be informed of new registration and the data can be retrieved.

Alternatively, the data are sent directly to the data collector (e.g. e-mail etc.).

In both cases associating the data to a Provision Agreement will aid identification of the data provider.

Read dataset

Data Structure Definition and related Concepts and Code lists

The data can be in a variety of different formats and a reader specific to the format is required.

The reader may require access to the Data Structure Definition in order to perform this function.

Data validation

Data Structure Definition and related Concepts and Code lists

Dataflow and related (valid) Content Constraint

Provision Agreement and (valid) Content Constraint

Validate data using the constrained DSD specification of the Dataflow

Validate data using the additional constraints of the Provision Agreement

Create database tables

Data Structure Definition

Database tables can be created automatically from the metadata in the Data Structure Definition.

Transform/Map

Data Structure Definition and related Concepts and Code lists

Dataflow

Input data may need to be mapped in terms of its dimensionality and/or coding schemes used.

Discover data

Category Scheme

Concepts

Data Providers

Dataflow

To enable the building of high level data discovery allowing the user to drill down to the broad data topic of data of interest (Dataflow).

Query data sources

Data Structure Definition and related Concepts and Code lists

Actual Content Constraint

Hierarchical Code List

To enable the building of search criteria that will bring back the data required.

Query metadata sources

Metadata Structure Definition

Metadata Set

Query for, retrieve, and unite metadata in the metadata set to the data points or structural metadata points to which the metadata relate.

Visualise data

Data Structure Definition and related Concepts and Code lists

Metadata Structure Definition

Data and related metadata can be visualised as tables, graphs, maps. This is made possible with the structural metadata. For instance, the data is highly coded whereas the visualisation will use the code labels using the chosen language variant. Pivoting tables is simple as the logical dimensional structure is known: this logical structure is not tied to a specific representation of the data.

Export

Data Structure Definition and related Concepts and Code lists

Metadata Structure Definition

Metadata Set

Make the data and related metadata available in the format requested by the user. The data and metadata writer will require access to the data and metadata structures to achieve this.

SDMX Structures

Terminology

Maintainable

All structures submitted to or queried from a registry are maintainable structures. A maintainable may contain sub-structures, but a maintainable is the highest level of containership. A maintainable cannot live inside another structure type, and structures cannot be submitted to the registry unless they are a maintainable, or defined in a maintainable. This is a rather complex statement, which is easier explained using a real world example:

  • A Codelist is a maintainable structure, and as such it can be submitted to the Fusion Registry.
  • A Codelist can contain codes, which are not maintainable.
  • A codelist has no parent structure type, and cannot contain other Codelists.
  • Codes cannot be submitted to the registry on their own, they must be defined inside a Codelist.

So a maintainable structure can be thought of as a container of information. The information cannot be maintained outside of its maintainable parent, hence the term Maintainable.

A maintainable structure has a reference to the maintenance agency who 'owns' the structure. It has a mandatory version, defaulting to 1.0 and a mandatory identifier. The combination of structure type, agency Id, Id, and version can be used to uniquely identify any maintainable structure in SDMX.

All the maintainable structure types in SDMX are listed below:

How they link together in the model, including some of the Identifiable components of which they are composed, is shown in the diagram below. Note that this diagram does not include the Metadata Structure Definition which is shown later.

Showing a UML model of the SDMX Structure Types

Identifiable

An identifiable structure is one can be uniquely identified in SDMX by using a URN.

A URN (Uniform Resource Name) is a specialised form of URI (Uniform Resource Identifier). A URI is a string of characters used to identify the name of a web resource. URNs are used to uniquely identify a specific SDMX artefact.

SDMX defines the URN syntax and each URN is constructed by concatenating the structure type, maintenance agency, structure version, and id of the structure.

An example URN for a Codelist is given below:

Showing the URN syntax for a maintainable structure

All maintainable structures are identifiable, as they all have a structure type, agency id, id, and version. This means each maintainable structure can be uniquely identified in SDMX using the URN syntax.

There are also identifiable structures in SDMX which are not maintainable. If they are identifiable they have a mandatory id, if they are not maintainable then they must live inside a maintainable. The maintainable parent is therefore used to derive the URN. An example of an identifiable which is not maintainable is a Code, which lives inside a Codelist. The URN syntax for a code is shown below, which simple attaches the code id to the end of the Codelist URN.

Showing the URN syntax for an identifiable structure

Structure References

In SDMX re-use of information is obtained by providing the URN mechanism allowing structures to reference other structures. This re-use of information allows structures, such as Codelists to be defined once, and then referenced by other structures that need to use them.

The decoupled nature provided by the referencing mechanism means that one structure can be maintained independently of another. For example it is possible to maintain a Data Provider: which contains a name; description; and contact details, independently of a Constraint which can reference a Data Provider. A Constraint is used to place restrictions on the allowable content of a Dataset, and when attached to a Data Provider the restrictions apply to any Data Set provided by the Data Provider. By removing the Constraint, the restrictions will cease to exist.

A good example of re-use can be seen in the SDMX Global Registry, which contains re-useable Codelists and Concept Schemes provided by the SDMX Agency. It is possible to define new structures which make use of these pre-defined structures by referencing them.

Category Scheme

A category scheme is a container for categories. Categories are used to classify any other structure in SDMX.

Categories are typically used to classify Dataflows. An example category hierarchy is given below.

Showing a category hierarchy used in a dissemination system

Categorisation

A categorisation links a category in a Category Scheme to any other Identifiable structure in SDMX. Typically in a dissemination environment a Categorisation would link a Category to a Dataflow. It is possible to use a single Category to classify multiple structures by using many Categorisations.

For Example the following selected Category:

Showing a category hierarchy used in a dissemination system

Could be used to link to the following 4 Dataflows, with a separate categorisation required for each link:

  1. Population by Household Status and Legal Marital Status
  2. Population by Household Status and Educational Attainment
  3. Population by Household Status and Status in Employment
  4. Population by Household Status and Size of the Locality

Codelist

A codelist is a container for codes (classifications). Codes associate an identifier with a name and optional description.

An example codelist is shown below.

Showing an example codelist

It should be noted that code ids are unique in a Codelist, but the same code id may be used in a separate codelist. An example is code 'A' used to represent 'Annual' in the above example, is also used to represent 'Agriculture, forestry and fishing' as shown below.

Showing an example codelist

Codelists are referenced by structures in order to give the structure an allowable enumeration of content. For example a Dimension can optionally reference a Codelist in order to restrict allowable content for the Dimension.

Code Hierarchies

It is possible to define a parent code for any given code in a codelist. The parent code must exist in the same codelist, and a code can only have one parent. By defining parent codes simple code hierarchies can be defined in a single Codelist.

SDMX provides a mechanism to create more complex hierarchies, which allow the same code to be defined more than one, in multiple hierarchies. In addition a complex hierarchy can also include codes from multiple codelists. To create more complex hierarchies, see Hierarchical Codelist.

Concept Scheme

A concept scheme is a container for concepts. A concept associates an identifier with a name and optional description. Their purpose is to provide other structures a semantic meaning.

A Data Structure Definition (DSD) provides a good example of how Concepts are used. A DSD contains one or more Dimensions. Dimensions do not have names or descriptions, instead they provide a reference to a Concept which gives this information. In order to get a human readable name for a Dimension, the referenced Concept must first be obtained.

Concepts may be referenced by multiple structures in SDMX, and therefore provide re-useable mechanism for giving meaning to structures.

An example of a Data Structure’s decoded Dimension labels is shown in the image below.

Showing the dimensions in a Data Structure Definition (DSD). Each dimension takes its name from the Concept it references

Concepts can optionally provide a link to a Codelist. The Codelist referenced by the Concept will be used as a default allowable enumeration of values if there is no alternative supplied. For example the concept for Frequency may reference a Codelist containing Frequency Codes, a Dimension which references the Frequency Concept will use the Frequency Codelist if it does not itself specify a different codelist to use.

SDMX provides a standard list of common concepts, which is termed the Cross Domain Concepts. The concept scheme includes over 100 concepts and includes concepts such as Frequency, Reference Area, Time period, Observation Status, Observation Confidentiality

Constraint

A constraint is used to further restrict allowable content for data reporting and can attach to one of the following structures:

  1. Data Structure Definition
  2. Dataflow
  3. Data Provider
  4. Provision Agreement

Data is reported against a Provision Agreement and the restrictions on the reported data are derived by merging the constraints for the Dataflow, Data Structure Definition, Data Provider and Provision Agreement. An example is given below:

Showing constraints applied at various levels of the information model

Content Constraints can restrict content in one of two ways:

  1. It can further restrict a codelist by defining a subset of permitted or restricted codes in the list.
  2. It can define full or partial series keys which are allowed or restricted.

Data Structure Definition

A Data Structure Definition (DSD) provides a template which describes the structure of related datasets in terms of their dimensionality and coding schemes.

An SDMX dataset must conform to a DSD and can only be interpreted by using the DSD and related Concepts and Codelists to decode the dataset information.

For example a series in a dataset may be defined by the following series key:

A:BN_KLT_DINV_CD:AUS

Using the DSD an application could decode this key to:

Dimension Name (en)

Code Id

Code Name (en)

Frequency

A

Annual

Series

BN_KLT_DINV_CD

Foreign direct investment, net (BoP, current US$)

Reference Area

AUS

Australia

An example DSD (with Ids decoded to their English labels) is shown below.

Dataflow

A Dataflow simply has an id, provides a name, and references a Data Structure Definition (DSD).

Data is disseminated against a Dataflow and must conform to the template defined by the DSD that the Dataflow references.

Having this additional layer allows multiple Dataflows to reference the same DSD (template) but each may define a different type of data. For example the European Central Bank (ECB) have a Balance Sheet Items DSD which is referenced by three Dataflows: Balance Sheet Items; Balance Sheet Items Statistics (tables 2 to 5 of the Blue Book), Key euro area indicators (BSI).

Data is also reported (indirectly) against a Dataflow, so it is possible to impose different restrictions on difference Dataflows using Constraints.

Hierarchical Codelist

A Hierarchical Codelist (HCL) supports the construction of complex classification hierarchies. A HCL does not define any new classifications (Codes) instead it references existing codes inside one or more codelists.

A simple hierarchy of codes can be created in a Code List and this hierarchy can be visualised to assist in creating a data selection.

The principal restrictions on the hierarchy in a Code List are:

  • there can only one hierarchy, though here is no limit to the number of levels in the hierarchy
  • a code can have only one parent
  • the levels cannot be named

The SDMX Information Model also has a construct called the Hierarchical Code List (HCL). This removes these restrictions. The principal features of the HCL are:

  • multiple hierarchies to be built
  • a code can be used in more than one hierarchy
  • perhaps most important from a data dissemination perspective, the hierarchies can be built from multiple Code Lists thus allowing codes to be introduced such as grouping codes (e.g. continents or economic communities that group countries taken from a geography code list) The Information Model for the HCL is shown schematically below.

The HCL can have one or more Hierarchies each comprising Levels (optional) and Hierarchical Codes. The Hierarchical Code can be linked to a Level in the Hierarchy. Note that the Level is used to give semantic information about the Level and does not control the physical hierarchy which is specified by the hierarchy of Hierarchical Codes.

The Hierarchical Code references a Code in a Code List. This Code List can contain a flat list of Codes or a simple hierarchy of Codes. The Hierarchy in the HCL need not reflect the hierarchy in the Code List as the Hierarchical Code references the Code and this is placed in the context its position in the hierarchy of the Hierarchical Code.

The HCL can be an extremely useful construct in a data dissemination system as it can introduce grouping codes and hierarchies that aid data discovery and data query. Importantly, the HCL can be built by organisations that do not maintain the Code Lists used by the DSDs which specify the structure of the datasets. In other words an HCL can be added to the dissemination system without interfering with or changing existing structural metadata that control the content of structure of a data set.

However, whilst the HCL is a part of the SDMX Information Model there is no standard way of relating the HCL to a construct such as a Dimension or Dataflow and it is left to individual systems to make and use these links.

The example below shows the use of an Annotation to create the link to a dataflow.

Here the Annotation Title is used to specify the SDMX URN of the HCL to be used for the Dataflow and Annotation Type identifies the Dimension for which the HCL is to be used.

The use of the SDMX constructs is shown below.

An example of displaying this HCL in a dissemination system is shown below.

Note that the codes taken from the Topic codelist are shown in the GUI as being not selectable for query, as they do not exist in the DSD and consequently there are no data series. The Hierarchical Code used in this example is used solely for grouping the codes in the Series Code List.

Metadata Structure Definition

A Metadata Structure Definition (MSD) is used to define a template for reporting and disseminating Reference Metadata.

A MSD is used to define a template for reporting and disseminating Reference Metadata.

Reference Metadata has many forms, it is everywhere, it is authored in many ways using different tools and is consequently stored in many forms, it is often not in a centralised accessible resource and consequently linking the metadata to the construct or data slice to which it relates at a granular level can be challenging.

Two major requirements that stem from these challenges are:

  1. The semantic of the metadata must be capable of being identified.
  2. The “object” of the metadata (e.g. data point, a structural construct such as a specific Code, a particular context of data such as education statistics for Canada) must be capable of being identified unambiguously.

In order to support these two requirements SDMX has a generic and powerful model for supporting this type of metadata. Whilst this supports the metadata requirements, the model can be difficult to understand as it is necessarily quite abstract. So, the way it works is best explained by means of an example.

Example

Consider the following metadata, which is taken from a live dissemination system which is re-publishing population data originally published by the OECD.

This is metadata that is collected as part of a data quality framework. It is not an intrinsic part of the data but relates to the data. It is held in a metadata repository. How is this defined and made available as SDMX?

The following diagram shows the metadata set and how it is related to the metadata structure definition.

The MSD specifies the types of construct to the which the reported metadata (in a metadata set) can relate to, and the structure (Report Structure) of the metadata (Metadata Report) when reported in a metadata set in terms of the Metadata Attributes for which metadata can be reported.

In the example the MSD supports reporting metadata for either a partial key (i.e. one or more dimension values) or a specific code. These are the Metadata Targets in the MSD and can take a coded representation if the allowed values are specific to one Code List or Concept Scheme or other type of SDMX scheme.

A separate Metadata Report is required for each Target. So in the example there will be two reports – one for the Code and one for the partial key (population for Germany). Note that both reports can contain metadata for the same set of Metadata Attributes: this is achieved by linking the Report Structure in the MSD to both of the Metadata Targets.

In a dissemination system it is common to first indicate the presence of metadata by placing an 'i' next to the text for which the metadata relates and when the user clicks on the 'i' the metadata us revealed.

Metadataflow

This is like a Dataflow but for metadata. The utility of the Metadataflow is identical to that of the Dataflow but relates to reference metadata instead of data. It links to a Metadata Structure Definition and the (meta)Data Provider via the Provision Agreement. For instance, the Provision Agreement can contain a (metadata) registration containing details of the location of the metadata.

Organisation Scheme

There are four types of scheme: maintenance agency, data provider, data consumer, and organisation unit. All except the organisation unit scheme are used by SDMX systems to control access to data reporting, data query, and maintenance of structural metadata.

Process

Describes and defines a process in terms of the Process Steps, the inputs and outputs of each Process Step, and the computations that are performed.

Provision Agreement

Links a single Data Provider with a single Dataflow.

In a data collection scenario information can be held for the Provision Agreement such as the allowable (Dimension or Attribute) values that can be reported by the Data Provider (Constraint) and the location from where the data set can be retrieved (Registration).

In a data dissemination scenario information can be held for the Provision Agreement such as the location from where data published by the Data Reporter can be retrieved or queried (Registration), and the actual series key values in the data source (Constraint attached to the Registration). The latter can be used to determine the content of a data source and this can be used to prevent a user querying for data that does not exist in the data source, and to broker queries to the data sources that actually have the data thus supporting a data portal.

Reporting Taxonomy

This is a special type of category scheme whose purpose is to link to a set of data structures that describe the data tables in a statistical publication or data reporting scenario.

Structure Set

A set of structure maps and items scheme maps (code list, category scheme, concept scheme, organisation scheme). The content of these maps is a definition of the correspondence between a source and target scheme or source and target data structure. The purpose of the mapping is to support the collection or dissemination of data or metadata when the dimensionality of the data in the source (e.g. internal) and target (e.g. agreed/harmonised data reporting or query) system is different, or the coding schemes are different.

Registration

A Data Provider can register the location of data and metadata for a specific Provision Agreement. This enables systems to discover that new data are available either in support of a data collection process or a data dissemination process.

Data

Schematic of the Information Model

The dataset must reference to a Data Structure Definition (DSD) either directly or via a Dataflow or Provision Agreement.

Logically the Dataset comprises Series and the Series comprises Observations. Observations are associated to a Time Period for tine series or to any other Dimension (e.g. geography) for non-time series.

Attributes can relate to the Dataset, the Series, an Observation or to a Group. A Group is a partial series key i.e. the value of one or more Dimensions comprising a sub set of the full key.

Depending on the physical format of the datasets the associations depicted by the green lines can be either containers (e.g. Dataset contains Series) or references (Series referenced the Dataset).

Implementing SDMX Formats

SDMX has many dataset formats in order to support the various use cases. This could seem to be quite confusing and difficult to implement, but choosing the correct architecture will not only simplify this but it will make it extensible to possible future formats for data and also for non-SDMX formats such as CSV and Excel.

Sdmx Source is an implementation of the SDMX Common Component Architecture developed by Metadata Technology and made available for all to use. The architecture is based on and a faithful representation of the SDMX Information Model. This model is syntax and format independent and adherence to this architecture makes it easy to support a wide variety of formats.

Use Cases

The various use cases and the names of the formats in the two major version of SDMX (2.0 and 2.1) are shown in the table below.

Use Case

Version 2.0

Version 2.1

Single data set xml schema supports all Data Structure Definitions (DSD)

Generic

Restricted to time series

SDMXDataGenericTimeSeries

Restricted to time series

SDMXDataGeneric

Supports both time series and non-time series

Time series data set specific to a single DSD

Format can be expressed as xml schema

Compact

Restricted to time series

SDMXDataStructureSpecificTimeSeries

Restricted to time series

Non-time series data set specific to a single DSD

Format can be expressed as xml schema

Cross Sectional

SDMXDataStructureSpecific

Supports both time series and non-time series

Validation of dataset using Schema validation

Utility

Discontinued

Users of GESMES/TS

SDMX-EDI

Restricted to time series

SDMX-EDI

Restricted to time series

Web data dissemination

Combines data and related structural metadata and links to reference metadata

Not supported

SDMX-JSON

Supports any type of SDMX data

The major difference between the “generic” formats and the “structure specific” formats is shown below.

Data Formats

Introduction

This section describes the various formats for a Dataset in the SDMX specification. The example for each format described in this section uses data taken from part of the table shown in the screenshot below.

The data shown in the above image is based on the Data Structure Definition and codelists shown in the image below. Note that each dataset (with the exception of SDMX-JSON) only carries with it the coded Identifiers, not the code names. Therefore in order to construct a human readable output the application must have access to the code and concept names which are maintained in the Structural Metadata.

Generic Format

Characteristics

  1. There is a single XML schema that supports all datasets regardless of the Data Structure Definition (DSD) that defines the allowable content.
  2. This original use case for this format was to enable systems to process the data set in a generic way using a single schema. This was useful for processing with XSLT.
  3. The schema validation for this format is very weak as the actual data content (series keys, attributes, measures) cannot be validated by the schema processor. The validation requires access to the DSD.
  4. This format is the most verbose of all of the SDMX data formats.

Example

Our View

In order to carry out full validation access to the DSD is required even when the format is “structure specific”. Therefore, from the perspective of validation it does not matter that the format has little inherent support for validation.

The benefit of using this format over the other (less verbose) formats is largely outdated as systems, components, and tools use a more programmatic approach to processing datasets. However, if XSLT is a part of the processing in a system (which may be the case if the datasets to be processed are small) then the Generic format may be the format of choice.

Structure Specific

Characteristics

This format combines the characteristic of the Compact and Cross Sectional structures in V 2.0 of the SDMX standard

  1. Except for the TimeSeries variant of this format any one of or ALL of the Dimensions can be placed at the level of the observation (thus supporting cross-sectional data) The TimeSeries variant of this format mandates that this is TIME_PERIOD.
  2. Multiple measures (specified in the Measure Dimension) can be contained in the Observation.
  3. This is the most terse XML format for SDMX data.
  4. An XML schema can be generated from the DSD and Content Constraints.
  5. The XML schema can be generated to represent the allowed content (Constraint) at the level of the DSD, or Dataflow, or Provision Agreement.
  6. The generated schema contains the allowed content for each of the Dimensions, Attributes, and Measures as specified in the DSD and any Content Constraint.
  7. However,
    • The schema generation rules places all of the Dimension and Attribute components in XML attributes.
    • There is no distinction in the dataset between the (XML) attributes that are Dimensions and those that are Attributes.
    • Because these XML attributes can be placed in various (XML) elements in the dataset the schema generation rules mandate that all of (XML) attributes generated from DSD Attributes are conditional.
  8. The generated schema can be used to validate the dataset using a generic schema validation tool. However, it cannot validate that:
    • All mandatory attributes are present (the presence of these depends on whether the dataset is creating new data, updating existing data (and attributes) or deleting data).
    • There are no duplicate observations.
  9. The consequence of (6) and (7) is that often access to the DSD is required when processing the dataset.

Example

This example shows TIME_PERIOD at the level of the observation.

Our View

This is the most versatile and most popular data format for those organisations using version 2.1 of the SDMX standard.

As schema validation is relatively slow and non-streaming and as the validation cannot trap all errors, the trend is for a validation engine to implement its own validation logic (including the ability to accept and output streamed data) by accessing the DSD, Dataflow, Provision Agreement and associated Constraints.

The SdmxSource contains the components to build a validation engine.

Compact

Characteristics

  1. Supports Time Series only.
  2. The format is similar to the version 2.1 Structure Specific format where TIME_PERIOD is the Dimension associated at the level of the observation – i.e. the contains all of the Dimensions except TIME_PERIOD and the TIME_PERIOD is iterated with the observation value at the level.
  3. An XML schema can be generated but, if this is generated from a V 2.0 DSD then this is not as powerful as the schema that can be generated for a version 2.1 DSD and associated Constraints – as the Constraints are not well supported at version 2.0.
  4. The same validation restrictions apply as for version 2.1 Structure Specific plus the fact that all XML attributes are conditional and so the presence of mandatory SDMX Attributes and Dimensions cannot be validated by the schema processor.

Example

Our View

This is the most versatile and most popular data format for those organisations using version 2.0 of the SDMX standard.

Cross Sectional

Characteristics

Supports a non-time series representation of the data (though time can be present in the dataset).

The placement of the Dimension and Attributes (with the group, series, observation) is specified in the DSD. This can even be a “flat” representation where all Dimensions and Attributes are presented at the same level as the observation value.

Multiple measures can be specified in the DSD though this is rather complex, as the DSD must declare explicitly the measures and map each to the relevant code list.

Example

Our View

The flexibility offered by the version 2.0 cross sectional format was offset by the complexity of the specification in the DSD. Except for some extreme use cases the format has been superseded by version 2.1 Structure Specific.

This format is not well supported in software tools except for very simple use (e.g. like that shown in the example). For this reason it is recommended not to use this format and to use the version 2.1 Structure Specific format.

Utility

Characteristics

  1. Similar to Compact but restricted to new data sets
  2. Mandatory SDMX Attributes and Dimensions are made mandatory XML attributes
  3. The most strict type of data format of all the schemas that support validation

Example

None. This format has been deprecated and little or no support for it exists in current tools.

Our View

Do not use this. The schema validation is not 100% and it is restricted to new data only. It is not supported at SDMX version 2.1. Better validation tools exist.

SDMX-EDI

Characteristics

  1. Supports Time Series only.
  2. Uses the UN/EDIFACT syntax – a format from the 1990s that pre-dated XML.
  3. Is extremely terse
  4. No generic software for validation, transformation, authoring, editing, navigating, such as that available for XML

Example

Our View

Whilst the syntax is a 1990s syntax, the format is still popular for reporting financial data in the Central Banking community.

However, unless you have a specific requirement to use this format then it is not recommended as the syntax is largely superseded by 21st century syntaxes such as XML and JSON.

However, should you need to use the format there are validation and transformation tools that support SDMX-EDI and the SdmxSource has SDMX-EDI reader and writers.

SDMX-JSON

Characteristics

  1. JSON stands for Java Script Object Notation and this is a representation of the dataset in Java Script that can be processed by JavaScript applications.
  2. It was developed by technical specialists under the auspices of the SDMX Technical Working Group.
  3. It is designed to be both terse and simple to process by Java Script programmers, and therefore for use in web-based application such as data dissemination.
  4. The textual representation of the dataset components (Concepts and Codes) is presented in addition to its coded form. Therefore there is no need to access independently the DSD in order to process the datasets for visualisation.
  5. The current status is that the format has been implemented by a number of organisations (including Metadata Technology) and is going through acceptance in the SDMX standards process.

Example

Snippet of observation values
Snippet of structural metadata values

SDMX Reference Metadata

Schematic of the Information Model

The Metadata Set must reference to a Metadata Structure Definition (MSD) either directly or via a Metadata Flow or Provision Agreement.

Logically the Metadata Set comprises Metadata Reports and the Metadata Reports comprise Metadata Attributes. The Metadata Report comprise the components required to identify the object to which the metadata relate (Metadata Target). This is typically a partial key of a series, a full key of a series, the agency, version, and id of structure component such as a Code or Concept.

The Metadata Attributes relate to the Metadata Target.

As these metadata (called reference metadata in SDMX) are typically authored or reported at different timescale to the data, by metadata experts rather than data experts, they are often stored in a different database from the data, usually known as a “metadata repository”.

SDMX Formats

SDMX has two Metadata Set formats in order to support two use cases.

Use Cases

The various use cases and the names of the formats in the two major version of SDMX (2.0 and 2.1) are shown in the table below.

Use Case

Version 2.0

Version 2.1

Single metadata set xml schema supports all Metadata Structure Definitions (DSD)

Generic Metadata

Generic Metadata

Metadata set specific to a single MSD

Metadata Report

Metadata Report

The major difference between the “generic” formats and the “structure specific” formats is shown below.

Format Choices

Example Metadata Set

In all of the examples below the data in the following table is used.

Generic Format

Characteristics

  1. There is a single XML schema that supports all Metadata Sets regardless of the MSD that defines the allowable content.
  2. The schema validation for this format is very weak as the actual metadata data content (Metadata Attributes) cannot be validated by the schema processor. The validation requires access to the DSD.
  3. However, as the metadata are, in the main, textual this is not a serious weakness.
  4. This format is the most verbose of all of the SDMX metadata formats.

Our View

In general reference metadata is textual and the metadata sets are small in size compared to the dataset. Therefore, there is little value in using an MSD-specific format. The generic format is easy to create and to use in combination with the MSD that describes its structure

Structure Specific

Characteristics

  1. An XML schema can be generated from the MSD and Content Constraints.
  2. The XML schema can be generated to represent the allowed content (Constraint) at the level of the DSD, or Dataflow, or Provision Agreement.
  3. The generated schema contains the allowed content for the Metadata Attributes, as specified in the MSD and any Content Constraint.
  4. The generated schema can be used to validate the Metadata Set using a generic schema validation tool.

Example

Unlike data, the message size is not a factor for reference metadata which tends to be measured in kilobytes and not megabytes.

Whilst the MSD-specific schema will validate the content of coded Metadata Attributes or length restrictions for text, the trend for a modern validation engine is to implement its own validation logic by accessing the MSD, Dataflow, Provision Agreement and associated Constraints.

Therefore there is no real differentiating factor that would point to a preference between the generic and the MSD-specific format.