Fusion Registry 2020-2022 Development Roadmap

The Fusion Registry development roadmap focuses on the three core themes of
  • VTL
  • Big Data, and
  • Microdata

1. VTL Execution Engine (Proof of Concept)
An initial implementation of a stand-alone VTL execution engine based on the 2.0 VTL specification to prove the concept and demonstrate it is able to satisfy the most common use cases. The engine will be designed to interface with Fusion Registry as a repository for VTL code, accessing SDMX structural metadata, and as a service to perform data validations and transformations. A subset of the most commonly used VTL functions will be supported.

2. Apache Avro
The first task in the Big Data strategy is to add support for SDMX data in Apache Avro - an open and well-established framework for efficiently serialising, exchanging and storing structured data (microdata and aggregated data). It works natively with Hadoop and is ideally suited to large datasets where other formats such as XML would prove too verbose.

3. Hadoop HDFS
Add support for Hadoop distributed file system (HDFS) as a Fusion Registry data source. HDFS is a distributed and scalable file system designed to reliably store data files typically in the range of gigabytes to terabytes. The functionality will allow Fusion Registry to load pre-prepared data from HDFS, store SDMX aggregated and microdata, and delegate processing tasks to distributed parallel processing engines capable of working directly with HDFS.

4. Microdata Modelling using SDMX
Allow microdata to be modelled using SDMX structures. Currently only aggregated data is supported in SDMX 2.1, but this work anticipates SDMX 3.0 by adding a new type of Codelist tuned for microdata and allowing alternative measure cardinality when creating Microdata Structure Definitions.

5. Microdata Schema Validation
Validation of microdata to check that it is structured as expected i.e. as defined by the structural model (4). Note that this is distinct from business rules validation described using VTL.

6. VTL Execution Engine (Production Release 1)
The first production release of the VTL Execution engine building on the lessons learned from the proof of concept particularly in terms of usability for practical applications, performance on large datasets and interfaces to SDMX Dataflows. The majority of the VTL 2.0 functions will be supported.

6. VTL Execution Engine (Production Release 1)
The first production release of the VTL Execution engine building on the lessons learned from the proof of concept particularly in terms of usability for practical applications, performance on large datasets and interfaces to SDMX Dataflows. The majority of the VTL 2.0 functions will be supported.

7. VTL Execution on Apache Spark
Implementation of the VTL Execution Engine on Apache Spark - an open-source distributed parallel processing engine designed for very fast processing of large datasets. Spark works natively with HDFS.

8. Apache Spark Integration
Integration of Apache Spark with Fusion Registry allowing SDMX metadata-driven batch processing of SDMX microdata and aggregated datasets using complex algorithms or expressions in VTL, R, Python, SQL, Scala and Java. For instance, high-speed on-demand computation of aggregated SDMX datasets from source microdata.

9. SDMX Microdata Processors
Addition of rules-based functions similar in concept to those used for SDMX Structure Mapping for aggregation of microdata and conversion to time series of specified frequency.

10. VTL Execution Engine (Production Release 2)
Second iteration of the VTL Execution engine addressing gaps identified in performance or functionality, and aligning with subsequent releases on the VTL standard.

10. Apache Spark Processing of Streaming Data from Apache Kafka
Continuous processing of streaming data from Apache Kafka allowing, for instance, SDMX aggregated datasets to be automatically updated as soon as new microdata arrives. Apache Kafka is an open-source distributed streaming platform implementing a 'publish and subscribe' model similar to a message queue enabling multiple streams to be processed simultaneously.