PDMX, Poll & Direct Metadata for SDMX (with Excel)
Introduction
The Excel PDMX tool supports the user in retrieving all major SDMX artifacts from the SDMX endpoint of his or her choice (dataflow, datastructure, conceptscheme, codelist and metadatastructure).
See below for more background info on SDMX.
Only JSON formatted content is supported.
Please note that some organizations require authentication for fair use of their API (which is not supported by PDMX).
PDMX can easily be used within an Office environment as a regular Excel worksheet (with macro enabled).
SDMX artifacts can be collected for two different systems, and optionally compared for any differences.
The results can be filtered by agency only (or all but agency, prefix with !).
The results are saved in separate sheets per artifact and system.
PDMX can be used for different uses cases, viewing available cross-domain codelists to reuse, deciding related artifacts to update in case of new versions etc.
In the main sheet (named PDMX) the results are summarized (see figure 1 below for an impression) and the requests can be initiated by selecting the proper button in the Form (to launch by the 'Form'-button or F1).
Retrieving the artifacts may take some time depending on the number within the selected system and your network connection.
In the statusbar the status of the retrieval can be followed.
Some supporting functions can be selected as well, to unhide or hide the configuration (of the SDMX endpoint details), to initialise the worksheet, to resize the column widths of all data sheets, to save summarized results to a history sheet.
The results for the different artifacts are checked for their usage as well (to find any unused artifacts or 'orphans', see figure 2).
The standard interrelationships of the especially the datastructure artifact are therefore of great importance.
The tool can be used for all standard SDMX-API endpoints (to configure by their url, accept header and REST parameters, see below the options).
You can try it out with the already configured examples, they all work.
The configuration of the SDMX endpoints can be hidden of unhidden by the button in the general form.
It is only needed to enter data in the configuration cells (yellow cells) and to select the endpoint of your choice in the main sheet (in the green or blue cell).
And remember, to prevent any malfunctioning of PDMX, do not update or delete cells in the existing sheets and do not change sheet names without a good reason!
Usage
After downloading PDMX from the download-page or by this link, you must first enable the macro by trusting this file (at your own risk).
You can rename the file for your own convenience.
The next thing is to add and select your endpoint-parameters in the configuration-part of the main sheet (named PDMX), which can be hidden afterwards.
With the Form-button (or F1) all available functions can be disclosed:
- Get DFs: Retrieve all dataflows from system (data is collected in sheet DF1 or DF2)
- Get DSs: Retrieve all datastructures from system (data is collected in sheet DS1 or DS2)
- Get CSs: Retrieve all conceptschemes from system (data is collected in sheet CS1 or CS2)
- Get CLs: Retrieve all codelists from system (data is collected in sheet CL1 or CL2)
- Agency rep.: Produces summary of all retrieved artifacts per agency (in sheet AG1 or AG2)
- Count data: Retrieve datasize in sheet DT1 or DT2 for all dataflows (collected in sheet DF1 or DF2)
- Count categ.: Retrieve categories and counts only the number present
- Count constr.: Retrieve constraints and counts only the number present
- Compare DFs: Compare all retrieved dataflows of system1 with system2 (in sheet DFc)
- Compare DSs: Compare all retrieved datastructures of system1 with system2 (in sheet DSc)
- Compare CSs: Compare all retrieved conceptschemes of system1 with system2 (in sheet CSc)
- Compare CLs: Compare all retrieved codelists of system1 with system2 (in sheet CLc)
- Add history: Add current totals of summary in main sheet as a new row in the History sheet
- Unhide config: Unhide (and hide) the configuration data in the main sheet
- Resize sheets: Resize all data sheets to its initial column sizes
- Initialise: Remove all retrieved data from the worksheet (except the configuration)
- Cancel: Cancel function selection with the form menu
Note 1: The first 8 functions can be applied for the 'green' and 'blue' system (or system1 and system2).
Note 2: The right order in retrieving a complete and consistent collection of a system is: dataflows, datastructures, conceptschemes and codelists.
Note 3: By hitting the little button next to label System 1 or System 2, the DFs, DSs, CSs and CLs can be retrieved in one go.
Note 4: The Count data function may take a very long time depending on your connection or system endpoint. You can abort this function at any time (the data size is marked with an asterisk as incomplete). And if you start another Excel workbook, it will automatically abort (only one Excel workbook opened is allowed during data counting).
The results are displayed in the main PDMX-sheet, as well the last url used and a snippet of the reponse retrieved.
The previous results are moved to the next column to the right.
After retrieving the artifacts, you can examine the results in the appropriate sheets.
Artifacts used are counted by their references, so artifacts not used can easily be recognized.
Older versions of artifacts can be such a reason.
Unused artifacts
One of the main goals of PDMX is to detect artifacts that are isolated and not (anymore) used.
The SDMX standard provides a highly interoperable and reusable way to manage statistical data with its metadata, but it has a drawback:
managing all the artifacts present and their versions can become cumbersome.
PDMX can help you with this, for that PDMX needs to count all related artifacts for datastructures and conceptschemas.
Related artifacts are part of the reference information in each artifact, available with the request option 'details=full', which is the default option for the SDMX API.
There are two challenges though:
First, it can be time consuming. Especially if the number of dataflows and datastructures is high.
Second, it may require a lot of data to be retrieved from the API at once, which can result in gateway timeouts (504) or other errors.
So, to enable the detection of unused artifacts depends on the size of the artifact sets in your system.
If this size is too high and gives an http error, unfortunately the detail option must be set to 'detail=allstubs'.
The maximum number of artifacts to retrieve by PDMX is currently set to 8000, the maximum number of concepts derived from the datastructures and conceptschemes is currently set to 80000.
Besides reducing your metadata by reusing it whenever possible, it makes managing your metadata, even with the help of PDMX, more pleasant.
To control the counting of related artifacts, PDMX offers a configuration parameter Count related artifacts, which is enabled by default. Disable it by entering 'N' as the value.
A second processing option could be the Show processing details in status bar, it gives detailed information in the status bar when artifacts are retrieved and processed.
Particularly useful for systems with large artifact sets.
Data counting
Statistical systems can contain a lot of data, which makes their maintenance difficult.
The Data count function allows the data available for each dataflow to be counted.
Most SDMX systems do support some way to get an indication of the size of a dataflow (like OECD), the AvailableConstraint artifact provides this functionality.
It generally contains the number of observations used by PDMX.
In the event that the SDMX system does not provide this artifact (like BIS or Unicef), the 'brute force' method is selectable via the 'Data' checkbox (by which PDMX retrieves the data itself).
Especially in the latter case, it will be a costly operation, at least in duration.
That's why PDMX offers an incremental approach (to enable via the 'Incremental' checkbox).
The data can be requested intermittently and repeated regularly.
After all, the size of the dataflows once retrieved will not change much or very often.
To facilitate incremental data retrieval, PDMX also provides two configurable parameters:
- Speed calibration: adjust this counter so that the speed of the counter value (dependant of your system) is about one second equal
- Maximum wait: set this counter to maximum 'seconds' (defined by speed calibration), set to 0 to disable
PS As PDMX retrieves the data, a counter (starting from zero) is displayed in the row of the counted dataflow.
Once the data retrieval is complete, the result (number of bytes) replaces this counter value.
Both parameters allow you to set a maximum wait in duration before skipping the dataflow and moving on to the next one..
However, you can always stop it and continue on another occasion (via alt-b).
The result is displayed as http status code 200 (if ok) or any 4** or 5** value.
If the retrieval fails due to a maximum wait, the fictitious result value will be -100.
Some systems do not like continuous retrieval (or 'harvesting'), so a dynamic response may be possible.
An appropriate way may be to return status code 429.
PDMX can automatically respond to the 429 status by waiting a minute before retrying (if enabled by the Retry after 429 parameter in the configuration).
In summary, an incremental approach does not repeat previous successful requests and retries all other data count requests.
And if you want to start with a clean slate, simply disable the 'Incremental' checkbox.
Disclaimer
Please note that you use the PDMX tool at your own risk. Excelmeer is in no way liable for any damage resulting from its use.
Version | Date | Information |
v0.9a | 11-01-1025 | First release made public, updated interface (to be independent of any locale), improved messages, added and checked examples. |
v0.9b | 27-01-2025 | Add an incremental way of counting all available data, add maximum wait on data retrieval. |
v0.9c | 11-02-2025 | Add an alternative and much quicker way of counting all available data, and improved minor things. |
v0.9d | 13-02-2025 | Fix error when counting raw data and interrupting. |
v1.0a | 06-03-2025 | Release 1.0, fix counting metadata structures, optimize getting datastructures & conceptschemes, add run indicator and more. |
v1.1a | 11-03-2025 | Update 1.1, increased processing capacity, added option to explicitly count related artifacts, added option for status bar update detail level. |
Contact
In case you are looking for some customization in PDMX or any other Excel development, please contact
Excelmeer.
Figure 1: animated example of main screen (v0.9a)
Figure 2: example of retrieved codelists
Standard SDMX-API parameters
DETAIL-parameter: This attribute specifies the desired amount of information to be returned.
- ALLSTUBS : all artifacts should be returned as stubs, containing only identification information, as well as the artifacts' name.
- REFERENCESTUBS: referenced artifacts should be returned as stubs, containing only identification information, as well as the artifacts' name.
- REFERENCEPARTIAL: referenced item schemes should only include items used by the artifact to be returned. For example, a concept scheme would only contain the concepts used in a DSD, and its ISPARTIAL flag would be set to true. Likewise, if a dataflow has been constrained, then the codelists referenced by the DSD referenced by the dataflow should only contain the codes allowed by the content constraint
- ALLCOMPLETESTUBS: all artifacts should be returned as complete stubs, containing identification information, the artifacts' name, description,annotations and isFinal information.
- REFERENCECOMPLETESTUBS: referenced artifacts should be returned as complete stubs, containing identification information, the artifacts' name, description, annotations and isFinal information.
- FULL (default): all available information for all artifacts should be returned.
REFERENCES-parameter: This attribute instructs the web service to return (or not) the artifacts referenced by the artifact to be returned (for example, the code lists and concepts used by the data structure definition matching the query as well as the artifacts that use the matching artifact (for example, the dataflows that use the data structure definition matching the query).
- NONE (default): no references will be returned.
- PARENTS: the artifacts that use the artifact matching the query
- PARENTSANDSIBLINGS: the artifacts that use the artifact matching the query, as well as the artifacts referenced by these artifacts.
- CHILDREN: artifacts referenced by the artifact to be returned.
- DESCENDANTS: references of references, up to any level, will also be returned.
- ALL: the combination of parentsandsiblings and descendants, in addition, a concrete type of resource may also be used (for example, references=codelist).
Note: For using PDMX, only the DETAIL attribute can be useful, the REFERENCES attribute is of no use.
More info on the SDMX-API can be found here.
SDMX background
SDMX, which stands for Statistical Data and Metadata eXchange, is an ISO standard designed to describe statistical data and metadata, normalise their exchange, and improve their efficient sharing across statistical and similar organisations.
SDMX is sponsored by eight international organisations including the Bank for International Settlements (BIS), European Central Bank (ECB), Eurostat (Statistical Office of the European Union), International Labour Organization (ILO), International Monetary Fund (IMF), Organisation for Economic Cooperation and Development (OECD), United Nations Statistical Division (UNSD), and World Bank.
The SDMX standard provides an integrated approach to facilitating statistical data and metadata exchange, enabling interoperable implementations within and between systems concerned with the exchange, reporting and dissemination of statistical data and their related meta-information.
More info on SDMX can be found at the SDMX website.
|