1. Creating the tag_config.json Configuration File: A Step-by-Step Guide#

The configuration file, tag_config.json, contains the essential settings for extracting the textual content and attributes of tag elements from XML-based TDS web services in an external manner (not inside of source code), based on the user’s specified preferences. Extensible Markup Language (XML) is a markup language that offers a set of regulations for the specification of data. XML enables the transmission of data alongside its corresponding description and additional particulars, hence safeguarding the preservation of data integrity. The currently available iteration of TDS2STAC is capable of accommodating all XML-based web services provided by Thredds data server, including WCS, WMS, WFS, ISO, DAP4, ncML, and Catalogs. This functionality facilitates the extraction of content from XML files through an innovative and user-friendly approach. In the subsequent discussion, we will go further into the composition of this particular file.

Upon examining the provided JSON file, it becomes evident that the primary keys inside this file correspond to the names of the extensions that are being targeted for harvesting. In the given JSON file, there are two extensions, namely item_datacube_extension and item_scientific_extension. The objective is to extract specific content from XML files within the TDS pertaining to these extensions.

{
    "item_datacube_extension": {},
    "item_scientific_extension": {},
}

As seen below, we aim to extract the variables of the extension item_datacube_extension such as horizontal_extent_lon_min, horizontal_extent_lon_max, horizontal_extent_lat_min, and horizontal_extent_lat_max for each items. To proceed, it is necessary to first comprehend the fundamental framework for defining variables. Subsequently, we can proceed with the definition of the variables that are to be harvested.

{
    "item_datacube_extension": {},
    "item_scientific_extension": {
                    "horizontal_extent_lon_min": {},
                    "horizontal_extent_lon_max": {},
                    "horizontal_extent_lat_min": {},
                    "horizontal_extent_lat_max": {}
}

1.1. Main and constant attributes:#

There exist four distinct modes of analysis for each variable in the tag_config.json file. The aforementioned kinds include str, list, get, and check. Each of them comprises both constant properties and variable attributes. The constant attributes are:

  1. tds2stac_mode_analyser: It determines the method of analysis. As previously stated, there exist four distinct modes. In this tutorial, we will provide comprehensive explanations for the terms str, list, get, and check.

  2. tds2stac_manual_variable: This attribute is employed when the variable is held constant. This feature is exclusively utilized for variables in the str and list modes. When a string or list is defined, such as 49.56 or [EPSG:4326, EPSG:3857], the utilization of this attribute becomes necessary.

  3. tds2stac_webservice_analyser: This attribute is employed to designate the nomenclature of the webservices that is sought to be harvested from. This feature is exclusively utilized for variables of the get and check types. It has the capability to accommodate a comprehensive range of activated webservice names within the TDS framework, including but not limited to WCS, WMS, WFS, ISO, DAP4, ncML, and Catalog-XMLfile. They consists ncml, wms, catalog, dap4, iso.

  4. tds2stac_reference_key: This attribute is just utilized for the get mode analyzer. In the context of the datacube extension, consider a scenario where there is a list consisting of five components representing variable IDs. In this scenario, the objective is to obtain the variable units for each of these elements. However, it is worth noting that three of these variables do not own any units inside the TDS webservie. The aforementioned attribute can be utilized to establish the reference key of the variable based on the variable_ids. It determines the variable_unit, and in cases where variable_unit is absent, it is defined as null. Ultimately, the two collected lists will possess equal lengths, so facilitating their utilization in subsequent stages, thereby rendering the process of evaluating them more manageable.

We shall provide separate descriptions for each of them and strive to understand them thoroughly.

1.2. Mode Analyzers#

  1. str TDS2STAC Mode Analyser

The str analyzer mode is utilized to declare a fixed string value for a variable within an extension. In the present scenario, if one intends to assign a constant string value such as 46.852 to the variable horizontal_extent_lon_min for STAC-Items, this mode can be employed. To gain further insights into this particular category, please refer to the illustrative sample provided below:

{
    "item_datacube_extension": {},
    "item_scientific_extension": {
                    "horizontal_extent_lon_min": {
                                                "tds2stac_mode_analyser": "str",
                                                "tds2stac_manual_variable": 46.852
                                                },
                    "horizontal_extent_lon_max": {},
                    "horizontal_extent_lat_min": {},
                    "horizontal_extent_lat_max": {}
}

In the aforementioned example, in order to establish a consistent minimum longitude for all STAC-Items within the datacube, it is necessary to designate tds2stac_mode_analyser as a string mode analyzer. Subsequently, the constant value of the variable is specified in the tds2stac_manual_variable field. In the present scenario, the fixed numerical number is 46.852.

  1. list TDS2STAC Mode Analyser

The list mode analyzer is employed to establish a fixed collection of string values for a variable in an extension. In the present scenario, if there is a need to declare a variable named variable_ids that has a constant list of string values such as [variable_id_1, variable_id_2, variable_id_3] for STAC-Items, this particular data type can be utilized. For further elucidation on this particular category, please refer to the illustrative example provided below.

{
    "item_datacube_extension": {},
    "item_scientific_extension": {
                    "variable_ids": {
                                    "tds2stac_mode_analyser": "list",
                                    "tds2stac_manual_variable": "[variable_id_1, variable_id_2, variable_id_3]"
                                    },
                "horizontal_extent_lon_max": {},
                "horizontal_extent_lat_min": {},
                "horizontal_extent_lat_max": {}
}

In the aforementioned example, in order to obtain a consistent list of variable IDs for all STAC-Items in the datacube, it is necessary to designate tds2stac_mode_analyser as a list mode analyzer. Subsequently, within the tds2stac_manual_variable field, the constant list of string values for the variable should be specified. In the present scenario, the constant list of string values consists of the following elements: variable_id_1, variable_id_2, and variable_id_3.

  1. get TDS2STAC Mode Analyser

The get mode analyzer is designed to automatically harvest the characteristics and variables of each extension from the XML-based web services of TDS. As an illustration, our objective is to obtain the four distinct variables, namely constellation, description, instruments, and horizontal_extent_lon_min, from four distinct extensions in the form of XML-based web services provided by the TDS for filling out the variables of two extensions and metadata, namely datacube and common_metadata. These extensions are identified as ‘ncml,’ ‘dap4,’ ‘iso,’ and ‘catalog.’ For further elucidation on this particular category, please refer to the illustrative example provided below.

{

    "item_datacube_extension": {},
    "item_scientific_extension": {
                    "horizontal_extent_lon_min": {
                                            "tds2stac_mode_analyser": "get",
                                            "tds2stac_webservice_analyser": "ncml",
                                            "netcdf": null,
                                            "group": {
                                                    "name":"CFMetadata"
                                                    },
                                            "attribute": {
                                                    "name":"geospatial_lon_min",
                                                    "value": null
                                                    }
                                            },
                "horizontal_extent_lon_max": {},
                "horizontal_extent_lat_min": {},
                "horizontal_extent_lat_max": {}
                },
   "common_metadata":{
                    "constellation": {
                                    "tds2stac_mode_analyser": "get",
                                    "tds2stac_webservice_analyser": "iso",
                                    "gmi:MI_Metadata": null,
                                    "gmd:contact":null,
                                    "gmd:CI_ResponsibleParty": null,
                                    "gmd:organisationName": null,
                                    "gco:CharacterString": null

                                    },
                    "description": {
                                    "tds2stac_mode_analyser": "get",
                                    "tds2stac_webservice_analyser": "catalog",
                                    "catalog": null,
                                    "dataset":{
                                            "name": null
                                    }
                                    },
                    "instruments": {
                                    "tds2stac_mode_analyser": "get",
                                    "tds2stac_webservice_analyser": "dap4",
                                    "Dataset": null,
                                    "Attribute":{
                                            "name":null,
                                            "type":null
                                            }
                                    }
    },
}

As observed in the initial attribute of the first extension in the aforementioned example, namely horizontal_extent_lon_min, it is intended to retrieve the content from the webservice. The tds2stac_mode_analyser (get) is used to harvest data automatically from a web service, specifically the tds2stac_webservice_analyser which is in the format of ncml. The initial two keys (tds2stac_mode_analyser, tds2stac_webservice_analyser) are considered constant keys, as previously discussed.

However, the remaining keys represent tag elements name within the XML file of the provided web service (ncml). In the given ncml XML file, the objective is to harvest the value of the geospatial_lon_min attribute from the CFMetadata group tag element.

The initial tag element, denoted as <netcdf xmlns='url' location='Not supplied due of security concerns.'>, represents the netcdf tag name within the specified XML namespace url.

The tag netcdf is placed after the attribute tds2stac_webservice_analyser. Due to our sole interest in obtaining the minimum longitude, we opt not to search for additional attributes within netcdf tag. Consequently, we assign a null value to the netcdf key to just move from this tag. Conversely, since we already possess just one netcdf tag name in whole XML file, it is unnecessary to include the attribute of the netcdf tag, such as {xmlns:url location:Not provided because of security concerns.}.

It is important to acknowledge that when encountering multiple occurrences of the netcdf or any other tag name within an XML file, it is advisable to include the attributes of the tag instead of leaving them as null. This approach facilitates more precise and desirable filtering outcomes.

As shown in the subsequent key, denoted as group, the presence of many groups necessitated a refinement process resulting in the adoption of the <group name=`CFMetadata>` tag. The ultimate stage holds paramount significance.

In an imaginary scenario when the minimum longitude is not provided as a value of an attribute in the XML file, but rather as text enclosed within opening and closing tags, our objective is to harvest this information. To achieve this, we may simply prepend null as value of attribute in order to obtain the desired text. For instance, if the value of the geospatial_lon_min attribute in the <attribute name=`geospatial_lon_min` value=`113,361` type=`float`/> tag were structured as <attribute name=`geospatial_lon_min type=`float`>113,361</attribute>`, it would be more convenient to retrieve it by inserting null as value of attribute key.

However, in this particular scenario, the attribute’s value is being considered. Therefore, it is necessary to include the value of the attribute as a dictionary as value of attribute key. The geospatial_lon_min is placed as value of name key, while null is assigned to the value of value key. Using this approach the minimum longitude of each STAC-Item in the dataset can be obtained automatically from the ncml webservice of the TDS. It is important to acknowledge that if all values of tag attributes are included in the nested dictionary, TDS2STAC will search for the corresponding text within the tag, not attribute’s value.

<netcdf xmlns="url" location="Not provided because of security concerns.">
<attribute name="title" value="IAGOS-CARIBIC netCDF4 data file"/>
<attribute name="creation_date" value="2023-09-26T11:01:56.344149+00:00"/>
<attribute name="mission" value="IAGOS-CARIBIC (CARIBIC-2), http://www.caribic-atmospheric.com/"/>
<attribute name="data_description" value="All continuous measurements (10s averages) for IAGOS-CARIBIC (onboard Airbus A340-600 HE of Lufthansa)"/>
<attribute name="data_institute" value="Institute of Meteorology and Climate Research (IMK), Karlsruhe Institute of Technology (KIT), 76021 Karlsruhe, P.O. Box 3640, Germany"/>
<attribute name="data_owners" value="A. Zahn; H. Boenisch; T. Gehrlein; F. Obersteiner; contact: andreas.zahn@kit.edu"/>
<attribute name="data_contributors" value="https://gitlab.kit.edu/kit/imk-asf-top/IAGOS-CARIBIC/-/blob/cb7705507465023e28937cb4f896a13058f6ebd0/doc/Caribic2_MS-dataset_Contributors/CARIBIC_MS_contributors.md"/>
<attribute name="license" value="https://creativecommons.org/licenses/by/4.0/deed.en"/>
<attribute name="doi" value="10.5281/zenodo.8188548"/>
<attribute name="format_date" value="2023-09-26"/>
<attribute name="format_version" value="0.3"/>
<attribute name="history" value="Converted from NASA Ames format with na_to_nc4 from caribic2dp.convert_caribic_na_nc4 module. Might contain only a subset of the parameters from the original NASA Ames file. caribic2dp 0.2.16, https://gitlab.kit.edu/FObersteiner/Caribic2dp"/>
<attribute name="conventions" value="CF-1.10"/>
<attribute name="ivar_C_format" value="%d"/>
<attribute name="_NCProperties" value="version=2,netcdf=4.9.3-development,hdf5=1.12.2"/>
<attribute name="_CoordSysBuilder" value="ucar.nc2.internal.dataset.conv.CF1Convention"/>
<dimension name="header_lines" length="197"/>
<dimension name="time" length="659"/>
<group name="CFMetadata">
    <attribute name="geospatial_lon_min" value="113,361" type="float"/>
    <attribute name="geospatial_lat_min" value="14,539" type="float"/>
    <attribute name="geospatial_lon_max" value="121,069" type="float"/>
    <attribute name="geospatial_lat_max" value="23,519" type="float"/>
    <attribute name="geospatial_lon_units" value=""/>
    <attribute name="geospatial_lat_units" value=""/>
    <attribute name="geospatial_lon_resolution" value="0.011714285992561503"/>
    <attribute name="geospatial_lat_resolution" value="0.013647415717684389"/>
    <attribute name="time_coverage_start" value="2005-05-20T12:31:15Z"/>
    <attribute name="time_coverage_end" value="2005-05-20T14:20:55Z"/>
    <attribute name="time_coverage_units" value="seconds"/>
    <attribute name="time_coverage_resolution" value="10.0"/>
    <attribute name="time_coverage_duration" value="P0Y0M0DT1H49M40.000S"/>
</group>
:language: xml

In instances where there are many null values for attributes, the TDS2STAC searches for a value among those attributes and compiles them into a list. This can be illustrated using the following example:

"instruments": {
                "tds2stac_mode_analyser": "get",
                "tds2stac_webservice_analyser": "dap4",
                "Dataset": null,
                "Attribute":{
                        "name":null,
                        "type":null
                        }
                }
  1. check TDS2STAC Mode Analyser

The check mode analyzer is employed to verify the existence of a specific tag element within the XML file of a web service. In the present scenario, if there is a need to verify the existence of a tag element named geospatial_lon_min within the XML file of the ncml web service, this particular data type can be utilized. For further elucidation on this particular category, please refer to the illustrative example provided below.

{
    "item_datacube_extension": {},
    "item_scientific_extension": {
                    "vertical_axis": {
                                    "tds2stac_mode_analyser": "check",
                                    "tds2stac_manual_variable": "z",
                                    "tds2stac_webservice_analyser": "ncml",
                                    "netcdf": null,
                                    "group": {
                                                "name":"CFMetadata"
                                            },
                                    "attribute": {
                                                "name":"geospatial_vertical_min"
                                                }
                                    },
}

In the aforementioned example, the objective is to verify the presence of the necessary attribute or attribute values in the XML file. If they are found, the tds2stac_manual_variable is to be added for the variable, namely the vertical_axis in this case.