The EDDTableFromEML and EDDTableFromEMLBatch Options in GenerateDatasetsXml
[This web page will only be of interest to ERDDAP™ administrators who work
with EML files.
This document was originally created in 2016. It was last edited on 2020-11-30. ]
ERDDAP™
is a data server that gives users a simple, consistent way to download
subsets of gridded and tabular scientific datasets in common file formats
and make graphs and maps. ERDDAP™ works with a given dataset as either
a group of multidimensional gridded variables (e.g., satellite or model data)
or as a database-like table (with a column for each type of information
and a row for each observation). ERDDAP™ is Free and Open Source Software,
so anyone can
download and install ERDDAP™
to serve their data.
To add a dataset to an ERDDAP™ installation, the ERDDAP™ administrator must add a
chunk of XML describing the dataset to a file called datasets.xml.
(There is
thorough documentation for datasets.xml.)
Although it is possible to generate the chunk of XML for datasets.xml entirely
by hand, ERDDAP™ comes with a tool called
GenerateDatasetsXml which can generate the rough draft of the chunk
of XML needed for a given dataset based on some source of information about the dataset.
The first thing GenerateDatasetsXml asks is what type of dataset you want to create.
GenerateDatasetsXml has a special option, EDDTableFromEML,
which uses the information in an
Ecological Metadata Language (EML)
XML file to generate the chunk of XML for datasets.xml
to create an
EDDTableFromAsciiFiles
dataset from each data table in an EML file.
This works very well for most EML files, mostly because EML files do an excellent
job of storing all of the needed metadata for a dataset in an easy-to-work-with format.
The information that GenerateDatasetsXml needs to create the datasets is in the EML file,
including the URL for the data file, which GenerateDatasetsXml downloads, parses,
and compares to the description in the EML file.
(Many groups would do well to switch to EML, which is a great system for
documenting any tabular scientific dataset, not just ecological data.
And many groups that create XML schemas would do well to use
EML as a case study for XML schema that are clear, to the point,
not excessively deep (i.e., too many levels),
and easy for humans and computers to work with.)
Here are all the questions GenerateDatasetsXml will ask,
with comments about how you should answer if you want to process just one EML file
or a batch of EML files:
- Which EDDType?
If you want to process just one file, answer: EDDTableFromEML
If you want to process a group of files, answer: EDDTableFromEMLBatch
- Directory to store files?
Enter the name of the directory that will be used to store downloaded
EML and/or data files.
If the directory doesn't exist, it will be created.
- (For EDDTableFromEML only) EML URL or local fileName?
Enter the URL or local filename of an EML file.
- (For EDDTableFromEMLBatch only) EML dir (URL or local)?
Enter the name of the directory with the EML files (a URL or a local dir).
For example: http://sbc.lternet.edu/data/eml/files/
- (For EDDTableFromEMLBatch only) Filename regex?
Enter the regular expression which will be used to identify the
desired EML files in the EML directory.
For example: knb-lter-sbc\.\d+
- Use local files if present (true|false)?
Enter true to use the existing local EML files and data files, if they exist.
Enter false to always re-download the EML files and/or data files.
- accessibleTo?
If you want the new datasets to be private datasets in ERDDAP,
specify the name of the group(s) that will be allowed access.
Recommended for LTER groups: combine "lter" plus the group, e.g., lterSbc .
If you enter "null", there will be no <accessibleTo> tag in the output.
See accessibleTo.
- localTimeZone (e.g., US/Pacific)?
If a time variable indicates that it has local time values, this time
zone will be assigned.
This must be a value from the
TZ column list of time zone names.
Note all of the easy-to-use "US/..." names at the end of the list.
If you later find that to be incorrect, you can change the time_zone
in the chunk of datasets.xml.
EML plus ERDDAP™ is a great combination, since ERDDAP™ can
give users more direct access to the wealth of
Knowledge Network for Biocomplexity (KNB)
and
Long Term Ecological Research (LTER)
data and help those projects meet the US government's
Public Access to Research Results (PARR) requirements
by making the data available via a web service.
Also, EML plus ERDDAP™ seems like a great bridge between scientists in
the academic / NSF-funded realm and scientists in the federal agency (NOAA, NASA, USGS) realm.
If you have questions, comments, suggestions, or need help, please send an email to
erd dot data at noaa dot gov .
Here are the design details of the EDDTableFromEML option in GenerateDatasetsXml.
Some are related to differences in how EML and ERDDAP™ do things and how
GenerateDatasetsXml deals with these problems.
- One dataTable Becomes One ERDDAP™ Dataset
One EML file may have multiple <dataTable>s. ERDDAP™ makes
one ERDDAP™ dataset per EML dataTable. The datasetID for the dataset is
EMLName_ttableNumber (when EMLname is text) or
system_EMLName_ttableNumber (when EMLname is a number).
For example, table #1 in the file knb-lter-sbc.28,
becomes ERDDAP™ datasetID=knb_lter_sbc_28_t1,
- EML versus CF+ACDD
Almost all of the metadata in the EML files gets into ERDDAP,
but in a different format. ERDDAP™ uses the
CF)
and
ACDD
metadata standards. They are complementary metadata systems that
use key=value pairs for global metadata and for
each variable's metadata.
Yes, the EML representation of the metadata is nicer than the
CF+ACDD representation.
I'm not suggesting using the CF+ACDD representation as a replacement for the EML.
Please think of CF+ACDD as part of the bridge from the EML world to the
OPeNDAP/CF/ACDD world.
- Small Changes
ERDDAP™ makes a lot of small changes. For example, ERDDAP™ uses the
EML non-DOI alternateIdentifier plus a dataTable number as the
ERDDAP™ datasetID, but slightly changes alternateIdentifier to make
it a valid variable name in most computer languages,
e.g., knb-lter-sbc.33 dataTable #1 becomes knb_lter_sbc_33_t1.
- DocBook
EML uses DocBook's markup system to provide structure to blocks of text in EML files.
CF and ACDD require that metadata be plain text. So GenerateDatasetsXml converts the marked up
text into plain text that looks like the formatted version of the text.
The inline tags are sanitized with square brackets, e.g., [emphasized],
and left in the plain text.
- Data Files
Since the EML dataTable includes the URL of the actual data file,
GenerateDatasetsXml will:
- Download the data file.
- Store it in the same directory as the EML file.
- Read the data.
- Compare the description of the data in the EML with the actual
data in the file.
- If GenerateDatasetsXml finds differences, it deals with them,
or asks the operator if the differences are okay, or returns
an error message. The details are in various items below.
- .zip'd Data Files
If the referenced data file is a .zip file, it must contain just
one file. That file will be used for the ERDDAP™ dataset.
If there is more than 1 file. ERDDAP™ will reject that dataset.
If needed, this could be modified.
(In practice, all SBC LTER zip files have just one data file.)
- StorageType
If a column's storageType isn't specified, ERDDAP™ uses its best guess
based on the data in the data file. This works pretty well.
- Units
ERDDAP™ uses
UDUNITS formatting for units.
GenerateDatasetsXml is able to convert EML units to UDUNITS cleanly about
95% of the time. The remaining 5% results in a readable description
of the units, e.g.,
"biomassDensityUnitPerAbundanceUnit" in EML
becomes "biomass density unit per abundance unit" in ERDDAP.
Technically this isn't allowed. I don't think it's so bad
under the circumstances.
[If necessary, units that can't be made UDUNITS compatible could be
moved to the variable's comment attribute.]
- EML version 2.1.1
This support for EML v2.1.1 files was added to GenerateDatasetsXml in 2016 with the
hope that there would be some uptake in the EML community. As of 2020, that has not happened.
The ERDDAP™ developers would be happy to add support for more recent versions of EML, but only
if the new features will actually be used. Please email
erd.data at noaa.gov if you want support for more recent versions of EML
and will actually use this feature.
There are some issues/problems with the EML files that cause problems
when a software client (such as the EDDTableFromEML option in GenerateDatasetsXML)
tries to interpret/process the EML files.
- Although there are several issues listed here, they are mostly small,
solvable problems. In general, EML is a great system and it has been my
pleasure to work with it.
- These are roughly sorted from worst / most common to least bad / less common.
- Most are related to small problems in specific EML files
(which are not EML's fault).
- Most can be fixed by simple changes to the EML
file or data file.
- Given that LTER people are building an EML checker to test the validity of EML files,
I have added some suggestions below regarding features that could be added to the checker.
Here are the issues:
- Separate Date and Time Columns
Some data files have separate columns for date and for time, but no
unified date+time column.
Currently, GenerateDatasetsXml creates a dataset with these separate columns,
but it isn't ideal because:
- It is best if datasets in ERDDAP™ have a combined date+time column called "time".
- Often the dataset won't load in ERDDAP™ because the "time" column doesn't have date+time data.
There are two possible solutions:
- Edit the source data file to add a new column in the datafile (and describe it in the EML)
where the date and time columns are merged into one column. Then rerun GenerateDatasetsXml
so it finds the new column.
- Use the Derived Variables feature in ERDDAP™ to define a new variable in datasets.xml
which is created by concatenating the date and the time columns. One of the examples
deals specifically with this situation.
- Inconsistent Column Names
The EML files list the data file's columns and their names.
Unfortunately, they are often different from the column names
in the actual data file.
Normally, the column order in the EML file is the same as the
column order in the data file, even if the names vary slightly,
but not always.
GenerateDatasetsXml tries to match the column names.
When it can't (which is common), it will stop, show you the EML/data filename pairs,
and ask if they are correctly aligned. If you enter 's' to skip
a table, GeneratedDatasetsXml will print an error message
and go on to the next table.
The solution is to change the erroneous column names in the EML file
to match the column names in the data file.
- Different Column Order
There are several cases where the EML specified the columns in
a different order than they exist in the data file.
GenerateDatasetsXml will stop and ask the operator if the
matchups are okay or if the dataset should be skipped.
If it is skipped, there will be an error message in the results file, e.g.,:
<-- SKIPPED (USUALLY BECAUSE THE COLUMN NAMES IN THE DATAFILE ARE IN
A DIFFERENT ORDER OR HAVE DIFFERENT UNITS THAN IN THE EML file):
datasetID=knb_lter_sbc_17_t1
dataFile=all_fish_all_years_20140903.csv
The data file and EML file have different column names.
ERDDAP™ would like to equate these pairs of names:
SURVEY_TIMING = notes
NOTES = survey_timing
-->
The solution is to fix the column order in these EML files so
that they match the order in the data files.
It would be nice if the EML checker checked that
the columns and column order in the source file match
the columns and column order in the EML file.
-
Several dataTables incorrectly state numHeaderLines=1, e.g., ...sbc.4011.
This causes ERDDAP™ to read the first line of data as the column names.
I tried to manually SKIP all of these dataTables.
They are obvious because the unmatched source col names are all data values.
And if there are files that incorrectly have numHeaderLines=0,
my system doesn't make it obvious.
Here's an example from the SBC LTER failures file:
<-- SKIPPED (USUALLY BECAUSE THE COLUMN NAMES IN THE DATAFILE ARE IN
A DIFFERENT ORDER OR HAVE DIFFERENT UNITS THAN IN THE EML file):
datasetID=knb_lter_sbc_3017_t1
dataFile=MC06_allyears_2012-03-03.txt
The data file and EML file have different column names.
ERDDAP™ would like to equate these pairs of names:
2008-10-01T00:00 = timestamp_local
2008-10-01T07:00 = timestamp_UTC
2.27 = discharge_lps
-999.0 = water_temperature_celsius
-->
So the error may appear as if GenerateDatasetsXml thinks that the first
line with data in the file (e.g., with 2008-10-01T00:00 etc.)
is the line with column names (as if 2008-10-01T00:00 were
a column name).
It would be nice if the EML checker checked the numHeaderLines value.
-
Some source files don't have column names. ERDDAP™ accepts that
if the EML describes the same number of columns.
In my opinion: this seems very dangerous. There could be columns
in a different order or with different units (see below)
and there is no way to catch those problems. It is much better if
all ASCII data files have a row with column names.
- DateTime Format Strings
EML has a standard way to describe date time formats.
but there is considerable variation in its use in EML files.
(I was previously wrong about this. I see the EML documentation
for formatString which appears to match the
Java DateTimeFormatter specification,
but which lacks the important guidelines about its use,
with the result that formatString is often/usually improperly used.)
There are several instances with incorrect case,
and/or incorrect duplication of a letter,
and/or non-standard formatting.
That puts an unreasonable burden on clients, especially software clients like GenerateDatasetsXml.
GenerateDatasetsXml tries to convert the incorrectly defined
formats in the EML files into
the date/time format that ERDDAP™ requires, which is almost
identical to for Java/Joda time format specification, but is slightly more forgiving.
It would be nice if the EML checker required strict adherence to the
Java/Joda/ERDDAP time units specification and verified that date time values in the data table
could be parsed correctly with the specified format.
- DateTime But No Time Zone
GenerateDatasetsXml looks for a column with dateTime and a specified
time zone (either
Zulu: from time units ending in 'Z' or a column name
or attribute definition that includes "gmt" or "utc",
or local: from "local" in the column name or attribute definition).
Also acceptable is a file with a date column but no time column.
Also acceptable is a file with no date or time information.
GenerateDatasetsXml treats all "local" times as being from the time zone
which you can specify for a given batch of files,
e.g., for SBC LTER, use US/Pacific.
The information is sometimes in the comments, but not in a form
that is easy for a computer program to figure out.
Files that don't meet this criteria are rejected with the message
"NO GOOD DATE(TIME) VARIABLE".
Common problems are:
- There is a column with dates and a column with times, but not dateTime column.
- There are time units, but the time zone isn't specified.
Other comments:
If there is a good date+time with time zone column,
that column will be named "time" in ERDDAP.
ERDDAP™ requires that time column data be understandable/convertible to
Zulu/UTC/GMT time zone dateTimes.
[My belief is: using local times and different date/time formats
(2-digit years! mm/dd/yy vs dd/mm/yy vs ... ) in data files forces
the end user to do complicated conversions to Zulu time in order to
compare data from one dataset with data from another.
So ERDDAP™ standardizes all time data:
For string times, ERDDAP™ always uses the ISO 8601:2004(E) standard format,
for example, 1985-01-02T00:00:00Z.
For numeric times, ERDDAP™ always uses "seconds since 1970-01-01T00:00:00Z".
ERDDAP™ always uses the Zulu (UTC, GMT) time zone to remove the
difficulties of working with different time zones and standard time
versus daylight saving time.
So GenerateDatasetsXml seeks an EML dataTable column with date+time Zulu.
This is hard because EML doesn't use a formal vocabulary/system
(like
Java/Joda time format)
for specifying the dataTime format:
If there is a col with numeric time values (e.g., Matlab times) and
Zulu timezone (or just dates, with no time columns),
it is used as "time".
If there is a col with date and time data, using the Zulu time zone,
it is used as "time"
and any other date or time column is removed.
Else if a col with just date information is found, it is used as
the "time" variable (with no time zone).
If there is a data column and a time column and no combined dateTime column,
the dataset is REJECTED — but the dataset could be made usable
by adding a combined dateTime column (preferably, Zulu time zone) to the datafile
and adding its description in the EML file.
EXAMPLE from SBC LTER:
https://sbclter.msi.ucsb.edu/external/InformationManagement/eml_2018_erddap/
dataTable #2.
It would be nice if EML/LTER required the inclusion of a column with
Zulu (UTC, GMT) time zone times in all relevant source data files.
Next best is to add a system to EML to specify a time_zone attribute using standard
names (from the
TZ column).
- Missing missing_value
Some columns use a missing_value
but don't list it in the EML metadata, e.g., precipitation_mm in
knb-lter-sbc.5011 uses -999.
If no missing value is specified in the EML, GenerateDatasetsXml
automatically searches for common missing values (e.g.,
99, -99, 999, -999, 9999, -9999, etc) and creates that
metadata. But other missing missing_values are not caught.
It would be nice if the EML checker looked for missing missing_values.
- Small Problems
There are a lot of small problems (spelling, punctuation)
which will probably only be found by a human inspecting each dataset.
It would be nice if the EML checker looked for spelling and grammatical errors.
This is a difficult problem because words in science are often
flagged by spell checkers. Human editing is probably needed.
- Invalid Unicode Characters
Some of the EML content contains invalid Unicode characters.
These are probably characters from the Windows charset that
were incorrectly copied and pasted into the UTF-8 EML files.
GenerateDatasetsXml sanitizes these characters to e.g., [#128],
so they are easy to search for in the ERDDAP™ datasets.xml file.
It would be nice if the EML checker checked for this.
It is easy to find and easy to fix.
- Different Column Units
Some EML dataTables define columns that are inconsistent
with the columns in the data file, notably because they
have different units.
GenerateDatasetsXml flags these. It is up to the operator to
decide if the differences are okay or not.
These appear in the failures file as "SKIPPED" dataTables.
EXAMPLE in SBC LTER failures file:
< SKIPPED (USUALLY BECAUSE THE COLUMN NAMES IN THE DATAFILE ARE IN
A DIFFERENT ORDER OR HAVE DIFFERENT UNITS THAN IN THE EML file):
datasetID=knb_lter_sbc_3_t1
dataFile=SBCFC_Precip_Daily_active_logger.csv
The data file and EML file have different column names.
ERDDAP™ would like to equate these pairs of names:
Daily_Precipitation_Total_mm = Daily_Precipitation_Total_inch
Flag_Daily_Precipitation_Total_mm = Flag_Daily_Precipitation_Total_inch
-->
It would be nice if the EML checker checked that
the units match. Unfortunately, this is probably impossible
to catch and then impossible to resolve without contacting the
dataset creator, given that the source file doesn't include units.
The discrepancy for the example above was only noticeable
because the units were included in the source column name
and the EML column name. How many other dataTables
have this problem but are undetectable?
- Different Versions of EML
GenerateDatasetsXml is designed to work with EML 2.1.1.
Other versions of EML will work to the extent that they
match 2.1.1 or that GenerateDatasetsXml has special code to deal with it.
This is a rare problem.
When it occurs, the solution is to convert your files to EML 2.1.1, or
send the EML file to erd.data at noaa.gov, so I can make
changes to GenerateDatasetsXml to deal with the differences.
Bob added support for EML files to GenerateDatasetsXml in 2016 with the
hope that there would be some uptake in the EML community. As of 2020, that has not happened.
Bob is happy to add support for more recent versions of EML, but only
if the new features will actually be used. Please email
erd.data at noaa.gov if you want support for more recent versions of EML
and will actually use this feature.
- Trouble Parsing the Data File
Rarely, a dataTable may be rejected with the error
"unexpected number of items on line #120 (observed=52, expected=50)"
An error message like this means that a line in the datafile
had a different number of values than the other lines.
It may be a problem in ERDDAP™ (e.g., not parsing the file correctly)
or in the file.
EXAMPLE from SBC LTER:
https://sbclter.msi.ucsb.edu/external/InformationManagement/eml_2018_erddap/
dataTable #3, see
datafile=LTER_monthly_bottledata_registered_stations_20140429.txt
Questions, comments, suggestions? Please send an email to
erd dot data at noaa dot gov .
ERDDAP, Version 2.25
Disclaimers |
Privacy Policy