DAPD1Servlet

DAP-DataONE Servlet. This implements a DataONE Version 1, tier 1, Member Node for data accessible using DAP2

View the Project on GitHub jgallagher59701/DAPD1Servlet

Welcome to the DAP-DataONE server documentation

The DAP-DataONE server is an implementation of a DataONE version 1, tier 1 Member Node. It is designed to act as a kind of broker for DataONE clients and data served using OPeNDAP. The initial version of the server only supports DAP2 but will work with any DAP server that can return data packaged in netCDF3 files and can return ISO-19115 metadata documents for those datasets. The server is implemented as a Java Servlet that contains the implementation of the DataONE Member Node protocol and a simple database paired with a command-line tool to add or update datasets in the database. The database uses SQLite, although all of the software interacts with it using JDBC, so a different RDB could be used with only very trivial modifications to the software.

What's here in the documentation?

The documentation contains information about how to:

In addition, there is documentation that discusses ways the server could be made more powerful and limitations imposed by the designs of DAP2 and DataONE:

Beta software; What we assume

This is beta software without a polished build/install process. We assume you are computer savvy and know how to configure systems, install software, start and stop web servers, ... all that stuff. If you want to use this software but are unfamiliar with these kinds of things, contact us at support@opendap.org or DataONE:Contact.

Build the software

This project and its companion, DAPD1DatasetsDatabase, use Maven.

Get Maven if you do not already have it

Goto http://maven.apache.org/download.cgi, get the binary and install it. On a Mac OSX machine, put the apache-maven-<ver> directory in /Applications. Then add

Build the code

Use git to clone the DAPD1DatasetsDatabase and DAPD1Servlet projects (hosted here on GitHub). It's best to store them in the same root directory (e.g., called 'dataone') even though they are completely separate projects. For the DAPD1DatasetsDatabase, build it using mvn clean install. This will build the executable jar file used by the edit-db.sh bash shell script and install that in the local maven repository so it can be found by the DAPD1Servlet build. To build the DAPD1Servlet code, use mvn clean package. This will build a war file that can be installed in tomcat's webapps subdirectory. If there's a significant demand, we can make binaries available.

Note:

I use Eclipse with the Maven plugin (m2e). Once the code is checked out using git clone on the command line, use Eclipse:File:Import... and choose Git->Existing local repository and select the import as general > project option. In the project, Right click on the pox.xml file and choose Run As... to build. It would also work to check out the code directly into Eclipse.

Configure and Test the Server

To install the DAP-DataOne server, you'll need Tomcat 7 (other servlet engines have not been tested; they might work, there's nothing I know about that is specific to Tomcat or Tomcat 7). Copy the target/DAPD1Servlet.war to Tomcat's webapps directory. Start Tomcat and then ten stop it. This little gyration will make a default 'opendap.properties' file that you can then edit (actually, you can simply start Tomcat and edit the file; the servlet will pick up the changes, but that doesn't always work as expected so I document the start-restart process). Look in $CATALINA_HOME/webapps/DAPD1Servlet/WEB-INF/classes/ for that file and edit it as follows:

Note:

You may have to hack your Tomcat servlet engine a bit to get it to properly decode escaped characters in the path part of the URL. To do that, edit Tomcat's configuration file in CATALINA_HOME/conf/catalina.properties. In Tomcat's 'catalina.properties' add:

  • org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
  • org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true

Once the config file(s) are edited, restart tomcat. If the test.db database is correctly referenced by the org.opendap.d1.DatabaseName option, the servlet should start without errors. Look in Tomcat's catalina.out log file to verify that is has started without error. If there are errors, edit the logback.xml file (webapps/DAPD1Servlet/WEB-INF/classes/logback.xml) and set the log level to DEBUG in the logger named "org.opendap.d1". Also note that the servlet writes log messages to a log file named 'opendap.log'.

Assuming Tomcat and the Servlet start without errors, try these URLs (I'll use localhost:8080 for the host and port):

Serving your own data

See the information about adding and updating datasets in the database over in the DAPD1DatasetsDatabase project. Once this database has been built, copy it to a safe place and edit the org.opendap.d1.DatabaseName parameter so that it references it. Restart the servlet, check for errors and test using ping and some of your own datasets.

About the design, potential optimizations and its current limitations

This is the first attempt at a broker (using the term loosely) for DAP and DataONE and was written over about two months calendar time. The DataONE Java libraries along with an example servlet made the process go quite fast (about 5 weeks total development time). However, there are significant differences in some of the goals of the DAP and DataONE projects. Fundamentally, DAP is a protocol for data access and subsetting while DataONE is a system that provides a complete solution to a range of activities important to users of online data. This difference means that DataONE provides one or more solutions to a range of problems (data persistence, location, etc.) in addition to data access. In many ways that is what makes combining the two so interesting. With that in mind...

How it works

I'm going to assume you know how DataONE works, at least at a basic level.

The DAP-DataONE server uses two features of many DAP implementations to provide two of the three most important responses of the DataONE Tier 1 Member Node API: The Science Data Object (SDO) and Science Metadata Object (SMO). DAP servers often provide a way to access data packages in a netCDF3 file - regardless of how the data are originally stored. Thus the DAP-DataONE server is designed to always return the SDO as a netCDF3 file. The original data might have been stored in a RDB, a HDF4 File, etc., but the return format from the DAP-DataONE server will always be as a netCDF3 file. The SMO - a metadata object that matches the SDO - is built using many DAP server's ability to build ISO-19115 documents that describe any given dataset.

The DAP URL to the dataset is used to build the URLs that are used to access the SDO and SMO from the DAP server. For each dataset that the DAP server holds, there is one base URL and derived from that base URL there is one URL that will return a netCDF3 file that holds all of its data (the SDO) and one URL that will return the ISO-19115 document (the SMO).

Note that DataONE clients refer to the SDO and SMO using Persistent Identifiers (PIDs) and not URLs per se. These PIDs are passed to a server that implements the DataONE API as parameters to a URL. The DAP-DataONE server stores the PIDs in a database and looks up the PID passed to it to determine the URL to use in accessing the requested SDO or SMO. The third object that a DataONE Member Node may return is a Object Reuse and Excahnge (ORE) document. This document is built using the SDO and SMO PIDs (not the DAP URLs, but the DataONE PIDs that refer to those URLs). The ORE document is stored in the DAP-DataONE server's database

In addition, DataONE requires other information about any given dataset. For each of the SDO, SMO and ORE responses, DataONE specifies a set of 'system metadata' that must also be made available by a DataONE server. This information includes the PID that is used to access the response, the size and checksum of the response as well as other metadata. The DAP-DataONE server uses a relational database to store this information.

The DAP-DataONE server uses SQLite (although it will be trivial to substitute a different database engine like MySQL) to store the metadata, PIDs, and other information including the ORE documents. This information is accessed and used to build every response except the SDO and SMO responses. The SDO and SMO responses are read from the DAP server, not the relational database. The DAPD1DatasetsDatabase project web page contains information about the tables in the current implementation of the datasets database.

The DAP-DataONE server also uses the RDB to store access information. Information about every SDO, SMO and ORE request is stored in a table and can be queried using the DataONE 'log' function.

Optimizations and Improvements

From the preceding description, it should be clear that one optimization of the server would be to cache the ISO-19115 metadata documents, either in the database or using a separate cache tool like the Java Caching System (JCS). This would speed up the response for these objects. Similarly, if sufficient memory is available for a server, caching the SDO would also boost performance, although this would likely be a real cache in the sense that old items would be pushed out to make room for newer or more frequently requested things. The current 'caching' of the ORE documents stores all of them permanently in the database.

The server could benefit from more generic optimizations such as pooling database connections and object reuse by the servlet. These kinds of optimizations will benefit installations with high usage (where 'high' is, or course, a relative term) but do not affect the overall functionality of the server in a conceptual sense.

There are some quirks to the server's design that could be addressed, and while not really optimizations, doing these things would improve its ease of use. First, the servlet and the database are loosely coupled, and while normally that's a good thing, in this case it can lead to some odd behavior. In particular, the server has a single DAP server named in the configuration file with the implication that all data are served from that one site. However, that does not have to be the case because there's no limitation on the number of different sites can be in the database (i.e., there's no way to ensure the database and the configuration file agree). In fact, the database has no notion of sites at all; it simply holds URLs without considering where the data will be read from. The 'ping' function uses the 'DAP server' named in the configuration parameter, so success of 'ping' means only that the one DAP server named in the configuration file is working, not that all of the servers referenced in the URLs in the database are up.

DAP servers that return ISO-19115 documents do not always return useful ISO-19115 documents because sometimes the underlying datasets lack information needed by the ISO standard. In the case where required information is missing, it is simply marked as 'null.' A very useful improvement to this server would be an editing tool that would enable a data provider to easily supplant the metadata in the automatically generated document. This can be done now using NCML or an Ancillary DAS on the DAP server, but it requires that the person configuring the DAP-DataONE server also have access to the DAP server. If this metadata editing and augmentation ability were combined with local metadata document caching, two issues - performance and usefulness - could be addressed at once.

The process of adding datasets is somewhat tedious. If the DAP-DataONE server could automate this, possibly in combination with a web interface, it would be much easier to manage a large number of datasets. As it stands, the database is edited using a command-line tool that is somewhat limited in its abilities (see DAPD1DatasetsDatabase for more information.

Limitations

DataONE, as was said earlier, is a complete data system that addresses a number of issues in the data management life-cycle. DAP is a protocol used to access data. The strengths of DAP are its ability to hide storage format and to provide server processing functions and subsetting operations. However, the latter two of those three features are accessed using a web services API, which is currently out of scope for DataONE. There is a real likelihood that the version 2 DataONE Member Node protocol will include support for web services at the same level as the version 1 protocol includes support for file formats but, that is currently not the case. This means that SDO access by the DAP-DataONE server is limited to either entire DAP datasets or preconfigured subsets of datasets. For datasets served by DAP that are aggregations of thousands of large files (e.g., satellite-derived time series data), it is simply not practical to access them using DataONE version 1. It is possible to configure 'useful subsets' of those and make them available - because the DAP URLs for subsets are distinct from those that reference the entire dataset - but those subsets must be hardwired by the person who configures the DAP-DataONE server. This limitation is fundamental to the designs of DAP and DataONE, not the DAP-DataONE server.