bes Updated for version 3.21.1
The Backend Server (BES) is the lower two tiers of the Hyrax data server
|
This directory (data) contains:
hdf5
/netcdf-4
data files to create portable dmr++ files whose binary data objects are held in a web object store like AWS S3.We have developed an initial set of tools that enable a data provider to easily serve data stored in Amazon's S3 Web Object Store. In the current implementation, the data must be stored in HDF5 or NetCDF4 files. The data do not, however, have to be reformatted to be used with the Hyrax server. Furthermore, the data objects are subset 'in-place' from S3 instead of first transferring the object and then serving it, resulting in lower response latency than other solutions for S3 darta access such as those based on FUSE filesystems. For data users, access is seamless - there is no difference between access to data stored in S3 or on spinning disk.
We have conducted tests of this software and the Google Cloud Store and found that it works with that Web Object Store as well. In fact, reconfiguration to GCS is trivial.
The dmr++ files are the control data used by the server to enable 'in-place' access and subsetting of data in S3.
self-contained and portable: The dmr++ files are self contained. They can be served by any Hyrax server (version 1.16.0 or higher) simply by placing them in the server's data file system.
size: The dmr++ files are typically very much smaller than their source hdf5
/nectdf-4
files, by as much as 2 or even 3 orders of magnitude (YMMV).
There are three programs for building dmr++ files:
get_dmrpp
builds a single dmr++ file from a single netcdf-4
/hdf5
file.ingest_filesystem
builds a collection of dmr++ files from data held in the locally mounted filesystem.ingest3bucket
builds a collection of dmr++ files from data held in Amazon's S3 storage.NOTE: Organizationally, this directory (data) and it's child directory dmrpp are arranged in this hierarchy in order to mimic the deployment structure resulting from running "make install". Most modules do not need to do this but since dmr++ files reference other files, and they do so using relative paths, the BES Catalog Root the mimicry is required.
NOTE: Examples can be run as shown from the bes/modules/dmrpp___module/data directory.
In order for these programs (shell scripts) to function correctly a localization step must take place. This happens when the parent software (the dmrpp_module
) is built and installed as part of the BES. Once this is done the scripts will have been installed and should be in $prefix/bin
and ready to use.
get_dmrpp
- build a dmr++ file from an hdf5/_nectdf-4_ file.The get_dmrpp
shell script generates a single dmr++ file from a single netcdf-4/hdf5 file. It is used by both ingest_filesystem
and ingest_s3bucket
.
Creates a dmr++ file (foo.dmrpp) whose binary object URL is a file URL containing the fully qualifed path to the source data file as it's value.
-v
-d pwd
-o foo.dmrpp
-u file://pwd
/dmrpp/chunked_shuffled_fourD.h5
dmrpp/chunked_shuffled_fourD.h5
Creates a dmr++ file (foo.dmrpp) whose binary object URL references an object in Amazon's S3.
-v
-d pwd
-o foo.dmrpp
-u https://s3.amazonaws.com/opendap.scratch/data/dmrpp/chunked_fourD.h5
dmrpp/chunked_shuffled_fourD.h5
ingest_filesystem
- building dmr++ files from local files.The shell script ingest_filesystem
is used to crawl through a branch of the local filesystem, identifying files that match a regular expression (default or supplied), and then attempting to build a dmr++ file for each matching file using the get_dmrpp
program.
In its simplest invocation, ingest_filesystem
's defaults will cause it check for the file ./data_files.txt
. If found ingest_filesystem
will treat every line in ./data_files.txt
as a fully qualifed path to an hdf5
/netcdf-4
file for which a dmr++
file is to be computed. By default the output tree will be placed in the current working directory. The base end point for the dmr++
binary object will be set to the current working directory.
In this invocation, ingest_filesystem
crawls the local filesystem beginning with the CWD every file that matches the default regular expression (^.*\\.(h5|he5|nc4)(\\.bz2|\\.gz|\\.Z)?$
) will be treated as an hdf5
/netcdf-4
file for which a dmr++
file is to be computed. The output tree will be placed in a directory called scratch in the current working directory. The base URL for the dmr++
binary objects will be set to the current working directory.
-f
find
command along with the regular expression to traverse the filesystem and locate all of the matching files. These file names are placed, as fully qualified path names, in the file ./data_files.txt
to be reused or hand edited if needed. -t scratch
In this invocation, ingest_filesystem
crawls the local filesystem beginning at /usr/share/hyrax
. Every file that matches the default regular expression (^.*\\.(h5|he5|nc4)(\\.bz2|\\.gz|\\.Z)?$
) will be treated as an hdf5
/netcdf-4
file for which a dmr++
file is to be computed. The output tree will be placed in /tmp/dmrpp
. The base URL for the dmr++
binary objects will be set to the AWS S3 bucket URL https://s3.amazonaws.com/cloudydap
.
-f
find
command along with the regular expression to traverse the filesystem and locate all of the matching files. These file names are placed, as fully qualified path names, in the file ./data_files.txt
to be reused or hand edited if needed. -u https://s3.amazonaws.com/cloudydap
https://s3.amazonaws.com/cloudydap
File paths relative to the BES DataRoot will be appended to this URL to form the binary access URL for each dmr++ file. -d /usr/share/hyrax
/usr/share/hyrax
for this invocataion. Since the -f option is also present the crawl of the file system will begin here. -t /tmp/dmrpp
/tmp/dmrpp
ingest_s3bucket
- building dmr++ files from files held in S3.The shell script ingest_s3bucket
utilizes the AWS CLI to list the contents of an S3 bucket. The name of each object in the bucket is checked against either the defaukt or user supplied regex. Each matching file is retrieved from S3 and then a dmr++ is built from the retrived data object. Once the dmr++ file is built the downloaded object is deleted unless otherwise instructed. The code relies on the AWS CLI being installed and configured using the aws configure
command (or it's equivalent).
In its simplest invocation, ingest_s3bucket
's defaults will cause it check for the file ./s3_cloudydap_data_files.txt
. (It looks for this file because the default bucket name is cloudydap
and the software caches bucket information in the files named in the patterns s3_BUCKETNAME_all_files.txt
s3_BUCKETNAME_data_files.txt
. Changing the bucket name will change the name of the file information files accordingly). If the file is found, ingest_s3bucket
will treat the 4th column of every line in ./s3_cloudydap_data_files.txt
as a relative path to an hdf5
/netcdf-4
file in the default bucket (cloudydap
) for which a dmr++
file is to be computed. By default the output tree will be placed in the current working directory. The base end point for the dmr++
binary object will be set the URL of the S3 binary file that was used to create the https://s3.amazonaws.comdmr++
file.
In this example we have ingest_s3bucket
locate all the matching data files in the S3 bucket opendap.scratch
, store the downloaded data files in /tmp/s3_scratch
, and place the resulting dmr++ files in /usr/share/hyrax
.
-v
-f
find
command along with the regular expression to traverse the object names retrieved from S3 and locate all of the matching files. These file names saved in the file ./s3_BUCKETNAME_data_files.txt
to be reused or hand edited if needed. -v
-b opendap.scratch
opendap.scratch
-d /tmp/s3_scratch
/tmp/s3_scratch
-t /usr/share/hyrax
/usr/share/hyrax
, the default data directry for Hyrax. ChangeLog
5/25/18
4/3/19