NetCDF  4.9.2
nczarr.md
1 The NetCDF NCZarr Implementation
2 ============================
3 <!-- double header is needed to workaround doxygen bug -->
4 
5 # The NetCDF NCZarr Implementation {#nczarr_head}
6 
7 \tableofcontents
8 
9 # NCZarr Introduction {#nczarr_introduction}
10 
11 Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3 <a href="#ref_aws">[1]</a> ).
12 This extension provides a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr <a href="#ref_zarrv2">[4]</a> data model.
13 The NetCDF version of this storage format is called NCZarr <a href="#ref_nczarr">[4]</a>.
14 
15 A note on terminology in this document.
16 1. The term "dataset" is used to refer to all of the Zarr objects constituting
17  the meta-data and data.
18 
19 There are some important "caveats" of which to be aware when using this software.
20 1. NCZarr currently is not thread-safe. So any attempt to use it with parallelism, including MPIO, is likely to fail.
21 
22 # The NCZarr Data Model {#nczarr_data_model}
23 
24 NCZarr uses a data model <a href="#ref_nczarr">[4]</a> that, by design, extends the Zarr Version 2 Specification <a href="#ref_zarrv2">[6]</a> to add support for the NetCDF-4 data model.
25 
26 __Note Carefully__: a legal _NCZarr_ dataset is also a legal _Zarr_ dataset under a specific assumption. This assumption is that within Zarr meta-data objects, like ''.zarray'', unrecognized dictionary keys are ignored.
27 If this assumption is true of an implementation, then the _NCZarr_ dataset is a legal _Zarr_ dataset and should be readable by that _Zarr_ implementation.
28 The inverse is true also. A legal _Zarr_ dataset is also a legal _NCZarr_
29 dataset, where "legal" means it conforms to the Zarr version 2 specification.
30 In addition, certain non-Zarr features are allowed and used.
31 Specifically the XArray ''\_ARRAY\_DIMENSIONS'' attribute is one such.
32 
33 There are two other, secondary assumption:
34 
35 1. The actual storage format in which the dataset is stored -- a zip file, for example -- can be read by the _Zarr_ implementation.
36 2. The compressors (aka filters) used by the dataset can be encoded/decoded by the implementation. NCZarr uses HDF5-style filters, so ensuring access to such filters is somewhat complicated. See [the companion document on
37 filters](./md_filters.html "filters") for details.
38 
39 Briefly, the data model supported by NCZarr is netcdf-4 minus
40 the user-defined types. However, a restricted form of String type
41 is supported (see Appendix H).
42 As with netcdf-4 chunking is supported. Filters and compression
43 are also [supported](./md_filters.html "filters").
44 
45 Specifically, the model supports the following.
46 - "Atomic" types: char, byte, ubyte, short, ushort, int, uint, int64, uint64, string.
47 - Shared (named) dimensions
48 - Attributes with specified types -- both global and per-variable
49 - Chunking
50 - Fill values
51 - Groups
52 - N-Dimensional variables
53 - Scalar variables
54 - Per-variable endianness (big or little)
55 - Filters (including compression)
56 
57 With respect to full netCDF-4, the following concepts are
58 currently unsupported.
59 - User-defined types (enum, opaque, VLEN, and Compound)
60 - Unlimited dimensions
61 - Contiguous or compact storage
62 
63 Note that contiguous and compact are not actually supported
64 because they are HDF5 specific.
65 When specified, they are treated as chunked where the file consists of only one chunk.
66 This means that testing for contiguous or compact is not possible; the _nc_inq_var_chunking_ function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.
67 
68 Additionally, it should be noted that NCZarr supports scalar variables,
69 but Zarr does not; Zarr only supports dimensioned variables.
70 In order to support interoperability, NCZarr does the following.
71 1. A scalar variable is recorded in the Zarr metadata as if it has a shape of **[1]**.
72 2. A note is stored in the NCZarr metadata that this is actually a netCDF scalar variable.
73 
74 These actions allow NCZarr to properly show scalars in its API while still
75 maintaining compatibility with Zarr.
76 
77 # Enabling NCZarr Support {#nczarr_enable}
78 
79 NCZarr support is enabled by default.
80 If the _--disable-nczarr_ option is used with './configure', then NCZarr (and Zarr) support is disabled.
81 If NCZarr support is enabled, then support for datasets stored as files in a directory tree is provided as the only guaranteed mechanism for storing datasets.
82 However, several addition storage mechanisms are available if additional libraries are installed.
83 
84 1. Zip format -- if _libzip_ is installed, then it is possible to directly read and write datasets stored in zip files.
85 2. If the AWS C++ SDK is installed, and _libcurl_ is installed, then it is possible to directly read and write datasets stored in the Amazon S3 cloud storage.
86 
87 # Accessing Data Using the NCZarr Prototocol {#nczarr_accessing_data}
88 
89 In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format.
90 Note specifically that there is no NC_NCZARR flag for the mode argument of _nc_create_ or _nc_open_.
91 In this case, it is indicated by the URL path.
92 
93 ## URL Format
94 The URL is the usual format.
95 ````
96 scheme:://host:port/path?query#fragment format
97 ````
98 There are some details that are important.
99 - Scheme: this should be _https_ or _s3_,or _file_.
100  The _s3_ scheme is equivalent
101  to "https" plus setting "mode=nczarr,s3" (see below).
102  Specifying "file" is mostly used for testing, but is used to support
103  directory tree or zipfile format storage.
104 - Host: Amazon S3 defines three forms: _Virtual_, _Path_, and _S3_
105  + _Virtual_: the host includes the bucket name as in
106  __bucket.s3.&lt;region&gt;.amazonaws.com__
107  + _Path_: the host does not include the bucket name, but
108  rather the bucket name is the first segment of the path.
109  For example __s3.&lt;region&gt;.amazonaws.com/bucket__
110  + _S3_: the protocol is "s3:" and if the host is a single name,
111  then it is interpreted as the bucket. The region is determined
112  using the algorithm in Appendix E.
113  + _Other_: It is possible to use other non-Amazon cloud storage, but
114  that is cloud library dependent.
115 - Query: currently not used.
116 - Fragment: the fragment is of the form _key=value&key=value&..._.
117  Depending on the key, the _value_ part may be left out and some
118  default value will be used.
119 
120 ## Client Parameters
121 
122 The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format.
123 For NCZarr support, the following _key=value_ pairs are allowed.
124 
125 - mode=nczarr|zarr|noxarray|file|zip|s3
126 
127 Typically one will specify two mode flags: one to indicate what format
128 to use and one to specify the way the dataset is to be stored.
129 For example, a common one is "mode=zarr,file"
130 
131 Using _mode=nczarr_ causes the URL to be interpreted as a
132 reference to a dataset that is stored in NCZarr format.
133 The _zarr_ mode tells the library to
134 use NCZarr, but to restrict its operation to operate on pure
135 Zarr Version 2 datasets.
136 
137 The modes _s3_, _file_, and _zip_ tell the library what storage
138 driver to use.
139 * The _s3_ driver is the default and indicates using Amazon S3 or some equivalent.
140 * The _file_ format stores data in a directory tree.
141 * The _zip_ format stores data in a local zip file.
142 
143 Note that It should be the case that zipping a _file_
144 format directory tree will produce a file readable by the
145 _zip_ storage format, and vice-versa.
146 
147 By default, the XArray convention is supported and used for
148 both NCZarr files and pure Zarr files. This
149 means that every variable in the root group whose named dimensions
150 are also in the root group will have an attribute called
151 *\_ARRAY\_DIMENSIONS* that stores those dimension names.
152 The _noxarray_ mode tells the library to disable the XArray support.
153 
154 The netcdf-c library is capable of inferring additional mode flags based on the flags it finds. Currently we have the following inferences.
155 - _zarr_ => _nczarr_
156 
157 So for example: ````...#mode=zarr,zip```` is equivalent to this.
158 ````...#mode=nczarr,zarr,zip
159 ````
160 <!--
161 - log=&lt;output-stream&gt;: this control turns on logging output,
162  which is useful for debugging and testing.
163 If just _log_ is used
164  then it is equivalent to _log=stderr_.
165 -->
166 
167 # NCZarr Map Implementation {#nczarr_mapimpl}
168 
169 Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used.
170 This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python _MutableMap_ <a href="#ref_python">[5]</a> class.
171 
172 In NCZarr, the corresponding type is called _zmap_.
173 The __zmap__ API essentially implements a simplified variant
174 of the Amazon S3 API.
175 
176 As with Amazon S3, __keys__ are utf8 strings with a specific structure:
177 that of a path similar to those of a Unix path with '/' as the
178 separator for the segments of the path.
179 
180 As with Unix, all keys have this BNF syntax:
181 ````
182 key: '/' | keypath ;
183 keypath: '/' segment | keypath '/' segment ;
184 segment: <sequence of UTF-8 characters except control characters and '/'>
185 ````
186 Obviously, one can infer a tree structure from this key structure.
187 A containment relationship is defined by key prefixes.
188 Thus one key is "contained" (possibly transitively)
189 by another if one key is a prefix (in the string sense) of the other.
190 So in this sense the key "/x/y/z" is contained by the key "/x/y".
191 
192 In this model all keys "exist" but only some keys refer to
193 objects containing content -- aka _content bearing_.
194 An important restriction is placed on the structure of the tree,
195 namely that keys are only defined for content-bearing objects.
196 Further, all the leaves of the tree are these content-bearing objects.
197 This means that the key for one content-bearing object should not
198 be a prefix of any other key.
199 
200 There several other concepts of note.
201 1. __Dataset__ - a dataset is the complete tree contained by the key defining
202 the root of the dataset.
203 Technically, the root of the tree is the key <dataset>/.zgroup, where .zgroup can be considered the _superblock_ of the dataset.
204 2. __Object__ - equivalent of the S3 object; Each object has a unique key
205 and "contains" data in the form of an arbitrary sequence of 8-bit bytes.
206 
207 The zmap API defined here isolates the key-value pair mapping
208 code from the Zarr-based implementation of NetCDF-4.
209  It wraps an internal C dispatch table manager for implementing an
210 abstract data structure implementing the zmap key/object model.
211 Of special note is the "search" function of the API.
212 
213 __Search__: The search function has two purposes:
214 1. Support reading of pure zarr datasets (because they do not explicitly track their contents).
215 2. Debugging to allow raw examination of the storage. See zdump for example.
216 
217 The search function takes a prefix path which has a key syntax (see above).
218 The set of legal keys is the set of keys such that the key references a content-bearing object -- e.g. /x/y/.zarray or /.zgroup.
219 Essentially this is the set of keys pointing to the leaf objects of the tree of keys constituting a dataset.
220 This set potentially limits the set of keys that need to be examined during search.
221 
222 The search function returns a limited set of names, where the set of names are immediate suffixes of a given prefix path.
223 That is, if _<prefix>_ is the prefix path, then search returnsnall _<name>_ such that _<prefix>/<name>_ is itself a prefix of a "legal" key.
224 This can be used to implement glob style searches such as "/x/y/*" or "/x/y/**"
225 
226 This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion.
227 It was also chosen to limit the number of names returned from the search.
228 Specifically
229 1. Avoid returning keys that are not a prefix of some legal key.
230 2. Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
231 3. Allow for use of partial read mechanisms such as iterators, if available.
232 This can support processing a limited set of keys for each iteration.
233 This is a straighforward tradeoff of space over time.
234 
235 As a side note, S3 supports this kind of search using common prefixes with a delimiter of '/', although its use is a bit tricky.
236 For the file system zmap implementation, the legal search keys can be obtained one level at a time, which directly implements the search semantics.
237 For the zip file implementation, this semantics is not possible, so the whole
238 tree must be obtained and searched.
239 
240 __Issues:__
241 
242 1. S3 limits key lengths to 1024 bytes.
243 Some deeply nested netcdf files will almost certainly exceed this limit.
244 2. Besides content, S3 objects can have an associated small set
245 of what may be called tags, which are themselves of the form of
246 key-value pairs, but where the key and value are always text.
247 As far as it is possible to determine, Zarr never uses these tags,
248 so they are not included in the zmap data structure.
249 
250 __A Note on Error Codes:__
251 
252 The zmap API returns some distinguished error code:
253 1. NC_NOERR if a operation succeeded
254 2. NC_EEMPTY is returned when accessing a key that has no content.
255 3. NC_EOBJECT is returned when an object is found which should not exist
256 4. NC_ENOOBJECT is returned when an object is not found which should exist
257 
258 This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission errors or illegal function arguments, for example.
259 It also does not preclude the use of other error codes internal to the zmap implementation.
260 So zmap_file, for example, uses NC_ENOTFOUND internally because it is possible to detect the existence of directories and files.
261 But this does not propagate outside the zmap_file implementation.
262 
263 ## Zmap Implementatons
264 
265 The primary zmap implementation is _s3_ (i.e. _mode=nczarr,s3_) and indicates that the Amazon S3 cloud storage -- or some related applicance -- is to be used.
266 Another storage format uses a file system tree of directories and files (_mode=nczarr,file_).
267 A third storage format uses a zip file (_mode=nczarr,zip_).
268 The latter two are used mostly for debugging and testing.
269 However, the _file_ and _zip_ formats are important because they are intended to match corresponding storage formats used by the Python Zarr implementation.
270 Hence it should serve to provide interoperability between NCZarr and the Python Zarr, although this interoperability has not been tested.
271 
272 Examples of the typical URL form for _file_ and _zip_ are as follows.
273 ````
274 file:///xxx/yyy/testdata.file#mode=nczarr,file
275 file:///xxx/yyy/testdata.zip#mode=nczarr,zip
276 ````
277 
278 Note that the extension (e.g. ".file" in "testdata.file")
279 is arbitraty, so this would be equally acceptable.
280 ````
281 file:///xxx/yyy/testdata.anyext#mode=nczarr,file
282 ````
283 As with other URLS (e.g. DAP), these kind of URLS can be passed as the path argument to, for example, __ncdump__.
284 
285 # NCZarr versus Pure Zarr. {#nczarr_purezarr}
286 
287 The NCZARR format extends the pure Zarr format by adding extra keys such as ''\_NCZARR\_ARRAY'' inside the ''.zarray'' object.
288 It is possible to suppress the use of these extensions so that the netcdf library can read and write a pure zarr formatted file.
289 This is controlled by using ''mode=zarr'', which is an alias for the
290 ''mode=nczarr,zarr'' combination.
291 The primary effects of using pure zarr are described in the [Translation Section](@ref nczarr_translation).
292 
293 There are some constraints on the reading of Zarr datasets using the NCZarr implementation.
294 
295 1. Zarr allows some primitive types not recognized by NCZarr.
296 Over time, the set of unrecognized types is expected to diminish.
297 Examples of currently unsupported types are as follows:
298  * "c" -- complex floating point
299  * "m" -- timedelta
300  * "M" -- datetime
301 2. The Zarr dataset may reference filters and compressors unrecognized by NCZarr.
302 3. The Zarr dataset may store data in column-major order instead of row-major order. The effect of encountering such a dataset is to output the data in the wrong order.
303 
304 Again, this list should diminish over time.
305 
306 # Notes on Debugging NCZarr Access {#nczarr_debug}
307 
308 The NCZarr support has a trace facility.
309 Enabling this can sometimes give important, but voluminous information.
310 Tracing can be enabled by setting the environment variable NCTRACING=n,
311 where _n_ indicates the level of tracing.
312 A good value of _n_ is 9.
313 
314 # Zip File Support {#nczarr_zip}
315 
316 In order to use the _zip_ storage format, the libzip [3] library must be installed.
317 Note that this is different from zlib.
318 
319 # Amazon S3 Storage {#nczarr_s3}
320 
321 The Amazon AWS S3 storage driver currently uses the Amazon AWS S3 Software Development Kit for C++ (aws-s3-sdk-cpp).
322 In order to use it, the client must provide some configuration information.
323 Specifically, the ''~/.aws/config'' file should contain something like this.
324 
325 ```
326 [default]
327 output = json
328 aws_access_key_id=XXXX...
329 aws_secret_access_key=YYYY...
330 ```
331 See Appendix E for additional information.
332 
333 ## Addressing Style
334 
335 The notion of "addressing style" may need some expansion.
336 Amazon S3 accepts two forms for specifying the endpoint for accessing the data.
337 
338 1. Virtual -- the virtual addressing style places the bucket in the host part of a URL.
339 For example:
340 ```
341 https://<bucketname>.s2.&lt;region&gt.amazonaws.com/
342 ```
343 2. Path -- the path addressing style places the bucket in at the front of the path part of a URL.
344 For example:
345 ```
346 https://s2.&lt;region&gt.amazonaws.com/<bucketname>/
347 ```
348 
349 The NCZarr code will accept either form, although internally, it is standardized on path style.
350 The reason for this is that the bucket name forms the initial segment in the keys.
351 
352 # Zarr vs NCZarr {#nczarr_vs_zarr}
353 
354 ## Data Model
355 
356 The NCZarr storage format is almost identical to that of the the standard Zarr version 2 format.
357 The data model differs as follows.
358 
359 1. Zarr only supports anonymous dimensions -- NCZarr supports only shared (named) dimensions.
360 2. Zarr attributes are untyped -- or perhaps more correctly characterized as of type string.
361 
362 ## Storage Format
363 
364 Consider both NCZarr and Zarr, and assume S3 notions of bucket and object.
365 In both systems, Groups and Variables (Array in Zarr) map to S3 objects.
366 Containment is modeled using the fact that the dataset's key is a prefix of the variable's key.
367 So for example, if variable _v1_ is contained in top level group g1 -- _/g1 -- then the key for _v1_ is _/g1/v_.
368 Additional meta-data information is stored in special objects whose name start with ".z".
369 
370 In Zarr, the following special objects exist.
371 
372 1. Information about a group is kept in a special object named _.zgroup_;
373 so for example the object _/g1/.zgroup_.
374 2. Information about an array is kept as a special object named _.zarray_;
375 so for example the object _/g1/v1/.zarray_.
376 3. Group-level attributes and variable-level attributes are stored in a special object named _.zattr_;
377 so for example the objects _/g1/.zattr_ and _/g1/v1/.zattr_.
378 4. Chunk data is stored in objects named "<n1>.<n2>...,<nr>" where the ni are positive integers representing the chunk index for the ith dimension.
379 
380 The first three contain meta-data objects in the form of a string representing a JSON-formatted dictionary.
381 The NCZarr format uses the same objects as Zarr, but inserts NCZarr
382 specific key-value pairs in them to hold NCZarr specific information
383 The value of each of these keys is a JSON dictionary containing a variety
384 of NCZarr specific information.
385 
386 These keys are as follows:
387 
388 _\_nczarr_superblock\__ -- this is in the top level group -- key _/.zarr_.
389 It is in effect the "superblock" for the dataset and contains
390 any netcdf specific dataset level information.
391 It is also used to verify that a given key is the root of a dataset.
392 Currently it contains the following key(s):
393 * "version" -- the NCZarr version defining the format of the dataset.
394 
395 _\_nczarr_group\__ -- this key appears in every _.zgroup_ object.
396 It contains any netcdf specific group information.
397 Specifically it contains the following keys:
398 * "dims" -- the name and size of shared dimensions defined in this group.
399 * "vars" -- the name of variables defined in this group.
400 * "groups" -- the name of sub-groups defined in this group.
401 These lists allow walking the NCZarr dataset without having to use the potentially costly search operation.
402 
403 _\_nczarr_array\__ -- this key appears in every _.zarray_ object.
404 It contains netcdf specific array information.
405 Specifically it contains the following keys:
406 * dimrefs -- the names of the shared dimensions referenced by the variable.
407 * storage -- indicates if the variable is chunked vs contiguous in the netcdf sense.
408 
409 _\_nczarr_attr\__ -- this key appears in every _.zattr_ object.
410 This means that technically, it is attribute, but one for which access
411 is normally surpressed .
412 Specifically it contains the following keys:
413 * types -- the types of all of the other attributes in the _.zattr_ object.
414 
415 ## Translation {#nczarr_translation}
416 
417 With some constraints, it is possible for an nczarr library to read Zarr and for a zarr library to read the nczarr format.
418 The latter case, zarr reading nczarr is possible if the zarr library is willing to ignore keys whose name it does not recognize; specifically anything beginning with _\_NCZARR\__.
419 
420 The former case, nczarr reading zarr is also possible if the nczarr can simulate or infer the contents of the missing _\_NCZARR\_XXX_ objects.
421 As a rule this can be done as follows.
422 1. _\_nczarr_group\__ -- The list of contained variables and sub-groups can be computed using the search API to list the keys "contained" in the key for a group.
423 The search looks for occurrences of _.zgroup_, _.zattr_, _.zarray_ to infer the keys for the contained groups, attribute sets, and arrays (variables).
424 Constructing the set of "shared dimensions" is carried out
425 by walking all the variables in the whole dataset and collecting
426 the set of unique integer shapes for the variables.
427 For each such dimension length, a top level dimension is created
428 named ".zdim_<len>" where len is the integer length.
429 2. _\_nczarr_array\__ -- The dimrefs are inferred by using the shape
430 in _.zarray_ and creating references to the simulated shared dimension.
431 netcdf specific information.
432 3. _\_nczarr_attr\__ -- The type of each attribute is inferred by trying to parse the first attribute value string.
433 
434 # Compatibility {#nczarr_compatibility}
435 
436 In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.
437 
438 ## XArray
439 
440 The Xarray <a href="#ref_xarray">[7]</a> Zarr implementation uses its own mechanism for specifying shared dimensions.
441 It uses a special attribute named ''_ARRAY_DIMENSIONS''.
442 The value of this attribute is a list of dimension names (strings).
443 An example might be ````["time", "lon", "lat"]````.
444 It is essentially equivalent to the ````_nczarr_array "dimrefs" list````, except that the latter uses fully qualified names so the referenced dimensions can be anywhere in the dataset.
445 
446 As of _netcdf-c_ version 4.8.2, The Xarray ''_ARRAY_DIMENSIONS'' attribute is supported for both NCZarr and pure Zarr.
447 If possible, this attribute will be read/written by default,
448 but can be suppressed if the mode value "noxarray" is specified.
449 If detected, then these dimension names are used to define shared dimensions.
450 The following conditions will cause ''_ARRAY_DIMENSIONS'' to not be written.
451 * The variable is not in the root group,
452 * Any dimension referenced by the variable is not in the root group.
453 
454 # Examples {#nczarr_examples}
455 
456 Here are a couple of examples using the _ncgen_ and _ncdump_ utilities.
457 
458 1. Create an nczarr file using a local directory tree as storage.
459  ```
460  ncgen -4 -lb -o "file:///home/user/dataset.file#mode=nczarr,file" dataset.cdl
461  ```
462 2. Display the content of an nczarr file using a zip file as storage.
463  ```
464  ncdump "file:///home/user/dataset.zip#mode=nczarr,zip"
465  ```
466 3. Create an nczarr file using S3 as storage.
467  ```
468  ncgen -4 -lb -o "s3://s3.us-west-1.amazonaws.com/datasetbucket" dataset.cdl
469  ```
470 4. Create an nczarr file using S3 as storage and keeping to the pure zarr format.
471  ```
472  ncgen -4 -lb -o "s3://s3.uswest-1.amazonaws.com/datasetbucket#mode=zarr" dataset.cdl
473  ```
474 5. Create an nczarr file using the s3 protocol with a specific profile
475  ```
476  ncgen -4 -lb -o "s3://datasetbucket/rootkey#mode=nczarr,awsprofile=unidata" dataset.cdl
477  ```
478  Note that the URLis internally translated to this
479  ````
480  https://s2.&lt;region&gt.amazonaws.com/datasetbucket/rootkey#mode=nczarr,awsprofile=unidata" dataset.cdl
481  ````
482  The region is from the algorithm described in Appendix E1.
483 
484 # References {#nczarr_bib}
485 
486 <a name="ref_aws">[1]</a> [Amazon Simple Storage Service Documentation](https://docs.aws.amazon.com/s3/index.html)<br>
487 <a name="ref_awssdk">[2]</a> [Amazon Simple Storage Service Library](https://github.com/aws/aws-sdk-cpp)<br>
488 <a name="ref_libzip">[3]</a> [The LibZip Library](https://libzip.org/)<br>
489 <a name="ref_nczarr">[4]</a> [NetCDF ZARR Data Model Specification](https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification)<br>
490 <a name="ref_python">[5]</a> [Python Documentation: 8.3.
491 collections — High-performance dataset datatypes](https://docs.python.org/2/library/collections.html)<br>
492 <a name="ref_zarrv2">[6]</a> [Zarr Version 2 Specification](https://zarr.readthedocs.io/en/stable/spec/v2.html)<br>
493 <a name="ref_xarray">[7]</a> [XArray Zarr Encoding Specification](http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification)<br>
494 <a name="dynamic_filter_loading">[8]</a> [Dynamic Filter Loading](https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf)<br>
495 <a name="official_hdf5_filters">[9]</a> [Officially Registered Custom HDF5 Filters](https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins)<br>
496 <a name="blosc-c-impl">[10]</a> [C-Blosc Compressor Implementation](https://github.com/Blosc/c-blosc)<br>
497 <a name="ref_awssdk_conda">[11]</a> [Conda-forge / packages / aws-sdk-cpp](https://anaconda.org/conda-forge/aws-sdk-cpp)<br>
498 <a name="ref_gdal">[12]</a> [GDAL Zarr](https://gdal.org/drivers/raster/zarr.html)<br>
499 
500 # Appendix A. Building NCZarr Support {#nczarr_build}
501 
502 Currently the following build cases are known to work.
503 
504 <table>
505 <tr><td><u>Operating System</u><td><u>Build System</u><td><u>NCZarr</u><td><u>S3 Support</u>
506 <tr><td>Linux <td> Automake <td> yes <td> yes
507 <tr><td>Linux <td> CMake <td> yes <td> yes
508 <tr><td>Cygwin <td> Automake <td> yes <td> no
509 <tr><td>OSX <td> Automake <td> unknown <td> unknown
510 <tr><td>OSX <td> CMake <td> unknown <td> unknown
511 <tr><td>Visual Studio <td> CMake <td> yes <td> tests fail
512 </table>
513 
514 Note: S3 support includes both compiling the S3 support code as well as running the S3 tests.
515 
516 ## Automake
517 
518 There are several options relevant to NCZarr support and to Amazon S3 support.
519 These are as follows.
520 
521 1. _--disable-nczarr_ -- disable the NCZarr support.
522 If disabled, then all of the following options are disabled or irrelevant.
523 2. _--enable-nczarr-s3_ -- Enable NCZarr S3 support.
524 3. _--enable-nczarr-s3-tests_ -- the NCZarr S3 tests are currently only usable by Unidata personnel, so they are disabled by default.
525 
526 __A note about using S3 with Automake.__
527 If S3 support is desired, and using Automake, then LDFLAGS must be properly set, namely to this.
528 ````
529 LDFLAGS="$LDFLAGS -L/usr/local/lib -laws-cpp-sdk-s3"
530 ````
531 The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.
532 
533 Note also that if S3 support is enabled, then you need to have a C++ compiler installed because part of the S3 support code is written in C++.
534 
535 ## CMake {#nczarr_cmake}
536 
537 The necessary CMake flags are as follows (with defaults)
538 
539 1. -DENABLE_NCZARR=off -- equivalent to the Automake _--disable-nczarr_ option.
540 2. -DENABLE_NCZARR_S3=off -- equivalent to the Automake _--enable-nczarr-s3_ option.
541 3. -DENABLE_NCZARR_S3_TESTS=off -- equivalent to the Automake _--enable-nczarr-s3-tests_ option.
542 
543 Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify _-laws-cpp-sdk-s3_ assuming that the aws s3 libraries are installed in the default location.
544 For CMake with Visual Studio, the default location is here:
545 
546 ````
547 C:/Program Files (x86)/aws-cpp-sdk-all
548 ````
549 
550 It is possible to install the sdk library in another location.
551 In this case, one must add the following flag to the cmake command.
552 ````
553 cmake ... -DAWSSDK_DIR=<awssdkdir>
554 ````
555 where "awssdkdir" is the path to the sdk installation.
556 For example, this might be as follows.
557 ````
558 cmake ... -DAWSSDK_DIR="c:\tools\aws-cpp-sdk-all"
559 ````
560 This can be useful if blanks in path names cause problems in your build environment.
561 
562 ## Testing S3 Support {#nczarr_testing_S3_support}
563 
564 The relevant tests for S3 support are in the _nczarr_test_ directory.
565 Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group.
566 This is because it uses a Unidata-specific bucket is inaccessible to the general user.
567 
568 # Appendix B. Building aws-sdk-cpp {#nczarr_s3sdk}
569 
570 In order to use the S3 storage driver, it is necessary to install the Amazon [aws-sdk-cpp library](https://github.com/aws/aws-sdk-cpp.git).
571 
572 Building this package from scratch has proven to be a formidable task.
573 This appears to be due to dependencies on very specific versions of,
574 for example, openssl.
575 
576 ## *\*nix\** Build
577 
578 For linux, the following context works. Of course your mileage may vary.
579 * OS: ubuntu 21
580 * aws-sdk-cpp version 1.9.96 (or later?)
581 * Required installed libraries: openssl, libcurl, cmake, ninja (ninja-build in apt)
582 
583 ### AWS-SDK-CPP Build Recipe
584 
585 ````
586 git clone --recurse-submodules https://www.github.com/aws/aws-sdk-cpp
587 pushd aws-sdk-cpp
588 mkdir build
589 cd build
590 PREFIX=/usr/local
591 FLAGS="-DCMAKE_INSTALL_PREFIX=${PREFIX} \
592  -DCMAKE_INSTALL_LIBDIR=lib \
593  -DCMAKE_MODULE_PATH=${PREFIX}/lib/cmake \
594  -DCMAKE_POLICY_DEFAULT_CMP0075=NEW \
595  -DBUILD_ONLY=s3 \
596  -DENABLE_UNITY_BUILD=ON \
597  -DENABLE_TESTING=OFF \
598  -DCMAKE_BUILD_TYPE=$CFG \
599  -DSIMPLE_INSTALL=ON"
600 cmake -GNinja $FLAGS ..
601 ninja all
602 ninja install
603 cd ..
604 popd
605 ````
606 
607 ### NetCDF Build
608 
609 In order to build netcdf-c with S3 sdk support,
610 the following options must be specified for ./configure.
611 ````
612 --enable-nczarr-s3
613 ````
614 If you have access to the Unidata bucket on Amazon, then you can
615 also test S3 support with this option.
616 ````
617 --enable-nczarr-s3-tests
618 ````
619 
620 ## Windows build
621 It is possible to build and install aws-sdk-cpp. It is also possible
622 to build netcdf-c using cmake. Unfortunately, testing currently fails.
623 
624 For Windows, the following context work. Of course your mileage may vary.
625 * OS: Windows 10 64-bit with Visual Studio community edition 2019.
626 * aws-sdk-cpp version 1.9.96 (or later?)
627 * Required installed libraries: openssl, libcurl, cmake
628 
629 ### AWS-SDK-CPP Build Recipe
630 
631 This command-line build assumes one is using Cygwin or Mingw to provide
632 tools such as bash.
633 
634 ````
635 git clone --recurse-submodules https://www.github.com/aws/aws-sdk-cpp
636 pushd aws-sdk-cpp
637 mkdir build
638 cd build
639 CFG="Release"
640 PREFIX="c:/tools/aws-sdk-cpp"
641 
642 FLAGS="-DCMAKE_INSTALL_PREFIX=${PREFIX} \
643  -DCMAKE_INSTALL_LIBDIR=lib" \
644  -DCMAKE_MODULE_PATH=${PREFIX}/cmake \
645  -DCMAKE_POLICY_DEFAULT_CMP0075=NEW \
646  -DBUILD_ONLY=s3 \
647  -DENABLE_UNITY_BUILD=ON \
648  -DCMAKE_BUILD_TYPE=$CFG \
649  -DSIMPLE_INSTALL=ON"
650 
651 rm -fr build
652 mkdir -p build
653 cd build
654 cmake -DCMAKE_BUILD_TYPE=${CFG} $FLAGS ..
655 cmake --build . --config ${CFG}
656 cmake --install . --config ${CFG}
657 cd ..
658 popd
659 ````
660 Notice that the sdk is being installed in the directory "c:\tools\aws-sdk-cpp"
661 rather than the default location "c:\Program Files (x86)/aws-sdk-cpp-all"
662 This is because when using a command line, an install path that contains
663 blanks may not work.
664 
665 ### NetCDF CMake Build
666 
667 Enabling S3 support is controlled by these two cmake options:
668 ````
669 -DENABLE_NCZARR_S3=ON
670 -DENABLE_NCZARR_S3_TESTS=OFF
671 ````
672 
673 However, to find the aws sdk libraries,
674 the following environment variables must be set:
675 ````
676 AWSSDK_ROOT_DIR="c:/tools/aws-sdk-cpp"
677 AWSSDKBIN="/cygdrive/c/tools/aws-sdk-cpp/bin"
678 PATH="$PATH:${AWSSDKBIN}"
679 ````
680 Then the following options must be specified for cmake.
681 ````
682 -DAWSSDK_ROOT_DIR=${AWSSDK_ROOT_DIR}
683 -DAWSSDK_DIR=${AWSSDK_ROOT_DIR}/lib/cmake/AWSSDK"
684 ````
685 
686 # Appendix C. Amazon S3 Imposed Limits {#nczarr_s3limits}
687 
688 The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).
689 
690 Some of the relevant limits are as follows:
691 1. The maximum object size is 5 Gigabytes with a total for all objects limited to 5 Terabytes.
692 2. S3 key names can be any UNICODE name with a maximum length of 1024 bytes.
693 Note that the limit is defined in terms of bytes and not (Unicode) characters.
694 This affects the depth to which groups can be nested because the key encodes the full path name of a group.
695 
696 # Appendix D. Alternative Mechanisms for Accessing Remote Datasets {#nczarr_altremote}
697 
698 The NetCDF-C library contains an alternate mechanism for accessing traditional netcdf-4 files stored in Amazon S3: The byte-range mechanism.
699 The idea is to treat the remote data as if it was a big file.
700 This remote "file" can be randomly accessed using the HTTP Byte-Range header.
701 
702 In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket.
703 Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object.
704 The dataset object is referenced using a URL with the trailing fragment containing the string ````#mode=bytes````.
705 
706 An examination of the test program _nc_test/test_byterange.sh_ shows simple examples using the _ncdump_ program.
707 One such test is specified as follows:
708 ````
709 https://s3.us-east-1.amazonaws.com/noaa-goes16/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes
710 ````
711 Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, _noaa-goes16_ in this case, is part of the URL path instead of the host.
712 
713 The _#mode=bytes_ mechanism generalizes to work with most servers that support byte-range access.
714 
715 Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.
716 ````
717 https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes
718 ````
719 
720 # Appendix E. AWS Selection Algorithms. {#nczarr_awsselect}
721 
722 If byterange support is enabled, the netcdf-c library will parse the files
723 ````
724 ${HOME}/.aws/config
725 and
726 ${HOME}/.aws/credentials
727 ````
728 to extract profile names plus a list
729 of key=value pairs. This example is typical.
730 ````
731 [default]
732  aws_access_key_id=XXXXXXXXXXXXXXXXXXXX
733  aws_secret_access_key=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
734  aws_region=ZZZZZZZZZ
735 ````
736 The keys in the profile will be used to set various parameters in the library
737 
738 ## Profile Selection
739 
740 The algorithm for choosing the active profile to use is as follows:
741 
742 1. If the "aws.profile" fragment flag is defined in the URL, then it is used. For example, see this URL.
743 ````
744 https://...#mode=nczarr,s3&aws.profile=xxx
745 ````
746 2. If the "AWS.PROFILE" entry in the .rc file (i.e. .netrc or .dodsrc) is set, then it is used.
747 3. Otherwise the profile "default" is used.
748 
749 The profile named "none" is a special profile that the netcdf-c library automatically defines.
750 It should not be defined anywhere else. It signals to the library that no credentialas are to used.
751 It is equivalent to the "--no-sign-request" option in the AWS CLI.
752 Also, it must be explicitly specified by name. Otherwise "default" will be used.
753 
754 ## Region Selection
755 
756 If the specified URL is of the form
757 ````
758 s3://<bucket>/key
759 ````
760 Then this is rebuilt to this form:
761 ````
762 s3://s2.&lt;region&gt.amazonaws.com>/key
763 ````
764 However this requires figuring out the region to use.
765 The algorithm for picking an region is as follows.
766 
767 1. If the "aws.region" fragment flag is defined in the URL, then it is used.
768 2. The active profile is searched for the "aws_region" key.
769 3. If the "AWS.REGION" entry in the .rc file (i.e. .netrc or .dodsrc) is set, then it is used.
770 4. Otherwise use "us-east-1" region.
771 
772 ## Authorization Selection
773 
774 Picking an access-key/secret-key pair is always determined
775 by the current active profile. To choose to not use keys
776 requires that the active profile must be "none".
777 
778 # Appendix F. NCZarr Version 1 Meta-Data Representation. {#nczarr_version1}
779 
780 In NCZarr Version 1, the NCZarr specific metadata was represented using new objects rather than as keys in existing Zarr objects.
781 Due to conflicts with the Zarr specification, that format is deprecated in favor of the one described above.
782 However the netcdf-c NCZarr support can still read the version 1 format.
783 
784 The version 1 format defines three specific objects: _.nczgroup_, _.nczarray_,_.nczattr_.
785 These are stored in parallel with the corresponding Zarr objects. So if there is a key of the form "/x/y/.zarray", then there is also a key "/x/y/.nczarray".
786 The content of these objects is the same as the contents of the corresponding keys. So the value of the ''_NCZARR_ARRAY'' key is the same as the content of the ''.nczarray'' object. The list of connections is as follows:
787 
788 * ''.nczarr'' <=> ''_nczarr_superblock_''
789 * ''.nczgroup <=> ''_nczarr_group_''
790 * ''.nczarray <=> ''_nczarr_array_''
791 * ''.nczattr <=> ''_nczarr_attr_''
792 
793 # Appendix G. JSON Attribute Convention. {#nczarr_json}
794 
795 The Zarr V2 specification is somewhat vague on what is a legal
796 value for an attribute. The examples all show one of two cases:
797 1. A simple JSON scalar atomic values (e.g. int, float, char, etc), or
798 2. A JSON array of such values.
799 
800 However, the Zarr specification can be read to infer that the value
801 can in fact be any legal JSON expression.
802 This "convention" is currently used routinely to help support various
803 attributes created by other packages where the attribute is a
804 complex JSON expression. An example is the GDAL Driver
805 convention <a href="#ref_gdal">[12]</a>, where the value is a complex
806 JSON dictionary.
807 
808 In order for NCZarr to be as consistent as possible with Zarr Version 2,
809 it is desirable to support this convention for attribute values.
810 This means that there must be some way to handle an attribute
811 whose value is not either of the two cases above. That is, its value
812 is some more complex JSON expression. Ideally both reading and writing
813 of such attributes should be supported.
814 
815 One more point. NCZarr attempts to record the associated netcdf
816 attribute type (encoded in the form of a NumPy "dtype") for each
817 attribute. This information is stored as NCZarr-specific
818 metadata. Note that pure Zarr makes no attempt to record such
819 type information.
820 
821 The current algorithm to support JSON valued attributes
822 operates as follows.
823 
824 ## Writing an attribute:
825 There are mutiple cases to consider.
826 
827 1. The netcdf attribute **is not** of type NC_CHAR and its value is a single atomic value.
828  * Convert to an equivalent JSON atomic value and write that JSON expression.
829  * Compute the Zarr equivalent dtype and store in the NCZarr metadata.
830 
831 2. The netcdf attribute **is not** of type NC_CHAR and its value is a vector of atomic values.
832  * Convert to an equivalent JSON array of atomic values and write that JSON expression.
833  * Compute the Zarr equivalent dtype and store in the NCZarr metadata.
834 
835 3. The netcdf attribute **is** of type NC_CHAR and its value &ndash; taken as a single sequence of characters &ndash;
836 **is** parseable as a legal JSON expression.
837  * Parse to produce a JSON expression and write that expression.
838  * Use "|U1" as the dtype and store in the NCZarr metadata.
839 
840 4. The netcdf attribute **is** of type NC_CHAR and its value &ndash; taken as a single sequence of characters &ndash;
841 **is not** parseable as a legal JSON expression.
842  * Convert to a JSON string and write that expression
843  * Use "|U1" as the dtype and store in the NCZarr metadata.
844 
845 ## Reading an attribute:
846 
847 The process of reading and interpreting an attribute value requires two
848 pieces of information.
849 * The value of the attribute as a JSON expression, and
850 * The optional associated dtype of the attribute; note that this may not exist
851 if, for example, the file is pure zarr.
852 
853 Given these two pieces of information, the read process is as follows.
854 
855 1. The JSON expression is a simple JSON atomic value.
856  * If the dtype is defined, then convert the JSON to that type of data,
857 and then store it as the equivalent netcdf vector of size one.
858  * If the dtype is not defined, then infer the dtype based on the the JSON value,
859 and then store it as the equivalent netcdf vector of size one.
860 
861 2. The JSON expression is an array of simple JSON atomic values.
862  * If the dtype is defined, then convert each JSON value in the array to that type of data,
863 and then store it as the equivalent netcdf vector.
864  * If the dtype is not defined, then infer the dtype based on the first JSON value in the array,
865 and then store it as the equivalent netcdf vector.
866 
867 3. The JSON expression is an array some of whose values are dictionaries or (sub-)arrays.
868  * Un-parse the expression to an equivalent sequence of characters, and then store it as of type NC_CHAR.
869 
870 3. The JSON expression is a dictionary.
871  * Un-parse the expression to an equivalent sequence of characters, and then store it as of type NC_CHAR.
872 
873 ## Notes
874 
875 1. If a character valued attributes's value can be parsed as a legal JSON expression, then it will be stored as such.
876 2. Reading and writing are *almost* idempotent in that the sequence of
877 actions "read-write-read" is equivalent to a single "read" and "write-read-write" is equivalent to a single "write".
878 The "almost" caveat is necessary because (1) whitespace may be added or lost during the sequence of operations,
879 and (2) numeric precision may change.
880 
881 # Appendix H. Support for string types
882 
883 Zarr supports a string type, but it is restricted to
884 fixed size strings. NCZarr also supports such strings,
885 but there are some differences in order to interoperate
886 with the netcdf-4/HDF5 variable length strings.
887 
888 The primary issue to be addressed is to provide a way for user
889 to specify the maximum size of the fixed length strings. This is
890 handled by providing the following new attributes:
891 1. **_nczarr_default_maxstrlen** &mdash;
892 This is an attribute of the root group. It specifies the default
893 maximum string length for string types. If not specified, then
894 it has the value of 128 characters.
895 2. **_nczarr_maxstrlen** &mdash;
896 This is a per-variable attribute. It specifies the maximum
897 string length for the string type associated with the variable.
898 If not specified, then it is assigned the value of
899 **_nczarr_default_maxstrlen**.
900 
901 Note that when accessing a string through the netCDF API, the
902 fixed length strings appear as variable length strings. This
903 means that they are stored as pointers to the string
904 (i.e. **char\***) and with a trailing nul character.
905 One consequence is that if the user writes a variable length
906 string through the netCDF API, and the length of that string
907 is greater than the maximum string length for a variable,
908 then the string is silently truncated.
909 Another consequence is that the user must reclaim the string storage.
910 
911 Adding strings also requires some hacking to handle the existing
912 netcdf-c NC_CHAR type, which does not exist in Zarr. The goal
913 was to choose NumPY types for both the netcdf-c NC_STRING type
914 and the netcdf-c NC_CHAR type such that if a pure zarr
915 implementation reads them, it will still work.
916 
917 For writing variables and NCZarr attributes, the type mapping is as follows:
918 * ">S1" for NC_CHAR.
919 * "|S1" for NC_STRING && MAXSTRLEN==1
920 * "|Sn" for NC_STRING && MAXSTRLEN==n
921 
922 Admittedly, this encoding is a bit of a hack.
923 
924 So when reading data with a pure zarr implementaion
925 the above types should always appear as strings,
926 and the type that signals NC_CHAR (in NCZarr)
927 would be handled by Zarr as a string of length 1.
928 
929 # Change Log {#nczarr_changelog}
930 
931 Note, this log was only started as of 8/11/2022 and is not
932 intended to be a detailed chronology. Rather, it provides highlights
933 that will be of interest to NCZarr users. In order to see exact changes,
934 It is necessary to use the 'git diff' command.
935 
936 ## 8/29/2022
937 1. Zarr fixed-size string types are now supported.
938 
939 ## 8/11/2022
940 1. The NCZarr specific keys have been converted to lower-case
941 (e.g. "_nczarr_attr" instead of "_NCZARR_ATTR"). Upper case is
942 accepted for back compatibility.
943 
944 2. The legal values of an attribute has been extended to
945 include arbitrary JSON expressions; see Appendix G for more details.
946 
947 # Point of Contact {#nczarr_poc}
948 
949 __Author__: Dennis Heimbigner<br>
950 __Email__: dmh at ucar dot edu<br>
951 __Initial Version__: 4/10/2020<br>
952 __Last Revised__: 8/27/2022