NetCDF  4.6.3
 All Data Structures Files Functions Variables Typedefs Macros Modules Pages
filters.md
1 NetCDF-4 Filter Support
2 ============================
3 <!-- double header is needed to workaround doxygen bug -->
4 
5 NetCDF-4 Filter Support {#filters}
6 ============================
7 
8 [TOC]
9 
10 # Introduction {#filters_intro}
11 
12 The HDF5 library (1.8.11 and later)
13 supports a general filter mechanism to apply various
14 kinds of filters to datasets before reading or writing.
15 The netCDF enhanced (aka netCDF-4) library inherits this
16 capability since it depends on the HDF5 library.
17 
18 Filters assume that a variable has chunking
19 defined and each chunk is filtered before
20 writing and "unfiltered" after reading and
21 before passing the data to the user.
22 
23 The most common kind of filter is a compression-decompression
24 filter, and that is the focus of this document.
25 
26 HDF5 supports dynamic loading of compression filters using the following
27 process for reading of compressed data.
28 
29 1. Assume that we have a dataset with one or more variables that
30 were compressed using some algorithm. How the dataset was compressed
31 will be discussed subsequently.
32 
33 2. Shared libraries or DLLs exist that implement the compress/decompress
34 algorithm. These libraries have a specific API so that the HDF5 library
35 can locate, load, and utilize the compressor.
36 These libraries are expected to installed in a specific
37 directory.
38 
39 # Enabling A Compression Filter {#filters_enable}
40 
41 In order to compress a variable, the netcdf-c library
42 must be given three pieces of information:
43 (1) some unique identifier for the filter to be used,
44 (2) a vector of parameters for
45 controlling the action of the compression filter, and
46 (3) a shared library implementation of the filter.
47 
48 The meaning of the parameters is, of course,
49 completely filter dependent and the filter
50 description [3] needs to be consulted. For
51 bzip2, for example, a single parameter is provided
52 representing the compression level.
53 It is legal to provide a zero-length set of parameters.
54 Defaults are not provided, so this assumes that
55 the filter can operate with zero parameters.
56 
57 Filter ids are assigned by the HDF group. See [4]
58 for a current list of assigned filter ids.
59 Note that ids above 32767 can be used for testing without
60 registration.
61 
62 The first two pieces of information can be provided in one of three ways:
63 using __ncgen__, via an API call, or via command line parameters to __nccopy__.
64 In any case, remember that filtering also requires setting chunking, so the
65 variable must also be marked with chunking information.
66 
67 ## Using The API {#filters_API}
68 The necessary API methods are included in __netcdf.h__ by default.
69 One API method is for setting the filter to be used
70 when writing a variable. The relevant signature is
71 as follows.
72 ````
73 int nc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, const unsigned int* parms);
74 ````
75 This must be invoked after the variable has been created and before
76 __nc_enddef__ is invoked.
77 
78 A second API methods makes it possible to query a variable to
79 obtain information about any associated filter using this signature.
80 ````
81 int nc_inq_var_filter(int ncid, int varid, unsigned int* idp, size_t* nparams, unsigned int* params);
82 
83 ````
84 The filter id will be returned in the __idp__ argument (if non-NULL),
85 the number of parameters in __nparamsp__ and the actual parameters in
86 __params__. As is usual with the netcdf API, one is expected to call
87 this function twice. The first time to get __nparams__ and the
88 second to get the parameters in client-allocated memory.
89 
90 ## Using ncgen {#filters_NCGEN}
91 
92 In a CDL file, compression of a variable can be specified
93 by annotating it with the following attribute:
94 
95 * ''_Filter'' -- a string containing a comma separated list of
96 constants specifying (1) the filter id to apply, and (2)
97 a vector of constants representing the
98 parameters for controlling the operation of the specified filter.
99 See the section on the <a href="#Syntax">parameter encoding syntax</a>
100 for the details on the allowable kinds of constants.
101 
102 This is a "special" attribute, which means that
103 it will normally be invisible when using
104 __ncdump__ unless the -s flag is specified.
105 
106 ### Example CDL File (Data elided)
107 
108 ````
109 netcdf bzip2 {
110 dimensions:
111  dim0 = 4 ; dim1 = 4 ; dim2 = 4 ; dim3 = 4 ;
112 variables:
113  float var(dim0, dim1, dim2, dim3) ;
114  var:_Filter = "307,9" ;
115  var:_Storage = "chunked" ;
116  var:_ChunkSizes = 4, 4, 4, 4 ;
117 data:
118 ...
119 }
120 ````
121 
122 ## Using nccopy {#filters_NCCOPY}
123 
124 When copying a netcdf file using __nccopy__ it is possible
125 to specify filter information for any output variable by
126 using the "-F" option on the command line; for example:
127 ````
128 nccopy -F "var,307,9" unfiltered.nc filtered.nc
129 ````
130 Assume that __unfiltered.nc__ has a chunked but not bzip2 compressed
131 variable named "var". This command will create that variable in
132 the __filtered.nc__ output file but using filter with id 307
133 (i.e. bzip2) and with parameter(s) 9 indicating the compression level.
134 See the section on the <a href="#Syntax">parameter encoding syntax</a>
135 for the details on the allowable kinds of constants.
136 
137 The "-F" option can be used repeatedly as long as the variable name
138 part is different. A different filter id and parameters can be
139 specified for each occurrence.
140 
141 It can be convenient to specify that the same compression is to be
142 applied to more than one variable. To support this, two additional
143 *-F* cases are defined.
144 
145 1. ````-F *,...``` means apply the filter to all variables in the dataset.
146 2. ````-F v1|v2|..,...``` means apply the filter to a multiple variables.
147 
148 Note that the characters '*' and '|' are bash reserved characters,
149 so you will probably need to escape or quote the filter spec in
150 that environment.
151 
152 As a rule, any input filter on an input variable will be applied
153 to the equivalent output variable -- assuming the output file type
154 is netcdf-4. It is, however, sometimes convenient to suppress
155 output compression either totally or on a per-variable basis.
156 Total suppression of output filters can be accomplished by specifying
157 a special case of "-F", namely this.
158 ````
159 nccopy -F none input.nc output.nc
160 ````
161 The expression ````-F *,none```` is equivalent to ````-F none````.
162 
163 Suppression of output filtering for a specific set of variables
164 can be accomplished using these formats.
165 ````
166 nccopy -F "var,none" input.nc output.nc
167 nccopy -F "v1|v2|...,none" input.nc output.nc
168 ````
169 where "var" and the "vi" are the fully qualified name of a variable.
170 
171 The rules for all possible cases of the "-F none" flag are defined
172 by this table.
173 
174 <table>
175 <tr><th>-F none<th>-Fvar,...<th>Input Filter<th>Applied Output Filter
176 <tr><td>true<td>unspecified<td>NA<td>unfiltered
177 <tr><td>true<td>-Fvar,none<td>NA<td>unfiltered
178 <tr><td>true<td>-Fvar,...<td>NA<td>use output filter
179 <tr><td>false<td>unspecified<td>defined<td>use input filter
180 <tr><td>false<td>-Fvar,none<td>NA<td>unfiltered
181 <tr><td>false<td>-Fvar,...<td>NA<td>use output filter
182 <tr><td>false<td>unspecified<td>none<td>unfiltered
183 </table>
184 
185 # Parameter Encode/Decode {#filters_paramcoding}
186 
187 The parameters passed to a filter are encoded internally as a vector
188 of 32-bit unsigned integers. It may be that the parameters
189 required by a filter can naturally be encoded as unsigned integers.
190 The bzip2 compression filter, for example, expects a single
191 integer value from zero thru nine. This encodes naturally as a
192 single unsigned integer.
193 
194 Note that signed integers and single-precision (32-bit) float values
195 also can easily be represented as 32 bit unsigned integers by
196 proper casting to an unsigned integer so that the bit pattern
197 is preserved. Simple integer values of type short or char
198 (or the unsigned versions) can also be mapped to an unsigned
199 integer by truncating to 16 or 8 bits respectively and then
200 zero extending.
201 
202 Machine byte order (aka endian-ness) is an issue for passing
203 some kinds of parameters. You might define the parameters when
204 compressing on a little endian machine, but later do the
205 decompression on a big endian machine. Byte order is not an
206 issue for 32-bit values because HDF5 takes care of converting
207 them between the local machine byte order and network byte
208 order.
209 
210 Parameters whose size is larger than 32-bits present a byte order problem.
211 This specifically includes double precision floats and (signed or unsigned)
212 64-bit integers. For these cases, the machine byte order issue must be
213 handled, in part, by the compression code. This is because HDF5 will treat,
214 for example, an unsigned long long as two 32-bit unsigned integers
215 and will convert each to network order separately. This means that
216 on a machine whose byte order is different than the machine in which
217 the parameters were initially created, the two integers will be separately
218 endian converted. But this will be incorrect for 64-bit values.
219 
220 So, we have this situation:
221 
222 1. the 8 bytes come in as native machine order for the machine
223  doing the call to *nc_def_var_filter*.
224 2. HDF5 divides the 8 bytes into 2 four byte pieces and ensures that each piece
225  is in network (big) endian order.
226 3. When the filter is called, the two pieces are returned in the same order
227  but with the bytes in each piece consistent with the native machine order
228  for the machine executing the filter.
229 
230 ## Encoding Algorithms
231 
232 In order to properly extract the correct 8-byte value, we need to ensure
233 that the values stored in the HDF5 file have a known format independent of
234 the native format of the creating machine.
235 
236 The idea is to do sufficient manipulation so that HDF5
237 will store the 8-byte value as a little endian value
238 divided into two 4-byte integers.
239 Note that little-endian is used as the standard
240 because it is the most common machine format.
241 When read, the filter code needs to be aware of this convention
242 and do the appropriate conversions.
243 
244 This leads to the following set of rules.
245 
246 ### Encoding
247 
248 1. Encode on little endian (LE) machine: no special action is required.
249  The 8-byte value is passed to HDF5 as two 4-byte integers. HDF5 byte
250  swaps each integer and stores it in the file.
251 2. Encode on a big endian (BE) machine: several steps are required:
252 
253  1. Do an 8-byte byte swap to convert the original value to little-endian
254  format.
255  2. Since the encoding machine is BE, HDF5 will just store the value.
256  So it is necessary to simulate little endian encoding by byte-swapping
257  each 4-byte integer separately.
258  3. This doubly swapped pair of integers is then passed to HDF5 and is stored
259  unchanged.
260 
261 ### Decoding
262 
263 1. Decode on LE machine: no special action is required.
264  HDF5 will get the two 4-bytes values from the file and byte-swap each
265  separately. The concatenation of those two integers will be the expected
266  LE value.
267 2. Decode on a big endian (BE) machine: the inverse of the encode case must
268  be implemented.
269 
270  1. HDF5 sends the two 4-byte values to the filter.
271  2. The filter must then byte-swap each 4-byte value independently.
272  3. The filter then must concatenate the two 4-byte values into a single
273  8-byte value. Because of the encoding rules, this 8-byte value will
274  be in LE format.
275  4. The filter must finally do an 8-byte byte-swap on that 8-byte value
276  to convert it to desired BE format.
277 
278 To support these rules, some utility programs exist and are discussed in
279 <a href="#AppendixA">Appendix A</a>.
280 
281 # Filter Specification Syntax {#filters_syntax}
282 
283 Both of the utilities
284 <a href="#NCGEN">__ncgen__</a>
285 and
286 <a href="#NCCOPY">__nccopy__</a>
287 allow the specification of filter parameters in text format.
288 These specifications consist of a sequence of comma
289 separated constants. The constants are converted
290 within the utility to a proper set of unsigned int
291 constants (see the <a href="#ParamEncode">parameter encoding section</a>).
292 
293 To simplify things, various kinds of constants can be specified
294 rather than just simple unsigned integers. The utilities will encode
295 them properly using the rules specified in
296 the section on <a href="#filters_paramcoding">parameter encode/decode</a>.
297 
298 The currently supported constants are as follows.
299 <table>
300 <tr halign="center"><th>Example<th>Type<th>Format Tag<th>Notes
301 <tr><td>-17b<td>signed 8-bit byte<td>b|B<td>Truncated to 8 bits and zero extended to 32 bits
302 <tr><td>23ub<td>unsigned 8-bit byte<td>u|U b|B<td>Truncated to 8 bits and zero extended to 32 bits
303 <tr><td>-25S<td>signed 16-bit short<td>s|S<td>Truncated to 16 bits and zero extended to 32 bits
304 <tr><td>27US<td>unsigned 16-bit short<td>u|U s|S<td>Truncated to 16 bits and zero extended to 32 bits
305 <tr><td>-77<td>implicit signed 32-bit integer<td>Leading minus sign and no tag<td>
306 <tr><td>77<td>implicit unsigned 32-bit integer<td>No tag<td>
307 <tr><td>93U<td>explicit unsigned 32-bit integer<td>u|U<td>
308 <tr><td>789f<td>32-bit float<td>f|F<td>
309 <tr><td>12345678.12345678d<td>64-bit double<td>d|D<td>LE encoding
310 <tr><td>-9223372036854775807L<td>64-bit signed long long<td>l|L<td>LE encoding
311 <tr><td>18446744073709551615UL<td>64-bit unsigned long long<td>u|U l|L<td>LE encoding
312 </table>
313 Some things to note.
314 
315 1. In all cases, except for an untagged positive integer,
316  the format tag is required and determines how the constant
317  is converted to one or two unsigned int values.
318  The positive integer case is for backward compatibility.
319 2. For signed byte and short, the value is sign extended to 32 bits
320  and then treated as an unsigned int value.
321 3. For double, and signed|unsigned long long, they are converted
322  as specified in the section on
323  <a href="#filters_paramcoding">parameter encode/decode</a>.
324 
325 Dynamic Loading Process {#filters_Process}
326 ==========
327 
328 The documentation[1,2] for the HDF5 dynamic loading was (at the time
329 this was written) out-of-date with respect to the actual HDF5 code
330 (see HDF5PL.c). So, the following discussion is largely derived
331 from looking at the actual code. This means that it is subject to change.
332 
333 Plugin directory {#filters_Plugindir}
334 ----------------
335 
336 The HDF5 loader expects plugins to be in a specified plugin directory.
337 The default directory is:
338  * "/usr/local/hdf5/lib/plugin” for linux/unix operating systems (including Cygwin)
339  * “%ALLUSERSPROFILE%\\hdf5\\lib\\plugin” for Windows systems, although the code
340  does not appear to explicitly use this path.
341 
342 The default may be overridden using the environment variable
343 __HDF5_PLUGIN_PATH__.
344 
345 Plugin Library Naming {#filters_Pluginlib}
346 ---------------------
347 
348 Given a plugin directory, HDF5 examines every file in that
349 directory that conforms to a specified name pattern
350 as determined by the platform on which the library is being executed.
351 <table>
352 <tr halign="center"><th>Platform<th>Basename<th>Extension
353 <tr halign="left"><td>Linux<td>lib*<td>.so*
354 <tr halign="left"><td>OSX<td>lib*<td>.so*
355 <tr halign="left"><td>Cygwin<td>cyg*<td>.dll*
356 <tr halign="left"><td>Windows<td>*<td>.dll
357 </table>
358 
359 Plugin Verification {#filters_Pluginverify}
360 -------------------
361 For each dynamic library located using the previous patterns,
362 HDF5 attempts to load the library and attempts to obtain information
363 from it. Specifically, It looks for two functions with the following
364 signatures.
365 
366 1. __H5PL_type_t H5PLget_plugin_type(void)__ --
367 This function is expected to return the constant value
368 __H5PL_TYPE_FILTER__ to indicate that this is a filter library.
369 2. __const void* H5PLget_plugin_info(void)__ --
370 This function returns a pointer to a table of type __H5Z_class2_t__.
371 This table contains the necessary information needed to utilize the
372 filter both for reading and for writing. In particular, it specifies
373 the filter id implemented by the library and if must match that id
374 specified for the variable in __nc_def_var_filter__ in order to be used.
375 
376 If plugin verification fails, then that plugin is ignored and
377 the search continues for another, matching plugin.
378 
379 Debugging {#filters_Debug}
380 -------
381 Debugging plugins can be very difficult. You will probably
382 need to use the old printf approach for debugging the filter itself.
383 
384 One case worth mentioning is when you have a dataset that is
385 using an unknown filter. For this situation, you need to
386 identify what filter(s) are used in the dataset. This can
387 be accomplished using this command.
388 ````
389 ncdump -s -h <dataset filename>
390 ````
391 Since ncdump is not being asked to access the data (the -h flag), it
392 can obtain the filter information without failures. Then it can print
393 out the filter id and the parameters (the -s flag).
394 
395 Test Case {#filters_TestCase}
396 -------
397 Within the netcdf-c source tree, the directory
398 __netcdf-c/nc_test4__ contains a test case (__test_filter.c__) for
399 testing dynamic filter writing and reading using
400 bzip2. Another test (__test_filter_misc.c__) validates
401 parameter passing. These tests are disabled if __--enable-shared__
402 is not set or if __--enable-netcdf-4__ is not set.
403 
404 Example {#filters_Example}
405 -------
406 A slightly simplified version of the filter test case is also
407 available as an example within the netcdf-c source tree
408 directory __netcdf-c/examples/C. The test is called __filter_example.c__
409 and it is executed as part of the __run_examples4.sh__ shell script.
410 The test case demonstrates dynamic filter writing and reading.
411 
412 The files __example/C/hdf5plugins/Makefile.am__
413 and __example/C/hdf5plugins/CMakeLists.txt__
414 demonstrate how to build the hdf5 plugin for bzip2.
415 
416 Notes
417 ==========
418 
419 Memory Allocation Issues
420 -----------
421 
422 Starting with HDF5 version 1.10.x, the plugin code MUST be
423 careful when using the standard *malloc()*, *realloc()*, and
424 *free()* function.
425 
426 In the event that the code is allocating, reallocating, for
427 free'ing memory that either came from or will be exported to the
428 calling HDF5 library, then one MUST use the corresponding HDF5
429 functions *H5allocate_memory()*, *H5resize_memory()*,
430 *H5free_memory()* [5] to avoid memory failures.
431 
432 Additionally, if your filter code leaks memory, then the HDF5 library
433 generates a failure something like this.
434 ````
435 H5MM.c:232: H5MM_final_sanity_check: Assertion `0 == H5MM_curr_alloc_bytes_s' failed.
436 ````
437 
438 One can look at the the code in plugins/H5Zbzip2.c and H5Zmisc.c to see this.
439 
440 SZIP Issues
441 -----------
442 The current szip plugin code in the HDF5 library
443 has some behaviors that can catch the unwary.
444 Specifically, this filter may do two things.
445 
446 1. Add extra parameters to the filter parameters: going from
447  the two parameters provided by the user to four parameters
448  for internal use. It turns out that the two parameters provided
449  when calling nc_def_var_filter correspond to the first two
450  parameters of the four parameters returned by nc_inq_var_filter.
451 2. Change the values of some parameters: the value of the
452  __options_mask__ argument is known to add additional flag bits,
453  and the __pixels_per_block__ parameter may be modified.
454 
455 The reason for these changes is has to do with the fact that
456 the szip API provided by the underlying H5Pset_szip function
457 is actually a subset of the capabilities of the real szip implementation.
458 Presumably this is for historical reasons.
459 
460 In any case, if the caller uses the __nc_inq_var_szip__, then
461 the values returned may differ from those originally specified.
462 If one used the __nc_inq_var_filter__ API calls, it may be the case that
463 both the number of parameters and the values will differ from the original
464 call to __nc_def_var_filter__.
465 
466 Supported Systems
467 -----------------
468 The current matrix of OS X build systems known to work is as follows.
469 <table>
470 <tr><th>Build System<th>Supported OS
471 <tr><td>Automake<td>Linux, Cygwin
472 <tr><td>Cmake<td>Linux, Cygwin, Visual Studio
473 </table>
474 
475 Generic Plugin Build
476 --------------------
477 If you do not want to use Automake or Cmake, the following
478 has been known to work.
479 ````
480 gcc -g -O0 -shared -o libbzip2.so <plugin source files> -L${HDF5LIBDIR} -lhdf5_hl -lhdf5 -L${ZLIBDIR} -lz
481 ````
482 
483 Appendix A. Support Utilities {#filters_AppendixA}
484 ==========
485 
486 Two functions are exported from the netcdf-c library
487 for use by client programs and by filter implementations.
488 
489 1. ````int NC_parsefilterspec(const char* spec, unsigned int* idp, size_t* nparamsp, unsigned int** paramsp);````
490  * idp will contain the filter id value from the spec.
491  * nparamsp will contain the number of 4-byte parameters
492  * paramsp will contain a pointer to the parsed parameters -- the caller
493  must free.
494  This function can parse filter spec strings as defined in
495  the section on <a href="#filters_syntax">Filter Specification Syntax</a>.
496  This function parses the first argument and returns several values.
497 
498 2. ````int NC_filterfix8(unsigned char* mem8, int decode);````
499  * mem8 is a pointer to the 8-byte value either to fix.
500  * decode is 1 if the function should apply the 8-byte decoding algorithm
501  else apply the encoding algorithm.
502  This function implements the 8-byte conversion algorithms.
503  Before calling *nc_def_var_filter* (unless *NC_parsefilterspec* was used),
504  the client must call this function with the decode argument set to 0.
505  Inside the filter code, this function should be called with the decode
506  argument set to 1.
507 
508 Examples of the use of these functions can be seen in the test program
509 *nc_test4/tst_filterparser.c*.
510 
511 # References {#filters_References}
512 
513 1. https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf
514 2. https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-CompressionTroubleshooting.pdf
515 3. https://portal.hdfgroup.org/display/support/Contributions#Contributions-filters
516 4. https://support.hdfgroup.org/services/contributions.html#filters
517 5. https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html
518 
519 # Point of Contact
520 
521 __Author__: Dennis Heimbigner<br>
522 __Email__: dmh at ucar dot edu
523 __Initial Version__: 1/10/2018<br>
524 __Last Revised__: 2/5/2018
525 

Return to the Main Unidata NetCDF page.
Generated on Sat Apr 6 2019 08:19:00 for NetCDF. NetCDF is a Unidata library.