Python module#

Overview#

Convenience functions#

GooseHDF5.dump(file, data[, root])

Dump (nested) dictionary to file.

GooseHDF5.copy(source, dest, source_paths[, ...])

Copy groups/datasets from one HDF5-archive source to another HDF5-archive dest.

GooseHDF5.copy_dataset(source, dest, paths)

Copy a dataset from one file to another.

GooseHDF5.compare(-> dict[list])

Compare two files. Return dictionary with differences::.

GooseHDF5.compare_rename(a, b[, rename, ...])

Compare two files. Return three dictionaries with differences::.

Manipulate path#

GooseHDF5.abspath(path)

Return absolute path.

GooseHDF5.join(*args[, root])

Join path components.

Iterators#

GooseHDF5.getdatapaths(file[, root, ...])

Get paths to all dataset and groups that contain attributes.

GooseHDF5.getdatasets(file[, root, ...])

Iterator to transverse all datasets in a HDF5-archive.

GooseHDF5.getgroups(file[, root, has_attrs, ...])

Paths of all groups in a HDF5-archive.

GooseHDF5.filter_datasets(file, paths)

From a list of paths, filter those paths that do not point to datasets.

Verify#

GooseHDF5.verify(file, datasets[, error])

Try reading each datasets.

GooseHDF5.exists(file, path)

Check if a path exists in the HDF5-archive.

GooseHDF5.exists_any(file, paths)

Check if any of the input paths exists in the HDF5-archive.

GooseHDF5.exists_all(file, paths)

Check if all of the input paths exists in the HDF5-archive.

GooseHDF5.equal(source, dest, source_dataset)

Check that a dataset is equal in both files.

GooseHDF5.allequal(source, dest, source_datasets)

Check that all listed datasets are equal in both files.

Documentation#

class GooseHDF5.ExtendableList(file: File, key: str, dtype=None, chunk: int = 1000, **kwargs)#

Write extendable list to HDF5 file.

For example:

data = np.random.random([100])

with h5py.File("foo.h5", "w") as file:
    with g5.ExtendableList(file, "foo", np.float64) as dset:
        for d in data:
            dset.append(d)
Parameters
  • file – Opened HDF5 file (in write mode).

  • key – Path to the dataset.

  • dtype – Data-type to use (needed for new datasets).

  • chunk – Chunk size: flush after this many entries.

  • kwargs – An optional dictionary with attributes.

flush()#

Flush the buffer.

class GooseHDF5.ExtendableSlice(file: File, name: str, shape: Optional[tuple[int, ...]] = None, dtype=None, chunk: int = 1, maxshape: Optional[tuple[int, ...]] = None, **kwargs)#

Write slices of an extendable dataset to HDF5 file.

For example:

dataset = np.random.random([100, 10, 10])

with h5py.File("foo.h5", "w") as file:
    with g5.ExtendableSlice(file, "foo", (10, 10), np.float64) as dset:
        for i in range(dataset.shape[0]):
            dset += dataset[i, ...]
Parameters
  • file – Opened HDF5 file (in write mode).

  • name – Path to the dataset.

  • shape – Shape of all dimensions >= 1. The shape of dimension 0 is dynamic.

  • dtype – Data-type to use (needed for new datasets).

  • chunk – Chunk size: flush after this many slices.

  • maxshape – Maximum shape of all dimensions >= 1. Default: same as shape.

  • kwargs – An optional dictionary with attributes.

flush()#

Flush the buffer.

GooseHDF5.G5compare(args: list[str])#

Command-line tool to print datasets from a file, see --help. :param args: Command-line arguments (should be all strings).

GooseHDF5.G5list(args: list[str])#

Command-line tool to print datasets from a file, see --help. :param args: Command-line arguments (should be all strings).

GooseHDF5.G5modify(args: list[str])#

Command-line tool to print datasets from a file, see --help. :param args: Command-line arguments (should be all strings).

GooseHDF5.G5print(args: list[str])#

Command-line tool to print datasets from a file, see --help. :param args: Command-line arguments (should be all strings).

GooseHDF5.abspath(path: str) str#

Return absolute path.

Parameters

path (str) – A HDF5-path.

Returns

The absolute path.

GooseHDF5.allequal(source: File, dest: File, source_datasets: list[str], dest_datasets: Optional[list[str]] = None, root: Optional[str] = None, attrs: bool = True, matching_dtype: bool = False, shallow: bool = False)#

Check that all listed datasets are equal in both files.

Parameters
  • source (h5py.File) – The source HDF5-archive.

  • dest (h5py.File) – The destination HDF5-archive.

  • source_datasets (list) – List of dataset-paths in source.

  • dest_datasets (list) – List of dataset-paths in dest, defaults to source_datasets.

  • root – Path prefix for all dest_datasets.

  • attrs – Compare attributes (the same way at datasets).

  • matching_dtype – Check that not only the data but also the type matches.

  • shallow – Check only the presence of the dataset, not its value.

GooseHDF5.compare(a: str | h5py.File, b: str | h5py.File, paths_a: list[str] = None, paths_b: list[str] = None, attrs: bool = True, matching_dtype: bool = False, shallow: bool = False, only_datasets: bool = False, fold: str | list[str] = None, max_depth: int = None, close: bool = False) dict[list]#
GooseHDF5.compare(a: h5py.File, b: h5py.File, paths_a: list[str] = None, paths_b: list[str] = None, attrs: bool = True, matching_dtype: bool = False, shallow: bool = False, only_datasets: bool = False, max_depth: int = None, fold: str | list[str] = None, list_folded: bool = False, close: bool = False) dict[list]
GooseHDF5.compare(a: str, b: str, paths_a: list[str] = None, paths_b: list[str] = None, attrs: bool = True, matching_dtype: bool = False, shallow: bool = False, only_datasets: bool = False, max_depth: int = None, fold: str | list[str] = None, list_folded: bool = False, close: bool = False) dict[list]

Compare two files. Return dictionary with differences:

{
    "->" : ["/path/in/b/but/not/in/a", ...],
    "<-" : ["/path/in/a/but/not/in/b", ...],
    "!=" : ["/path/in/both/but/different/data", ...],
    "==" : ["/data/matching", ...]
}

Warning

Folded groups are not compared in any way! Use list_folded to include this in the output.

Parameters
  • a – HDF5-archive (as opened h5py.File or with the filepath).

  • b – HDF5-archive (as opened h5py.File or with the filepath).

  • paths_a – Paths from a to consider. Default: read from getdatapaths().

  • paths_b – Paths from b to consider. Default: read from getdatapaths().

  • attrs – Compare attributes (the same way at datasets).

  • matching_dtype – Check that not only the data but also the type matches.

  • shallow – Check only the presence of datasets, not their values, size, or attributes.

  • only_datasets – Compare datasets only (not groups, regardless if they have attributes).

  • max_depth – Set a maximum depth beyond which groups are folded.

  • fold – Specify groups that are folded.

  • list_folded – Return folded groups under “??”

  • close – Use np.isclose also on float-int matches.

Returns

Dictionary with difference.

GooseHDF5.compare_allow(comparison: dict[list], paths: list[str], keys: list[str] = ['->', '<-', '!='], root: Optional[str] = None) dict[list]#

Modify the output of compare() to allow specific differences. In practice this removes certain fields from the lists under specific keys in the dictionary.

Parameters
  • comparison – The output of compare().

  • paths – List of paths to allow.

  • keys – List of comparison keys ("->", "<-", "!=").

  • root – Path prefix for paths.

Returns

The modified comparison dictionary.

GooseHDF5.compare_rename(a: h5py.File, b: h5py.File, rename: list[str] = None, paths_a: list[str] = None, paths_b: list[str] = None, attrs: bool = True, matching_dtype: bool = False, shallow: bool = False, regex: bool = False, only_datasets: bool = True, max_depth: int = None, fold: str | list[str] = None, list_folded: bool = False, close: bool = False) dict[list]#

Compare two files. Return three dictionaries with differences:

# plain comparison between a and b

{
    "->" : ["/path/in/b/but/not/in/a", ...],
    "<-" : ["/path/in/a/but/not/in/b", ...],
    "!=" : ["/path/in/both/but/different/data", ...],
    "==" : ["/data/matching", ...]
}

# comparison of renamed paths: list of paths in a

{
    "!=" : ["/path/in/a/with/rename/path/not_equal", ...],
    "==" : ["/path/in/a/with/rename/path/matching", ...]
}

# comparison of renamed paths: list of paths in b

{
    "!=" : ["/path/in/b/with/rename/path/not_equal", ...],
    "==" : ["/path/in/b/with/rename/path/matching", ...]
}

Warning

Folded groups are not compared in any way! Use list_folded to include this in the output.

Parameters
  • a – HDF5-archive (as opened h5py.File or with the filepath).

  • b – HDF5-archive (as opened h5py.File or with the filepath).

  • rename – List with with renamed pairs: [["/a/0", "/b/1"], ...].

  • paths_a – Paths from a to consider. Default: read from getdatapaths().

  • paths_b – Paths from b to consider. Default: read from getdatapaths().

  • attrs – Compare attributes (the same way at datasets).

  • matching_dtype – Check that not only the data but also the type matches.

  • shallow – Check only the presence of datasets, not their values, size, or attributes.

  • regex – Use regular expressions to match rename.

  • only_datasets – Compare datasets only (not groups, regardless if they have attributes).

  • max_depth – Set a maximum depth beyond which groups are folded.

  • fold – Specify groups that are folded.

  • list_folded – Return folded groups under “??”

  • close – Use np.isclose also on float-int matches.

Returns

Dictionary with difference.

GooseHDF5.copy(source: File, dest: File, source_paths: list[str], dest_paths: Optional[list[str]] = None, root: Optional[str] = None, source_root: Optional[str] = None, skip: bool = False, preserve_soft: bool = False, shallow: bool = False, expand_soft: bool = False, expand_external: bool = False, expand_refs: bool = False, without_attrs: bool = False)#

Copy groups/datasets from one HDF5-archive source to another HDF5-archive dest. The datasets can be renamed by specifying a list of dest_paths (whose entries should correspond to the source_paths). In addition, a root path prefix can be specified for the destination datasets. Likewise, a source_root path prefix can be specified for the source datasets.

For the options shallow, expand_soft, expand_external, expand_refs, without_attrs see: h5py.Group.copy.

Parameters
  • source – The source HDF5-archive.

  • dest – The destination HDF5-archive.

  • source_paths – List of dataset-paths in source.

  • dest_paths – List of dataset-paths in dest, defaults to source_paths.

  • root – Path prefix for all dest_paths.

  • source_root – Path prefix for all source_paths.

  • skip – Skip datasets that are not present in source.

  • preserve_soft – Preserve soft links.

  • shallow – Only copy immediate members of a group.

  • expand_soft – Expand soft links into new objects.

  • expand_external – Expand external links into new objects.

  • expand_refs – Copy objects which are pointed to by references.

  • without_attrs – Copy object(s) without copying HDF5 attributes.

GooseHDF5.copy_dataset(source, dest, paths, compress=False, double_to_float=False)#

Copy a dataset from one file to another. This function also copies possible attributes.

Parameters
  • source (h5py.File) – The source HDF5-archive.

  • dest (h5py.File) – The destination HDF5-archive.

  • paths (str, list) – (List of) HDF5-path(s) to copy.

  • compress (bool) – Compress the destination dataset(s).

  • double_to_float (bool) – Convert doubles to floats before copying.

GooseHDF5.create_extendible(file: File, key: str, dtype, ndim: int = 1, **kwargs) Dataset#

Create extendible dataset.

Parameters
  • file – Opened HDF5 file.

  • key – Path to the dataset.

  • dtype – Data-type to use.

  • ndim – Number of dimensions.

  • kwargs – An optional dictionary with attributes.

GooseHDF5.dump(file: File, data: dict, root: str = '/')#

Dump (nested) dictionary to file.

GooseHDF5.equal(source: File, dest: File, source_dataset: str, dest_dataset: Optional[str] = None, root: Optional[str] = None, attrs: bool = True, matching_dtype: bool = False, shallow: bool = False, close: bool = False)#

Check that a dataset is equal in both files.

Parameters
  • source (h5py.File) – The source HDF5-archive.

  • dest (h5py.File) – The destination HDF5-archive.

  • source_datasets (list) – List of dataset-paths in source.

  • dest_datasets (list) – List of dataset-paths in dest, defaults to source_datasets.

  • root – Path prefix for dest_dataset.

  • attrs – Compare attributes (the same way at datasets).

  • matching_dtype – Check that not only the data but also the type matches.

  • shallow – Check only the presence of the dataset, not its value.

  • close – Use np.isclose also on float-int matches.

GooseHDF5.exists(file, path)#

Check if a path exists in the HDF5-archive.

Parameters
  • file (h5py.File) – A HDF5-archive.

  • path (str) – HDF5-path.

GooseHDF5.exists_all(file, paths)#

Check if all of the input paths exists in the HDF5-archive.

Arguments

Parameters
  • file (h5py.File) – A HDF5-archive.

  • path (list) – List of HDF5-paths.

GooseHDF5.exists_any(file, paths)#

Check if any of the input paths exists in the HDF5-archive.

Parameters
  • file (h5py.File) – A HDF5-archive.

  • path (list) – List of HDF5-paths.

GooseHDF5.filter_datasets(file, paths)#

From a list of paths, filter those paths that do not point to datasets.

Parameters
  • file (h5py.File) – A HDF5-archive.

  • paths (list) – List of HDF5-paths.

Returns

Filtered paths.

GooseHDF5.getdatapaths(file: h5py.File, root: str = '/', max_depth: int = None, fold: str | list[str] = None, fold_symbol: str = '/...') list[str]#

Get paths to all dataset and groups that contain attributes.

Warning

getgroups() visits all groups in the file, regardless if they are folded (by fold or max_depth). Depending on the file, this can be quite costly. If runtime is an issue consider searching for datasets only using getdatasets() if your use-case allows it.

Parameters
  • file – A HDF5-archive.

  • root – Start a certain point along the path-tree.

  • max_depth – Set a maximum depth beyond which groups are folded.

  • fold – Specify groups that are folded.

  • fold_symbol – Use symbol to indicate that a group is folded.

Returns

List of paths (always absolute, so includes the root if used).

GooseHDF5.getdatasets(file: h5py.File, root: str = '/', max_depth: int = None, fold: str | list[str] = None, fold_symbol: str = '/...') Iterator#

Iterator to transverse all datasets in a HDF5-archive. One can choose to fold (not transverse deeper than):

  • Groups deeper than a certain max_depth.

  • A (list of) specific group(s).

Parameters
  • file – A HDF5-archive.

  • root – Start a certain point along the path-tree.

  • max_depth – Set a maximum depth beyond which groups are folded.

  • fold – Specify groups that are folded.

  • fold_symbol – Use symbol to indicate that a group is folded.

Returns

Iterator to paths (always absolute, so includes the root if used).

Example

Consider this file:

/path/to/first/a
/path/to/first/b
/data/c
/data/d
/e

Calling:

with h5py.File("...", "r") as file:

    for path in GooseHDF5.getpaths(file, max_depth=2, fold="/data"):
        print(path)

Will print:

/path/to/...
/data/...
/e

The ... indicates that it concerns a folded group, not a dataset. Here, the first group was folded because of the maximum depth, the second because it was specifically requested to be folded.

GooseHDF5.getgroups(file: h5py.File, root: str = '/', has_attrs: bool = False, max_depth: int = None, fold: str | list[str] = None, fold_symbol: str = '/...') list[str]#

Paths of all groups in a HDF5-archive.

Warning

The function visits all groups in the file, regardless if they are folded (by fold or max_depth). Depending on the file, this can be quite costly.

Parameters
  • file – A HDF5-archive.

  • root – Start at a certain point along the path-tree.

  • has_attrs – Return only groups that have attributes.

  • max_depth (int) – Set a maximum depth beyond which groups are folded.

  • fold – Specify groups that are folded.

  • fold_symbol – Use symbol to indicate that a group is folded.

Returns

List of paths (always absolute, so includes the root if used).

GooseHDF5.info_table(source, paths: list[str], link_type: bool = False) PrettyTable#

Get a table with basic information per path:

  • path

  • size

  • shape

  • dtype

  • attrs: Number of attributes

  • link: Link type

Parameters
  • paths – List of paths.

  • link_type – Include the link-type in the output.

GooseHDF5.isnumeric(a)#

Returns True is an array contains numeric values.

Parameters

a (array) – An array.

Returns

bool

GooseHDF5.join(*args, root: bool = False) str#

Join path components.

Parameters
  • args (list) – Piece of a path.

  • root – Prepend the output with the root "/".

Returns

The concatenated path.

GooseHDF5.print_attribute(source, paths: list[str])#

Print paths to dataset and to all underlying attributes. :param paths: List of paths.

GooseHDF5.print_plain(source, paths: list[str], show_links: bool = False)#

Print the paths to all datasets (one per line). :param paths: List of paths. :param show_links: Show the path the link points to.

GooseHDF5.verify(file, datasets, error=False)#

Try reading each datasets.

Parameters
  • file (h5py.File) – A HDF5-archive.

  • datasets (list) – List of HDF5-paths tp datasets.

  • error (bool) –

    • If True, the function raises an error if reading failed.

    • If False, the function just continues.

Returns

List with only those datasets that can be successfully opened.