HDF5: The File Format for Serious Data
What Is HDF5?
HDF5 (Hierarchical Data Format 5) is a binary file format and library designed for storing large, complex scientific datasets. It is the standard format in computational physics, climate science, bioinformatics, and any domain where you need to store multi-dimensional arrays efficiently alongside metadata.
The selling point: a single HDF5 file can contain petabytes of data organized into a filesystem-like hierarchy of groups and datasets, with attributes attached to any node.
The Data Model
HDF5 organizes data like a UNIX filesystem:
- Group — a container, like a directory. The root is always /. - Dataset — an N-dimensional array of a fixed type (float64, int32, compound struct, etc.) - Attribute — small metadata attached to a group or dataset (simulation parameters, timestamps, units)
/
├── simulation/
│ ├── parameters [attribute: dt=0.01, solver="RK4"]
│ ├── time [float64, shape: (10000,)]
│ └── positions [float64, shape: (10000, 1024, 3)]
└── analysis/
└── energy [float64, shape: (10000,)]
Reading and Writing in C++
The C++ HDF5 API is verbose but expressive:
#include <H5Cpp.h>
#include <vector>
void write_dataset() { H5::H5File file("simulation.h5", H5F_ACC_TRUNC); H5::Group sim = file.createGroup("/simulation");
std::vector<double> data(1024 * 3); // fill data...
hsize_t dims[2] = {1024, 3}; H5::DataSpace space(2, dims); H5::DataSet ds = sim.createDataSet( "positions", H5::PredType::NATIVE_DOUBLE, space ); ds.write(data.data(), H5::PredType::NATIVE_DOUBLE); }
void read_dataset() { H5::H5File file("simulation.h5", H5F_ACC_RDONLY); H5::DataSet ds = file.openDataSet("/simulation/positions");
H5::DataSpace space = ds.getSpace(); hsize_t dims[2]; space.getSimpleExtentDims(dims);
std::vector<double> data(dims[0] * dims[1]); ds.read(data.data(), H5::PredType::NATIVE_DOUBLE); }
Chunking and Compression
The most important performance decision in HDF5 is chunking. By default, a dataset is contiguous on disk. If you have a 3D array shaped (10000, 1024, 3) and you read one time step at a time, you're reading near the entire file to get 1/10000 of the data.
Chunking stores data in fixed-size blocks that can be independently read and compressed:
H5::DSetCreatPropList props;
hsize_t chunk_dims[3] = {1, 1024, 3}; // one time step per chunk
props.setChunk(3, chunk_dims);
props.setDeflate(6); // gzip level 6
H5::DataSet ds = sim.createDataSet( "positions", H5::PredType::NATIVE_DOUBLE, space, props );
Chunk shape should match your read pattern. Reading along the wrong axis through chunked data is as slow as reading column-major data row-by-row in a C array.
When to Use HDF5
HDF5 is not a general-purpose database. It excels when:
- You have large arrays of numerical data (simulations, sensor readings, image stacks) - You need portable, self-describing files — the schema lives inside the file - You need efficient partial I/O — reading slices of huge arrays without loading everything - Multiple tools need to read the same data (Python's h5py, MATLAB, Julia's HDF5.jl all speak the same format)
Avoid it for: transactional data, frequently updated records, anything that fits in memory and doesn't need partial reads.
Conclusion
HDF5 is the right tool when raw binary is too opaque and CSV is too slow. The overhead of learning its API pays off quickly when you're working with datasets measured in gigabytes. The hierarchical structure and embedded metadata also serve as self-documentation — six months later, you can open a simulation file and know exactly what it contains.