Caching: add compression + filename + closing/loading#3005
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances the HDF5 caching functionality for SpeechBrain by adding compression support, configurable cache filenames, and pickle/deepcopy compatibility. The changes make the CachedHDF5DynamicItem class more flexible and compatible with operations like sorting that require object serialization.
- Adds HDF5 compression support with configurable compression algorithm (default: "gzip")
- Enables custom cache filenames instead of hardcoded "cache.hdf5"
- Implements
__getstate__and__setstate__methods to enable pickling/deepcopy by properly handling HDF5 file handles
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
pplantinga
left a comment
There was a problem hiding this comment.
Overall, great improvement to the caching. Just a few minor points that could be addressed before merging.
| """Set the state of the object for unpickling.""" | ||
| self.__dict__ = state | ||
| # Reopen the file lazily in the same mode using the directory and filename. | ||
| self.hdf5_path = self.cache_location / self.cache_filename |
There was a problem hiding this comment.
Shouldn't the hdf5_path variable be already loaded from the state? Perhaps we should just make hdf5_path a @property to ensure it stays in sync with cache_location and cache_filename
| assert (cache_dir / "utt1.pt").exists() | ||
|
|
||
| # Second call should use cache | ||
| result2 = tokenize("utt1", " Hello World ") |
There was a problem hiding this comment.
Should we use a different value here, to ensure its loading from the cache and not re-computing? I suppose the call_count already does this but we could make extra sure
pplantinga
left a comment
There was a problem hiding this comment.
Looks great, thanks!
What does this PR do?
This PR adds the support of a compression backend for HD5, as well as the possibility of modifying the filename, and two utility function when using pickle/deep copy of the class which in the past would raise an error as it's not possible to deepcopy an item that is reading a file (you first need to close it). This now makes the Caching feature compatible with sorting and such.
Before submitting
PR review
Reviewer checklist