Skip to content

Caching: add compression + filename + closing/loading#3005

Merged
pplantinga merged 18 commits into
speechbrain:developfrom
Adel-Moumen:fix_caching
Nov 28, 2025
Merged

Caching: add compression + filename + closing/loading#3005
pplantinga merged 18 commits into
speechbrain:developfrom
Adel-Moumen:fix_caching

Conversation

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR adds the support of a compression backend for HD5, as well as the possibility of modifying the filename, and two utility function when using pickle/deep copy of the class which in the past would raise an error as it's not possible to deepcopy an item that is reading a file (you first need to close it). This now makes the Caching feature compatible with sorting and such.

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the HDF5 caching functionality for SpeechBrain by adding compression support, configurable cache filenames, and pickle/deepcopy compatibility. The changes make the CachedHDF5DynamicItem class more flexible and compatible with operations like sorting that require object serialization.

  • Adds HDF5 compression support with configurable compression algorithm (default: "gzip")
  • Enables custom cache filenames instead of hardcoded "cache.hdf5"
  • Implements __getstate__ and __setstate__ methods to enable pickling/deepcopy by properly handling HDF5 file handles

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated
Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated
Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated
Adel-Moumen and others added 5 commits November 25, 2025 20:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated
Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated
Comment thread speechbrain/integrations/hdf5/cached_item.py
Comment thread speechbrain/integrations/tests/test_cached_item.py
Comment thread speechbrain/integrations/tests/test_cached_item.py
Comment thread tests/unittests/test_data_pipeline.py
Comment thread tests/unittests/test_data_pipeline.py

@pplantinga pplantinga left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, great improvement to the caching. Just a few minor points that could be addressed before merging.

Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated
"""Set the state of the object for unpickling."""
self.__dict__ = state
# Reopen the file lazily in the same mode using the directory and filename.
self.hdf5_path = self.cache_location / self.cache_filename

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the hdf5_path variable be already loaded from the state? Perhaps we should just make hdf5_path a @property to ensure it stays in sync with cache_location and cache_filename

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment thread tests/unittests/test_data_pipeline.py Outdated
assert (cache_dir / "utt1.pt").exists()

# Second call should use cache
result2 = tokenize("utt1", " Hello World ")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a different value here, to ensure its loading from the cache and not re-computing? I suppose the call_count already does this but we could make extra sure

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@pplantinga pplantinga left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@pplantinga pplantinga merged commit 6899673 into speechbrain:develop Nov 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants