Caching: add compression + filename + closing/loading by Adel-Moumen · Pull Request #3005 · speechbrain/speechbrain

Adel-Moumen · 2025-11-25T15:06:14Z

What does this PR do?

This PR adds the support of a compression backend for HD5, as well as the possibility of modifying the filename, and two utility function when using pickle/deep copy of the class which in the past would raise an error as it's not possible to deepcopy an item that is reading a file (you first need to close it). This now makes the Caching feature compatible with sorting and such.

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

Copilot

Pull request overview

This PR enhances the HDF5 caching functionality for SpeechBrain by adding compression support, configurable cache filenames, and pickle/deepcopy compatibility. The changes make the CachedHDF5DynamicItem class more flexible and compatible with operations like sorting that require object serialization.

Adds HDF5 compression support with configurable compression algorithm (default: "gzip")
Enables custom cache filenames instead of hardcoded "cache.hdf5"
Implements __getstate__ and __setstate__ methods to enable pickling/deepcopy by properly handling HDF5 file handles

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Changed default compression from 'gzip' to None in CachedHDF5DynamicItem.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

into fix_caching

pplantinga

Overall, great improvement to the caching. Just a few minor points that could be addressed before merging.

pplantinga · 2025-11-26T16:38:00Z

+        """Set the state of the object for unpickling."""
+        self.__dict__ = state
+        # Reopen the file lazily in the same mode using the directory and filename.
+        self.hdf5_path = self.cache_location / self.cache_filename


Shouldn't the hdf5_path variable be already loaded from the state? Perhaps we should just make hdf5_path a @property to ensure it stays in sync with cache_location and cache_filename

pplantinga · 2025-11-26T16:41:09Z

+    assert (cache_dir / "utt1.pt").exists()
+
+    # Second call should use cache
+    result2 = tokenize("utt1", "  Hello World  ")


Should we use a different value here, to ensure its loading from the cache and not re-computing? I suppose the call_count already does this but we could make extra sure

into fix_caching

pplantinga

Looks great, thanks!

Adel-Moumen added 2 commits November 25, 2025 07:03

add compression + filename + closing/loading

42c797d

fix pre-commit

df25c86

Adel-Moumen requested a review from Copilot November 25, 2025 15:23

Copilot started reviewing on behalf of Adel-Moumen November 25, 2025 15:23 View session

Copilot finished reviewing on behalf of Adel-Moumen November 25, 2025 15:27

Copilot AI reviewed Nov 25, 2025

View reviewed changes

Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated

Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated

Comment thread speechbrain/integrations/hdf5/cached_item.py Outdated

Adel-Moumen and others added 5 commits November 25, 2025 20:17

Change default compression to None in cached_item.py

2dde68d

Changed default compression from 'gzip' to None in CachedHDF5DynamicItem.

add tests

c43fa7e

docstring

bbf829e

docstring

7b6fd77

docstring

75aa502

Adel-Moumen requested a review from Copilot November 26, 2025 15:44

Copilot started reviewing on behalf of Adel-Moumen November 26, 2025 15:44 View session

Copilot finished reviewing on behalf of Adel-Moumen November 26, 2025 15:47

Copilot AI reviewed Nov 26, 2025

View reviewed changes

Adel-Moumen and others added 4 commits November 26, 2025 15:52

Update speechbrain/integrations/hdf5/cached_item.py

0f069ca

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix potential bug with unpickled

d25a75f

Merge branch 'fix_caching' of https://github.com/Adel-Moumen/speechbrain

94f7d51

into fix_caching

test serialisation

0d261b4

Adel-Moumen requested a review from pplantinga November 26, 2025 16:06

pplantinga reviewed Nov 26, 2025

View reviewed changes

Adel-Moumen added 7 commits November 28, 2025 15:21

update repo

debae7f

Merge branch 'fix_caching' of https://github.com/Adel-Moumen/speechbrain

dfb49d0

into fix_caching

wtf

39f8209

wtf x2

370d8f4

:(

edebf4d

fix Path

0d0bd9e

property state

aa2938e

pplantinga approved these changes Nov 28, 2025

View reviewed changes

pplantinga merged commit 6899673 into speechbrain:develop Nov 28, 2025
5 checks passed

Adel-Moumen mentioned this pull request Nov 28, 2025

HDF5 integration not loading/caching etc. #3007

Merged

13 tasks

Uh oh!

Conversation

Adel-Moumen commented Nov 25, 2025

What does this PR do?

PR review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pplantinga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pplantinga Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Adel-Moumen Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

pplantinga Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Adel-Moumen Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

pplantinga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants