fix: serve still-valid cached AWS credentials when refresh fails#7506
Open
lancedb-robot wants to merge 1 commit into
Open
fix: serve still-valid cached AWS credentials when refresh fails#7506lancedb-robot wants to merge 1 commit into
lancedb-robot wants to merge 1 commit into
Conversation
AwsCredentialAdapter proactively refreshes credentials credentials_refresh_offset (default 60s) before they expire. When that proactive refresh hit a transient error from the underlying provider (e.g. an IMDS/STS HTTP connect timeout), get_credential discarded the still-valid cached credentials and returned a hard error, surfacing as a 500 for S3 and DynamoDB operations. Fall back to the cached credentials when a refresh fails but the cached credentials have not actually expired yet; the next call retries the refresh. Truly-expired credentials still surface the error rather than being used.
Katomoto
reviewed
Jun 27, 2026
Katomoto
left a comment
There was a problem hiding this comment.
Logic looks correct. One suggestion: the early return on line prevents the cleanup function from running — worth adding a finally block.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Production query nodes intermittently fail S3/DynamoDB operations with:
which surfaces to callers as a
500.AwsCredentialAdapter(lance-io/src/object_store/providers/aws.rs) proactively refreshes credentialscredentials_refresh_offset(default 60s) before they expire. During that window the cached credentials are still valid, but the adapter treated the cache as a miss and performed a blocking refresh. When that refresh hit a transient error from the underlying provider — e.g. an IMDS/STS HTTP connect timeout —get_credentialdiscarded the still-valid cached credentials and returned a hard error.The same adapter backs both the S3 object store and the DynamoDB external manifest store (via
OSObjectStoreToAwsCredAdaptor), so a single transient credential-provider blip during the refresh window turns into a failed request.This is the gap left by the earlier credential-caching/refresh-offset work: hard expiry is handled, but a failed proactive refresh was not.
Fix
When a refresh fails but the cached credentials have not actually expired yet, fall back to the cached credentials and log a warning; the next call retries the refresh. Truly-expired credentials still surface the error rather than being served.
Added a unit test (
test_aws_credential_adapter_falls_back_to_cached_on_refresh_failure) using a provider that succeeds once and then fails, asserting that the still-valid cached credentials are served while valid, and that an error is returned once they expire.Notes
cargo fmt/cargo clippycould not be run in this environment (the pinned toolchain's rustfmt/clippy components are unavailable offline). The change was kept formatting-clean by hand; please confirm in CI.