Skip to content

fix: serve still-valid cached AWS credentials when refresh fails#7506

Open
lancedb-robot wants to merge 1 commit into
lance-format:mainfrom
lancedb-robot:backburner/aws-cred-refresh-fallback
Open

fix: serve still-valid cached AWS credentials when refresh fails#7506
lancedb-robot wants to merge 1 commit into
lance-format:mainfrom
lancedb-robot:backburner/aws-cred-refresh-fallback

Conversation

@lancedb-robot

Copy link
Copy Markdown
Collaborator

Problem

Production query nodes intermittently fail S3/DynamoDB operations with:

Failed to get AWS credentials: ... DispatchFailure { source: ConnectorError { kind: Timeout, source: HttpTimeoutError { kind: "HTTP connect", duration: 3.1s } } }

which surfaces to callers as a 500.

AwsCredentialAdapter (lance-io/src/object_store/providers/aws.rs) proactively refreshes credentials credentials_refresh_offset (default 60s) before they expire. During that window the cached credentials are still valid, but the adapter treated the cache as a miss and performed a blocking refresh. When that refresh hit a transient error from the underlying provider — e.g. an IMDS/STS HTTP connect timeout — get_credential discarded the still-valid cached credentials and returned a hard error.

The same adapter backs both the S3 object store and the DynamoDB external manifest store (via OSObjectStoreToAwsCredAdaptor), so a single transient credential-provider blip during the refresh window turns into a failed request.

This is the gap left by the earlier credential-caching/refresh-offset work: hard expiry is handled, but a failed proactive refresh was not.

Fix

When a refresh fails but the cached credentials have not actually expired yet, fall back to the cached credentials and log a warning; the next call retries the refresh. Truly-expired credentials still surface the error rather than being served.

Added a unit test (test_aws_credential_adapter_falls_back_to_cached_on_refresh_failure) using a provider that succeeds once and then fails, asserting that the still-valid cached credentials are served while valid, and that an error is returned once they expire.

Notes

  • This addresses transient failures during the refresh window. A cold-start credential fetch (empty cache) against an unreachable IMDS/STS endpoint will still fail, as there is nothing valid to fall back to.
  • cargo fmt/cargo clippy could not be run in this environment (the pinned toolchain's rustfmt/clippy components are unavailable offline). The change was kept formatting-clean by hand; please confirm in CI.

AwsCredentialAdapter proactively refreshes credentials
credentials_refresh_offset (default 60s) before they expire. When that
proactive refresh hit a transient error from the underlying provider
(e.g. an IMDS/STS HTTP connect timeout), get_credential discarded the
still-valid cached credentials and returned a hard error, surfacing as a
500 for S3 and DynamoDB operations.

Fall back to the cached credentials when a refresh fails but the cached
credentials have not actually expired yet; the next call retries the
refresh. Truly-expired credentials still surface the error rather than
being used.
@github-actions github-actions Bot added A-encoding Encoding, IO, file reader/writer bug Something isn't working labels Jun 27, 2026

@Katomoto Katomoto left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks correct. One suggestion: the early return on line prevents the cleanup function from running — worth adding a finally block.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants