Skip to content

GCP: Add scheduled refresh for storage credentials held by GCSFileIO#15696

Merged
danielcweeks merged 2 commits into
apache:mainfrom
nastra:gcsfile-refresh-credentials
Mar 23, 2026
Merged

GCP: Add scheduled refresh for storage credentials held by GCSFileIO#15696
danielcweeks merged 2 commits into
apache:mainfrom
nastra:gcsfile-refresh-credentials

Conversation

@nastra

@nastra nastra commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

The GCSFileIO implementation never refreshes the credentials that are held directly from the table load. The GCS storage clients use a OAuth2RefreshCredentialsHandler that internally refreshes the storage client, but those updates are not reflected back to the FileIO.

If an GCSFileIO instance is serialized to remote workers, the credentials may be expired triggering a thundering herd of requests to refresh immediately.

This change addresses this problem by proactively updating the credentials in the FileIO so that only valid credentials are propagated to remote clients.

This applies the same changes as was already done for S3FileIO in #15678

@nastra nastra force-pushed the gcsfile-refresh-credentials branch from 9574bba to 09f1ad0 Compare March 20, 2026 08:59
@nastra nastra requested a review from danielcweeks March 20, 2026 09:01

@singhpk234 singhpk234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM !

Have some optional suggestions, when seeing things from fresh set of eyes !

}

try (OAuth2RefreshCredentialsHandler handler =
OAuth2RefreshCredentialsHandler.create(properties)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a OAuth2RefreshCredentialsHandler per refresh call would be expensive, it creates a http client and an AuthManager additionally, i wonder if we can just have this done one (invalidated / created again if props change) and just reuse that across refreshes ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like this is too expensive. Typically, we're talking about possibly one invocation per hour (most queries will likely have none because the run within the timeout), so holding all of the http client thread pools and other resources is the more wasteful approach.

@singhpk234 singhpk234 Mar 20, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, we're talking about possibly one invocation per hour (most queries will likely have none because the run within the timeout)

If we create the handler lazily, we can achieve the same.

holding all of the http client thread pools and other resources is the more wasteful approach

should we trim down the http client pool to be way less than ~100 in the first place ? we technically don't need the whole HTTP Client pool (~100) for OAuth2RefreshCredsHandler (though i think pools connections might be lazily made) ? its gonna be mostly call to rest server by picking a connection from the pool .
My main suggestion was we are creating burst of connections / threads which we are discarding anyways after the credentials call completes, hence i though it was an expensive

Comment thread gcp/src/main/java/org/apache/iceberg/gcp/gcs/GCSFileIO.java
Comment thread gcp/src/main/java/org/apache/iceberg/gcp/gcs/GCSFileIO.java
@nastra nastra force-pushed the gcsfile-refresh-credentials branch from 09f1ad0 to 5b5ae17 Compare March 23, 2026 07:41
@nastra nastra requested a review from danielcweeks March 23, 2026 07:49
@danielcweeks danielcweeks merged commit 19b4189 into apache:main Mar 23, 2026
35 checks passed
@nastra nastra deleted the gcsfile-refresh-credentials branch March 23, 2026 18:43
manuzhang pushed a commit to manuzhang/iceberg that referenced this pull request Mar 30, 2026
…pache#15696)

GCP: Add scheduled refresh for storage credentials held by GCSFileIO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants