Skip to content

Add service account impersonation support for BigQueryMetastoreCatalog#14447

Merged
danielcweeks merged 12 commits into
apache:mainfrom
joyhaldar:gcp-bigquery-impersonation
Dec 19, 2025
Merged

Add service account impersonation support for BigQueryMetastoreCatalog#14447
danielcweeks merged 12 commits into
apache:mainfrom
joyhaldar:gcp-bigquery-impersonation

Conversation

@joyhaldar

@joyhaldar joyhaldar commented Oct 30, 2025

Copy link
Copy Markdown
Contributor

Description:
This PR adds service account impersonation support to BigQueryMetastoreCatalog, enabling identity separation between cluster operations and data access

Problem
BigQueryMetastoreCatalog only supports Application Default Credentials with no mechanism for service account impersonation. This prevents:

  • Implementing least-privilege security (cluster operations vs data access)
  • Running multi-tenant workloads on shared clusters
  • Creating proper audit trails per service account

Solution
Introduces a properties based approach for BigQuery client configuration with impersonation support using Google's ImpersonatedCredentials API.

Key changes:

  • Created BigQueryProperties class with metastoreOptions() for BigQuery client configuration
  • Added impersonation properties to GCPProperties: service account, delegates, lifetime, scopes
  • Updated PrefixedStorage to support impersonated credentials for GCS operations
  • Impersonation properties are scoped per service (gcp.bigquery.impersonate.* for BigQuery, gcs.impersonate.* for GCS)
  • Added deprecation annotations for constants moved to BigQueryProperties

Configuration

Minimal:

gcp.bigquery.impersonate.service-account=data-sa@project.iam.gserviceaccount.com
gcs.impersonate.service-account=data-sa@project.iam.gserviceaccount.com

Full:

# BigQuery
gcp.bigquery.impersonate.service-account=data-sa@project.iam.gserviceaccount.com
gcp.bigquery.impersonate.delegates=admin-sa@project.iam.gserviceaccount.com
gcp.bigquery.impersonate.lifetime-seconds=3600
gcp.bigquery.impersonate.scopes=bigquery

# GCS
gcs.impersonate.service-account=data-sa@project.iam.gserviceaccount.com
gcs.impersonate.delegates=admin-sa@project.iam.gserviceaccount.com
gcs.impersonate.lifetime-seconds=3600
gcs.impersonate.scopes=devstorage.read_write

Testing

Unit tests:

  • TestBigQueryProperties
  • TestPrefixedStorage

Backward Compatibility
Fully backward compatible. Catalogs without impersonation properties continue using ADC exactly as before.

Closes #14446

@github-actions github-actions Bot added the GCP label Oct 30, 2025
@nastra

nastra commented Oct 30, 2025

Copy link
Copy Markdown
Contributor

@talatuyarer can you review this one please?

@joyhaldar

Copy link
Copy Markdown
Contributor Author

Hello @talatuyarer, @ebyhr, @nastra, i would really appreciate it if you could please take a look if you have some time.

@yogevyuval

Copy link
Copy Markdown
Contributor

@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type

@joyhaldar

joyhaldar commented Nov 3, 2025

Copy link
Copy Markdown
Contributor Author

@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type

Hi @yogevyuval,

Thanks for the feedback! I really appreciate it.

I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed.

I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea.

Please let me know what you think.

Thanks,
Joy

this.closeableGroup = new CloseableGroup();
builder.setCredentials(
GCPAuthUtils.oauth2CredentialsFromGcpProperties(gcpProperties, closeableGroup));
} else if (gcpProperties.impersonateServiceAccount().isPresent()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change tested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Added tests for the impersonation path:

  • impersonationPropertiesAreRead() - Verifies all impersonation properties
  • impersonationPropertiesWithDefaults() - Verifies defaults work

Also tested end to end with actual GCP projects to confirm credential creation works.

@yogevyuval

Copy link
Copy Markdown
Contributor

@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type

Hi @yogevyuval,

Thanks for the feedback! I really appreciate it.

I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed.

I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea.

Please let me know what you think.

Thanks, Joy

So what I meant is a situation where a lakehouse is hosted on GCP but with a self managed catalog, such as polaris/hive metastore, but the files would still be hosted in GCS, that's where the impersonation can really be useful even when not using BigQuery

@kevinjqliu

Copy link
Copy Markdown
Contributor

Is the service account impersonation support for the catalog, fileio, or both?

I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials

Could we reuse the GoogleAuthManager to abstract away the auth details?

@joyhaldar

joyhaldar commented Nov 8, 2025

Copy link
Copy Markdown
Contributor Author

Is the service account impersonation support for the catalog, fileio, or both?

I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials

Could we reuse the GoogleAuthManager to abstract away the auth details?

Thank you for the comment, Kevin.

The impersonation supports both BigQuery and GCS FileIO.

Regarding GoogleAuthManager, I was under the impression that it's designed for
REST Catalog authentication, while BigQueryMetastoreCatalog uses GoogleCredentials
directly with GCP client libraries.

Please let me know if I'm misunderstanding your suggestion.


public static final String CLIENT_FACTORY = "gcp.bigquery.client.factory";
private static final String GCS_IMPERSONATE_SERVICE_ACCOUNT = "gcs.impersonate.service-account";
private static final String GCS_PROJECT_ID = "gcs.project-id";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 should have the gcp prefix instead of gcs. That matches your PR example + keeps everything under the same namespace.

@joyhaldar joyhaldar Nov 13, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for reviewing @rambleraptor.

I wanted to explain the flow a bit:

Do you think this should be changed?

I can try removing the private constants from BigQueryMetastoreCatalog and use GCPProperties.GCS_*
directly.

Please let me know what you think would be more appropriate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this makes sense. It's fine by me.

GoogleCredentials scopedCredentials =
(credentials instanceof ImpersonatedCredentials)
? credentials
: credentials.createScoped(BigqueryScopes.all());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talatuyarer Love your opinion on this:

I'm a little worried about defaulting this to use scopes.all() (even though that's the current functionality). Scoping is a great way to force read-only behavior at a lower-level.

@joyhaldar joyhaldar Nov 13, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also love Talat's opinion on this.

private String projectId;
private String location;

private static final String DEFAULT_LOCATION = "us";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally not a huge fan of having a default location, but I'm happy to be overridden.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review comment @rambleraptor .

I preserved the default "us" from BigQueryMetastoreCatalog (DEFAULT_GCP_LOCATION) to
avoid breaking existing users. Do you think it should be removed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that sounds great

@nastra nastra requested a review from danielcweeks November 19, 2025 13:29
Comment thread bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryClientFactory.java Outdated
@danielcweeks danielcweeks self-requested a review December 9, 2025 22:58
joyhaldar and others added 2 commits December 11, 2025 18:55
…ttern

- Removed BigQueryClientFactory, DefaultBigQueryClientFactory, ImpersonatedBigQueryClientFactory
- Created BigQueryProperties with metastoreOptions() for client configuration
- Added impersonation support to GCPProperties and PrefixedStorage
- Single property (gcp.impersonate.service-account) enables impersonation for both BigQuery and GCS

@danielcweeks danielcweeks left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, but this is looking pretty close.

Comment thread bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryProperties.java Outdated
Comment thread gcp/src/main/java/org/apache/iceberg/gcp/GCPProperties.java
Comment thread bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryProperties.java Outdated
- Add deprecation annotations for moved constants
- Move credential scoping to BigQueryProperties.metastoreOptions()
- Make BigQueryProperties package-private
- Add configurable GCS impersonation scopes
Comment thread bigquery/src/main/java/org/apache/iceberg/gcp/bigquery/BigQueryProperties.java Outdated

@danielcweeks danielcweeks left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joyhaldar!

@danielcweeks danielcweeks merged commit 554a3c1 into apache:main Dec 19, 2025
44 checks passed
talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026
…atalog (apache#14447)

* Add supporting impersonation in BigQueryMetastoreCatalog


Co-authored-by: Joy Haldar <Joy.Haldar@target.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add service account impersonation support for BigQueryMetastoreCatalog

7 participants