Add service account impersonation support for BigQueryMetastoreCatalog#14447
Conversation
adding tests and formatting
Add scope expansion to support short names and http URLs
|
@talatuyarer can you review this one please? |
|
Hello @talatuyarer, @ebyhr, @nastra, i would really appreciate it if you could please take a look if you have some time. |
|
@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type |
Hi @yogevyuval, Thanks for the feedback! I really appreciate it. I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed. I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea. Please let me know what you think. Thanks, |
| this.closeableGroup = new CloseableGroup(); | ||
| builder.setCredentials( | ||
| GCPAuthUtils.oauth2CredentialsFromGcpProperties(gcpProperties, closeableGroup)); | ||
| } else if (gcpProperties.impersonateServiceAccount().isPresent()) { |
There was a problem hiding this comment.
Thanks for pointing this out! Added tests for the impersonation path:
- impersonationPropertiesAreRead() - Verifies all impersonation properties
- impersonationPropertiesWithDefaults() - Verifies defaults work
Also tested end to end with actual GCP projects to confirm credential creation works.
So what I meant is a situation where a lakehouse is hosted on GCP but with a self managed catalog, such as polaris/hive metastore, but the files would still be hosted in GCS, that's where the impersonation can really be useful even when not using BigQuery |
|
Is the service account impersonation support for the catalog, fileio, or both? I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials Could we reuse the GoogleAuthManager to abstract away the auth details? |
Thank you for the comment, Kevin. The impersonation supports both BigQuery and GCS FileIO. Regarding GoogleAuthManager, I was under the impression that it's designed for Please let me know if I'm misunderstanding your suggestion. |
|
|
||
| public static final String CLIENT_FACTORY = "gcp.bigquery.client.factory"; | ||
| private static final String GCS_IMPERSONATE_SERVICE_ACCOUNT = "gcs.impersonate.service-account"; | ||
| private static final String GCS_PROJECT_ID = "gcs.project-id"; |
There was a problem hiding this comment.
These 4 should have the gcp prefix instead of gcs. That matches your PR example + keeps everything under the same namespace.
There was a problem hiding this comment.
Thank you so much for reviewing @rambleraptor.
I wanted to explain the flow a bit:
- ImpersonatedBigQueryClientFactory defines:
gcp.bigquery.impersonate.service-account(used for BigQuery client) - prepareFileIOProperties() transforms:
gcp.*→gcs.*(for FileIO, based on constants defined in GCPProperties) - GCPProperties reads:
gcs.impersonate.service-account
Do you think this should be changed?
I can try removing the private constants from BigQueryMetastoreCatalog and use GCPProperties.GCS_*
directly.
Please let me know what you think would be more appropriate.
There was a problem hiding this comment.
Alright, this makes sense. It's fine by me.
| GoogleCredentials scopedCredentials = | ||
| (credentials instanceof ImpersonatedCredentials) | ||
| ? credentials | ||
| : credentials.createScoped(BigqueryScopes.all()); |
There was a problem hiding this comment.
@talatuyarer Love your opinion on this:
I'm a little worried about defaulting this to use scopes.all() (even though that's the current functionality). Scoping is a great way to force read-only behavior at a lower-level.
There was a problem hiding this comment.
I would also love Talat's opinion on this.
| private String projectId; | ||
| private String location; | ||
|
|
||
| private static final String DEFAULT_LOCATION = "us"; |
There was a problem hiding this comment.
I'm personally not a huge fan of having a default location, but I'm happy to be overridden.
There was a problem hiding this comment.
Thank you for your review comment @rambleraptor .
I preserved the default "us" from BigQueryMetastoreCatalog (DEFAULT_GCP_LOCATION) to
avoid breaking existing users. Do you think it should be removed?
There was a problem hiding this comment.
Nope, that sounds great
…ttern - Removed BigQueryClientFactory, DefaultBigQueryClientFactory, ImpersonatedBigQueryClientFactory - Created BigQueryProperties with metastoreOptions() for client configuration - Added impersonation support to GCPProperties and PrefixedStorage - Single property (gcp.impersonate.service-account) enables impersonation for both BigQuery and GCS
danielcweeks
left a comment
There was a problem hiding this comment.
A few more comments, but this is looking pretty close.
- Add deprecation annotations for moved constants - Move credential scoping to BigQueryProperties.metastoreOptions() - Make BigQueryProperties package-private - Add configurable GCS impersonation scopes
…ables parsing to BigQueryProperties.
…atalog (apache#14447) * Add supporting impersonation in BigQueryMetastoreCatalog Co-authored-by: Joy Haldar <Joy.Haldar@target.com>
Description:
This PR adds service account impersonation support to BigQueryMetastoreCatalog, enabling identity separation between cluster operations and data access
Problem
BigQueryMetastoreCatalog only supports Application Default Credentials with no mechanism for service account impersonation. This prevents:
Solution
Introduces a properties based approach for BigQuery client configuration with impersonation support using Google's ImpersonatedCredentials API.
Key changes:
metastoreOptions()for BigQuery client configurationgcp.bigquery.impersonate.*for BigQuery,gcs.impersonate.*for GCS)Configuration
Minimal:
Full:
Testing
Unit tests:
Backward Compatibility
Fully backward compatible. Catalogs without impersonation properties continue using ADC exactly as before.
Closes #14446