Azure: Add FileIO that supports ADLSv2 storage#8303
Conversation
|
I am wondering how this differs from #4465, which has been lying dormant (and as far as I can see is just waiting for someone to hit the "merge" button) for over a year now? |
The main difference is this PR uses the data lake client API instead of the blob client API. File operations work the same way but bulk and prefix operations will differ. |
| dependencies { | ||
| implementation platform(libs.azuresdk.bom) | ||
| implementation "com.azure:azure-storage-file-datalake" | ||
| implementation "com.azure:azure-identity" |
There was a problem hiding this comment.
For other integrations, we don't bundle the dependencies and only ship the Iceberg side. That keeps our bundle small and doesn't force any particular version on downstream consumers. It also avoids needing to do a lot of license and notice documentation work. Is that possible here? Is there a dependency bundle that we can use at runtime?
There was a problem hiding this comment.
The azure-bundle project build was set up for bundling all of the necessary Azure dependencies in one shadow jar as an (optional) convenience to users, similar to the aws-bundle and gcp-bundle projects, which include only the necessary runtime libraries at the same version used for the Iceberg build and shades conflicting libraries. A user can opt to include their own Azure dependencies if desired and not use this at all. For example, all you would need to run with Spark is the Spark runtime and Azure/AWS/GCP bundle. Neither Microsoft nor Google provide such a bundle. Amazon has one for AWS but it is very large, which causes issues with some systems.
This is separate from the azure project build which declares the Azure dependencies as compileOnly so they are not included with any runtime.
There was a problem hiding this comment.
Okay, I see. I didn't know about the other bundle projects. Looks like the LICENSE file is updated for those, but not the NOTICE. Did you check whether each bundled project has a NOTICE that we need to include?
There was a problem hiding this comment.
I did not, I'll do that now for all three.
There was a problem hiding this comment.
I added this. I'll open a separate PR for the AWS and GCP bundles.
There was a problem hiding this comment.
The PR for the AWS and GCP bundles is here: #8323
There was a problem hiding this comment.
Thanks! It's amazing that we can automate this now. It was such a giant pain to do this in the past!
| Preconditions.checkState(!closed, "Cannot seek: already closed"); | ||
| Preconditions.checkArgument(newPos >= 0, "Cannot seek: position %s is negative", newPos); | ||
|
|
||
| // this allows a seek beyond the end of the stream but the next read will fail |
There was a problem hiding this comment.
Why allow seek beyond the end of the stream?
There was a problem hiding this comment.
This was done to keep the behavior consistent with S3InputStream
| * Support</a> | ||
| */ | ||
| class ADLSLocation { | ||
| private static final Pattern URI_PATTERN = Pattern.compile("^abfss?://(.+?)([/?#].*)?$"); |
There was a problem hiding this comment.
Wouldn't it be safer to use [^/?#]+ for the first group instead of using non-greedy matching?
There was a problem hiding this comment.
Yes, thanks, I made this change
|
|
||
| String uriPath = matcher.group(2); | ||
| uriPath = uriPath == null ? "" : uriPath.startsWith("/") ? uriPath.substring(1) : uriPath; | ||
| this.path = uriPath.split("\\?", -1)[0].split("#", -1)[0]; |
There was a problem hiding this comment.
If uriPath is null, then path is going to be an empty string? I generally try to avoid empty string as a default.
There was a problem hiding this comment.
This was done primarily for the Azure API, which expects an empty string instead of null for the root.
|
Thanks @bryanck for working on this, and @danielcweeks, @rdblue & @nastra for the review 🙌🏻 |
This PR adds a FileIO implementation for Azure Data Lake Storage Gen2. The URI format was kept consistent with Hadoop's Azure URI format to make any transition easier, however TLS is always used with either the
abfsorabfssscheme. Range reads were also implemented. The new FileIO was added as a delegate type in ResolvingFileIO. Both the prefix and bulk operation mixin interfaces were implemented as well.To limit the scope of this PR, authorization was limited to using the default Azure credential chain, SAS token, or connection string. Enhancements can be addressed in a follow-up PR.
The project was added as a dependency to the Spark and Flink runtimes, similar to AWS and GCP. Also added was an Azure bundle project for building a jar with the necessary Azure dependencies when running with Spark or Flink. This is similar to the bundles for AWS and GCP.
Currently Azurite doesn't yet support ADLSv2 directory operations, so mocks were used for prefix-related tests. Manual testing was performed against a real Azure account.