Skip to content

Add interface for FileIO prefix operations and implementations #5096

Merged
rdblue merged 8 commits into
apache:masterfrom
danielcweeks:fileio-prefix-ops
Jun 22, 2022
Merged

Add interface for FileIO prefix operations and implementations #5096
rdblue merged 8 commits into
apache:masterfrom
danielcweeks:fileio-prefix-ops

Conversation

@danielcweeks

Copy link
Copy Markdown
Contributor

This adds an interface for FileIO implementations to support prefix based operations for listing and deleting.

The primary motivation is to enable supporting maintenance activities (like cleaning path directories or listing table locations) without the need to fall back to Hadoop FileSystem.

There are some notable behavioral differences between directory based and object based storage systems (e.g. for directory based storage, a the prefix must denote a directory).

Comment thread api/src/main/java/org/apache/iceberg/io/FileInfo.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/io/FileInfo.java Outdated
Comment thread aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated
Comment thread aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/io/FileInfo.java Outdated
Comment thread aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java

@nastra nastra left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM once we switch to Tasks.foreach, everything else were just a few nits

Comment thread api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java Outdated
Comment thread aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java Outdated
/**
* This method provides a "best-effort" to delete all objects under the
* given prefix.
*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing <p>?

return () -> client().listObjectsV2Paginator(request).stream()
.flatMap(r -> r.contents().stream())
.map(o -> new FileInfo(
String.format("%s://%s/%s", s3uri.scheme(), s3uri.bucket(), o.key()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The bucket returned by S3URI may actually be an access point reference, which could break URI parsing if this were to embed it in a location.

I think the solution is to move the S3 access point mapping out of S3URI, since that class should remain simple and report what was parsed and avoid this kind of issue. Let's not fix it here, but we should revisit this.

FYI @jackye1995.

@rdblue rdblue merged commit ac8733d into apache:master Jun 22, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants