Skip to content

Spark [3.2] : Improve stats estimation for spark scan#4446

Merged
rdblue merged 3 commits into
apache:masterfrom
singhpk234:enhancement/improve-stats-estimation-spark-scan
Apr 3, 2022
Merged

Spark [3.2] : Improve stats estimation for spark scan#4446
rdblue merged 3 commits into
apache:masterfrom
singhpk234:enhancement/improve-stats-estimation-spark-scan

Conversation

@singhpk234

@singhpk234 singhpk234 commented Mar 31, 2022

Copy link
Copy Markdown
Contributor

As per my understanding, presently we don't take SplitScanTask(which also implements FileScanTask),into consideration where we would not be reading the complete file rather a subpart of the file.

So in this case it's better to find number of records actually scanned by the task, rather than going to datafile this splitscanTask is based on and get's its complete record count.

At present it seems like we are over-estimating the size by taking complete record count of the file associated with the SplitScanTask, where as we are just scanning the subset of data from this file.


cc @rdblue @aokolnychyi @RussellSpitzer @jackye1995 @wypoon

@github-actions github-actions Bot added the spark label Mar 31, 2022
@singhpk234 singhpk234 force-pushed the enhancement/improve-stats-estimation-spark-scan branch from 2e0a149 to c501dc7 Compare March 31, 2022 06:13
Comment thread spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java Outdated
Comment thread spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java Outdated
@rdblue rdblue merged commit 4ab4b91 into apache:master Apr 3, 2022
@rdblue

rdblue commented Apr 3, 2022

Copy link
Copy Markdown
Contributor

Thanks, @singhpk234!

@singhpk234 singhpk234 deleted the enhancement/improve-stats-estimation-spark-scan branch April 4, 2022 03:08
singhpk234 pushed a commit to singhpk234/iceberg that referenced this pull request Apr 4, 2022
singhpk234 pushed a commit to singhpk234/iceberg that referenced this pull request Apr 4, 2022
RussellSpitzer pushed a commit that referenced this pull request Apr 4, 2022
Co-authored-by: Prashant Singh <psinghvk@amazon.com>
RussellSpitzer pushed a commit that referenced this pull request Apr 4, 2022
…4488)

Co-authored-by: Prashant Singh <psinghvk@amazon.com>
felixYyu added a commit to felixYyu/iceberg that referenced this pull request Apr 10, 2022
felixYyu added a commit to felixYyu/iceberg that referenced this pull request Apr 10, 2022
sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023
Co-authored-by: Prashant Singh <psinghvk@amazon.com>
(cherry picked from commit 4ab4b91)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants