Skip to content

Spec: Clarify behavior of special geo objects for lower/upper bounds#12956

Merged
szehon-ho merged 4 commits into
apache:mainfrom
szehon-ho:geo_spec_clarify_special
May 9, 2025
Merged

Spec: Clarify behavior of special geo objects for lower/upper bounds#12956
szehon-ho merged 4 commits into
apache:mainfrom
szehon-ho:geo_spec_clarify_special

Conversation

@szehon-ho

Copy link
Copy Markdown
Member

This is to match clarification for: https://github.com/apache/parquet-format/pull/494/files

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label May 3, 2025
@szehon-ho

Copy link
Copy Markdown
Member Author

@jiayuasu @paleolimbot fyi, let me know if this captures the meaning

@szehon-ho szehon-ho force-pushed the geo_spec_clarify_special branch from dc2b842 to 1cc8d28 Compare May 3, 2025 05:51
Comment thread format/spec.md Outdated

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Jia's edit this makes sense to me...thanks!

Co-authored-by: Jia Yu <jiayu@wherobots.com>
@szehon-ho

Copy link
Copy Markdown
Member Author

Had an offline sync with @jiayuasu and @rdblue , we simplified it:

  • empty geometries do not need to be explicitly mentioned, as its obvious they wont have any coordinate and cant contribute to bbox
  • coordinate values outside the range should be error or otherwise handled, as it may mean something even worse. So omitting mentioning them in this context, as well.

@szehon-ho szehon-ho force-pushed the geo_spec_clarify_special branch from a3d1d21 to b4ef6a4 Compare May 6, 2025 05:21
Comment thread format/spec.md Outdated

For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are both points of the following coordinates X, Y, Z, and M (see [Appendix G](#appendix-g-geospatial-notes)) which are the lower / upper bound of all objects in the file. For the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that `x >= xmin` OR`x <= xmax`. In geographic terminology, the concepts of `xmin`, `xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`, `southernmost` and `northernmost`, respectively. For `geography` types, these points are further restricted to the canonical ranges of [-180 180] for X and [-90 90] for Y.

Like for other types, null or invalid `geometry` and `geography` objects are skipped when calculating the upper and lower bounds. In contrast, null or invalid (NaN) coordinate values within a `geometry` or `geography` do not lead to the entire object being skipped, instead only that coordinate value itself is omitted for calculation. Note, no bounding box is produced if all x values or all y values in the file are invalid.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other types, only null and NaN values are omitted from the calculation, so I would rephrase this. It doesn't quite work to replace "invalid" with "NaN" though since I think you're talking about objects without coordinates. I think I'd just call out the two cases directly:

When calculating upper and lower bounds for geometry and geography, null and NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN)contributes no value to the Y, Z, or M dimension bounds. If a dimension has no non-null or non-NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed:

When calculating upper and lower bounds for `geometry` and `geography`, null or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) contributes a value to X but no values to Y, Z, or M dimension bounds. If a dimension has only null or NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced.
  1. clarified a little more your example (maybe redundant, but thought we should be clear as an example)
  2. changed double-negatives

@szehon-ho szehon-ho force-pushed the geo_spec_clarify_special branch from 25a5859 to 1fc6ba6 Compare May 6, 2025 17:37

@jiayuasu jiayuasu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hsiang-c

hsiang-c commented May 6, 2025

Copy link
Copy Markdown
Contributor

LGTM

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating on this!

@szehon-ho szehon-ho merged commit 97c0e13 into apache:main May 9, 2025
@szehon-ho

Copy link
Copy Markdown
Member Author

Merged to master, thanks for all review! Reference: vote thread: https://lists.apache.org/thread/g7rz2kt12ytd5j2xnbdlk696cxm0d3s2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants