You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ryan-johnson-databricks (via GitHub)" <gi...@apache.org> on 2023/03/09 23:52:10 UTC

[GitHub] [spark] ryan-johnson-databricks commented on pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

ryan-johnson-databricks commented on PR #40300:
URL: https://github.com/apache/spark/pull/40300#issuecomment-1463005688

> It's a good idea to provide an API that allows people to unambiguously reference metadata columns, and I like the new `Dataset.metadataColumn` function. However, I think the prepending underscore approach is a bit hacky. It's too implicit and I'd prefer a more explicit syntax like `SELECT metadata(_metadata) FROM t`. We can discuss this more and invite more SQL experts. Shall we exclude it from this PR for now?

@cloud-fan The prepended underscore is _NOT_ primarily intended as a user surface. Rather, it's a reliale way to get a unique column name that's still at least somewhat readable if you look at the query plan (unlike e.g. a uuid). The new `Dataset.metadataColumn` method does not even _look_ at a renamed attribute's name, for example.

At this point, the only reference in the code to prepended underscores is the two unit tests ("metadata name conflict resolved with leading underscores") that try to validate that the renaming works as intended. If you don't think the test coverage is important, we could remove even that?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org