You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/06 22:23:15 UTC

[GitHub] [spark] rdblue opened a new pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

rdblue opened a new pull request #28027:
URL: https://github.com/apache/spark/pull/28027

### What changes were proposed in this pull request?

This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns.

The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used.

### Why are the changes needed?

This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan.

### Does this PR introduce any user-facing change?

Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns.

### How was this patch tested?

Will include new unit tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704613777


   Thanks, @brkyvz! It would be great to get your opinion as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704606229


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34082/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512967572



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       You're right that this should respect the case sensitivity setting. I'll update it.
   
   I'd rather not use a struct for metadata columns. I think that would make them harder to produce and would also introduce an unnecessary row copy in a lot of cases. For example, if I run `select id, metadata._file`, then the source will produce something like `struct(id=1, metadata=struct(_file='path.parquet'))`, which would then be copied into the expected `struct(id=1, _file='path.parquet')`.
   
   I also thought that we would like to get away from using a function that behaves differently based on its context, like `input_file_name`. It is better to project `_file`, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729980801


   @holdenk, @brkyvz, @bart-samwel, is everyone okay with merging this? I think we have agreement, but I just want to make sure. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721859903


   **[Test build #130611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130611/testReport)** for PR 28027 at commit [`9a4dfb1`](https://github.com/apache/spark/commit/9a4dfb1d1f62ae6e47af3b01ccf2d54e8f72c6cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] brkyvz commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

brkyvz commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-717431688


   cc @jose-torres 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r526549174



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
##########
@@ -0,0 +1,58 @@
+package org.apache.spark.sql.connector.catalog;

Review comment:
       Fixed in https://github.com/apache/spark/pull/30415




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721859903


   **[Test build #130611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130611/testReport)** for PR 28027 at commit [`9a4dfb1`](https://github.com/apache/spark/commit/9a4dfb1d1f62ae6e47af3b01ccf2d54e8f72c6cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r513812453



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       I've been thinking about this suggestion. I think there are two ways we might implement the function-style approach that I'll respond to separately.
   
   First, we could push the function to the data source so that metadata columns are no longer hidden columns. When requesting a projection, Spark would also pass an Expression (the public one) and that would be something like `metadata_column("_file")` or `input_file()`. This would require a different interface, but would also be more flexible (metadata tables could return things like `row_count()`). The trade-off is that the rules are a lot more complicated. The optimizer is needed to push expressions down to the leaves so that these functions can be pushed to the source. But, non-deterministic expressions can prevent that pushdown, which could lead to some cases where the function works and some where it doesn't because there is no way to run the function without pushing it to the data source. And, the function would need to _appear_ like it is resolved when it actually isn't to make it to the optimizer. Also, I think we discussed this option in the v2 syncs and came to the conclu
 sion that it would be simpler to project columns.
   
   Second, we could use a function to signal that a column is a metadata column only, and keep the interface to sources the same. So the way to project `_file` is via `metadata_column("_file")`, and Spark will add `_file` to the output of the relation (and request it through the DSv2 API) if it finds a metadata column with that name. That is a nice way to avoid the current way that columns are exposed through `metadataColumns` in the Spark plan. But it also has the drawbacks of the first option: converting the function call into an attribute reference is much more complicated.
   
   The current implementation works well. We've been running this code in production for months now, so I would prefer to move forward with this approach.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] brkyvz commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

brkyvz commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729985442


   I looked at this PR a couple hours ago and it had failing tests. Now that it has passed all builders, I'm going to merge this. Thank you @rdblue !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704675667






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

brkyvz commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512364187



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -880,6 +880,12 @@ case class SubqueryAlias(
     val qualifierList = identifier.qualifier :+ alias
     child.output.map(_.withQualifier(qualifierList))
   }
+
+  override def metadataOutput: Seq[Attribute] = {
+    val qualifierList = identifier.qualifier :+ alias
+    child.metadataOutput.map(_.withQualifier(qualifierList))
+  }

Review comment:
       why is this differentiation needed? Won't the metadata columns be part of `output`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       don't you need to resolve column names case insensitively (according to column name)?
   What if there's a match between a data column and the metadata column?
   
   I believe we'll have to come up with UDFs to access these metadata columns. `input_file_name()` could be one such UDF

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       Another option is that metadata columns can be part of a struct, and you can select fields out of those metadata columns. A `get_source_metadata()` UDF can be a catch all, and you can use it like:
   `get_source_metadata("file_size")` or `get_source_metadata("row_id")` and things like that. What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-653960650


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721892205






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721445646






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r525619188



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
##########
@@ -0,0 +1,29 @@
+package org.apache.spark.sql.connector.catalog;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.SupportsPushDownRequiredColumns;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * An interface for exposing data columns for a table that are not in the table schema. For example,
+ * a file source could expose a "file" column that contains the path of the file that contained each
+ * row.
+ * <p>

Review comment:
       I added this:
   
   > If a table column and a metadata column have the same name, the metadata column will never be requested and is ignored. It is recommended that Table implementations reject data column names that conflict with metadata column names.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704612225


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34082/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517176346



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       I think the column approach has the following benefits over the function approach:
   
   - You can scope the location of the calls (you can't reference the metadata unless you're directly SELECTing from the table, so no calls in places where it doesn't make sense),
   
   - You can call them immediately from a join, eg. `SELECT t1._input_file_name, t2._input_file_name FROM Table1 t1 JOIN Table2 t2`. The function syntax doesn't allow this, unless you make it something "special" like `input_file_name(t1)`.
   
   - You can easily push down the projection using existing mechanisms to turn off the overhead of generating the columns. The semantics of such pushdown with functions is iffy at best.
   
   - If you convert this to the physical level directly, a column is more compatible with physical transformations e.g. things that add readahead and buffering. The current approach like `input_file_name()` fundamentally makes buffering and readahead hard. You'd have to convert it to columns under the covers anyway.
   
   However, the column approach has name conflicts. This gets problematic if people actually select these columns without assigning new aliases and then write the result to files, and then try to read those back. That means that either DataFrames don't round trip with 100% fidelity via files (which may break assumptions of libraries that people have built), or you can't access metadata columns on files that have columns with these names, which breaks libraries that assume that these features work. If I'd write `table._some_metadata_column` in an auditing-related library, I'd like to be sure that I get what I asked for, and that this cannot be spoofed.
   
   How about using _special syntax_ like `input_file_name(t1)`, or `metadata(t1, file_name)`, where t1 must be a table (_data source_, not subquery/view) in the FROM clause, optional if there's only one table. So it's not a real function, it's just function-call-like syntax. That can be converted into a special type of column reference under the covers, even at parsing time, giving it all the advantages of columns from an implementation perspective. The column reference could use a reserved name that is not possible to write otherwise. When used in the SELECT list, we wouldn't assign that column name as the SELECT list alias, so it wouldn't show up in files! You still get an error at analysis time if you call it from a place where there's no table, so there's no incorrect use like with normal nondeterministic functions. This has the advantage that you can't have name conflicts. If we would add `input_file_name(t1)` and `input_file_name()` as such special functions with special syntax
 , then we'd have backward compatibility with existing queries, and we could migrate that mechanism to the new mechanism. And to avoid having to add extra functions for everything, we'd add`metadata(...)` for all other metadata columns.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-730004956


   Thanks, @brkyvz! Great to have this in.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] asfgit closed pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

asfgit closed pull request #28027:
URL: https://github.com/apache/spark/pull/28027


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-732502053


   The root cause was https://github.com/apache/spark/pull/28993 and I made a PR to fix this.
   - https://github.com/apache/spark/pull/30477


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704609525


   @holdenk, I can add some tests, but I appreciate it. I have mainly not been prioritizing this, but if you can help review I can write up the tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] holdenk commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r511493573



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       This is the logic where we prefer the data name over the metadata name? Maybe add a comment?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721439184


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35178/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721495040






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721892205






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun edited a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-732484047


   The newly added test case seems dependent on Scala versions.
   
   I filed https://issues.apache.org/jira/browse/SPARK-33524 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729315786


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35842/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708876085






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721424926


   **[Test build #130577 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130577/testReport)** for PR 28027 at commit [`d661246`](https://github.com/apache/spark/commit/d6612469d738a6d0494cc093cf9801359b5432c7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721892183


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35212/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708839665






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721424926


   **[Test build #130577 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130577/testReport)** for PR 28027 at commit [`d661246`](https://github.com/apache/spark/commit/d6612469d738a6d0494cc093cf9801359b5432c7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708875271


   **[Test build #129771 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129771/testReport)** for PR 28027 at commit [`14f8c3e`](https://github.com/apache/spark/commit/14f8c3e9b3a32e4a16b1fc1ce4be3e5a9f6be860).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708876085






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708695774


   @holdenk, @brkyvz, I updated this and added tests. Please have a look, I think it is ready to review now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721881500


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35212/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704675090


   **[Test build #129475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129475/testReport)** for PR 28027 at commit [`0a5c7ca`](https://github.com/apache/spark/commit/0a5c7caebb807f2d256ce696ba8112acc24292b7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `  implicit class MetadataColumnsHelper(metadata: Array[MetadataColumn]) `


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-732483416


   Hi, All.
   This seems to break `Scala 2.13` Unit Test.
   ```scala
   $ dev/change-scala-version.sh 2.13
   $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517582852



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       An alternative would be to make the column names use a recognizable pattern that we can forbid as column names. E.g. Snowflake uses `metadata$foo` for these columns, which is a lot more distinguishable than just a leading underscore. Hence, we could forbid using these as normal identifiers (basically check for a `metadata$` prefix) to prevent the round-tripping-via-file issues, but other than that the accesses would still look like normal column references.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517651552



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       > @bart-samwel, I like your last suggestion, but with a slight change: I think sources should reject column names that conflict with their metadata columns. That way, a source can choose a format that makes sense (e.g., __offset) and allow everything else. All Spark has to do is to define what the behavior will be when there is a conflict, and recommend that sources reject column names that are also exposed as metadata columns.
   
   They probably should reject such column names, yes. Especially at write time -- you don't want to be in a situation where you wrote a file and can then never read it back. But if you did that, how would you deal with it? Could you never read it with Spark? One alternative is to just hide the column if it's in the file. You wouldn't be able to access it, but at least that would satisfy the "auditing library requirement" that the metadata fields always do the expected thing. Another alternative would be to expose the column under a new name. But what if that new name is also there? It gets very annoying very quickly. So maybe just hiding is the best policy. 
   
   > I think using a special syntax requires significantly more changes, so I don't think that's a good option. Specific replies are below.
   > 
   > > If I'd write table._some_metadata_column in an auditing-related library, I'd like to be sure that I get what I asked for, and that this cannot be spoofed.
   > 
   > I think this is a fair argument, but I expect it would very rarely be a problem that you wanted a metadata column and accidentally projected a data column.
   
   It'll be rare, but if I have an auditing library that e.g. inserts a special execution node on top of a scan to track exactly what was read, then it would be great if people can't spoof that by adding a column with a particular special name to their underlying files. If the internal metadata column names take precedence over the columns in the file, then that problem is not there. (FWIW, I usually try to apply the principle of "name resolution must be environment independent", i.e., if a query defines something locally, e.g. an alias, then something in the environment (such as a table's schema) shouldn't be able to make that aspect of the query change meaning, because then things will break randomly as a result of schema changes. This fits in the same category.)
   
   > > How about using special syntax
   > 
   > This is what I considered above, "we could use a function to signal that a column is a metadata column only, and keep the interface to sources the same . . ." I think the challenge here is carrying the information through analysis and optimization. This approach either has many of the same complications as using a function (if kept as an expression), or we would need to introduce a new attribute type for metadata attrs that would hopefully work in rules just like normal attributes in existing rules. We can't lose the fact that the attribute references metadata, so we can't replace a metadata attribute reference with an attribute reference.
   
   Yeah, I was thinking about the special AttributeReference type. But more in the way of "use a very special naming convention that you can't type" -- maybe because we'd reject it in the parser if you'd try to explicitly write an identifier like that. We'd have to do something to prevent that column name from showing up in output files etc. though, so it's not entirely foolproof.
   
   > I think the option most likely to not introduce more problems is to use a regular attribute for these columns, but that eliminates the benefit of a special syntax to track the fact that they are metadata column references. This doesn't seem like a good option to me, if all we want to do is guarantee that metadata columns don't conflict with data columns.
   > 
   > The special syntax is also more awkward to use. One of the use cases for this feature is to expose non-schema fields like Kafka topic, offset, and partition. I think it is much more natural to select `_offset` than `metadata_column("offset")`. And if these are exposed as columns, we can add them to `DESCRIBE EXTENDED`, like I've done in this PR.
   
   That aspect of columns is certainly nice.
   
   > > An alternative would be to make the column names use a recognizable pattern that we can forbid as column names.
   > 
   > I like this suggestion, but I'd leave rejecting column names to sources as I said above. Another option is to make this available, but not enforce it. We could add a check that data columns are not of the form `metadata$.*`, but not require metadata column names using the pattern. That way there is a way to guarantee names don't conflict, but if a source chooses to use a name that may conflict it still works fine.
   
   So we could set the example by using this pattern in data sources that we control, but other data sources could choose otherwise. I guess that works. I'm personally inclined to be a bit more prescriptive about these things, basically because if something is a best practice but there's no reviewer for the next data source that knows about this best practice, then the best practice won't be followed. And before you know it, there's a data source out in the wild that you can't migrate to use the best practice because there's already usage out there. Maybe it's already enough if all the existing data sources consistently use the pattern -- then at least it's "cargo cult proof".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721494538


   **[Test build #130577 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130577/testReport)** for PR 28027 at commit [`d661246`](https://github.com/apache/spark/commit/d6612469d738a6d0494cc093cf9801359b5432c7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721857592


   Rebased to fix the conflict in DataSourceV2SQLSuite.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

brkyvz commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512996706



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       what if we don't have the structs, but just have to use this UDF to access a metadata column?
   So you can do:
   ```
   select id, source_metadata(file)
   ```
   this would avoid name collisions and explicitly tells you which functions need to be resolved. You can also return better error messages and things knowing that the column you tried to resolve was a metadata column or a column.
   
   I believe other systems like Oracle didn't have a function for pseudocolumns, therefore I'm also fine with this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517655219



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       > If the internal metadata column names take precedence over the columns in the file, then that problem is not there.
   
   To be clear, I think that data columns should always take precedence over metadata, so that you can always read what was written. You just may not be able to load the optional metadata columns if they conflict. And that conflict _should_ handled by the source by rejecting columns at write time.
   
   For your source, you'd have the guarantee that `metadata$.*` columns do not conflict to satisfy your requirement, but another source may use a different convention.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-722005267






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708839650


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34377/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721445646






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-732493448


   cc @srowen 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729302417


   **[Test build #131238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131238/testReport)** for PR 28027 at commit [`6e72f4d`](https://github.com/apache/spark/commit/6e72f4d68e6935efba32ba7a5f8b48adb66fe145).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704612248


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34082/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-722005267






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704587017


   **[Test build #129475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129475/testReport)** for PR 28027 at commit [`0a5c7ca`](https://github.com/apache/spark/commit/0a5c7caebb807f2d256ce696ba8112acc24292b7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708699868


   **[Test build #129771 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129771/testReport)** for PR 28027 at commit [`14f8c3e`](https://github.com/apache/spark/commit/14f8c3e9b3a32e4a16b1fc1ce4be3e5a9f6be860).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r526538423



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
##########
@@ -0,0 +1,58 @@
+package org.apache.spark.sql.connector.catalog;

Review comment:
       Hmm that's weird RAT misses this.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
##########
@@ -0,0 +1,58 @@
+package org.apache.spark.sql.connector.catalog;

Review comment:
       Hmm that's weird RAT missed this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721445631


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35178/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r526536639



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
##########
@@ -0,0 +1,58 @@
+package org.apache.spark.sql.connector.catalog;

Review comment:
       hmm @brkyvz @rdblue don't we require license header for new files?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] brkyvz commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

brkyvz commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704610048


   @rdblue I'd be happy to help review as well. I think a lot of nice things can be built correctly on top of this interface, e.g. `input_file_name()` and also object level metadata, etc


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729326073






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

brkyvz commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r525389190



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
##########
@@ -0,0 +1,29 @@
+package org.apache.spark.sql.connector.catalog;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.SupportsPushDownRequiredColumns;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * An interface for exposing data columns for a table that are not in the table schema. For example,
+ * a file source could expose a "file" column that contains the path of the file that contained each
+ * row.
+ * <p>

Review comment:
       Should we talk about the behavior of reserving names for metadata columns or the behavior that will happen during name collisions here (data columns will be selected over metadata)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708827678


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34377/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun edited a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-732483416


   Hi, All.
   This seems to break `Scala 2.13` Unit Test.
   ```scala
   $ dev/change-scala-version.sh 2.13
   $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
   ...
   [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 milliseconds)
   [info] - SPARK-31255: Projects data column when metadata column has the same name *** FAILED *** (77 milliseconds)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-732484047


   The newly added test case seems dependent on Scala versions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729326073


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-721495040






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704612244






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517015556



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       Done. I added a reason and an example.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517016147



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       @brkyvz, I updated this to use the resolver when removing metadata columns that conflict with data columns. Please have another look.
   
   Also, are you okay moving forward with this implementation instead of a UDF-style interface to project the columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729434797






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r526568906



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
##########
@@ -0,0 +1,58 @@
+package org.apache.spark.sql.connector.catalog;

Review comment:
       It was due to incorrect exclusion rule. I made a PR.
   - https://github.com/apache/spark/pull/30418




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512961848



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -880,6 +880,12 @@ case class SubqueryAlias(
     val qualifierList = identifier.qualifier :+ alias
     child.output.map(_.withQualifier(qualifierList))
   }
+
+  override def metadataOutput: Seq[Attribute] = {
+    val qualifierList = identifier.qualifier :+ alias
+    child.metadataOutput.map(_.withQualifier(qualifierList))
+  }

Review comment:
       They are _eventually_ part of the output, but they can't be at first because `*` expansion uses all of `output`. If we added them immediately, we would get metadata columns in a `select *`.
   
   Instead, we add the metadata columns to this and then update column resolution to look up columns here. The result is that we can resolve everything just like normal, including `*`, but the columns are missing from output. Then the new analyzer rule adds the columns to the output if they are resolved, but missing. Since the parent node is already resolved, we know that this is safe and happens after `*` expansion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729302417


   **[Test build #131238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131238/testReport)** for PR 28027 at commit [`6e72f4d`](https://github.com/apache/spark/commit/6e72f4d68e6935efba32ba7a5f8b48adb66fe145).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517639689



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       @bart-samwel, I like your last suggestion, but with a slight change: I think sources should reject column names that conflict with their metadata columns. That way, a source can choose a format that makes sense (e.g., __offset) and allow everything else. All Spark has to do is to define what the behavior will be when there is a conflict, and recommend that sources reject column names that are also exposed as metadata columns.
   
   I think using a special syntax requires significantly more changes, so I don't think that's a good option. Specific replies are below.
   
   > If I'd write table._some_metadata_column in an auditing-related library, I'd like to be sure that I get what I asked for, and that this cannot be spoofed.
   
   I think this is a fair argument, but I expect it would very rarely be a problem that you wanted a metadata column and accidentally projected a data column.
   
   > How about using special syntax
   
   This is what I considered above, "we could use a function to signal that a column is a metadata column only, and keep the interface to sources the same . . ." I think the challenge here is carrying the information through analysis and optimization. This approach either has many of the same complications as using a function (if kept as an expression), or we would need to introduce a new attribute type for metadata attrs that would hopefully work in rules just like normal attributes in existing rules. We can't lose the fact that the attribute references metadata, so we can't replace a metadata attribute reference with an attribute reference.
   
   I think the option most likely to not introduce more problems is to use a regular attribute for these columns, but that eliminates the benefit of a special syntax to track the fact that they are metadata column references. This doesn't seem like a good option to me, if all we want to do is guarantee that metadata columns don't conflict with data columns.
   
   The special syntax is also more awkward to use. One of the use cases for this feature is to expose non-schema fields like Kafka topic, offset, and partition. I think it is much more natural to select `_offset` than `metadata_column("offset")`. And if these are exposed as columns, we can add them to `DESCRIBE EXTENDED`, like I've done in this PR.
   
   > An alternative would be to make the column names use a recognizable pattern that we can forbid as column names.
   
   I like this suggestion, but I'd leave rejecting column names to sources as I said above. Another option is to make this available, but not enforce it. We could add a check that data columns are not of the form `metadata$.*`, but not require metadata column names using the pattern. That way there is a way to guarantee names don't conflict, but if a source chooses to use a name that may conflict it still works fine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704612244


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708839665






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] holdenk commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

holdenk commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704586022


   Would it help @rdblue if someone submited a PR to your PR adding some tests?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729326062


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35842/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704587017


   **[Test build #129475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129475/testReport)** for PR 28027 at commit [`0a5c7ca`](https://github.com/apache/spark/commit/0a5c7caebb807f2d256ce696ba8112acc24292b7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #28027:
URL: https://github.com/apache/spark/pull/28027


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-708699868


   **[Test build #129771 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129771/testReport)** for PR 28027 at commit [`14f8c3e`](https://github.com/apache/spark/commit/14f8c3e9b3a32e4a16b1fc1ce4be3e5a9f6be860).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (WIP)

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-704675667






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-722004291


   **[Test build #130611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130611/testReport)** for PR 28027 at commit [`9a4dfb1`](https://github.com/apache/spark/commit/9a4dfb1d1f62ae6e47af3b01ccf2d54e8f72c6cd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729433791


   **[Test build #131238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131238/testReport)** for PR 28027 at commit [`6e72f4d`](https://github.com/apache/spark/commit/6e72f4d68e6935efba32ba7a5f8b48adb66fe145).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r526546803



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
##########
@@ -0,0 +1,58 @@
+package org.apache.spark.sql.connector.catalog;

Review comment:
       I'll open a new PR to add the license headers.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729326077


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35842/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512963331



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       Yes, it is. I'll add a comment to make it clear.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r525554592



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
##########
@@ -0,0 +1,29 @@
+package org.apache.spark.sql.connector.catalog;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.SupportsPushDownRequiredColumns;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * An interface for exposing data columns for a table that are not in the table schema. For example,
+ * a file source could expose a "file" column that contains the path of the file that contained each
+ * row.
+ * <p>

Review comment:
       Good idea. I'll add that and rebase to fix the conflicts. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28027:
URL: https://github.com/apache/spark/pull/28027#issuecomment-729434797






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org