You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/30 13:39:39 UTC

[GitHub] [spark] olaky opened a new pull request, #39314: Add new metadata types

olaky opened a new pull request, #39314:
URL: https://github.com/apache/spark/pull/39314

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060326307


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##########
@@ -492,21 +492,31 @@ object MetadataAttribute {
 
 /**
  * The internal representation of the FileSourceMetadataAttribute, it sets `__metadata_col`
- * and `__file_source_metadata_col` to `true` in AttributeReference's metadata
+ * and `__file_source_metadata_col` to `true` in AttributeReference's metadata.
+ * This is a subtype of [[FileSourceConstantMetadataAttribute]] and
+ * [[FileSourceGeneratedMetadataAttribute]].
+ *
  * - apply() will create a file source metadata attribute reference
- * - unapply() will check if an attribute reference is the file source metadata attribute reference
+ * - unapply() will check if an attribute reference is any file source metadata attribute reference
  */
 object FileSourceMetadataAttribute {
 
   val FILE_SOURCE_METADATA_COL_ATTR_KEY = "__file_source_metadata_col"
 
-  def apply(name: String, dataType: DataType): AttributeReference =
-    // Metadata column for file sources is always not nullable.
-    AttributeReference(name, dataType, nullable = false,
+  /**
+   * Cleanup the internal metadata information of an attribute if it is
+   * a [[FileSourceConstantMetadataAttribute]] or [[FileSourceGeneratedMetadataAttribute]].
+   */
+  def cleanupFileSourceMetadataInformation(attr: Attribute): Attribute =
+    removeInternalMetadata(attr)
+
+  def apply(name: String, dataType: DataType, nullable: Boolean = false): AttributeReference =

Review Comment:
   Yes, it is used [here](https://github.com/apache/spark/blob/df1747ad83d78d0282815c7ba63000575dd279ce/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L226)
   The generic type (not constant or generated) is used to wrap columns that have a struct of metadata fields. It was also used for this purpose before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060332259


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -269,7 +277,7 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
             case FileFormat.FILE_PATH | FileFormat.FILE_NAME | FileFormat.FILE_SIZE |
                  FileFormat.FILE_MODIFICATION_TIME =>
               col
-            case FileFormat.ROW_INDEX =>
+            case FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME =>

Review Comment:
   This is an effect of the different iteration / building of different lists that I apply above.
   Before, metadataColumns was just a flattened list of all columns.
   Now the flattening also includes all transformations. I introduced this to have one place in the code that handles the different types only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060279827


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -269,7 +277,7 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
             case FileFormat.FILE_PATH | FileFormat.FILE_NAME | FileFormat.FILE_SIZE |
                  FileFormat.FILE_MODIFICATION_TIME =>
               col
-            case FileFormat.ROW_INDEX =>
+            case FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME =>

Review Comment:
   is it a bug fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on PR #39314:
URL: https://github.com/apache/spark/pull/39314#issuecomment-1367960163

   cc @cloud-fan @somani 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060338220


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -379,11 +381,12 @@ trait FileSourceScanLike extends DataSourceScanExec {
   @transient
   protected lazy val pushedDownFilters = {
     val supportNestedPredicatePushdown = DataSourceUtils.supportNestedPredicatePushdown(relation)
-    // `dataFilters` should not include any metadata col filters
+    // `dataFilters` should not include any constant metadata col filters
     // because the metadata struct has been flatted in FileSourceStrategy
-    // and thus metadata col filters are invalid to be pushed down
+    // and thus metadata col filters are invalid to be pushed down. Metadata that is generated
+    // during the scan can be used for filters.
     dataFilters.filterNot(_.references.exists {
-      case FileSourceMetadataAttribute(_) => true
+      case FileSourceConstantMetadataAttribute(_) => true

Review Comment:
   row_index uses FileSourceGeneratedMetadataAttribute. Let me double check that a test hits this code path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060338611


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -217,7 +219,7 @@ trait FileSourceScanLike extends DataSourceScanExec {
         vectorTypes ++
           // for column-based file format, append metadata column's vector type classes if any
           metadataColumns.map { metadataCol =>
-            if (FileFormat.isConstantMetadataAttr(metadataCol.name)) {
+            if (FileFormat.isConstantMetadataAttr(metadataCol)) {

Review Comment:
   That is an excellent point,  updated this and also renamed the variable to make it clear that we only handle constant metadata here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060278135


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##########
@@ -161,4 +162,7 @@ case class StructField(
     val nullString = if (nullable) "" else " NOT NULL"
     s"${quoteIfNeeded(name)} ${dataType.sql}${nullString}$getDDLComment"
   }
+
+  def toAttribute: AttributeReference =

Review Comment:
   can we add `private[sql]`? Otherwise this becomes a public API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060277676


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##########
@@ -492,21 +492,31 @@ object MetadataAttribute {
 
 /**
  * The internal representation of the FileSourceMetadataAttribute, it sets `__metadata_col`
- * and `__file_source_metadata_col` to `true` in AttributeReference's metadata
+ * and `__file_source_metadata_col` to `true` in AttributeReference's metadata.
+ * This is a subtype of [[FileSourceConstantMetadataAttribute]] and
+ * [[FileSourceGeneratedMetadataAttribute]].
+ *
  * - apply() will create a file source metadata attribute reference
- * - unapply() will check if an attribute reference is the file source metadata attribute reference
+ * - unapply() will check if an attribute reference is any file source metadata attribute reference
  */
 object FileSourceMetadataAttribute {
 
   val FILE_SOURCE_METADATA_COL_ATTR_KEY = "__file_source_metadata_col"
 
-  def apply(name: String, dataType: DataType): AttributeReference =
-    // Metadata column for file sources is always not nullable.
-    AttributeReference(name, dataType, nullable = false,
+  /**
+   * Cleanup the internal metadata information of an attribute if it is
+   * a [[FileSourceConstantMetadataAttribute]] or [[FileSourceGeneratedMetadataAttribute]].
+   */
+  def cleanupFileSourceMetadataInformation(attr: Attribute): Attribute =
+    removeInternalMetadata(attr)
+
+  def apply(name: String, dataType: DataType, nullable: Boolean = false): AttributeReference =

Review Comment:
   when do we need this apply method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1061637682


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -249,8 +255,26 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
       // By placing `fileFormatReaderGeneratedMetadataColumns` before `partitionColumns` and
       // `fileConstantMetadataColumns` in the `outputAttributes` we make these row operations
       // simpler and more efficient.
-      val outputAttributes = readDataColumns ++ fileFormatReaderGeneratedMetadataColumns ++
-        partitionColumns ++ fileConstantMetadataColumns
+      val outputAttributes = readDataColumns ++ generatedMetadataColumns ++
+        partitionColumns ++ constantMetadataColumns
+
+      // The attribute references in the filters also have to be categorized as either constant
+      // or generated metadata attributes. Only data filters can contain metadata filters.
+      def categorizeFileSourceMetadataAttributesInFilters(

Review Comment:
   The key problem is actually in PartitioningAwareFileIndex: There we build file level filters, and when row_index is not thrown out of the filter list, we end up with an empty file list because the row index filter filters out all files.
   Let me check if that was broken before, but I think this is an important problem to fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060278839


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -217,7 +219,7 @@ trait FileSourceScanLike extends DataSourceScanExec {
         vectorTypes ++
           // for column-based file format, append metadata column's vector type classes if any
           metadataColumns.map { metadataCol =>
-            if (FileFormat.isConstantMetadataAttr(metadataCol.name)) {
+            if (FileFormat.isConstantMetadataAttr(metadataCol)) {

Review Comment:
   do we still need it? `lazy val metadataColumns` is guaranteed to be only constant metadata columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #39314: [SPARK-41791] Add new file source metadata column types

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #39314: [SPARK-41791] Add new file source metadata column types
URL: https://github.com/apache/spark/pull/39314


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060276627


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##########
@@ -492,21 +492,31 @@ object MetadataAttribute {
 
 /**
  * The internal representation of the FileSourceMetadataAttribute, it sets `__metadata_col`
- * and `__file_source_metadata_col` to `true` in AttributeReference's metadata
+ * and `__file_source_metadata_col` to `true` in AttributeReference's metadata.
+ * This is a subtype of [[FileSourceConstantMetadataAttribute]] and

Review Comment:
   ```suggestion
    * This is a super type of [[FileSourceConstantMetadataAttribute]] and
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060880237


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -249,8 +255,26 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
       // By placing `fileFormatReaderGeneratedMetadataColumns` before `partitionColumns` and
       // `fileConstantMetadataColumns` in the `outputAttributes` we make these row operations
       // simpler and more efficient.
-      val outputAttributes = readDataColumns ++ fileFormatReaderGeneratedMetadataColumns ++
-        partitionColumns ++ fileConstantMetadataColumns
+      val outputAttributes = readDataColumns ++ generatedMetadataColumns ++
+        partitionColumns ++ constantMetadataColumns
+
+      // The attribute references in the filters also have to be categorized as either constant
+      // or generated metadata attributes. Only data filters can contain metadata filters.
+      def categorizeFileSourceMetadataAttributesInFilters(

Review Comment:
   This was the least intrusive / most generic way I could come up with.
   The problem is: When the filters come in, the attributes are already bound. Whatever changes we make that end up in outputAttributes, have to be applied to the filters as well, if we do not want all filter attributes to stay as FileSourceMetadataAttribute.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1061563125


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -249,8 +255,26 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
       // By placing `fileFormatReaderGeneratedMetadataColumns` before `partitionColumns` and
       // `fileConstantMetadataColumns` in the `outputAttributes` we make these row operations
       // simpler and more efficient.
-      val outputAttributes = readDataColumns ++ fileFormatReaderGeneratedMetadataColumns ++
-        partitionColumns ++ fileConstantMetadataColumns
+      val outputAttributes = readDataColumns ++ generatedMetadataColumns ++
+        partitionColumns ++ constantMetadataColumns
+
+      // The attribute references in the filters also have to be categorized as either constant
+      // or generated metadata attributes. Only data filters can contain metadata filters.
+      def categorizeFileSourceMetadataAttributesInFilters(

Review Comment:
   This looks a bit complicated. Can we keep the previous behavior and don't support metadata col filter at all? https://github.com/apache/spark/pull/39314/files#diff-089285f1484c1598cb2839b86b6a9e65b98ab5b30462aedc210fe4bbf44cae78L386
   
   We can support it later in an individual PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060332259


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -269,7 +277,7 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
             case FileFormat.FILE_PATH | FileFormat.FILE_NAME | FileFormat.FILE_SIZE |
                  FileFormat.FILE_MODIFICATION_TIME =>
               col
-            case FileFormat.ROW_INDEX =>
+            case FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME =>

Review Comment:
   This is an effect of the different iteration / building of metadata type lists that I apply above.
   Before, metadataColumns was just a flattened list of all columns.
   Now the flattening also includes all transformations. I introduced this to have one place in the code that handles the different types only.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -269,7 +277,7 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
             case FileFormat.FILE_PATH | FileFormat.FILE_NAME | FileFormat.FILE_SIZE |
                  FileFormat.FILE_MODIFICATION_TIME =>
               col
-            case FileFormat.ROW_INDEX =>
+            case FileFormat.ROW_INDEX_TEMPORARY_COLUMN_NAME =>

Review Comment:
   No, this is an effect of the different iteration / building of metadata type lists that I apply above.
   Before, metadataColumns was just a flattened list of all columns.
   Now the flattening also includes all transformations. I introduced this to have one place in the code that handles the different types only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060326307


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala:
##########
@@ -492,21 +492,31 @@ object MetadataAttribute {
 
 /**
  * The internal representation of the FileSourceMetadataAttribute, it sets `__metadata_col`
- * and `__file_source_metadata_col` to `true` in AttributeReference's metadata
+ * and `__file_source_metadata_col` to `true` in AttributeReference's metadata.
+ * This is a subtype of [[FileSourceConstantMetadataAttribute]] and
+ * [[FileSourceGeneratedMetadataAttribute]].
+ *
  * - apply() will create a file source metadata attribute reference
- * - unapply() will check if an attribute reference is the file source metadata attribute reference
+ * - unapply() will check if an attribute reference is any file source metadata attribute reference
  */
 object FileSourceMetadataAttribute {
 
   val FILE_SOURCE_METADATA_COL_ATTR_KEY = "__file_source_metadata_col"
 
-  def apply(name: String, dataType: DataType): AttributeReference =
-    // Metadata column for file sources is always not nullable.
-    AttributeReference(name, dataType, nullable = false,
+  /**
+   * Cleanup the internal metadata information of an attribute if it is
+   * a [[FileSourceConstantMetadataAttribute]] or [[FileSourceGeneratedMetadataAttribute]].
+   */
+  def cleanupFileSourceMetadataInformation(attr: Attribute): Attribute =
+    removeInternalMetadata(attr)
+
+  def apply(name: String, dataType: DataType, nullable: Boolean = false): AttributeReference =

Review Comment:
   It is used [here](https://github.com/apache/spark/blob/df1747ad83d78d0282815c7ba63000575dd279ce/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L226)
   The generic type (not constant or generated) is used to wrap columns that have a struct of metadata fields. It was also used for this purpose before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] olaky commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
olaky commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060555717


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -379,11 +381,12 @@ trait FileSourceScanLike extends DataSourceScanExec {
   @transient
   protected lazy val pushedDownFilters = {
     val supportNestedPredicatePushdown = DataSourceUtils.supportNestedPredicatePushdown(relation)
-    // `dataFilters` should not include any metadata col filters
+    // `dataFilters` should not include any constant metadata col filters
     // because the metadata struct has been flatted in FileSourceStrategy
-    // and thus metadata col filters are invalid to be pushed down
+    // and thus metadata col filters are invalid to be pushed down. Metadata that is generated
+    // during the scan can be used for filters.
     dataFilters.filterNot(_.references.exists {
-      case FileSourceMetadataAttribute(_) => true
+      case FileSourceConstantMetadataAttribute(_) => true

Review Comment:
   I actually did end up finding a problem here: The filters reference the attributes before they are transformed / flattened, so they are all of type FileSourceMetadataAttribute. I will have to see how I can fix that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060278984


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:
##########
@@ -379,11 +381,12 @@ trait FileSourceScanLike extends DataSourceScanExec {
   @transient
   protected lazy val pushedDownFilters = {
     val supportNestedPredicatePushdown = DataSourceUtils.supportNestedPredicatePushdown(relation)
-    // `dataFilters` should not include any metadata col filters
+    // `dataFilters` should not include any constant metadata col filters
     // because the metadata struct has been flatted in FileSourceStrategy
-    // and thus metadata col filters are invalid to be pushed down
+    // and thus metadata col filters are invalid to be pushed down. Metadata that is generated
+    // during the scan can be used for filters.
     dataFilters.filterNot(_.references.exists {
-      case FileSourceMetadataAttribute(_) => true
+      case FileSourceConstantMetadataAttribute(_) => true

Review Comment:
   do we have tests for filters with generated metadata cols?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #39314:
URL: https://github.com/apache/spark/pull/39314#issuecomment-1367987084

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060279340


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala:
##########
@@ -208,35 +210,41 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging {
       val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq ++ projects
       val requiredAttributes = AttributeSet(requiredExpressions)
 
+      val readDataColumns = dataColumns
+        .filter(requiredAttributes.contains)
+        .filterNot(partitionColumns.contains)
+
+      // Metadata attributes are part of a column of type struct up to this point. Here we extract
+      // this column from the schema.
       val metadataStructOpt = l.output.collectFirst {
         case FileSourceMetadataAttribute(attr) => attr
       }
 
-      val metadataColumns = metadataStructOpt.map { metadataStruct =>
-        metadataStruct.dataType.asInstanceOf[StructType].fields.map { field =>
-          FileSourceMetadataAttribute(field.name, field.dataType)
-        }.toSeq
-      }.getOrElse(Seq.empty)
-
-      val fileConstantMetadataColumns: Seq[Attribute] =
-        metadataColumns.filter(_.name != FileFormat.ROW_INDEX)
+      val fileConstantMetadataColumns: mutable.Buffer[Attribute] = mutable.Buffer.empty
+      val fileFormatReaderGeneratedMetadataColumns: mutable.Buffer[Attribute] = mutable.Buffer.empty

Review Comment:
   maybe just name them `constantMetadataColumns` and `generatedMetadataColumns`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39314:
URL: https://github.com/apache/spark/pull/39314#discussion_r1060280243


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##########
@@ -161,4 +162,7 @@ case class StructField(
     val nullString = if (nullable) "" else " NOT NULL"
     s"${quoteIfNeeded(name)} ${dataType.sql}${nullString}$getDDLComment"
   }
+
+  def toAttribute: AttributeReference =

Review Comment:
   We can also simplify `StructType.toAttributes` with this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #39314: [SPARK-41791] Add new metadata types

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #39314:
URL: https://github.com/apache/spark/pull/39314#issuecomment-1372108566

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org