You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 18:39:48 UTC

[GitHub] [arrow-datafusion] andrei-ionescu opened a new pull request #1392: Fix index out of bounds for stats on nested fields

andrei-ionescu opened a new pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392


   # Which issue does this PR close?
   
   Closes #1383
   
    # Rationale for this change
   
   This is a step in supporting nested fields in data fusion being read from parquet in 
   
   # What changes are included in this PR?
   
   This adds a helper method to get the flattened schema. This is similar to how the columns are stored in the parquet metadata section.
   
   # Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu edited a comment on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985775205


   @houqp After more debugging and fixing different things I found that the physical plan lacks the nested fields support. 
   
   I got into this error: 
   ```
   Error: ArrowError(SchemaError("Unexpected batch schema from file, expected 36 cols but got 6"))
   ```
   
   And this error is happening in these lines of code: 
   https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_plan/file_format/mod.rs#L223-L229
   
   The chunk of data that has been read has only 6 columns (`file_batch.columns().len()`) while the expected number of columns (`expected_cols `) is 36.
   
   The root cause seems to be the way parquet files are read vs how it gets projected. It reads one top nested column at a time, while it tries to project that chunk of data over the full schema. For example, in the case of the `nested_struct.rust.parquet` it reads the first column with 6 leaves and then tries to project that over all 36 top columns of that parquet file. This is root cause of the error above.
   
   It seems that DataFusion lacks the support for nested fields, at least when using the parquet data source.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985463113


   Re clippy failures, @xudong963  is working on them in this PR https://github.com/apache/arrow-datafusion/pull/1395 -- once that gets merged we should be able to merge this PR with master and clippy would pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-1046706994


   @alamb Please re-open this PR. It is not stale just waiting for the https://github.com/apache/arrow/pull/11704 to get merged and then rebase the work on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu edited a comment on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985775205


   @houqp After more debugging and fixing different things I found that the physical plan lacks the nested fields support. 
   
   I got into this error: 
   ```
   Error: ArrowError(SchemaError("Unexpected batch schema from file, expected 36 cols but got 6"))
   ```
   
   And this error is happening in these lines of code: 
   https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_plan/file_format/mod.rs#L223-L229
   
   The chunk of data that has been read has only 6 columns (`file_batch.columns().len()`) while the expected number of columns (`expected_cols`) is 36.
   
   The root cause seems to be the way parquet files are read vs how it gets projected. It reads one top nested column at a time, while it tries to project that chunk of data over the full schema. For example, in the case of the `nested_struct.rust.parquet` it reads the first column with 6 leaves and then tries to project that over all 36 top columns of that parquet file. This is root cause of the error above.
   
   It seems that DataFusion lacks the support for nested fields, at least when using the parquet data source.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu edited a comment on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985713034


   @houqp This what I'm seeing - and please correct me if I'm wrong - there is no support in reading nested parquet files in DataFusion.
   
   I've tried to use the [`nested_struct.rust.parquet`](https://github.com/apache/parquet-testing/blob/master/data/nested_structs.rust.parquet) file that is present in the `parquet-testing` git submodule for some DataFusion tests and is not working. And I couldn't find any other tests with nested fields in parquet in DataFusion.
   
   With all these being said, what's the strategy on supporting nested source structures in DataFusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-984925577


   There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985333246


   Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985775205


   @houqp After more debugging and fixing different things I found that the physical plan lacks the nested fields support. 
   
   I got into this error: 
   ```
   Error: ArrowError(SchemaError("Unexpected batch schema from file, expected 36 cols but got 6"))
   ```
   
   And this error is happening in these lines of code: [physical_plan/file_format/mod.rs#L223-L229](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/file_format/mod.rs#L223-L229). The chunk of data that has been read has only 6 columns while the expected number of columns is 36.
   
   The root cause seems to be the way parquet files are read vs how it gets projected. It reads one top nested column at a time, while it tries to project that chunk of data over the full schema. For example, in the case of the `nested_struct.rust.parquet` it reads the first column with 6 leaves and then tries to project that over all 36 top columns of that parquet file. This is root cause of the error above.
   
   It seems that DataFusion lacks the support for nested fields, at least when using the parquet data source.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-1059444264


   Next week I'll take another stab at it. First I need to check the issue that I've seen before and refactor the PR changes if that still applies.
   
   I'll let you know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu edited a comment on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985713034






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on a change in pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on a change in pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#discussion_r762176794



##########
File path: datafusion/src/datasource/mod.rs
##########
@@ -198,3 +200,27 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn flatten_schema(schema: &Schema) -> Vec<&Field> {

Review comment:
       @Dandandan Simplifying it is not a solution as we need to read/calculate the statistics for the nested fields too. We will be using the flattened fields when calling `summarize_min_max` function.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-1040676345


   Closing stale PRs. Please reopen (or open a new one) if you plan to keep working on this feature. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-1026192015


   Marking PRs without activity in the last month as stale. I'll plan to close it in another month or so without activity, though feel free to reopen it when you have time to work on it)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] matthewmturner commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
matthewmturner commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-1059435821


   @andrei-ionescu it looks like the issue you referenced waiting on was closed.  with that in mind, do you have an idea how youd like to manage this PR now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-989633614


   Sorry for the late reply, @andrei-ionescu the problem you are getting is basically caused by the problem I mentioned in https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985333246. Fundamentally, it's due to differences between how nested struct fields are handled in Arrow and Parquet.
   
   @lst-codes managing stats in a nested data structure could fix the problem. However, being inspired by https://github.com/apache/arrow/pull/11704, I think it would be more efficient to resolve the nested column key path during planning by traversing the `Expr::GetIndexedField` expression , then only load corresponding parquet column stats into memory. This way, we can skip columns that are not accessed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] lst-codes commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
lst-codes commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-989029092


   > Thanks @andrei-ionescu for taking a stab at this. I don't think we can just flatting all the columns like how it's handled in parquet because the rest of the datafusion code base assumes arrow's hierarchical column structure. As a result, column indexes only map to top level columns. I suspect this approach will not cause runtime crashes because we are only expanding the column vector, but it could lead to incorrect read of column stats in places like
   > 
   > https://github.com/apache/arrow-datafusion/blob/069449f754c92765a2d0dbbf62cd7bac76257892/datafusion/src/physical_optimizer/aggregate_statistics.rs#L181
   
   @houqp 
   
   Do you think column statistics should include something similar to `Field#fields()`, which returns a list of subfields for nested structures? 
   
   https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/field.rs#L110-L125


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu edited a comment on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985713034


   @houqp This what I'm seeing - and please correct me if I'm wrong - **there is no support in reading nested parquet files in DataFusion.**
   
   I've tried to use the [`nested_struct.rust.parquet`](https://github.com/apache/parquet-testing/blob/master/data/nested_structs.rust.parquet) file that is present in the `parquet-testing` git submodule for some DataFusion tests and is not working. And I couldn't find any other tests with nested fields in parquet in DataFusion.
   
   With all these being said, what's the strategy on supporting nested source structures in DataFusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985713034


   @houqp This what I'm seeing - and please correct me if I'm wrong - there is no support in reading nested parquet files in DataFusion.
   
   I've tried to use the `nested_struct.parquet` file that is present in the `parquet-testing` git submodule for some DataFusion tests and is not working. And I couldn't find any other tests with nested fields in parquet in DataFusion.
   
   With all these being said, what's the strategy on supporting nested source structures in DataFusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985068068


   > There is only the "Clippy" task that did fail. I think that is not related to my changes. Can someone check it out, please?
   
   
   I suspect that is due to the new Rust version that is released and clippy got a little more finicky -- we just need to fix it on master. I'll handle it tomorrow, unless someone beats me to it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-989633614


   Sorry for the late reply, @andrei-ionescu the problem you are getting is basically caused by the problem I mentioned in https://github.com/apache/arrow-datafusion/pull/1392#issuecomment-985333246. Fundamentally, it's due to differences between how nested struct fields are handled in Arrow and Parquet.
   
   @lst-codes managing stats in a nested data structure could fix the problem. However, being inspired by https://github.com/apache/arrow/pull/11704, I think it would be more efficient to resolve the nested column key path during planning by traversing the `Expr::GetIndexedField` expression , then only load corresponding parquet column stats into memory. This way, we can skip columns that are not accessed by the query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on a change in pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on a change in pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#discussion_r762041024



##########
File path: datafusion/src/datasource/mod.rs
##########
@@ -198,3 +200,27 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn flatten_schema(schema: &Schema) -> Vec<&Field> {

Review comment:
       I'll try simplifying it...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on a change in pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on a change in pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#discussion_r762202525



##########
File path: datafusion/src/datasource/mod.rs
##########
@@ -198,3 +200,27 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn flatten_schema(schema: &Schema) -> Vec<&Field> {

Review comment:
       @Dandandan I added the `total_number_of_fields` function besides the flatten one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #1392: Fix index out of bounds for stats on nested fields

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #1392:
URL: https://github.com/apache/arrow-datafusion/pull/1392#discussion_r761880357



##########
File path: datafusion/src/datasource/mod.rs
##########
@@ -198,3 +200,27 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn flatten_schema(schema: &Schema) -> Vec<&Field> {

Review comment:
       In the current uses, it seems we only need the total number of fields (i.e. return a `usize`) instead of the field information.
   
   It seems to me that would be a bit more simple / faster than first collecting all the fields if we don't need the results of that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org