You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/26 20:19:48 UTC

[GitHub] [arrow] maxburke opened a new pull request #9331: [Rust] Propagate nulls if all results are null.

maxburke opened a new pull request #9331:
URL: https://github.com/apache/arrow/pull/9331


   Noticing that if one of the expressions was entirely null, the result would be a NullArray and that the downcast_ref would result in a panic because it wouldn't match the column type.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-775368752


   My current hypothesis is that this due to the field `ul_observation_date`, that is an `int96`, which Rust does not natively support. We may be pushing a `NullArray` to `ul_observation_date` instead of raising unsupported or casting it to something else.
   
   @nevi-me , any idea of what we are doing with `i96`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #9331:
URL: https://github.com/apache/arrow/pull/9331


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
maxburke commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-768799244


   Hi Jorge,
   
   I'll see if I can find something. Unfortunately the one example I was testing with has data we're not allowed to generally release.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-775402100


   > @nevi-me , any idea of what we are doing with `i96`?
   
   @jorgecarleitao int96 gets read into a ts[nano], so I wouldn't expect different behaviour to occur on it. 
   I'm not very familiar with the Datafusion crate as I haven't been following changes, but I'll try reproduce this on master with a parquet file that uses a filter that returns nothing back.
   
   @maxburke may you please kindly rebase the PR to resolve the merge conflict. I'll start looking at this in the mean time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke edited a comment on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
maxburke edited a comment on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-770552255


   @jorgecarleitao Here's the schema:
   
   ```
   >>> t = pq.ParquetFile('problem.parquet')
   >>> t
   <pyarrow.parquet.ParquetFile object at 0x10f345f40>
   >>> t.schema
   <pyarrow._parquet.ParquetSchema object at 0x10f350a40>
   required group field_id=0 schema {
     optional int96 field_id=1 ul_observation_date;
     optional binary field_id=2 document;
     optional binary field_id=3 report_id (String);
     optional binary field_id=4 type (String);
     optional binary field_id=5 name (String);
     optional binary field_id=6 year (String);
     optional binary field_id=7 vendor (String);
     optional binary field_id=8 direction (String);
     optional binary field_id=9 geometry (String);
     optional binary field_id=10 ul_node_id (String);
     optional binary field_id=11 classification (String);
     optional double field_id=12 posted_speed;
     optional int64 field_id=13 __index_level_0__;
   }
   ```
   
   The query: 
   
   ```
   SELECT "ul_node_id", "ul_observation_date", "vendor" FROM parquet_table WHERE (("vendor" = 'foo') AND "ul_observation_date" >= to_timestamp('2020-09-01T14:30:00.000Z') AND "ul_observation_date" <= to_timestamp('2020-09-04T00:00:00.000Z')) ORDER BY ul_node_id ASC NULLS LAST;
   ```
   
   (in this file there are no rows where vendor = 'foo')
   
   The error message received is:
   
   ```
   thread 'tokio-runtime-worker' panicked at 'compute_op failed to downcast array', datafusion/src/physical_plan/expressions.rs:1522:31
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke edited a comment on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
maxburke edited a comment on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-770552255


   @jorgecarleitao Here's the schema:
   
   ```
   >>> t = pq.ParquetFile('problem.parquet')
   >>> t
   <pyarrow.parquet.ParquetFile object at 0x10f345f40>
   >>> t.schema
   <pyarrow._parquet.ParquetSchema object at 0x10f350a40>
   required group field_id=0 schema {
     optional int96 field_id=1 ul_observation_date;
     optional binary field_id=2 document;
     optional binary field_id=3 report_id (String);
     optional binary field_id=4 type (String);
     optional binary field_id=5 name (String);
     optional binary field_id=6 year (String);
     optional binary field_id=7 vendor (String);
     optional binary field_id=8 direction (String);
     optional binary field_id=9 geometry (String);
     optional binary field_id=10 ul_node_id (String);
     optional binary field_id=11 classification (String);
     optional double field_id=12 posted_speed;
     optional int64 field_id=13 __index_level_0__;
   }
   ```
   
   The query: 
   
   `SELECT "ul_node_id", "ul_observation_date", "vendor" FROM parquet_table WHERE (("vendor" = 'foo') AND "ul_observation_date" >= to_timestamp('2020-09-01T14:30:00.000Z') AND "ul_observation_date" <= to_timestamp('2020-09-04T00:00:00.000Z')) ORDER BY ul_node_id ASC NULLS LAST;`
   
   (in this file there are no rows where vendor = 'foo')
   
   The error message received is:
   
   ```
   thread 'tokio-runtime-worker' panicked at 'compute_op failed to downcast array', datafusion/src/physical_plan/expressions.rs:1522:31
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-768764568


   Interesting. I would expect that when an expression is all nulls, it would still be an array of the type declared on the `DataType` and not a `NullArray`. Do you have an example where we get a `NullArray` but its `DataType` is not `DataType::Null`? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-768640533


   https://issues.apache.org/jira/browse/ARROW-11407


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-775467510


   I think I've found the issue, it's with the way we handle statistics. We create a null array whenever 0 results are returned from predicate filtering. https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L681
   
   Not all data types are supported, which would explain the int96 being converted to a `NullArray`.
   
   The solution is to extend the check for `ParquetStatistics::ByteArray` beyond just the `Utf8` Arrow type (https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/parquet.rs#L676). It's probably safe to create an empty array of whatever Arrow type, because casting to a null array might be unsound.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-777682331


   Hi @maxburke, I'm closing this as it's going to be addressed by https://github.com/apache/arrow/pull/9469. I'll check with you after it's merged, to confirm if the behaviour no longer persists.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-769773389


   I was just going to ask for a test on this PR and I see that @jorgecarleitao  already did so!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
maxburke commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-770552255


   @jorgecarleitao Here's the schema:
   
   ```
   >>> t = pq.ParquetFile('problem.parquet')
   >>> t
   <pyarrow.parquet.ParquetFile object at 0x10f345f40>
   >>> t.schema
   <pyarrow._parquet.ParquetSchema object at 0x10f350a40>
   required group field_id=0 schema {
     optional int96 field_id=1 ul_observation_date;
     optional binary field_id=2 document;
     optional binary field_id=3 report_id (String);
     optional binary field_id=4 type (String);
     optional binary field_id=5 name (String);
     optional binary field_id=6 year (String);
     optional binary field_id=7 vendor (String);
     optional binary field_id=8 direction (String);
     optional binary field_id=9 geometry (String);
     optional binary field_id=10 ul_node_id (String);
     optional binary field_id=11 classification (String);
     optional double field_id=12 posted_speed;
     optional int64 field_id=13 __index_level_0__;
   }```
   
   The query: 
   
   ```
   SELECT "ul_node_id", "ul_observation_date", "vendor" FROM parquet_table WHERE (("vendor" = 'foo') AND "ul_observation_date" >= to_timestamp('2020-09-01T14:30:00.000Z') AND "ul_observation_date" <= to_timestamp('2020-09-04T00:00:00.000Z')) ORDER BY ul_node_id ASC NULLS LAST;
   ```
   
   (in this file there are no rows where vendor = 'foo')
   
   The error message received is:
   
   ```
   thread 'tokio-runtime-worker' panicked at 'compute_op failed to downcast array', datafusion/src/physical_plan/expressions.rs:1522:31
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9331: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-767803111


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-768803567


   Thanks a lot for digging into it, @maxburke . fyi, there is no need to share the data. For this exercise we only need:
   
   * the input schema
   * the SQL statement
   * the column name where that appeared
   
   The invariant is:
   
   > the downcast of `&dyn Array` to a concrete type `T: Array` is infalible iff `T` is the corresponding `struct` for the `DataType` of the associated `Field`.
   
   For example,
   
   ```rust 
   let field = record_batch.schema().field(0);
   let array = record_batch.column(0);
   
   let meta_is_string = field.data_type() == DataType::Utf8;
   let data_is_string = array.as_any().downcast_ref::<StringArray>().is_some();
   
   assert_eq!(meta_is_string, data_is_string)
   ```
   
   and the above holds for every pair `(DataType, T: Array)`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke commented on pull request #9331: ARROW-11407: [Rust] Propagate nulls if all results are null.

Posted by GitBox <gi...@apache.org>.
maxburke commented on pull request #9331:
URL: https://github.com/apache/arrow/pull/9331#issuecomment-775355582


   Any updates here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org