You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/30 13:37:12 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #957: Inconsistent cast behavior

andygrove opened a new issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957


   **Describe the bug**
   
   I have a string column containing floating-point values but I want to cast this to an integer using `CAST(col AS int)` but this fails with `Cast error: Cannot cast string '1.20' to value of arrow::datatypes::types::Int32Type type`. However, we support casting floats to ints and I can do `CAST(CAST(col AS float) AS int)` to work around this but if I can do that, why can I not just cast to int directly? 
   
   Is this the intended behavior?
   
   **To Reproduce**
   
   ```bash
   echo "1.2" > foo.csv
   ```
   
   CLI:
   
   ```sql
   > CREATE EXTERNAL TABLE foo (bar VARCHAR) STORED AS CSV LOCATION 'foo.csv';
   0 rows in set. Query took 0.001 seconds.
   
   > SELECT bar FROM foo;
   +-----+
   | bar |
   +-----+
   | 1.2 |
   +-----+
   1 row in set. Query took 0.018 seconds.
   
   > SELECT CAST(bar AS int) FROM foo;
   ArrowError(ExternalError(ArrowError(CastError("Cannot cast string '1.2' to value of arrow::datatypes::types::Int32Type type"))))
   
   > SELECT CAST(CAST(bar as float) as int) FROM foo;
   +-------------------------------------+
   | CAST(CAST(bar AS Float64) AS Int32) |
   +-------------------------------------+
   | 1                                   |
   +-------------------------------------+
   1 row in set. Query took 0.017 seconds.
   ```
   
   **Expected behavior**
   I would expect the cast to work. If there is working as intended then it would be good to document the behavior in the user guide.
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] j-a-m-l edited a comment on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
j-a-m-l edited a comment on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-945422583


   @alamb That code declares which type coercions are available, but, maybe I'm missing something, I don't see any rule that states which numeric type should be chosen to coerce strings or the actual code to convert them from utf8 to numeric.
   
   I'm also interested in casting strings to decimal, but I don't see anything relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-945092798


   @j-a-m-l  -- I think https://github.com/apache/arrow-datafusion/issues/957#issuecomment-932727063 has a pointer to the code that is used to convert utf8 to numeric


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] j-a-m-l commented on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
j-a-m-l commented on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-946430458


   @alamb Yes, I also think that the real issue is that one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] j-a-m-l commented on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
j-a-m-l commented on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-945422583


   @alamb That code delares which type coercions are available, but, maybe I'm missing something, I don't see any rule that states which numeric type should be chosen to coerce strings or the actual code to convert them from utf8 to numeric.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-946173074


   > That code declares which type coercions are available, but, maybe I'm missing something, I don't see any rule that states which numeric type should be chosen to coerce strings or the actual code to convert them from utf8 to numeric.
   
   The arrow cast kernel https://docs.rs/arrow/6.0.0/arrow/compute/kernels/cast/index.html contains the code to covert `utf8` to many differnet types -- see https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/cast.rs 
   
   I think the issue that is described on this bug is that by desigh DF won't auto cast (coerce) in a way that loses data (so it won't coerce  `1.2` to an int because it loses precision (aka loses the .2)).
   
   However, if the user explicitly says `CAST(col as INT)` the idea is that they are specifying the behavior and thus DataFusion should truncate them correctly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] j-a-m-l commented on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
j-a-m-l commented on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-944962026


   I think that there are 2 different problems here:
   
   1. The value is cast as float by default.
   ```echo "1.2" > foo.csv``` writes a file with `1.2`, not `"1.2"`.
   
   I've tried this script:
   ```rust
   use datafusion::{error::Result, prelude::*};
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let filename = "foo.csv";
   
       let options = CsvReadOptions::new().has_header(false);
   
       let mut ctx = ExecutionContext::new();
       let csv_df = ctx.read_csv(filename, options).await?;
   
       println!("{:?}", csv_df.schema());
       Ok(())
   }
   ```
   
   The result is:
   ```DFSchema { fields: [DFField { qualifier: Some("foo.csv"), field: Field { name: "column_1", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None } }] }```
   
   2.  I'm not completely familiar with this codebase, but I haven't seen yet anything to convert from Utf8 to numeric types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-932727063


   I think this would involve writing some tests and updating the coercion rules in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/expressions/coercion.rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] j-a-m-l edited a comment on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
j-a-m-l edited a comment on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-945422583


   @alamb That code delares which type coercions are available, but, maybe I'm missing something, I don't see any rule that states which numeric type should be chosen to coerce strings or the actual code to convert them from utf8 to numeric.
   
   I'm also interested in casting strings to decimal, but I don't see anything relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] j-a-m-l edited a comment on issue #957: Inconsistent cast behavior

Posted by GitBox <gi...@apache.org>.
j-a-m-l edited a comment on issue #957:
URL: https://github.com/apache/arrow-datafusion/issues/957#issuecomment-944962026






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org