You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/30 13:51:46 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #958: Add support for parsing timestamps from CSV files

andygrove opened a new issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I updated the  nyc benchmark schema to use timestamps:
   
   ```
   fn nyctaxi_schema() -> Schema {
       Schema::new(vec![
           Field::new("VendorID", DataType::Utf8, true),
           Field::new("pickup_datetime", DataType::Timestamp(TimeUnit::Microsecond, None), true),
           Field::new("dropoff_datetime", DataType::Timestamp(TimeUnit::Microsecond, None), true),
           ...
   ```
   
   I tried running a query and got this error.
   
   ```
   Error: ArrowError(ExternalError(ArrowError(ParseError("Error while parsing value 2020-01-01 00:35:39 for column 1 at line 2"))))
   ```
   
   **Describe the solution you'd like**
   I would like to be able to query CSV files containing timestamps.
   
   **Describe alternatives you've considered**
   None.
   
   **Additional context**
   None.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-939458090


   https://github.com/novemberkilo/arrow-datafusion/commit/d9f096a5ececcb8fdef6cc74761b782d33e02799 <-- looks very cool 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-932829960


   I would like to pick this up. Please assign to me as appropriate // @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-932726750


   Arrow contains the code to parse a string --> timestamp correctly here: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/cast_utils.rs#L69
   
   This ticket would likely be a matter of hooking that code up into the CSV parser: https://github.com/apache/arrow-rs/blob/master/arrow/src/csv/reader.rs
   
   So most of the code in this PR might best belong in arrow-rs rather than datafusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-939236947


   I have an example of this now on https://github.com/novemberkilo/arrow-datafusion/commit/d9f096a5ececcb8fdef6cc74761b782d33e02799
   
   To reproduce, follow the directions in `benchmarks/README.md` to get a `ballista-scheduler` and `ballista-executor` going locally, then do 
   
   ```
   cargo run --release --bin nyctaxi -- --iterations 3 --path benchmarks/data/nyctaxi_100.csv --format csv --batch-size 4096
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-946584836


   arrow 6.0.0 is released. When https://github.com/apache/arrow-rs/pull/832 is merged I'll backport that( will be included in 6.1.0, due to be released around Nov 1 2021)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
alamb closed issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-945558699


   @alamb looks like datafusion is [pinned to version](https://github.com/apache/arrow-datafusion/blob/master/datafusion/Cargo.toml#L53) `5.3` of `arrow-rs`. Once https://github.com/apache/arrow-rs/pull/832 is merged,  in order to get it to datafusion, will need to upgrade to around `7.0.0`  -- that seems like a not-small change? What would the process for this be? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-1077899228


   🤔  I wonder if this issue is now done? Or does it need more work?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-946388821


   @novemberkilo since apache/arrow-rs#832 doesn't break any public api, it will be released as part of arrrow 6.x. @alamb already have a PR ready to merge for arrow-rs 6.x integration: https://github.com/apache/arrow-datafusion/pull/984


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-946388821


   @novemberkilo since apache/arrow-rs#832 doesn't break any public api, it will be released as part of arrrow 6.x. @alamb already have a PR ready to merge for arrow-rs 6.x integration: https://github.com/apache/arrow-datafusion/pull/984. Process wise, we need to get arrow-rs 6.0.0 released first. I will let @alamb decide whether your arrow-rs PR should be merged and released as part of the 6.0.0 release or the release after that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] novemberkilo commented on issue #958: Add support for parsing timestamps from CSV files

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on issue #958:
URL: https://github.com/apache/arrow-datafusion/issues/958#issuecomment-1078459214


   iirc we just wanted to wait until we could confirm that the version of `arrow-rs` that contains the [fix](https://github.com/apache/arrow-rs/pull/832) is being used in datafusion. I don't think it needs more work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org