You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Eduardo Ponce (Jira)" <ji...@apache.org> on 2021/09/21 05:41:00 UTC

[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

    [ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417922#comment-17417922 ] 

Eduardo Ponce commented on ARROW-13028:
---------------------------------------

I think that having CSV infer to largest type is more robust/safe and use explicit column types for other conversions.
If inference is set to be from smallest to largest, then where does these decisions end?
Do we infer first as signed or unsigned integers? Int8 vs. int32, etc? Half-float vs float vs double?
We can definitely decide to simply try signed int32 and float as the smallest integral type, but it stills feels a bit opinionated.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---------------------------------------------------------------
>
>                 Key: ARROW-13028
>                 URL: https://issues.apache.org/jira/browse/ARROW-13028
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nate Clark
>            Assignee: Nate Clark
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large data sets it could be better to use 32 bit types to save over all memory. To do this it would be useful to add an option to ConvertOptions to try 32 bit numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)