You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Frédérique Vanneste (JIRA)" <ji...@apache.org> on 2018/08/29 07:36:00 UTC

[jira] [Updated] (ARROW-3139) [Python]ArrowIOError: Arrow error: Capacity error during read

     [ https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frédérique Vanneste updated ARROW-3139:
---------------------------------------
    Description: 
My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)

looks similar as  https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

Code
 * basket_plateau= pq.read_table("basket_plateau.parquet")
 * basket_plateau = pd.read_parquet("basket_plateau.parquet")

Error produced
 * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655

Dataset
 * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
 * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
 * aprox 90GB in memory
 * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think food retail categories)

History to bug:
 * was using older version of pyarrow
 * tried writing dataset to disk (parquet) and failed
 * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
 * upgraded to 0.10
 * tried writing dataset to disk (parquet) and succeeded
 * tried reading dataset and failed
 * looks like a similar case as: https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

  was:
My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)

looks similar as  https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

Code
 * basket_plateau= pq.read_table("basket_plateau.parquet")
 * basket_plateau = pd.read_parquet("basket_plateau.parquet")

Error produced
 * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655

Dataset
 * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
 * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
 * aprox 90GB in memory
 * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (thing food retail categories)

History to bug:
 * was using older version of pyarrow
 * tried writing dataset to disk (parquet) and failed
 * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
 * upgraded to 0.10
 * tried writing dataset to disk (parquet) and succeeded
 * tried reading dataset and failed
 * looks like a similar case as: https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 

Code outputs


> [Python]ArrowIOError: Arrow error: Capacity error during read
> -------------------------------------------------------------
>
>                 Key: ARROW-3139
>                 URL: https://issues.apache.org/jira/browse/ARROW-3139
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.10.0
>         Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
>            Reporter: Frédérique Vanneste
>            Priority: Major
>
> My assumption: the problem is caused by a large object column containing strings up to 27 characters long. (so that column is much larger than 2GB of strings, chunking issue)
> looks similar as  https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 
> Code
>  * basket_plateau= pq.read_table("basket_plateau.parquet")
>  * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
>  * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483655
> Dataset
>  * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
>  * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
>  * aprox 90GB in memory
>  * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think food retail categories)
> History to bug:
>  * was using older version of pyarrow
>  * tried writing dataset to disk (parquet) and failed
>  * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
>  * upgraded to 0.10
>  * tried writing dataset to disk (parquet) and succeeded
>  * tried reading dataset and failed
>  * looks like a similar case as: https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)