You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lucas da Silva Abreu (Jira)" <ji...@apache.org> on 2020/12/15 19:57:00 UTC
[jira] [Updated] (ARROW-10928) [Python] Unknown error: data type
leaf_count mismatch
[ https://issues.apache.org/jira/browse/ARROW-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lucas da Silva Abreu updated ARROW-10928:
-----------------------------------------
Description:
I was trying to write some dataframes to parquet using {{snappy}} with the command
{{}}
{code:java}
{code}
{{[// df2.to_parquet('my-parquet', compression='snappy')|http://df.to/]}}
But I got the following error
Unknown error: data type leaf_count != builder_leaf_count9 8
By manually sampling with columns, I found out that a column that is a list of dicts was causing the issue
A toy example is shown below which enables one to reproduce the error
df2 = pd.DataFrame(
[[
[\{'my_field_1': {},
'my_field_2':
{'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}
,
'my_field_3': {'my_field_31': 'value_31',
'my_field_32': 1,
'my_field_33': 1,
'my_field_34': 1}},
\{'my_field_1': {},
'my_field_2':
{'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}
,
'my_field_3': {'my_field_31': 'value_31',
'my_field_32': 1,
'my_field_33': 1,
'my_field_34': 1}}]
]], columns = ['my_column'])
df2['toy_column_1'] = 1
df2['toy_column_2'] = 'ab'
Current configuration of my pandas is
INSTALLED VERSIONS
------------------
commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-126-generic
Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8pandas : 1.1.4
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.3
setuptools : 41.2.0
Cython : None
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.10.1
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.10.0
pyarrow : 2.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.18
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : 0.52.0
I have found this issue within pandas ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to me be the same root cause, but I've noticed that was already using the same version of the issue and that the example in the original issue worked fine to me.
Could someone please help me ?
\{{}}
was:
I was trying to write some dataframes to parquet using {{snappy}} with the command {{[df.to|http://df.to/]}}{{_parquet('my-parquet', compression= 'snapppy')}}
But , I got the following error
Unknown error: data type leaf_count != builder_leaf_count9 8
By manually sampling with columns, I found out that a column that is a list of dicts was causing the issue
A toy example is shown below which enables one to reproduce the error
df2 = pd.DataFrame(
[[
[\{'my_field_1': {},
'my_field_2': {'my_field_21': 'value_21',
'my_field_22': 1,
'my_field_23': 1,
'my_field_24': 1.0},
'my_field_3': {'my_field_31': 'value_31',
'my_field_32': 1,
'my_field_33': 1,
'my_field_34': 1}},
\{'my_field_1': {},
'my_field_2': {'my_field_21': 'value_21',
'my_field_22': 1,
'my_field_23': 1,
'my_field_24': 1.0},
'my_field_3': {'my_field_31': 'value_31',
'my_field_32': 1,
'my_field_33': 1,
'my_field_34': 1}}]
]], columns = ['my_column'])
df2['toy_column_1'] = 1
df2['toy_column_2'] = 'ab'
Current configuration of my pandas is
INSTALLED VERSIONS
------------------
commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-126-generic
Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8pandas : 1.1.4
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.3
setuptools : 41.2.0
Cython : None
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 0.10.1
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.10.0
pyarrow : 2.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.18
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : 0.52.0
I have found this issue within pandas ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to me be the same root cause, but I've noticed that was already using the same version of the issue and that the example in the original issue worked fine to me.
Could someone please help me ?
{{}}
> [Python] Unknown error: data type leaf_count mismatch
> -----------------------------------------------------
>
> Key: ARROW-10928
> URL: https://issues.apache.org/jira/browse/ARROW-10928
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Environment: ubuntu 18.04
> Reporter: Lucas da Silva Abreu
> Priority: Blocker
>
> I was trying to write some dataframes to parquet using {{snappy}} with the command
>
> {{}}
> {code:java}
> {code}
> {{[// df2.to_parquet('my-parquet', compression='snappy')|http://df.to/]}}
> But I got the following error
> Unknown error: data type leaf_count != builder_leaf_count9 8
> By manually sampling with columns, I found out that a column that is a list of dicts was causing the issue
> A toy example is shown below which enables one to reproduce the error
> df2 = pd.DataFrame(
> [[
> [\{'my_field_1': {},
> 'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}
> ,
> 'my_field_3': {'my_field_31': 'value_31',
> 'my_field_32': 1,
> 'my_field_33': 1,
> 'my_field_34': 1}},
> \{'my_field_1': {},
> 'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}
> ,
> 'my_field_3': {'my_field_31': 'value_31',
> 'my_field_32': 1,
> 'my_field_33': 1,
> 'my_field_34': 1}}]
> ]], columns = ['my_column'])
> df2['toy_column_1'] = 1
> df2['toy_column_2'] = 'ab'
> Current configuration of my pandas is
> INSTALLED VERSIONS
> ------------------
> commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
> python : 3.6.9.final.0
> python-bits : 64
> OS : Linux
> OS-release : 4.15.0-126-generic
> Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
> machine : x86_64
> processor : x86_64
> byteorder : little
> LC_ALL : None
> LANG : en_US.UTF-8
> LOCALE : pt_BR.UTF-8pandas : 1.1.4
> numpy : 1.19.1
> pytz : 2020.1
> dateutil : 2.8.1
> pip : 20.3
> setuptools : 41.2.0
> Cython : None
> pytest : 5.1.1
> hypothesis : None
> sphinx : None
> blosc : None
> feather : None
> xlsxwriter : None
> lxml.etree : None
> html5lib : None
> pymysql : 0.10.1
> psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
> jinja2 : 2.11.2
> IPython : 7.16.1
> pandas_datareader: None
> bs4 : None
> bottleneck : None
> fsspec : None
> fastparquet : 0.4.1
> gcsfs : None
> matplotlib : 3.3.2
> numexpr : None
> odfpy : None
> openpyxl : None
> pandas_gbq : 0.10.0
> pyarrow : 2.0.0
> pytables : None
> pyxlsb : None
> s3fs : None
> scipy : 1.5.2
> sqlalchemy : 1.3.18
> tables : None
> tabulate : 0.8.7
> xarray : None
> xlrd : None
> xlwt : None
> numba : 0.52.0
>
> I have found this issue within pandas ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to me be the same root cause, but I've noticed that was already using the same version of the issue and that the example in the original issue worked fine to me.
> Could someone please help me ?
>
> \{{}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)