You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/16 08:35:00 UTC
[jira] [Commented] (ARROW-10928) [C++][Parquet] Unknown error: data type leaf_count mismatch

    [ https://issues.apache.org/jira/browse/ARROW-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250167#comment-17250167 ] 

Joris Van den Bossche commented on ARROW-10928:
-----------------------------------------------

[~lsabreu96] thanks for the report!

I made a slightly simplified example (with 2 values in the column):

{code:python}
df = pd.DataFrame({"column": 
    [
        [{'my_field_1': {},
          'my_field_2': {'my_field_21': 'value_211', 'my_field_22': 1}
         },
         {'my_field_1': {},
          'my_field_2': {'my_field_21': 'value_212', 'my_field_22': 2}
         }
        ],
        [{'my_field_1': {},
          'my_field_2': {'my_field_21': 'value_211', 'my_field_22': 1}
         },
         {'my_field_1': {},
          'my_field_2': {'my_field_21': 'value_212', 'my_field_22': 2}
         }
        ]
    ]
})

df.to_parquet("test.parquet")
{code}

As a possible pointer to find the bug, it's the empty struct that causes the issue:

{code:python}
In [19]: pa.table(df)
Out[19]: 
pyarrow.Table
column: list<item: struct<my_field_1: struct<>, my_field_2: struct<my_field_21: string, my_field_22: int64>>>
  child 0, item: struct<my_field_1: struct<>, my_field_2: struct<my_field_21: string, my_field_22: int64>>
      child 0, my_field_1: struct<>     # <--------------------- empty struct field
      child 1, my_field_2: struct<my_field_21: string, my_field_22: int64>
          child 0, my_field_21: string
          child 1, my_field_22: int64
{code}

If I add a value to this empty dict for the first occurrence, then it works fine:

{code:python}
df = pd.DataFrame({"column": 
    [
        [{'my_field_1': {'my_field_11': 'value_111'},  # <----- no longer an empty dict
          'my_field_2': {'my_field_21': 'value_211', 'my_field_22': 1}
         },
         {'my_field_1': {},
          'my_field_2': {'my_field_21': 'value_212', 'my_field_22': 2}
         }
        ]
    ]
})

df.to_parquet("test.parquet")
{code}

cc [~emkornfield] [~apitrou]


> [C++][Parquet] Unknown error: data type leaf_count mismatch
> -----------------------------------------------------------
>
>                 Key: ARROW-10928
>                 URL: https://issues.apache.org/jira/browse/ARROW-10928
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: ubuntu 18.04
>            Reporter: Lucas da Silva Abreu
>            Priority: Blocker
>
> I was trying to write some dataframes to parquet using {{snappy}} with the command
>  
> {code:java}
> df2.to_parquet('my-parquet', compression='snappy') {code}
>  
> But I got the following error
>  Unknown error: data type leaf_count != builder_leaf_count9 8
>  By manually sampling with columns, I found out that a column that is a list of dicts was causing the issue
> A toy example is shown below which enables one to reproduce the error
>  
> {code:java}
> df2 = pd.DataFrame(
>  [[
>  [{'my_field_1': {},
>  'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}
> ,
>  'my_field_3': {'my_field_31': 'value_31',
>  'my_field_32': 1,
>  'my_field_33': 1,
>  'my_field_34': 1}},
>  {'my_field_1': {},
>  'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 1.0}
> ,
>  'my_field_3': {'my_field_31': 'value_31',
>  'my_field_32': 1,
>  'my_field_33': 1,
>  'my_field_34': 1}}]
>  ]], columns = ['my_column'])
>  df2['toy_column_1'] = 1
>  df2['toy_column_2'] = 'ab'
> {code}
> Current configuration of my pandas is
> {code:java}
> INSTALLED VERSIONS
>  ------------------
>  commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
>  python : 3.6.9.final.0
>  python-bits : 64
>  OS : Linux
>  OS-release : 4.15.0-126-generic
>  Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
>  machine : x86_64
>  processor : x86_64
>  byteorder : little
>  LC_ALL : None
>  LANG : en_US.UTF-8
>  LOCALE : pt_BR.UTF-8pandas : 1.1.4
>  numpy : 1.19.1
>  pytz : 2020.1
>  dateutil : 2.8.1
>  pip : 20.3
>  setuptools : 41.2.0
>  Cython : None
>  pytest : 5.1.1
>  hypothesis : None
>  sphinx : None
>  blosc : None
>  feather : None
>  xlsxwriter : None
>  lxml.etree : None
>  html5lib : None
>  pymysql : 0.10.1
>  psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
>  jinja2 : 2.11.2
>  IPython : 7.16.1
>  pandas_datareader: None
>  bs4 : None
>  bottleneck : None
>  fsspec : None
>  fastparquet : 0.4.1
>  gcsfs : None
>  matplotlib : 3.3.2
>  numexpr : None
>  odfpy : None
>  openpyxl : None
>  pandas_gbq : 0.10.0
>  pyarrow : 2.0.0
>  pytables : None
>  pyxlsb : None
>  s3fs : None
>  scipy : 1.5.2
>  sqlalchemy : 1.3.18
>  tables : None
>  tabulate : 0.8.7
>  xarray : None
>  xlrd : None
>  xlwt : None
>  numba : 0.52.0
> {code}
>  
>  
>  I have found this issue within pandas ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to me be the same root cause, but I've noticed that was already using the same version of the issue and that the example in the original issue worked fine to me.
>  Could someone ,please, help me ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)