You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "colin fang (JIRA)" <ji...@apache.org> on 2019/03/18 17:47:00 UTC

[jira] [Updated] (PARQUET-1546) page level min / max written by parquet-cpp is not recognized by parquet-tools

     [ https://issues.apache.org/jira/browse/PARQUET-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

colin fang updated PARQUET-1546:
--------------------------------
    Description: 
test parquet is created by

{code:python}
n = 1000000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é'] * n + [u'a', None, u'a'] * n

z = np.random.rand(len(x)).tolist()
df = pd.DataFrame({'x': x, 'y': y, 'z': z})
{code}

 

output from parquet-tools

 
{code}
    y TV=1900100 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: é, max: é, num_nulls: 0] SZ:1050632 VC:175104
    page 1:   DLE:RLE RLE:RLE VLE:PLAIN ST:[num_nulls: 90072, min/max not defined] SZ:1083218 VC:294912
    page 2:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105131] SZ:1091359 VC:315392
    page 3:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105130] SZ:1091364 VC:315392
{code}


 
In the above "min/max not defined"

The parquet generated by `parquet-mr` has the correct page min  max.

 

  was:
test parquet is created by

{code: python}
n = 1000000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é'] * n + [u'a', None, u'a'] * n

z = np.random.rand(len(x)).tolist()
df = pd.DataFrame({'x': x, 'y': y, 'z': z})
{code}

 

output from parquet-tools

 
{code}
    y TV=1900100 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: é, max: é, num_nulls: 0] SZ:1050632 VC:175104
    page 1:   DLE:RLE RLE:RLE VLE:PLAIN ST:[num_nulls: 90072, min/max not defined] SZ:1083218 VC:294912
    page 2:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105131] SZ:1091359 VC:315392
    page 3:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105130] SZ:1091364 VC:315392
{code}


 
In the above "min/max not defined"

The parquet generated by `parquet-mr` has the correct page min  max.

 


> page level min / max written by parquet-cpp  is not recognized by parquet-tools
> -------------------------------------------------------------------------------
>
>                 Key: PARQUET-1546
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1546
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: colin fang
>            Priority: Minor
>
> test parquet is created by
> {code:python}
> n = 1000000
> x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
> y = [u'é', u'é', u'é', u'é'] * n + [u'a', None, u'a'] * n
> z = np.random.rand(len(x)).tolist()
> df = pd.DataFrame({'x': x, 'y': y, 'z': z})
> {code}
>  
> output from parquet-tools
>  
> {code}
>     y TV=1900100 RL=0 DL=1
>     ----------------------------------------------------------------------------
>     page 0:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: é, max: é, num_nulls: 0] SZ:1050632 VC:175104
>     page 1:   DLE:RLE RLE:RLE VLE:PLAIN ST:[num_nulls: 90072, min/max not defined] SZ:1083218 VC:294912
>     page 2:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105131] SZ:1091359 VC:315392
>     page 3:   DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105130] SZ:1091364 VC:315392
> {code}
>  
> In the above "min/max not defined"
> The parquet generated by `parquet-mr` has the correct page min  max.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)