You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "colin fang (JIRA)" <ji...@apache.org> on 2019/03/18 17:47:00 UTC
[jira] [Updated] (PARQUET-1546) page level min / max written by
parquet-cpp is not recognized by parquet-tools
[ https://issues.apache.org/jira/browse/PARQUET-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
colin fang updated PARQUET-1546:
--------------------------------
Description:
test parquet is created by
{code:python}
n = 1000000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é'] * n + [u'a', None, u'a'] * n
z = np.random.rand(len(x)).tolist()
df = pd.DataFrame({'x': x, 'y': y, 'z': z})
{code}
output from parquet-tools
{code}
y TV=1900100 RL=0 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: é, max: é, num_nulls: 0] SZ:1050632 VC:175104
page 1: DLE:RLE RLE:RLE VLE:PLAIN ST:[num_nulls: 90072, min/max not defined] SZ:1083218 VC:294912
page 2: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105131] SZ:1091359 VC:315392
page 3: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105130] SZ:1091364 VC:315392
{code}
In the above "min/max not defined"
The parquet generated by `parquet-mr` has the correct page min max.
was:
test parquet is created by
{code: python}
n = 1000000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é'] * n + [u'a', None, u'a'] * n
z = np.random.rand(len(x)).tolist()
df = pd.DataFrame({'x': x, 'y': y, 'z': z})
{code}
output from parquet-tools
{code}
y TV=1900100 RL=0 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: é, max: é, num_nulls: 0] SZ:1050632 VC:175104
page 1: DLE:RLE RLE:RLE VLE:PLAIN ST:[num_nulls: 90072, min/max not defined] SZ:1083218 VC:294912
page 2: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105131] SZ:1091359 VC:315392
page 3: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105130] SZ:1091364 VC:315392
{code}
In the above "min/max not defined"
The parquet generated by `parquet-mr` has the correct page min max.
> page level min / max written by parquet-cpp is not recognized by parquet-tools
> -------------------------------------------------------------------------------
>
> Key: PARQUET-1546
> URL: https://issues.apache.org/jira/browse/PARQUET-1546
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: colin fang
> Priority: Minor
>
> test parquet is created by
> {code:python}
> n = 1000000
> x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
> y = [u'é', u'é', u'é', u'é'] * n + [u'a', None, u'a'] * n
> z = np.random.rand(len(x)).tolist()
> df = pd.DataFrame({'x': x, 'y': y, 'z': z})
> {code}
>
> output from parquet-tools
>
> {code}
> y TV=1900100 RL=0 DL=1
> ----------------------------------------------------------------------------
> page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: é, max: é, num_nulls: 0] SZ:1050632 VC:175104
> page 1: DLE:RLE RLE:RLE VLE:PLAIN ST:[num_nulls: 90072, min/max not defined] SZ:1083218 VC:294912
> page 2: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105131] SZ:1091359 VC:315392
> page 3: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: a, max: a, num_nulls: 105130] SZ:1091364 VC:315392
> {code}
>
> In the above "min/max not defined"
> The parquet generated by `parquet-mr` has the correct page min max.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)