You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jarno Seppanen (JIRA)" <ji...@apache.org> on 2017/08/18 08:27:00 UTC
[jira] [Comment Edited] (ARROW-1357) Data corruption in reading multi-file parquet dataset

    [ https://issues.apache.org/jira/browse/ARROW-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131896#comment-16131896 ] 

Jarno Seppanen edited comment on ARROW-1357 at 8/18/17 8:26 AM:
----------------------------------------------------------------

Tested on 0.6.0 from pypi, problem still exists

FWIW, the first file is loaded correctly but the second file is mostly incorrect, when doing the multi-file load. New example:

{code}
import sys
import pyarrow
import pyarrow.parquet as pq
import pandas as pd

sys.version
# '3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'
pyarrow.__version__
# '0.6.0'
pd.__version__
# '0.19.0'

part0 = pq.read_table('data/part-00000-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet').to_pandas()
part1 = pq.read_table('data/part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet').to_pandas()
both_bad = pq.read_table('data').to_pandas()
both_good = pd.concat([part0, part1])

# first part of multi-file load gets loaded correctly:
all(all(x==y)
    for x,y in zip(both_bad[:len(part0)].legal_labels,
                   part0.legal_labels))
# True

# second part is loaded as corrupt in memory:
all(all(x==y)
    for x,y in zip(both_bad[len(part0):].legal_labels,
                   part1.legal_labels))
# False

# most of the rows in the second part are corrupt
sum(all(x==y)
    for x,y in zip(both_bad[len(part0):].legal_labels,
                   part1.legal_labels))
# 151

len(part1)
# 21447

# for comparison, the same queries for the manually concatenated data frame:
all(all(x==y)
    for x,y in zip(both_good[:len(part0)].legal_labels,
                   part0.legal_labels))
# True

all(all(x==y)
    for x,y in zip(both_good[len(part0):].legal_labels,
                   part1.legal_labels))
# True
{code}


was (Author: jseppanen):
Tested on 0.6.0 from pypi, problem still exists

FWIW, the first file is loaded correctly but the second file is mostly incorrect, when doing the multi-file load. New example:

import sys
import pyarrow
import pyarrow.parquet as pq
import pandas as pd

sys.version
# '3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'
pyarrow.__version__
# '0.6.0'
pd.__version__
# '0.19.0'

part0 = pq.read_table('data/part-00000-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet').to_pandas()
part1 = pq.read_table('data/part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet').to_pandas()
both_bad = pq.read_table('data').to_pandas()
both_good = pd.concat([part0, part1])

# first part of multi-file load gets loaded correctly:
all(all(x==y)
    for x,y in zip(both_bad[:len(part0)].legal_labels,
                   part0.legal_labels))
# True

# second part is loaded as corrupt in memory:
all(all(x==y)
    for x,y in zip(both_bad[len(part0):].legal_labels,
                   part1.legal_labels))
# False

# most of the rows in the second part are corrupt
sum(all(x==y)
    for x,y in zip(both_bad[len(part0):].legal_labels,
                   part1.legal_labels))
# 151

len(part1)
# 21447

# for comparison, the same queries for the manually concatenated data frame:
all(all(x==y)
    for x,y in zip(both_good[:len(part0)].legal_labels,
                   part0.legal_labels))
# True

all(all(x==y)
    for x,y in zip(both_good[len(part0):].legal_labels,
                   part1.legal_labels))
# True


> Data corruption in reading multi-file parquet dataset
> -----------------------------------------------------
>
>                 Key: ARROW-1357
>                 URL: https://issues.apache.org/jira/browse/ARROW-1357
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.5.0, 0.6.0
>         Environment: python 3.5.3
>            Reporter: Jarno Seppanen
>             Fix For: 0.7.0
>
>
> I generated a parquet dataset in Spark that has two files. PyArrow corrupts the data of the second file if I read them both in using pyarrow's parquet directory loading mode.
> $ ls -l data
> total 28608
> -rw-rw-r-- 1 jarno jarno 14651449 Aug 15 09:30 part-00000-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet
> -rw-rw-r-- 1 jarno jarno 14636502 Aug 15 09:30 part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet
> import pyarrow.parquet as pq
> tab1 = pq.read_table('data')
> df1 = tab1.to_pandas()
> df1[df1.account_id == 38658373328].legal_labels.tolist()
> # [array([ 2,  3,  5,  8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 31, 60, 61, 63,
> #        64, 65, 66, 69, 70, 74, 75, 77, 82,  0,  1,  2,  3,  5,  8, 10, 11,
> #        13, 14, 17, 18, 19, 21, 22])]
> tab2 = pq.read_table('data/part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet')
> df2 = tab2.to_pandas()
> df2[df2.account_id == 38658373328].legal_labels.tolist()
> # [array([ 0,  1,  2,  3,  5,  8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 24, 28,
> #        30, 31, 36, 38, 39, 40, 41, 43, 49, 60, 61, 62, 63, 64, 65, 66, 67,
> #        69, 70, 74, 75, 77, 82, 90])]
> Unfortunately I cannot share the data files, and I was not able to create a dummy data file pair that would have triggered the bug. I'm sending this bug report in the hope that it is still useful without a minimal repro example.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)