You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan mercier (Jira)" <ji...@apache.org> on 2021/03/08 13:47:00 UTC
[jira] [Updated] (ARROW-11903) Stored data to parquet do not fit values before the storing

     [ https://issues.apache.org/jira/browse/ARROW-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan mercier updated ARROW-11903:
-------------------------------------
    Description: 
Dear,

 

I have a strange behavior, indeed data before do not keep their value once stored to parquet.

 

the schema is:

 
{code:python}
variations = struct((field('start', int64(), nullable=False),
                     field('stop', int64(), nullable=False),
                     field('reference', string(), nullable=False),
                     field('alternative', string(), nullable=False),
                     field('category', int8(), nullable=False)))
variations_field = field('variations', list_(variations))
metadata = {b'pandas': b'{"index_columns": ["sample"], '
 b'"column_indexes": [{"name": null, "field_name": "sample", "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], '
 b'"columns": ['
 b'{"name": "variations", "field_name": "variations", "pandas_type": "list[object]", "numpy_type": "object", "metadata": null}, '
 b'{"name": "sample", "field_name": "sample", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], '
 b'"pandas_version": "1.2.0"}'}
sample_to_variations_schema = schema((sample_field, variations_field), metadata=metadata)
{code}
 

 to store data I do:
{code:python}
table = Table.from_arrays([samples, variations_by_sample], schema=sample_to_variations_schema)
dataset_dir = path.join(outdir, f'contig={contig}')
makedirs(dataset_dir, exist_ok=True)
with ParquetWriter(where=path.join(dataset_dir, 'variant_to_samples'),
version='2.0', schema=table.schema, compression='SNAPPY') as pw:
    pw.write_table(table){code}


I put a breakpoint just after 
{noformat}
table
{noformat}
 is assigned, in order to check values in memory:

Example for the row n°210027


{code:python}
>>> samples[210027]
831028
>>> variations_by_sample[210027]
[(241, 241, 'C', 'T', 0), (445, 445, 'T', 'C', 0), (3037, 3037, 'C', 'T', 0), (6286, 6286, 'C', 'T', 0), (11024, 11024, 'A', 'G', 0), (14408, 14408, 'C', 'T', 0), (21255, 21255, 'G', 'C', 0), (22227, 22227, 'C', 'T', 0), (23403, 23403, 'A', 'G', 0), (24140, 24140, 'G', 'A', 0), (25496, 25496, 'T', 'C', 0), (26801, 26801, 'C', 'G', 0), (27840, 27840, 'T', 'C', 0), (27944, 27944, 'C', 'T', 0), (27948, 27948, 'G', 'T', 0), (28932, 28932, 'C', 'T', 0), (29645, 29645, 'G', 'T', 0)]
{code}


Now the application end successfully and data are stored into a parquet dataset.
So, I load those data and check their consistencies.


{code:python}
$ ipython
In [1]: from pyarrow.parquet import read_table
   ...: sample_to_variants = read_table('sample_to_variants_db')

In [2]: row_num = 0
   ...: an_id = 0
   ...: while an_id != 831028:
   ...:     an_id = sample_to_variants.column(0)[row_num].as_py()
   ...:     row_num += 1
   ...: 
In [3]: sample_to_variants.column(0)[row_num-1].as_py()
Out[3]: 831028
In [4]: sample_to_variants.column(1)[row_num-1].as_py()
Out[4]: 
[{'start': 241,
  'stop': 241,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 445,
  'stop': 445,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 3037,
  'stop': 3037,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 6286,
  'stop': 6286,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 11024,
  'stop': 11024,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 14408,
  'stop': 14408,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 21255,
  'stop': 21255,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 22227,
  'stop': 22227,
  'reference': 'G',
  'alternative': 'A',
  'category': 0},
 {'start': 23403,
  'stop': 23403,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 24140,
  'stop': 24140,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 25496,
  'stop': 25496,
  'reference': 'A',
  'alternative': 'G',
  'category': 0},
 {'start': 26801,
  'stop': 26801,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 27840,
  'stop': 27840,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 27944,
  'stop': 27944,
  'reference': 'T',
  'alternative': 'C',
  'category': 0},
 {'start': 27948,
  'stop': 27948,
  'reference': 'G',
  'alternative': 'A',
  'category': 0},
 {'start': 28932,
  'stop': 28932,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 29645,
  'stop': 29645,
  'reference': 'G',
  'alternative': 'A',
  'category': 0}]
{code}

We can see that the column 1 (0 based) do not have the same value before to be written in parquet. 
As example into parquet dataset I have this value:

{code:python}
 {'start': 24140,
  'stop': 24140,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
{code}

while from the memory before to be stored:


{code:python}
(24140, 24140, 'G', 'A', 0)
{code}

I do not understand what is the mechanism which lead to this inconsistency.
So I am not able to make a minimal example case (sorry)

Thanks


  was:
Dear,

 

I have a strange behavior, indeed data before do not keep their value once stored to parquet.

 

the schema is:

 
{code:python}
variations = struct((field('start', int64(), nullable=False),
                     field('stop', int64(), nullable=False),
                     field('reference', string(), nullable=False),
                     field('alternative', string(), nullable=False),
                     field('category', int8(), nullable=False)))
variations_field = field('variations', list_(variations))
metadata = {b'pandas': b'{"index_columns": ["sample"], '
 b'"column_indexes": [{"name": null, "field_name": "sample", "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], '
 b'"columns": ['
 b'{"name": "variations", "field_name": "variations", "pandas_type": "list[object]", "numpy_type": "object", "metadata": null}, '
 b'{"name": "sample", "field_name": "sample", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], '
 b'"pandas_version": "1.2.0"}'}
sample_to_variations_schema = schema((sample_field, variations_field), metadata=metadata)
{code}
 

 to store data I do:
{code:python}
table = Table.from_arrays([samples, variations_by_sample], schema=sample_to_variations_schema)
dataset_dir = path.join(outdir, f'contig={contig}')
makedirs(dataset_dir, exist_ok=True)
with ParquetWriter(where=path.join(dataset_dir, 'variant_to_samples'),
version='2.0', schema=table.schema, compression='SNAPPY') as pw:
    pw.write_table(table){code}


I put a breakpoint just after table is assgned, in onder to check values in memory:

Example for the row n°210027


{code:python}
>>> samples[210027]
831028
>>> variations_by_sample[210027]
[(241, 241, 'C', 'T', 0), (445, 445, 'T', 'C', 0), (3037, 3037, 'C', 'T', 0), (6286, 6286, 'C', 'T', 0), (11024, 11024, 'A', 'G', 0), (14408, 14408, 'C', 'T', 0), (21255, 21255, 'G', 'C', 0), (22227, 22227, 'C', 'T', 0), (23403, 23403, 'A', 'G', 0), (24140, 24140, 'G', 'A', 0), (25496, 25496, 'T', 'C', 0), (26801, 26801, 'C', 'G', 0), (27840, 27840, 'T', 'C', 0), (27944, 27944, 'C', 'T', 0), (27948, 27948, 'G', 'T', 0), (28932, 28932, 'C', 'T', 0), (29645, 29645, 'G', 'T', 0)]
{code}


Now the application end successfully and data are stored into a parquet dataset.
So, I load those data and check their consistencies.


{code:python}
$ ipython
In [1]: from pyarrow.parquet import read_table
   ...: sample_to_variants = read_table('sample_to_variants_db')

In [2]: row_num = 0
   ...: an_id = 0
   ...: while an_id != 831028:
   ...:     an_id = sample_to_variants.column(0)[row_num].as_py()
   ...:     row_num += 1
   ...: 
In [3]: sample_to_variants.column(0)[row_num-1].as_py()
Out[3]: 831028
In [4]: sample_to_variants.column(1)[row_num-1].as_py()
Out[4]: 
[{'start': 241,
  'stop': 241,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 445,
  'stop': 445,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 3037,
  'stop': 3037,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 6286,
  'stop': 6286,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 11024,
  'stop': 11024,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 14408,
  'stop': 14408,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 21255,
  'stop': 21255,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 22227,
  'stop': 22227,
  'reference': 'G',
  'alternative': 'A',
  'category': 0},
 {'start': 23403,
  'stop': 23403,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 24140,
  'stop': 24140,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 25496,
  'stop': 25496,
  'reference': 'A',
  'alternative': 'G',
  'category': 0},
 {'start': 26801,
  'stop': 26801,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 27840,
  'stop': 27840,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 27944,
  'stop': 27944,
  'reference': 'T',
  'alternative': 'C',
  'category': 0},
 {'start': 27948,
  'stop': 27948,
  'reference': 'G',
  'alternative': 'A',
  'category': 0},
 {'start': 28932,
  'stop': 28932,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 29645,
  'stop': 29645,
  'reference': 'G',
  'alternative': 'A',
  'category': 0}]
{code}

we can see that the column 1 (0 based) do not èhave the same value before to be wrote in parquet. 
As example into parquet dataset I have this value:

{code:python}
 {'start': 24140,
  'stop': 24140,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
{code}

while from the memory before to be stored:


{code:python}
(24140, 24140, 'G', 'A', 0)
{code}

I do not understand what is the mechanism which lead to this inconsistency.
So I am not able to make a minimal example case (sorry)

Thanks



> Stored data to parquet do not fit values before the storing
> -----------------------------------------------------------
>
>                 Key: ARROW-11903
>                 URL: https://issues.apache.org/jira/browse/ARROW-11903
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Archery
>    Affects Versions: 2.0.0
>            Reporter: Jonathan mercier
>            Priority: Major
>
> Dear,
>  
> I have a strange behavior, indeed data before do not keep their value once stored to parquet.
>  
> the schema is:
>  
> {code:python}
> variations = struct((field('start', int64(), nullable=False),
>                      field('stop', int64(), nullable=False),
>                      field('reference', string(), nullable=False),
>                      field('alternative', string(), nullable=False),
>                      field('category', int8(), nullable=False)))
> variations_field = field('variations', list_(variations))
> metadata = {b'pandas': b'{"index_columns": ["sample"], '
>  b'"column_indexes": [{"name": null, "field_name": "sample", "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], '
>  b'"columns": ['
>  b'{"name": "variations", "field_name": "variations", "pandas_type": "list[object]", "numpy_type": "object", "metadata": null}, '
>  b'{"name": "sample", "field_name": "sample", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], '
>  b'"pandas_version": "1.2.0"}'}
> sample_to_variations_schema = schema((sample_field, variations_field), metadata=metadata)
> {code}
>  
>  to store data I do:
> {code:python}
> table = Table.from_arrays([samples, variations_by_sample], schema=sample_to_variations_schema)
> dataset_dir = path.join(outdir, f'contig={contig}')
> makedirs(dataset_dir, exist_ok=True)
> with ParquetWriter(where=path.join(dataset_dir, 'variant_to_samples'),
> version='2.0', schema=table.schema, compression='SNAPPY') as pw:
>     pw.write_table(table){code}
> I put a breakpoint just after 
> {noformat}
> table
> {noformat}
>  is assigned, in order to check values in memory:
> Example for the row n°210027
> {code:python}
> >>> samples[210027]
> 831028
> >>> variations_by_sample[210027]
> [(241, 241, 'C', 'T', 0), (445, 445, 'T', 'C', 0), (3037, 3037, 'C', 'T', 0), (6286, 6286, 'C', 'T', 0), (11024, 11024, 'A', 'G', 0), (14408, 14408, 'C', 'T', 0), (21255, 21255, 'G', 'C', 0), (22227, 22227, 'C', 'T', 0), (23403, 23403, 'A', 'G', 0), (24140, 24140, 'G', 'A', 0), (25496, 25496, 'T', 'C', 0), (26801, 26801, 'C', 'G', 0), (27840, 27840, 'T', 'C', 0), (27944, 27944, 'C', 'T', 0), (27948, 27948, 'G', 'T', 0), (28932, 28932, 'C', 'T', 0), (29645, 29645, 'G', 'T', 0)]
> {code}
> Now the application end successfully and data are stored into a parquet dataset.
> So, I load those data and check their consistencies.
> {code:python}
> $ ipython
> In [1]: from pyarrow.parquet import read_table
>    ...: sample_to_variants = read_table('sample_to_variants_db')
> In [2]: row_num = 0
>    ...: an_id = 0
>    ...: while an_id != 831028:
>    ...:     an_id = sample_to_variants.column(0)[row_num].as_py()
>    ...:     row_num += 1
>    ...: 
> In [3]: sample_to_variants.column(0)[row_num-1].as_py()
> Out[3]: 831028
> In [4]: sample_to_variants.column(1)[row_num-1].as_py()
> Out[4]: 
> [{'start': 241,
>   'stop': 241,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 445,
>   'stop': 445,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 3037,
>   'stop': 3037,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 6286,
>   'stop': 6286,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 11024,
>   'stop': 11024,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 14408,
>   'stop': 14408,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 21255,
>   'stop': 21255,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 22227,
>   'stop': 22227,
>   'reference': 'G',
>   'alternative': 'A',
>   'category': 0},
>  {'start': 23403,
>   'stop': 23403,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 24140,
>   'stop': 24140,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 25496,
>   'stop': 25496,
>   'reference': 'A',
>   'alternative': 'G',
>   'category': 0},
>  {'start': 26801,
>   'stop': 26801,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 27840,
>   'stop': 27840,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 27944,
>   'stop': 27944,
>   'reference': 'T',
>   'alternative': 'C',
>   'category': 0},
>  {'start': 27948,
>   'stop': 27948,
>   'reference': 'G',
>   'alternative': 'A',
>   'category': 0},
>  {'start': 28932,
>   'stop': 28932,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 29645,
>   'stop': 29645,
>   'reference': 'G',
>   'alternative': 'A',
>   'category': 0}]
> {code}
> We can see that the column 1 (0 based) do not have the same value before to be written in parquet. 
> As example into parquet dataset I have this value:
> {code:python}
>  {'start': 24140,
>   'stop': 24140,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
> {code}
> while from the memory before to be stored:
> {code:python}
> (24140, 24140, 'G', 'A', 0)
> {code}
> I do not understand what is the mechanism which lead to this inconsistency.
> So I am not able to make a minimal example case (sorry)
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)