You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jonathan mercier (Jira)" <ji...@apache.org> on 2021/03/08 13:40:00 UTC
[jira] [Created] (ARROW-11903) Stored data to parquet do not fit values before the storing

Jonathan mercier created ARROW-11903:
----------------------------------------

             Summary: Stored data to parquet do not fit values before the storing
                 Key: ARROW-11903
                 URL: https://issues.apache.org/jira/browse/ARROW-11903
             Project: Apache Arrow
          Issue Type: Bug
          Components: Archery
    Affects Versions: 2.0.0
            Reporter: Jonathan mercier


Dear,

 

I have a strange behavior, indeed data before do not keep their value once stored to parquet.

 

the schema is:

 
{code:python}
variations = struct((field('start', int64(), nullable=False),
                     field('stop', int64(), nullable=False),
                     field('reference', string(), nullable=False),
                     field('alternative', string(), nullable=False),
                     field('category', int8(), nullable=False)))
variations_field = field('variations', list_(variations))
metadata = {b'pandas': b'{"index_columns": ["sample"], '
 b'"column_indexes": [{"name": null, "field_name": "sample", "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], '
 b'"columns": ['
 b'{"name": "variations", "field_name": "variations", "pandas_type": "list[object]", "numpy_type": "object", "metadata": null}, '
 b'{"name": "sample", "field_name": "sample", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], '
 b'"pandas_version": "1.2.0"}'}
sample_to_variations_schema = schema((sample_field, variations_field), metadata=metadata)
{code}
 

 to store data I do:
{code:python}
table = Table.from_arrays([samples, variations_by_sample], schema=sample_to_variations_schema)
dataset_dir = path.join(outdir, f'contig={contig}')
makedirs(dataset_dir, exist_ok=True)
with ParquetWriter(where=path.join(dataset_dir, 'variant_to_samples'),
version='2.0', schema=table.schema, compression='SNAPPY') as pw:
    pw.write_table(table){code}


I put a breakpoint just after table is assgned, in onder to check values in memory:

Example for the row n°210027


{code:python}
>>> samples[210027]
831028
>>> variations_by_sample[210027]
[(241, 241, 'C', 'T', 0), (445, 445, 'T', 'C', 0), (3037, 3037, 'C', 'T', 0), (6286, 6286, 'C', 'T', 0), (11024, 11024, 'A', 'G', 0), (14408, 14408, 'C', 'T', 0), (21255, 21255, 'G', 'C', 0), (22227, 22227, 'C', 'T', 0), (23403, 23403, 'A', 'G', 0), (24140, 24140, 'G', 'A', 0), (25496, 25496, 'T', 'C', 0), (26801, 26801, 'C', 'G', 0), (27840, 27840, 'T', 'C', 0), (27944, 27944, 'C', 'T', 0), (27948, 27948, 'G', 'T', 0), (28932, 28932, 'C', 'T', 0), (29645, 29645, 'G', 'T', 0)]
{code}


Now the application end successfully and data are stored into a parquet dataset.
So, I load those data and check their consistencies.


{code:python}
$ ipython
In [1]: from pyarrow.parquet import read_table
   ...: sample_to_variants = read_table('sample_to_variants_db')

In [2]: row_num = 0
   ...: an_id = 0
   ...: while an_id != 831028:
   ...:     an_id = sample_to_variants.column(0)[row_num].as_py()
   ...:     row_num += 1
   ...: 
In [3]: sample_to_variants.column(0)[row_num-1].as_py()
Out[3]: 831028
In [4]: sample_to_variants.column(1)[row_num-1].as_py()
Out[4]: 
[{'start': 241,
  'stop': 241,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 445,
  'stop': 445,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 3037,
  'stop': 3037,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 6286,
  'stop': 6286,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 11024,
  'stop': 11024,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 14408,
  'stop': 14408,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 21255,
  'stop': 21255,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 22227,
  'stop': 22227,
  'reference': 'G',
  'alternative': 'A',
  'category': 0},
 {'start': 23403,
  'stop': 23403,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 24140,
  'stop': 24140,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 25496,
  'stop': 25496,
  'reference': 'A',
  'alternative': 'G',
  'category': 0},
 {'start': 26801,
  'stop': 26801,
  'reference': 'G',
  'alternative': 'T',
  'category': 0},
 {'start': 27840,
  'stop': 27840,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 27944,
  'stop': 27944,
  'reference': 'T',
  'alternative': 'C',
  'category': 0},
 {'start': 27948,
  'stop': 27948,
  'reference': 'G',
  'alternative': 'A',
  'category': 0},
 {'start': 28932,
  'stop': 28932,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
 {'start': 29645,
  'stop': 29645,
  'reference': 'G',
  'alternative': 'A',
  'category': 0}]
{code}

we can see that the column 1 (0 based) do not èhave the same value before to be wrote in parquet. 
As example into parquet dataset I have this value:

{code:python}
 {'start': 24140,
  'stop': 24140,
  'reference': 'C',
  'alternative': 'T',
  'category': 0},
{code}

while from the memory before to be stored:


{code:python}
(24140, 24140, 'G', 'A', 0)
{code}

I do not understand what is the mechanism which lead to this inconsistency.
So I am not able to make a minimal example case (sorry)

Thanks




--
This message was sent by Atlassian Jira
(v8.3.4#803005)