You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Lars Volker (JIRA)" <ji...@apache.org> on 2017/09/28 23:44:00 UTC

[jira] [Created] (PARQUET-1118) Build a corpus of Parquet files that client implementations can use for validation

Lars Volker created PARQUET-1118:
------------------------------------

             Summary: Build a corpus of Parquet files that client implementations can use for validation
                 Key: PARQUET-1118
                 URL: https://issues.apache.org/jira/browse/PARQUET-1118
             Project: Parquet
          Issue Type: Task
          Components: parquet-format
            Reporter: Lars Volker


We should build a corpus of Parquet files that client implementations can use for validation. In addition to the input files, it should contain a description or a verbatim copy of the data in each file, so that readers can validate their results.

As a starting point we can look at [the old parquet-compatibility repo|https://github.com/Parquet/parquet-compatibility] and [Impala's test data, in particular the Parquet files it contains|https://github.com/apache/incubator-impala/tree/master/testdata].

{noformat}
$ find testdata | grep -i parq
testdata/workloads/tpch/queries/insert_parquet.test
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
testdata/workloads/functional-query/queries/QueryTest/insert_parquet_invalid_codec.test
testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test
testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
testdata/workloads/functional-query/queries/QueryTest/parquet-resolution-by-name.test
testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test
testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet.test
testdata/workloads/functional-query/queries/QueryTest/parquet.test
testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-nested.test
testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
testdata/max_nesting_depth/int_map/file.parq
testdata/max_nesting_depth/struct/file.parq
testdata/max_nesting_depth/struct_map/file.parq
testdata/max_nesting_depth/int_array/file.parq
testdata/max_nesting_depth/struct_array/file.parq
testdata/parquet_nested_types_encodings
testdata/parquet_nested_types_encodings/README
testdata/parquet_nested_types_encodings/UnannotatedListOfGroups.parquet
testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
testdata/parquet_nested_types_encodings/UnannotatedListOfPrimitives.parquet
testdata/parquet_nested_types_encodings/AmbiguousList.json
testdata/parquet_nested_types_encodings/AvroPrimitiveInList.parquet
testdata/parquet_nested_types_encodings/ThriftPrimitiveInList.parquet
testdata/parquet_nested_types_encodings/bad-avro.parquet
testdata/parquet_nested_types_encodings/AmbiguousList.avsc
testdata/parquet_nested_types_encodings/SingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/ThriftSingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/AvroSingleFieldGroupInList.parquet
testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
testdata/parquet_nested_types_encodings/bad-thrift.parquet
testdata/ComplexTypesTbl/nonnullable.parq
testdata/ComplexTypesTbl/nullable.parq
testdata/bad_parquet_data
testdata/bad_parquet_data/README
testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq
testdata/bad_parquet_data/plain-encoded-negative-len.parq
testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq
testdata/bad_parquet_data/dict-encoded-negative-len.parq
testdata/parquet_schema_resolution
testdata/parquet_schema_resolution/README
testdata/parquet_schema_resolution/switched_map.json
testdata/parquet_schema_resolution/switched_map.avsc
testdata/parquet_schema_resolution/switched_map.parq
testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
testdata/LineItemMultiBlock/lineitem_one_row_group.parquet
testdata/LineItemMultiBlock/lineitem_sixblocks.parquet
testdata/data/zero_rows_zero_row_groups.parquet
testdata/data/chars-formats.parquet
testdata/data/multiple_rowgroups.parquet
testdata/data/bad_parquet_data.parquet
testdata/data/bad_metadata_len.parquet
testdata/data/huge_num_rows.parquet
testdata/data/bad_compressed_size.parquet
testdata/data/zero_rows_one_row_group.parquet
testdata/data/bad_rle_repeat_count.parquet
testdata/data/bad_column_metadata.parquet
testdata/data/alltypesagg_hive_13_1.parquet
testdata/data/bad_dict_page_offset.parquet
testdata/data/bad_rle_literal_count.parquet
testdata/data/bad_magic_number.parquet
testdata/data/repeated_values.parquet
testdata/data/schemas/malformed_decimal_tiny.parquet
testdata/data/schemas/alltypestiny.parquet
testdata/data/schemas/nested/modern_nested.parquet
testdata/data/schemas/nested/legacy_nested.parquet
testdata/data/schemas/enum/enum.parquet
testdata/data/schemas/decimal.parquet
testdata/data/schemas/zipcode_incomes.parquet
testdata/data/repeated_root_schema.parquet
testdata/data/long_page_header.parquet
testdata/data/deprecated_statistics.parquet
testdata/data/kite_required_fields.parquet
testdata/data/out_of_range_timestamp.parquet
{noformat}

Impala also has a tool to generate Parquet files from JSON files: https://github.com/apache/incubator-impala/blob/master/testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java

Arrow has a similar tool: https://github.com/apache/arrow/blob/master/integration/integration_test.py



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)