You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/01/21 22:51:00 UTC

[jira] [Resolved] (PARQUET-1766) [C++] parquet NaN/null double statistics can result in endless loop

     [ https://issues.apache.org/jira/browse/PARQUET-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney resolved PARQUET-1766.
-----------------------------------
    Resolution: Fixed

Issue resolved by pull request 6167
[https://github.com/apache/arrow/pull/6167]

> [C++] parquet NaN/null double statistics can result in endless loop
> -------------------------------------------------------------------
>
>                 Key: PARQUET-1766
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1766
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Pierre Belzile
>            Assignee: Francois Saint-Jacques
>            Priority: Critical
>              Labels: parquet, pull-request-available
>             Fix For: cpp-1.6.0
>
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> There is a bug in the doubles column statistics computation when writing to parquet an array with only NaNs and nulls. It loops endlessly if the last cell of a write group is a Null. The line in error is [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L633] which checks for NaN but not for Null. Code then falls through and loops endlessly and causes the program to appear frozen.
> This code snippet repeats:
> {noformat}
> TEST(parquet, nans) {
>   /* Create a small parquet structure */
>   std::vector<std::shared_ptr<::arrow::Field>> fields;
>   fields.push_back(::arrow::field("doubles", ::arrow::float64()));
>   std::shared_ptr<::arrow::Schema> schema = ::arrow::schema(std::move(fields));  std::unique_ptr<::arrow::RecordBatchBuilder> builder;
>   ::arrow::RecordBatchBuilder::Make(schema, ::arrow::default_memory_pool(), &builder);
>   builder->GetFieldAs<::arrow::DoubleBuilder>(0)->Append(std::numeric_limits<double>::quiet_NaN());
>   builder->GetFieldAs<::arrow::DoubleBuilder>(0)->AppendNull();  std::shared_ptr<::arrow::RecordBatch> batch;
>   builder->Flush(&batch);
>   arrow::PrettyPrint(*batch, 0, &std::cout);  std::shared_ptr<arrow::Table> table;
>   arrow::Table::FromRecordBatches({batch}, &table);  /* Attempt to write */
>   std::shared_ptr<::arrow::io::FileOutputStream> os;
>   arrow::io::FileOutputStream::Open("/tmp/test.parquet", &os);
>   parquet::WriterProperties::Builder writer_props_bld;
>   // writer_props_bld.disable_statistics("doubles");
>   std::shared_ptr<parquet::WriterProperties> writer_props = writer_props_bld.build();
>   std::shared_ptr<parquet::ArrowWriterProperties> arrow_props =
>       parquet::ArrowWriterProperties::Builder().store_schema()->build();
>   std::unique_ptr<parquet::arrow::FileWriter> writer;
>   parquet::arrow::FileWriter::Open(
>       *table->schema(), arrow::default_memory_pool(), os,
>       writer_props, arrow_props, &writer);
>   writer->WriteTable(*table, 1024);
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)