You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ildar (JIRA)" <ji...@apache.org> on 2019/01/18 16:55:00 UTC
[jira] [Updated] (ARROW-4293) [C++] Can't access parquet statistics
on binary columns
[ https://issues.apache.org/jira/browse/ARROW-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ildar updated ARROW-4293:
-------------------------
Description:
Hi,
I'm trying to use per-column statistics (min/max values) to filter out row groups while reading parquet file. But I don't see statistics built for binary columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} discards statistics that have sort order {{UNSIGNED and haven't been created by parquet-cpp}}. As I understand there used to be some issues in {{parquet-mr}} before. But do they still persist?
For example, I have parquet file created with {{parquet-mr}} version 1.10, it seems to have correct min/max values for binary columns. And {{parquet-cpp}} works fine for me if I remove this code from {{HasCorrectStatistics()}} func:
{code:java}
if (SortOrder::SIGNED != sort_order && !max_equals_min) {
return false;
}{code}
was:
Hi,
I'm trying to use per-column statistics (min/max values) to filter out row groups while reading parquet file. But I don't see statistics built for binary columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} discards statistics that have sort order {{UNSIGNED }}and haven't been created by {{parquet-cpp}}. As I understand there used to be some issues in {{parquet-mr}} before. But do they still persist?
For example, I have parquet file created with {{parquet-mr}} version 1.10, it seems to have correct min/max values for binary columns. And {{parquet-cpp}} works fine for me if I remove this code from {{HasCorrectStatistics()}} func:
{{ if (SortOrder::SIGNED != sort_order && !max_equals_min) {}}
{{ return false; }}}
> [C++] Can't access parquet statistics on binary columns
> -------------------------------------------------------
>
> Key: ARROW-4293
> URL: https://issues.apache.org/jira/browse/ARROW-4293
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Ildar
> Priority: Major
>
> Hi,
> I'm trying to use per-column statistics (min/max values) to filter out row groups while reading parquet file. But I don't see statistics built for binary columns. I noticed that {{ApplicationVersion::HasCorrectStatistics()}} discards statistics that have sort order {{UNSIGNED and haven't been created by parquet-cpp}}. As I understand there used to be some issues in {{parquet-mr}} before. But do they still persist?
> For example, I have parquet file created with {{parquet-mr}} version 1.10, it seems to have correct min/max values for binary columns. And {{parquet-cpp}} works fine for me if I remove this code from {{HasCorrectStatistics()}} func:
>
> {code:java}
> if (SortOrder::SIGNED != sort_order && !max_equals_min) {
> return false;
> }{code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)