You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Jim Pivarski (JIRA)" <ji...@apache.org> on 2017/09/28 16:01:00 UTC
[jira] [Issue Comment Deleted] (PARQUET-1084) Parquet-C++ doesn't selectively read columns

     [ https://issues.apache.org/jira/browse/PARQUET-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Pivarski updated PARQUET-1084:
----------------------------------
    Comment: was deleted

(was: If the file is opened as a memory map (I don't know that I initiated this, but perhaps it's the default), then it would be useful to know an affected operating system. Here's mine:

{{% uname -a
Linux localhost 3.18.0-14875-g438cb8ab27c6 #1 SMP PREEMPT Tue Sep 12 13:55:56 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux

% lsb_release -a
LSB Version:    core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:desktop-4.1-amd64:desktop-4.1-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:graphics-4.1-amd64:graphics-4.1-noarch:languages-3.2-amd64:languages-3.2-noarch:languages-4.0-amd64:languages-4.0-noarch:languages-4.1-amd64:languages-4.1-noarch:multimedia-3.2-amd64:multimedia-3.2-noarch:multimedia-4.0-amd64:multimedia-4.0-noarch:multimedia-4.1-amd64:multimedia-4.1-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:printing-4.1-amd64:printing-4.1-noarch:qt4-3.1-amd64:qt4-3.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty}}
)

> Parquet-C++ doesn't selectively read columns
> --------------------------------------------
>
>                 Key: PARQUET-1084
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1084
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.0.0, cpp-1.2.0
>            Reporter: Jim Pivarski
>              Labels: performance
>             Fix For: cpp-1.3.0
>
>
> I first saw this reported in a [review of file formats for C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf), which showed that an attempt to read two columns from a Parquet file in C++ resulted in the whole file— 26 columns— being read (18th page of the PDF, "15 / 25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0.
> To check this, I pip-installed pyarrow (version 0.6.0), which comes with Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to identify the fraction of pages touched, and double-checked by measuring the time-to-load. The fact that it's a slow disk makes it obvious whether it's reading one column or all columns.
> I'm using the same files as the presenter of that talk: [B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated) and [B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated). They have 20 double-precision columns and 6 int32 columns with no nesting, 500 rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the inflated (uncompressed) file. Each column within a row group should be 4000 or 2000 bytes, so reading one column should be one or two 4k disk pages per row group out of 769 disk pages per row group, depending on alignment— granularity should not be a problem, as it would be if the row groups were too small.
> *Procedure:*
> # I evicted the uncompressed file from VM cache to force reads to come from disk.
> # I imported {{pyarrow.parquet}} in Python and called {{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column).
> # I checked to see how much of the file has been loaded into VM cache.
> # I also checked the time-to-load of one column from cold cache versus all columns from cold cache.
> The result is that the entire file get loaded into VM cache and the file takes 14.6 seconds to read regardless of whether I read one column or the whole file. (From warm cache is 4.7 seconds, so we're clearly seeing the effect of disk speed.) Both methods agree that the file is _not_ being selectively read, as I think it should be.
> Is there a setting that the presenter of the talk (using Parquet-C++ version 1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both missing? Is this a future feature? I would consider it to be a performance bug, since a major reason for having a columnar data format is to read columns selectively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)