You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by basil arockia edwin <ba...@gmail.com> on 2017/03/29 20:02:10 UTC

Apache Drill Clarification on Reading Parquet files

Dear team,
       We are planning to use apache drill in our project to query the
parquet files resides in the file system/openstack swift which we would use
it in our web application for analytics purpose.
    We need the below questions to be clarified to take further decision.

1.If we are having 1000 parquet files in a directory and we have our
required results in only 5 files. Does drill search the entire 1000 parquet
files metadata information or it will search only the associated 5 files?

2.Is it possible to install apache drill in cluster mode with out using
HDFS for scaling?

Thanks,
Basil

Re: Apache Drill Clarification on Reading Parquet files

Posted by rahul challapalli <ch...@gmail.com>.

Welcome to the community and we are glad you are considering drill for your
use-case.

1. There are a few ways in which you can make drill avoid reading all the
files. Take a look at the below items
      a) Partition your data and store the partition information in the
parquet footer. Documentation can be found at
https://drill.apache.org/docs/partition-by-clause/
      b) Partition your data based on directory structure. Documentation
can be found at https://drill.apache.org/docs/how-to-partition-data/
      c) You can also leverage parquet filter pushdown which works at the
row-group level. Even with a single parquet file you can skip reading all
row-groups. Documentation can be found at
https://drill.apache.org/docs/parquet-filter-pushdown/

2. So you do not have a distributed file system like MAPRFS or HDFS but
still want to run drill on multiple nodes. One obvious requirement would be
to make sure your data is exactly replicated on all the nodes where drill
is running. And drill uses zookeeper for coordination. So you would still
need to install that. Since this is not a configuration which is widely
used/tested, I wouldn't be surprised if you run into issues.

Also if you have a lot of parquet files, you may want to take a look at
parquet metadata caching feature (
https://drill.apache.org/docs/optimizing-parquet-metadata-reading/).

- Rahul

On Wed, Mar 29, 2017 at 1:02 PM, basil arockia edwin <
basil.edwin89@gmail.com> wrote:

> Dear team,
>        We are planning to use apache drill in our project to query the
> parquet files resides in the file system/openstack swift which we would use
> it in our web application for analytics purpose.
>     We need the below questions to be clarified to take further decision.
>
> 1.If we are having 1000 parquet files in a directory and we have our
> required results in only 5 files. Does drill search the entire 1000 parquet
> files metadata information or it will search only the associated 5 files?
>
> 2.Is it possible to install apache drill in cluster mode with out using
> HDFS for scaling?
>
> Thanks,
> Basil
>