You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by "Geercken, Uwe" <Uw...@swissport.com> on 2015/07/29 13:37:23 UTC

Querying partitioned Parquet files

Hello,

If I have a list of partitioned parquet files on the filesystem and two drillbits with access to the filesystem and I query the data using the column I partitioned on in the where clause of the query, will both drillbits share the work?

Or do I need a distributed filesystem such as Hadoop underlying to make the bits work in parallel (or work together)?

Tks.

Uwe

Re: Querying partitioned Parquet files

Posted by Jason Altekruse <al...@gmail.com>.
I almost wrote that each node needed access to a common namespace, but I
decided to answer the question more in line with how it was originally
asked. As parth confirmed, your point is valid. If you are okay with
reading all data over the network, NFS is definitely an option (as it just
looks like part of the local disk, but is guaranteed to be available on all
of the machines if they have it mounted in to the same path).

I will however say that I would lean towards doing as this JIRA suggests
and disabling the fact that the non-NFS case happens to work if you have a
series of machines with the same files (or filenames) on all of their local
disks. It's just too fragile and will likely produce false assumptions. I
do however think this web log querying use case that I described, which
would require a some core enhancements, should be considered as a strong
potential use case for drill.

On Wed, Jul 29, 2015 at 10:13 PM, Parth Chandra <pa...@apache.org> wrote:

> Yes that would work too, though if there are inconsistencies in the copies
> of files made, then the results would be unreliable.
>
> Parth
>
> On Wed, Jul 29, 2015 at 6:45 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Just to clarify this, Jason - you don't necessarily need HDFS or the like
> > for this, if you had say a NFS volume (for example, Amazon Elastic File
> > System), you can still accomplish it, right?  Or merely if you had all
> > files duplicated on every node locally.
> >
> > On Thu, Jul 30, 2015 at 10:00 AM, Jason Altekruse <
> > altekrusejason@gmail.com>
> > wrote:
> >
> > > Put a little more simply, the node that we end up planning the query on
> > is
> > > going to enumerate the files we will be reading in the query so that we
> > can
> > > assign work to given nodes. This currently assumes we are going to know
> > at
> > > planning time (on the single node) all of the files to be read. This
> > > happens to work in a single node setup, because all of the work will be
> > > done on the single machine against one filesystem (the local fs). In
> the
> > > distributed case we currently require that we have a connection from
> each
> > > node to a DFS.
> > >
> > > There is an outstanding feature request to support a use case like
> > querying
> > > a series of server logs, each machine may have a different number of
> log
> > > files. We will need to modify the planning process to allow for the
> > > description of a scan that is more flexible and allows enumerating the
> > > files on each machine separately when we go to actually read them.
> > >
> > > This JIRA discusses the issue you are facing in more detail, I believe
> we
> > > should have one outstanding for the feature request as well. I will try
> > to
> > > take a look around for it and open one if I can't find it soon.
> > >
> > > https://issues.apache.org/jira/browse/DRILL-3230
> > >
> > > On Wed, Jul 29, 2015 at 4:14 PM, Kristine Hahn <kh...@maprtech.com>
> > wrote:
> > >
> > > > Yes, you need a distributed file system to take advantage of Drill's
> > > query
> > > > planning. If you use multiple Drillbits and do not use a distributed
> > file
> > > > system, the consistency of the fragment information cannot be
> > maintained.
> > > >
> > > >
> > > >
> > > > Kristine Hahn
> > > > Sr. Technical Writer
> > > > 415-497-8107 @krishahn skype:krishahn
> > > >
> > > >
> > > > On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe <
> > > Uwe.Geercken@swissport.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > If I have a list of partitioned parquet files on the filesystem and
> > two
> > > > > drillbits with access to the filesystem and I query the data using
> > the
> > > > > column I partitioned on in the where clause of the query, will both
> > > > > drillbits share the work?
> > > > >
> > > > > Or do I need a distributed filesystem such as Hadoop underlying to
> > make
> > > > > the bits work in parallel (or work together)?
> > > > >
> > > > > Tks.
> > > > >
> > > > > Uwe
> > > > >
> > > >
> > >
> >
>

Re: Querying partitioned Parquet files

Posted by Parth Chandra <pa...@apache.org>.
Yes that would work too, though if there are inconsistencies in the copies
of files made, then the results would be unreliable.

Parth

On Wed, Jul 29, 2015 at 6:45 PM, Adam Gilmore <dr...@gmail.com> wrote:

> Just to clarify this, Jason - you don't necessarily need HDFS or the like
> for this, if you had say a NFS volume (for example, Amazon Elastic File
> System), you can still accomplish it, right?  Or merely if you had all
> files duplicated on every node locally.
>
> On Thu, Jul 30, 2015 at 10:00 AM, Jason Altekruse <
> altekrusejason@gmail.com>
> wrote:
>
> > Put a little more simply, the node that we end up planning the query on
> is
> > going to enumerate the files we will be reading in the query so that we
> can
> > assign work to given nodes. This currently assumes we are going to know
> at
> > planning time (on the single node) all of the files to be read. This
> > happens to work in a single node setup, because all of the work will be
> > done on the single machine against one filesystem (the local fs). In the
> > distributed case we currently require that we have a connection from each
> > node to a DFS.
> >
> > There is an outstanding feature request to support a use case like
> querying
> > a series of server logs, each machine may have a different number of log
> > files. We will need to modify the planning process to allow for the
> > description of a scan that is more flexible and allows enumerating the
> > files on each machine separately when we go to actually read them.
> >
> > This JIRA discusses the issue you are facing in more detail, I believe we
> > should have one outstanding for the feature request as well. I will try
> to
> > take a look around for it and open one if I can't find it soon.
> >
> > https://issues.apache.org/jira/browse/DRILL-3230
> >
> > On Wed, Jul 29, 2015 at 4:14 PM, Kristine Hahn <kh...@maprtech.com>
> wrote:
> >
> > > Yes, you need a distributed file system to take advantage of Drill's
> > query
> > > planning. If you use multiple Drillbits and do not use a distributed
> file
> > > system, the consistency of the fragment information cannot be
> maintained.
> > >
> > >
> > >
> > > Kristine Hahn
> > > Sr. Technical Writer
> > > 415-497-8107 @krishahn skype:krishahn
> > >
> > >
> > > On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe <
> > Uwe.Geercken@swissport.com
> > > >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > If I have a list of partitioned parquet files on the filesystem and
> two
> > > > drillbits with access to the filesystem and I query the data using
> the
> > > > column I partitioned on in the where clause of the query, will both
> > > > drillbits share the work?
> > > >
> > > > Or do I need a distributed filesystem such as Hadoop underlying to
> make
> > > > the bits work in parallel (or work together)?
> > > >
> > > > Tks.
> > > >
> > > > Uwe
> > > >
> > >
> >
>

Re: Querying partitioned Parquet files

Posted by Adam Gilmore <dr...@gmail.com>.
Just to clarify this, Jason - you don't necessarily need HDFS or the like
for this, if you had say a NFS volume (for example, Amazon Elastic File
System), you can still accomplish it, right?  Or merely if you had all
files duplicated on every node locally.

On Thu, Jul 30, 2015 at 10:00 AM, Jason Altekruse <al...@gmail.com>
wrote:

> Put a little more simply, the node that we end up planning the query on is
> going to enumerate the files we will be reading in the query so that we can
> assign work to given nodes. This currently assumes we are going to know at
> planning time (on the single node) all of the files to be read. This
> happens to work in a single node setup, because all of the work will be
> done on the single machine against one filesystem (the local fs). In the
> distributed case we currently require that we have a connection from each
> node to a DFS.
>
> There is an outstanding feature request to support a use case like querying
> a series of server logs, each machine may have a different number of log
> files. We will need to modify the planning process to allow for the
> description of a scan that is more flexible and allows enumerating the
> files on each machine separately when we go to actually read them.
>
> This JIRA discusses the issue you are facing in more detail, I believe we
> should have one outstanding for the feature request as well. I will try to
> take a look around for it and open one if I can't find it soon.
>
> https://issues.apache.org/jira/browse/DRILL-3230
>
> On Wed, Jul 29, 2015 at 4:14 PM, Kristine Hahn <kh...@maprtech.com> wrote:
>
> > Yes, you need a distributed file system to take advantage of Drill's
> query
> > planning. If you use multiple Drillbits and do not use a distributed file
> > system, the consistency of the fragment information cannot be maintained.
> >
> >
> >
> > Kristine Hahn
> > Sr. Technical Writer
> > 415-497-8107 @krishahn skype:krishahn
> >
> >
> > On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe <
> Uwe.Geercken@swissport.com
> > >
> > wrote:
> >
> > > Hello,
> > >
> > > If I have a list of partitioned parquet files on the filesystem and two
> > > drillbits with access to the filesystem and I query the data using the
> > > column I partitioned on in the where clause of the query, will both
> > > drillbits share the work?
> > >
> > > Or do I need a distributed filesystem such as Hadoop underlying to make
> > > the bits work in parallel (or work together)?
> > >
> > > Tks.
> > >
> > > Uwe
> > >
> >
>

Re: Querying partitioned Parquet files

Posted by Jason Altekruse <al...@gmail.com>.
Put a little more simply, the node that we end up planning the query on is
going to enumerate the files we will be reading in the query so that we can
assign work to given nodes. This currently assumes we are going to know at
planning time (on the single node) all of the files to be read. This
happens to work in a single node setup, because all of the work will be
done on the single machine against one filesystem (the local fs). In the
distributed case we currently require that we have a connection from each
node to a DFS.

There is an outstanding feature request to support a use case like querying
a series of server logs, each machine may have a different number of log
files. We will need to modify the planning process to allow for the
description of a scan that is more flexible and allows enumerating the
files on each machine separately when we go to actually read them.

This JIRA discusses the issue you are facing in more detail, I believe we
should have one outstanding for the feature request as well. I will try to
take a look around for it and open one if I can't find it soon.

https://issues.apache.org/jira/browse/DRILL-3230

On Wed, Jul 29, 2015 at 4:14 PM, Kristine Hahn <kh...@maprtech.com> wrote:

> Yes, you need a distributed file system to take advantage of Drill's query
> planning. If you use multiple Drillbits and do not use a distributed file
> system, the consistency of the fragment information cannot be maintained.
>
>
>
> Kristine Hahn
> Sr. Technical Writer
> 415-497-8107 @krishahn skype:krishahn
>
>
> On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe <Uwe.Geercken@swissport.com
> >
> wrote:
>
> > Hello,
> >
> > If I have a list of partitioned parquet files on the filesystem and two
> > drillbits with access to the filesystem and I query the data using the
> > column I partitioned on in the where clause of the query, will both
> > drillbits share the work?
> >
> > Or do I need a distributed filesystem such as Hadoop underlying to make
> > the bits work in parallel (or work together)?
> >
> > Tks.
> >
> > Uwe
> >
>

Re: Querying partitioned Parquet files

Posted by Kristine Hahn <kh...@maprtech.com>.
Yes, you need a distributed file system to take advantage of Drill's query
planning. If you use multiple Drillbits and do not use a distributed file
system, the consistency of the fragment information cannot be maintained.



Kristine Hahn
Sr. Technical Writer
415-497-8107 @krishahn skype:krishahn


On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe <Uw...@swissport.com>
wrote:

> Hello,
>
> If I have a list of partitioned parquet files on the filesystem and two
> drillbits with access to the filesystem and I query the data using the
> column I partitioned on in the where clause of the query, will both
> drillbits share the work?
>
> Or do I need a distributed filesystem such as Hadoop underlying to make
> the bits work in parallel (or work together)?
>
> Tks.
>
> Uwe
>