You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Yousef Lasi <yo...@gmail.com> on 2015/05/20 22:25:57 UTC

Auto-splitting delimitted files

It appears that we will be implementing Drill before our Hadoop infrastructure is ready for production. A question that's come up related to deploying Drill on clustered
 Linux hosts (i.e. hosts with a shared file system but no HDFS) is whether Drill parallelization can take advantage of multiple drill bits in this scenario.

 Should we expect Drill to auto-split large CSV files and read/sort them in parallel? That does not appear to happen in our testing. We've had to manually partition large files into sets of files stored in a shared folder.

 Is there any value to having multiple drill bits with access to the same shared files in CFS/GFS?

 Thanks

Re: Auto-splitting delimitted files

Posted by Yousef Lasi <yo...@gmail.com>.

I've sent the full JSON profile of the query in a separate mail message.

May 21 2015 12:16 PM, "Ted Dunning" <te...@gmail.com> wrote: 
> Can you publish the test queries and associated logical and physical plans?
> 
> On Thu, May 21, 2015 at 7:06 AM, Yousef Lasi <yo...@gmail.com> wrote:
> 
>> We do expect to use MapRFS at some point so data locality will be
>> available to Drill once that happens. In the interim, we're trying to
>> leverage Drill to pre-process large data sets. As an example, we're
>> creating a view into a join across 4 large files (the largest of which is
>> 20 GB). This join currently takes about 40 minutes on single server using a
>> local file system. By manually splitting the files, we gain some
>> performance as the elapsed time drops down to ~ 30 minutes.
>> 
>> The part where we get a little lost is in understanding the optimization
>> process. Based on the query plan, it appears that the majority of the time
>> is spent on the hash joins. Logically, it would make sense that if we split
>> the files into smaller chunks we would gain increasing efficiency. However,
>> this doesn't appear to be the case as we're not really getting much
>> improvement beyond the 30 minute range despite increasing parallelization
>> by adding additional drill bits and file partitions.
>> 
>> May 21 2015 12:55 AM, "Ted Dunning" <te...@gmail.com> wrote:
>>> Drill loses locality information on anything but an HDFS oriented file
>>> system.  That might be part of what you are observing.  Having pre-split
>>> files should allow parallelism.
>>> 
>>> Can you describe your experiments in more detail?
>>> 
>>> Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?
>>> 
>>> It might help you if you check out the MapR community edition.  That
>> would
>>> give you a more standard view of a shared file system since it allows
>>> distributed NFS service.  You also don't have to worry about the
>>> implications of having an object store under your file system as with
>>> Ceph.  Instead, the cluster (made up of any machines you have) would
>>> present as a *very* standard file system with the exception of locking.
>>> This would have the side effect of letting you experiment on the same
>> data
>>> from both kinds of API (NFS and HDFS) to check for differences.
>>> 
>>> On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <yo...@gmail.com>
>> wrote:
>>> 
>>>> It appears that we will be implementing Drill before our Hadoop
>>>> infrastructure is ready for production. A question that's come up
>> related
>>>> to deploying Drill on clustered
>>>> Linux hosts (i.e. hosts with a shared file system but no HDFS) is
>> whether
>>>> Drill parallelization can take advantage of multiple drill bits in this
>>>> scenario.
>>>> 
>>>> Should we expect Drill to auto-split large CSV files and read/sort them
>>>> in parallel? That does not appear to happen in our testing. We've had to
>>>> manually partition large files into sets of files stored in a shared
>> folder.
>>>> 
>>>> Is there any value to having multiple drill bits with access to the same
>>>> shared files in CFS/GFS?
>>>> 
>>>> Thanks

Re: Auto-splitting delimitted files

Posted by Ted Dunning <te...@gmail.com>.

Can you publish the test queries and associated logical and physical plans?



On Thu, May 21, 2015 at 7:06 AM, Yousef Lasi <yo...@gmail.com> wrote:

> We do expect to use MapRFS at some point so data locality will be
> available to Drill once that happens. In the interim, we're trying to
> leverage Drill to pre-process large data sets. As an example, we're
> creating a view into a join across 4 large files (the largest of which is
> 20 GB). This join currently takes about 40 minutes on single server using a
> local file system. By manually splitting the files, we gain some
> performance as the elapsed time drops down to ~ 30 minutes.
>
> The part where we get a little lost is in understanding the optimization
> process. Based on the query plan, it appears that the majority of the time
> is spent on the hash joins. Logically, it would make sense that if we split
> the files into smaller chunks we would gain increasing efficiency. However,
> this doesn't appear to be the case as we're not really getting much
> improvement beyond the 30 minute range despite increasing parallelization
> by adding additional drill bits and file partitions.
>
>
> May 21 2015 12:55 AM, "Ted Dunning" <te...@gmail.com> wrote:
> > Drill loses locality information on anything but an HDFS oriented file
> > system.  That might be part of what you are observing.  Having pre-split
> > files should allow parallelism.
> >
> > Can you describe your experiments in more detail?
> >
> > Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?
> >
> > It might help you if you check out the MapR community edition.  That
> would
> > give you a more standard view of a shared file system since it allows
> > distributed NFS service.  You also don't have to worry about the
> > implications of having an object store under your file system as with
> > Ceph.  Instead, the cluster (made up of any machines you have) would
> > present as a *very* standard file system with the exception of locking.
> > This would have the side effect of letting you experiment on the same
> data
> > from both kinds of API (NFS and HDFS) to check for differences.
> >
> > On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <yo...@gmail.com>
> wrote:
> >
> >> It appears that we will be implementing Drill before our Hadoop
> >> infrastructure is ready for production. A question that's come up
> related
> >> to deploying Drill on clustered
> >> Linux hosts (i.e. hosts with a shared file system but no HDFS) is
> whether
> >> Drill parallelization can take advantage of multiple drill bits in this
> >> scenario.
> >>
> >> Should we expect Drill to auto-split large CSV files and read/sort them
> >> in parallel? That does not appear to happen in our testing. We've had to
> >> manually partition large files into sets of files stored in a shared
> folder.
> >>
> >> Is there any value to having multiple drill bits with access to the same
> >> shared files in CFS/GFS?
> >>
> >> Thanks
>

Re: Auto-splitting delimitted files

Posted by Yousef Lasi <yo...@gmail.com>.

We do expect to use MapRFS at some point so data locality will be available to Drill once that happens. In the interim, we're trying to leverage Drill to pre-process large data sets. As an example, we're creating a view into a join across 4 large files (the largest of which is 20 GB). This join currently takes about 40 minutes on single server using a local file system. By manually splitting the files, we gain some performance as the elapsed time drops down to ~ 30 minutes. 

The part where we get a little lost is in understanding the optimization process. Based on the query plan, it appears that the majority of the time is spent on the hash joins. Logically, it would make sense that if we split the files into smaller chunks we would gain increasing efficiency. However, this doesn't appear to be the case as we're not really getting much improvement beyond the 30 minute range despite increasing parallelization by adding additional drill bits and file partitions. 

May 21 2015 12:55 AM, "Ted Dunning" <te...@gmail.com> wrote: 
> Drill loses locality information on anything but an HDFS oriented file
> system.  That might be part of what you are observing.  Having pre-split
> files should allow parallelism.
> 
> Can you describe your experiments in more detail?
> 
> Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?
> 
> It might help you if you check out the MapR community edition.  That would
> give you a more standard view of a shared file system since it allows
> distributed NFS service.  You also don't have to worry about the
> implications of having an object store under your file system as with
> Ceph.  Instead, the cluster (made up of any machines you have) would
> present as a *very* standard file system with the exception of locking.
> This would have the side effect of letting you experiment on the same data
> from both kinds of API (NFS and HDFS) to check for differences.
> 
> On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <yo...@gmail.com> wrote:
> 
>> It appears that we will be implementing Drill before our Hadoop
>> infrastructure is ready for production. A question that's come up related
>> to deploying Drill on clustered
>> Linux hosts (i.e. hosts with a shared file system but no HDFS) is whether
>> Drill parallelization can take advantage of multiple drill bits in this
>> scenario.
>> 
>> Should we expect Drill to auto-split large CSV files and read/sort them
>> in parallel? That does not appear to happen in our testing. We've had to
>> manually partition large files into sets of files stored in a shared folder.
>> 
>> Is there any value to having multiple drill bits with access to the same
>> shared files in CFS/GFS?
>> 
>> Thanks

Re: Auto-splitting delimitted files

Posted by Ted Dunning <te...@gmail.com>.

Drill loses locality information on anything but an HDFS oriented file
system.  That might be part of what you are observing.  Having pre-split
files should allow parallelism.

Can you describe your experiments in more detail?

Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?

It might help you if you check out the MapR community edition.  That would
give you a more standard view of a shared file system since it allows
distributed NFS service.  You also don't have to worry about the
implications of having an object store under your file system as with
Ceph.  Instead, the cluster (made up of any machines you have) would
present as a *very* standard file system with the exception of locking.
This would have the side effect of letting you experiment on the same data
from both kinds of API (NFS and HDFS) to check for differences.

On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <yo...@gmail.com> wrote:

> It appears that we will be implementing Drill before our Hadoop
> infrastructure is ready for production. A question that's come up related
> to deploying Drill on clustered
>  Linux hosts (i.e. hosts with a shared file system but no HDFS) is whether
> Drill parallelization can take advantage of multiple drill bits in this
> scenario.
>
>  Should we expect Drill to auto-split large CSV files and read/sort them
> in parallel? That does not appear to happen in our testing. We've had to
> manually partition large files into sets of files stored in a shared folder.
>
>  Is there any value to having multiple drill bits with access to the same
> shared files in CFS/GFS?
>
>  Thanks
>