You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Tridib Samanta <tr...@live.com> on 2014/10/29 22:24:40 UTC

Enable caching in Drill

Hello,
I am doing a count query like bellow. I understand that it will take long time at first attempt. But not sure why it takes same time in subsequent execution. Will I have to enable caching or something like that?
 
Thanks
Tridib

Re: Enable caching in Drill

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Oct 31, 2014 at 11:48 PM, Tridib Samanta <tr...@live.com>
wrote:

> that json files can't be split between worker while reading and simple
> calculation can't be faster because of that.
>

Good point.  Only if the downstream is complicated will parallelism help.

RE: Enable caching in Drill

Posted by Tridib Samanta <tr...@live.com>.

I tested with 1 million and 12 million file. but what I understood from earlier reply is that json files can't be split between worker while reading and simple calculation can't be faster because of that.
 
> From: ted.dunning@gmail.com
> Date: Fri, 31 Oct 2014 15:53:12 -0700
> Subject: Re: Enable caching in Drill
> To: drill-user@incubator.apache.org
> 
> By default, Drill only splits input on fairly large chunks (100,000
> records, I think).
> 
> How many records in your input?
> 
> 
> 
> On Wed, Oct 29, 2014 at 9:46 PM, Tridib Samanta <tr...@live.com>
> wrote:
> 
> > select count(*) from myhdfs.json.`x00.json`;
> >
> > Surprising thing is, I get same performance when I use 1 drillbit compare
> > to 4 drillbits.
> >
> > > Date: Thu, 30 Oct 2014 10:08:04 +0530
> > > Subject: Re: Enable caching in Drill
> > > From: mufeed.usman@gmail.com
> > > To: drill-user@incubator.apache.org
> > >
> > > The query didn't get through :-).
> > >
> > >
> > > ---
> > > Mufeed Usman
> > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> > > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> > > <http://mufeed.livejournal.com>
> > >
> > >
> > >
> > >
> > > On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <tridib.samanta@live.com
> > >
> > > wrote:
> > >
> > > > Hello,
> > > > I am doing a count query like bellow. I understand that it will take
> > long
> > > > time at first attempt. But not sure why it takes same time in
> > subsequent
> > > > execution. Will I have to enable caching or something like that?
> > > >
> > > > Thanks
> > > > Tridib
> > > >
> >
> >

Re: Enable caching in Drill

Posted by Ted Dunning <te...@gmail.com>.

By default, Drill only splits input on fairly large chunks (100,000
records, I think).

How many records in your input?



On Wed, Oct 29, 2014 at 9:46 PM, Tridib Samanta <tr...@live.com>
wrote:

> select count(*) from myhdfs.json.`x00.json`;
>
> Surprising thing is, I get same performance when I use 1 drillbit compare
> to 4 drillbits.
>
> > Date: Thu, 30 Oct 2014 10:08:04 +0530
> > Subject: Re: Enable caching in Drill
> > From: mufeed.usman@gmail.com
> > To: drill-user@incubator.apache.org
> >
> > The query didn't get through :-).
> >
> >
> > ---
> > Mufeed Usman
> > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> > <http://mufeed.livejournal.com>
> >
> >
> >
> >
> > On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <tridib.samanta@live.com
> >
> > wrote:
> >
> > > Hello,
> > > I am doing a count query like bellow. I understand that it will take
> long
> > > time at first attempt. But not sure why it takes same time in
> subsequent
> > > execution. Will I have to enable caching or something like that?
> > >
> > > Thanks
> > > Tridib
> > >
>
>

RE: Enable caching in Drill

Posted by Jacques Nadeau <ja...@apache.org>.

Line breaks are allowed in the JSON standard even if your file doesn't have
them.
On Oct 29, 2014 10:07 PM, "Tridib Samanta" <tr...@live.com> wrote:

> Hmm...
> Each line of my file has one json document. The file is in HDFS. Though I
> am not familiar with Drill's code, but looks like it reads each line as
> separate record and process them. Not sure why it can't split the files
> with 1 million json records. Anyway it treats each line as independent
> record.
>
> > Date: Wed, 29 Oct 2014 21:54:24 -0700
> > Subject: RE: Enable caching in Drill
> > From: jacques@apache.org
> > To: drill-user@incubator.apache.org
> >
> > Drill doesn't currently cache data and relies on the underlying file
> system
> > cache.
> >
> > Also,  json is not splittable so adding nodes with a single json file
> will
> > generally have little impact.
> > On Oct 29, 2014 9:48 PM, "Tridib Samanta" tridib.samanta@live.com wrote:
> >
> > > select count(*) from myhdfs.json.`x00.json`;
> > >
> > > Surprising thing is, I get same performance when I use 1 drillbit
> compare
> > > to 4 drillbits.
> > >
> > > > Date: Thu, 30 Oct 2014 10:08:04 +0530
> > > > Subject: Re: Enable caching in Drill
> > > > From: mufeed.usman@gmail.com
> > > > To: drill-user@incubator.apache.org
> > > >
> > > > The query didn't get through :-).
> > > >
> > > >
> > > > ---
> > > > Mufeed Usman
> > > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> |
> My
> > > > Social Cause <http://www.vision2016.org.in/> | My Blogs :
> LiveJournal
> > > > <http://mufeed.livejournal.com>
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <
> tridib.samanta@live.com
> > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > > I am doing a count query like bellow. I understand that it will
> take
> > > long
> > > > > time at first attempt. But not sure why it takes same time in
> > > subsequent
> > > > > execution. Will I have to enable caching or something like that?
> > > > >
> > > > > Thanks
> > > > > Tridib
> > > > >
> > >
>

RE: Enable caching in Drill

Posted by Tridib Samanta <tr...@live.com>.

Hmm... 
Each line of my file has one json document. The file is in HDFS. Though I am not familiar with Drill's code, but looks like it reads each line as separate record and process them. Not sure why it can't split the files with 1 million json records. Anyway it treats each line as independent record.
 
> Date: Wed, 29 Oct 2014 21:54:24 -0700
> Subject: RE: Enable caching in Drill
> From: jacques@apache.org
> To: drill-user@incubator.apache.org
> 
> Drill doesn't currently cache data and relies on the underlying file system
> cache.
> 
> Also,  json is not splittable so adding nodes with a single json file will
> generally have little impact.
> On Oct 29, 2014 9:48 PM, "Tridib Samanta" tridib.samanta@live.com wrote:
> 
> > select count(*) from myhdfs.json.`x00.json`;
> >
> > Surprising thing is, I get same performance when I use 1 drillbit compare
> > to 4 drillbits.
> >
> > > Date: Thu, 30 Oct 2014 10:08:04 +0530
> > > Subject: Re: Enable caching in Drill
> > > From: mufeed.usman@gmail.com
> > > To: drill-user@incubator.apache.org
> > >
> > > The query didn't get through :-).
> > >
> > >
> > > ---
> > > Mufeed Usman
> > > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> > > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> > > <http://mufeed.livejournal.com>
> > >
> > >
> > >
> > >
> > > On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <tridib.samanta@live.com
> > >
> > > wrote:
> > >
> > > > Hello,
> > > > I am doing a count query like bellow. I understand that it will take
> > long
> > > > time at first attempt. But not sure why it takes same time in
> > subsequent
> > > > execution. Will I have to enable caching or something like that?
> > > >
> > > > Thanks
> > > > Tridib
> > > >
> >

RE: Enable caching in Drill

Posted by Jacques Nadeau <ja...@apache.org>.

Drill doesn't currently cache data and relies on the underlying file system
cache.

Also,  json is not splittable so adding nodes with a single json file will
generally have little impact.
On Oct 29, 2014 9:48 PM, "Tridib Samanta" <tr...@live.com> wrote:

> select count(*) from myhdfs.json.`x00.json`;
>
> Surprising thing is, I get same performance when I use 1 drillbit compare
> to 4 drillbits.
>
> > Date: Thu, 30 Oct 2014 10:08:04 +0530
> > Subject: Re: Enable caching in Drill
> > From: mufeed.usman@gmail.com
> > To: drill-user@incubator.apache.org
> >
> > The query didn't get through :-).
> >
> >
> > ---
> > Mufeed Usman
> > My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> > Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> > <http://mufeed.livejournal.com>
> >
> >
> >
> >
> > On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <tridib.samanta@live.com
> >
> > wrote:
> >
> > > Hello,
> > > I am doing a count query like bellow. I understand that it will take
> long
> > > time at first attempt. But not sure why it takes same time in
> subsequent
> > > execution. Will I have to enable caching or something like that?
> > >
> > > Thanks
> > > Tridib
> > >
>

RE: Enable caching in Drill

Posted by Tridib Samanta <tr...@live.com>.

select count(*) from myhdfs.json.`x00.json`;
 
Surprising thing is, I get same performance when I use 1 drillbit compare to 4 drillbits.
 
> Date: Thu, 30 Oct 2014 10:08:04 +0530
> Subject: Re: Enable caching in Drill
> From: mufeed.usman@gmail.com
> To: drill-user@incubator.apache.org
> 
> The query didn't get through :-).
> 
> 
> ---
> Mufeed Usman
> My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
> Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
> <http://mufeed.livejournal.com>
> 
> 
> 
> 
> On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <tr...@live.com>
> wrote:
> 
> > Hello,
> > I am doing a count query like bellow. I understand that it will take long
> > time at first attempt. But not sure why it takes same time in subsequent
> > execution. Will I have to enable caching or something like that?
> >
> > Thanks
> > Tridib
> >

Re: Enable caching in Drill

Posted by mufy <mu...@gmail.com>.

The query didn't get through :-).


---
Mufeed Usman
My LinkedIn <http://www.linkedin.com/pub/mufeed-usman/28/254/400> | My
Social Cause <http://www.vision2016.org.in/> | My Blogs : LiveJournal
<http://mufeed.livejournal.com>




On Thu, Oct 30, 2014 at 2:54 AM, Tridib Samanta <tr...@live.com>
wrote:

> Hello,
> I am doing a count query like bellow. I understand that it will take long
> time at first attempt. But not sure why it takes same time in subsequent
> execution. Will I have to enable caching or something like that?
>
> Thanks
> Tridib
>