You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Leandro Ordonez <le...@intec.ugent.be> on 2016/05/17 13:36:27 UTC

Performance tuning

Hello,

I've deployed an HDFS cluster and installed Apache Drill on top of it, 
but found in my case that It takes quite long for Drill to run some 
queries on large JSON files, such as the full Reddit submission corpus 
(260GB). For instance, this query: /SELECT COUNT(*) from 
dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <> 
'[deleted]'//; /took about one hour to run. The other thing I've noticed 
is that none of my queries get processed in a "fragmented" way, the 
query execution is always in charge of the drilbit acting as the foreman.

In the attachment you can find the topology that I'm using. Any feedback 
on this would be greatly appreciated.

Thank you very much for your kind attention.

Best regards,

-- 
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
W: www.ibcn.intec.UGent.be

Re: Performance tuning

Posted by Leandro Ordonez <le...@intec.ugent.be>.

Hehe, So I hope Parquet to do the trick. Thank you Tom! Great work on 
Saiku by the way :-)

On 05/17/2016 03:56 PM, Tom Barber wrote:
> Boo Christopher beat me to it. Leandro I didn't mention to Merlijn I was
> using Parquet files :)
>
> On Tue, May 17, 2016 at 2:54 PM, Leandro Ordonez <
> leandro.ordonez@intec.ugent.be> wrote:
>
>> Thank you Jim,
>>
>> The attachment was this image: https://i.imgsafe.org/7e98f92.png
>>
>> Then, is it expected for the query I've mentioned before to take that long?
>>
>>
>> On 05/17/2016 03:41 PM, Jim Scott wrote:
>>
>>> The mailing lists do not support attachments. You can provide a link to a
>>> git repo or something like that though.
>>>
>>> You might want to alter your query to be something like select
>>> count(FIELDX) from....
>>>
>>> On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
>>> leandro.ordonez@intec.ugent.be> wrote:
>>>
>>> Hello,
>>>> I've deployed an HDFS cluster and installed Apache Drill on top of it,
>>>> but
>>>> found in my case that It takes quite long for Drill to run some queries
>>>> on
>>>> large JSON files, such as the full Reddit submission corpus (260GB). For
>>>> instance, this query: *SELECT COUNT(*) from
>>>> dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
>>>> '[deleted]'**; *took about one hour to run. The other thing I've noticed
>>>> is that none of my queries get processed in a "fragmented" way, the query
>>>> execution is always in charge of the drilbit acting as the foreman.
>>>>
>>>> In the attachment you can find the topology that I'm using. Any feedback
>>>> on this would be greatly appreciated.
>>>>
>>>> Thank you very much for your kind attention.
>>>>
>>>> Best regards,
>>>>
>>>> --
>>>> Leandro Ordonez-Ante
>>>> Department of Information Technology
>>>> Internet Based Communication Networks and Services (IBCN)
>>>> Ghent University - iMinds
>>>> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
>>>> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
>>>> W: www.ibcn.intec.UGent.be
>>>>
>>>>
>>>>
>> --
>> Leandro Ordonez-Ante
>> Department of Information Technology
>> Internet Based Communication Networks and Services (IBCN)
>> Ghent University - iMinds
>> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
>> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
>> W: www.ibcn.intec.UGent.be
>>
>>

-- 
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
W: www.ibcn.intec.UGent.be

Re: Performance tuning

Posted by Tom Barber <to...@meteorite.bi>.

Boo Christopher beat me to it. Leandro I didn't mention to Merlijn I was
using Parquet files :)

On Tue, May 17, 2016 at 2:54 PM, Leandro Ordonez <
leandro.ordonez@intec.ugent.be> wrote:

> Thank you Jim,
>
> The attachment was this image: https://i.imgsafe.org/7e98f92.png
>
> Then, is it expected for the query I've mentioned before to take that long?
>
>
> On 05/17/2016 03:41 PM, Jim Scott wrote:
>
>> The mailing lists do not support attachments. You can provide a link to a
>> git repo or something like that though.
>>
>> You might want to alter your query to be something like select
>> count(FIELDX) from....
>>
>> On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
>> leandro.ordonez@intec.ugent.be> wrote:
>>
>> Hello,
>>>
>>> I've deployed an HDFS cluster and installed Apache Drill on top of it,
>>> but
>>> found in my case that It takes quite long for Drill to run some queries
>>> on
>>> large JSON files, such as the full Reddit submission corpus (260GB). For
>>> instance, this query: *SELECT COUNT(*) from
>>> dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
>>> '[deleted]'**; *took about one hour to run. The other thing I've noticed
>>> is that none of my queries get processed in a "fragmented" way, the query
>>> execution is always in charge of the drilbit acting as the foreman.
>>>
>>> In the attachment you can find the topology that I'm using. Any feedback
>>> on this would be greatly appreciated.
>>>
>>> Thank you very much for your kind attention.
>>>
>>> Best regards,
>>>
>>> --
>>> Leandro Ordonez-Ante
>>> Department of Information Technology
>>> Internet Based Communication Networks and Services (IBCN)
>>> Ghent University - iMinds
>>> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
>>> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
>>> W: www.ibcn.intec.UGent.be
>>>
>>>
>>>
>>
> --
> Leandro Ordonez-Ante
> Department of Information Technology
> Internet Based Communication Networks and Services (IBCN)
> Ghent University - iMinds
> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
> W: www.ibcn.intec.UGent.be
>
>

Re: Performance tuning

Posted by Leandro Ordonez <le...@intec.ugent.be>.

Thank you Jim,

The attachment was this image: https://i.imgsafe.org/7e98f92.png

Then, is it expected for the query I've mentioned before to take that long?

On 05/17/2016 03:41 PM, Jim Scott wrote:
> The mailing lists do not support attachments. You can provide a link to a
> git repo or something like that though.
>
> You might want to alter your query to be something like select
> count(FIELDX) from....
>
> On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
> leandro.ordonez@intec.ugent.be> wrote:
>
>> Hello,
>>
>> I've deployed an HDFS cluster and installed Apache Drill on top of it, but
>> found in my case that It takes quite long for Drill to run some queries on
>> large JSON files, such as the full Reddit submission corpus (260GB). For
>> instance, this query: *SELECT COUNT(*) from
>> dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
>> '[deleted]'**; *took about one hour to run. The other thing I've noticed
>> is that none of my queries get processed in a "fragmented" way, the query
>> execution is always in charge of the drilbit acting as the foreman.
>>
>> In the attachment you can find the topology that I'm using. Any feedback
>> on this would be greatly appreciated.
>>
>> Thank you very much for your kind attention.
>>
>> Best regards,
>>
>> --
>> Leandro Ordonez-Ante
>> Department of Information Technology
>> Internet Based Communication Networks and Services (IBCN)
>> Ghent University - iMinds
>> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
>> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
>> W: www.ibcn.intec.UGent.be
>>
>>
>

-- 
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
W: www.ibcn.intec.UGent.be

Re: Performance tuning

Posted by Leandro Ordonez <le...@intec.ugent.be>.

That's great Chris! I'll try with parquet then. Thank you very much for 
your help!

Best,

Leandro


On 05/17/2016 03:52 PM, Christopher Matta wrote:
> Leandro,
> I ran into a similar situation while building this demo:
> https://github.com/cjmatta/DrillPandasReddit/blob/master/Reddit%20Drill%20Pandas.ipynb
>
> I don't think Drill splits single JSON files the way it does for delimited
> one-record-per-line files, so that would explain why you're seeing
> single-threaded processing.
>
> If you look I ended up extracting the data I was concerned with by creating
> Parquet files using a CTAS statement, this could potentially be helpful for
> you because Parquet is significantly smaller than JSON (I've observed a 10x
> storage savings) and will also be able to be split by Drill.
>
> --
> Chris Matta
> 215-701-3146
> chris@mapr.com
>
> On Tue, May 17, 2016 at 9:41 AM, Jim Scott <js...@maprtech.com> wrote:
>
>> The mailing lists do not support attachments. You can provide a link to a
>> git repo or something like that though.
>>
>> You might want to alter your query to be something like select
>> count(FIELDX) from....
>>
>> On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
>> leandro.ordonez@intec.ugent.be> wrote:
>>
>>> Hello,
>>>
>>> I've deployed an HDFS cluster and installed Apache Drill on top of it,
>> but
>>> found in my case that It takes quite long for Drill to run some queries
>> on
>>> large JSON files, such as the full Reddit submission corpus (260GB). For
>>> instance, this query: *SELECT COUNT(*) from
>>> dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
>>> '[deleted]'**; *took about one hour to run. The other thing I've noticed
>>> is that none of my queries get processed in a "fragmented" way, the query
>>> execution is always in charge of the drilbit acting as the foreman.
>>>
>>> In the attachment you can find the topology that I'm using. Any feedback
>>> on this would be greatly appreciated.
>>>
>>> Thank you very much for your kind attention.
>>>
>>> Best regards,
>>>
>>> --
>>> Leandro Ordonez-Ante
>>> Department of Information Technology
>>> Internet Based Communication Networks and Services (IBCN)
>>> Ghent University - iMinds
>>> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
>>> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
>>> W: www.ibcn.intec.UGent.be
>>>
>>>
>>
>> --
>> *Jim Scott*
>> Director, Enterprise Strategy & Architecture
>> +1 (347) 746-9281
>> @kingmesal <https://twitter.com/kingmesal>
>>
>> <http://www.mapr.com/>
>> [image: MapR Technologies] <http://www.mapr.com>
>>
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available

-- 
Leandro Ordonez-Ante
Department of Information Technology
Internet Based Communication Networks and Services (IBCN)
Ghent University - iMinds
Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
W: www.ibcn.intec.UGent.be

Re: Performance tuning

Posted by Christopher Matta <ch...@mapr.com>.

Leandro,
I ran into a similar situation while building this demo:
https://github.com/cjmatta/DrillPandasReddit/blob/master/Reddit%20Drill%20Pandas.ipynb

I don't think Drill splits single JSON files the way it does for delimited
one-record-per-line files, so that would explain why you're seeing
single-threaded processing.

If you look I ended up extracting the data I was concerned with by creating
Parquet files using a CTAS statement, this could potentially be helpful for
you because Parquet is significantly smaller than JSON (I've observed a 10x
storage savings) and will also be able to be split by Drill.

--
Chris Matta
215-701-3146
chris@mapr.com

On Tue, May 17, 2016 at 9:41 AM, Jim Scott <js...@maprtech.com> wrote:

> The mailing lists do not support attachments. You can provide a link to a
> git repo or something like that though.
>
> You might want to alter your query to be something like select
> count(FIELDX) from....
>
> On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
> leandro.ordonez@intec.ugent.be> wrote:
>
> > Hello,
> >
> > I've deployed an HDFS cluster and installed Apache Drill on top of it,
> but
> > found in my case that It takes quite long for Drill to run some queries
> on
> > large JSON files, such as the full Reddit submission corpus (260GB). For
> > instance, this query: *SELECT COUNT(*) from
> > dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
> > '[deleted]'**; *took about one hour to run. The other thing I've noticed
> > is that none of my queries get processed in a "fragmented" way, the query
> > execution is always in charge of the drilbit acting as the foreman.
> >
> > In the attachment you can find the topology that I'm using. Any feedback
> > on this would be greatly appreciated.
> >
> > Thank you very much for your kind attention.
> >
> > Best regards,
> >
> > --
> > Leandro Ordonez-Ante
> > Department of Information Technology
> > Internet Based Communication Networks and Services (IBCN)
> > Ghent University - iMinds
> > Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
> > E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
> > W: www.ibcn.intec.UGent.be
> >
> >
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
> +1 (347) 746-9281
> @kingmesal <https://twitter.com/kingmesal>
>
> <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Performance tuning

Posted by Jim Scott <js...@maprtech.com>.

The mailing lists do not support attachments. You can provide a link to a
git repo or something like that though.

You might want to alter your query to be something like select
count(FIELDX) from....

On Tue, May 17, 2016 at 8:36 AM, Leandro Ordonez <
leandro.ordonez@intec.ugent.be> wrote:

> Hello,
>
> I've deployed an HDFS cluster and installed Apache Drill on top of it, but
> found in my case that It takes quite long for Drill to run some queries on
> large JSON files, such as the full Reddit submission corpus (260GB). For
> instance, this query: *SELECT COUNT(*) from
> dfs.reddit.`RS_full_corpus.json` WHERE selftext <> '' and selftext <>
> '[deleted]'**; *took about one hour to run. The other thing I've noticed
> is that none of my queries get processed in a "fragmented" way, the query
> execution is always in charge of the drilbit acting as the foreman.
>
> In the attachment you can find the topology that I'm using. Any feedback
> on this would be greatly appreciated.
>
> Thank you very much for your kind attention.
>
> Best regards,
>
> --
> Leandro Ordonez-Ante
> Department of Information Technology
> Internet Based Communication Networks and Services (IBCN)
> Ghent University - iMinds
> Technologiepark Zwijnaarde 15, B-9052 Gent, Belgium
> E: leandro.ordonez@intec.UGent.be, leandro.ordonezante@UGent.be
> W: www.ibcn.intec.UGent.be
>
>


-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal <https://twitter.com/kingmesal>

<http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>