You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by yousuf <yo...@css.org.sa> on 2016/12/15 11:27:51 UTC

Aggregation performance

Hello experts

As a POC project, I've built a drill cluster on 5 VMs , each with the 
following specs

32 GB ram

1 TB storage

16 Cores

Zookeeper quorum & apache drill installed on all 5 nodes. My storage 
engine is mongo which has 5 million docs. (Our daily collection is close 
to 2.5 million tweets)

*Problem*

My aggregation queries are slow, not fit for my realtime dashboard. 
sample query as follows took 12.+ secs,


SELECT count(*) as cnt, actor_preferred_username from tweets where 
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time 
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order 
by cnt desc limit 10;

Pls. note that,

*In Drill*

1) I've changed the conf/drill-env.sh on each node with following

|export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export 
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|

2) changed few setting based on the docs 
<https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>

*In Mongo*

data is sharded on 3 servers on shardkey tweet_id which distributes data 
evenly on all shards and created compound index on fields used by the 
above drill query.


sample data

{
     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
     "rule_list" : [
         "A",
         "B",
         "C",
         "D13"
     ],
     "actor_friends_count" : 40,
     "klout_score" : 28,
     "actor_favorites_count" : 1697,
     "actor_preferred_username" : "_20_xxxx",
     "sentiment" : "neu",
     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
     "object_actor_followers_count" : 573,
     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
     "actor_id" : "id:twitter.com:xxxxxxxxxx",
     "actor_display_name" : "xxxxxx",
     "retweet_count" : 57,
     "hashtag_list" : [
         "c",
         "d",
         "E",
         "f",
         "VCVC",
         "XXCXCXC",
         "RRRR"
     ],
     "body" : "some tweet blah blah",
     "actor_followers_count" : 21,
     "actor_status_count" : 1756,
     "verb" : "share",
     "posted_time" : "2016-08-01T23:47:43.000Z",
     "object_actor_status_count" : 2796,
     "lang" : "ar",
     "object_actor_preferred_username" : "xxxxxxx",
     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
     "gender" : "male",
     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
     "favorites_count" : 0,
     "object_posted_time" : "2016-08-01T22:54:22.000Z",
     "object_actor_friends_count" : 69,
     "generator_display_name" : "Twitter Web Client",
     "object_actor_display_name" : "xxxxxxxxxxxx",
     "actor_listed_count" : 0
}



*Questions*

1) How to improve aggregation query performance?

2) Do I also need to do something in mongodb to enhance performance? (I 
mean optimize source)

3) Does Apache drill capable of handling/aggregating billions of 
documents in real-time?

Your early response is highly appreciated!

Thank you & Kind Regards

Yousuf




Re: Aggregation performance

Posted by Dechang Gu <dg...@maprtech.com>.
Hi Yousuf,
Thanks for the update and profile.  From the profile, looks like most of
the time was spent on the following operator:
05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s *1.350s* *4.903s* *7.817s*
0.000s 0.000s 0.000s 280KB 280KBwhich is mainly mongoScan.  Also the min
(1.35s)  and max (7.817s) processing times here are quite different, which
may indicate some
skewness of data distribution among the nodes. Take a look of the minor
fragment stats to find out how the scanned data is distributed.

Another thing to look is filter pushdown in mongoScan, shown in physical
plan. As an example, the physical plan from Kathleen's run on your sample
data is attached. Please compare it with yours (please also share it with
me if you want me to take a look).


Thanks,
Dechang

On Sat, Dec 17, 2016 at 11:53 PM, yousuf <yo...@css.org.sa> wrote:

> Hi Dechang,
>
> Thanks for your reply,
>
> Drill versiom : 1.8.0
>
> Mongo:3.2.11
>
>
> *Here is the query profile, Kindly also check my system options for
> possibly incorrect settings as attachement. *
>
> *Operator ID* *Type* *Min Setup Time* *Avg Setup Time* *Max Setup Time* *Min
> Process Time* *Avg Process Time* *Max Process Time* *Min Wait Time* *Avg
> Wait Time* *Max Wait Time* *Avg Peak Memory* *Max Peak Memory*
> 00-xx-00 SCREEN 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s 0.001s
> 0.001s 52KB 52KB
> 00-xx-01 PROJECT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 52KB 52KB
> 00-xx-02 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s 0.000s
> 0.000s 0.000s 0.000s 0.000s 52KB 52KB
> 00-xx-03 LIMIT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 308KB 308KB
> 00-xx-04 MERGING_RECEIVER 0.000s 0.000s 0.000s 0.003s 0.003s 0.003s 7.829s
> 7.829s 7.829s 308KB 308KB
> 01-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
> 0.001s 0.002s 52KB 52KB
> 01-xx-01 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s 0.000s
> 0.000s 0.000s 0.000s 0.000s 84KB 84KB
> 01-xx-02 TOP_N_SORT 0.000s 0.000s 0.000s 0.004s 0.016s 0.050s 0.000s
> 0.000s 0.000s 5MB 18MB
> 01-xx-03 PROJECT 0.000s 0.000s 0.001s 0.000s 0.001s 0.005s 0.000s 0.000s
> 0.000s 32KB 32KB
> 01-xx-04 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.007s 0.026s
> 7.759s 7.819s 7.841s - -
> 02-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.073s 0.081s 0.090s
> 0.039s 0.044s 0.051s 373KB 379KB
> 02-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.001s 0.001s
> 7.748s 7.755s 7.761s - -
> 03-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
> 0.004s 0.008s 5MB 5MB
> 03-xx-01 PROJECT 0.001s 0.002s 0.005s 0.000s 0.000s 0.001s 0.000s 0.000s
> 0.000s 5MB 5MB
> 03-xx-02 PROJECT 0.000s 0.000s 0.002s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 4MB 4MB
> 03-xx-03 HASH_AGGREGATE 0.004s 0.009s 0.024s 0.054s 0.065s 0.077s 0.000s
> 0.000s 0.000s 6MB 6MB
> 03-xx-04 PROJECT 0.000s 0.001s 0.006s 0.001s 0.002s 0.002s 0.000s 0.000s
> 0.000s 32KB 32KB
> 03-xx-05 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.007s 0.009s 0.018s
> 7.646s 7.670s 7.695s - -
> 04-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.435s 0.488s 0.659s
> 0.088s 0.147s 0.216s 2MB 2MB
> 04-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.021s 0.025s 0.028s
> 6.014s 6.371s 6.854s - -
> 05-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.002s 0.004s 0.001s
> 0.004s 0.050s 116KB 116KB
> 05-xx-01 PROJECT 0.001s 0.002s 0.010s 0.001s 0.007s 0.012s 0.000s 0.000s
> 0.000s 116KB 116KB
> 05-xx-02 PROJECT 0.000s 0.001s 0.004s 0.000s 0.001s 0.004s 0.000s 0.000s
> 0.000s 100KB 100KB
> 05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s 1.350s 4.903s 7.817s 0.000s
> 0.000s 0.000s 280KB 280KB
>
> Thanks & regards
>
> On 12/16/2016 01:27 AM, Dechang Gu wrote:
>
> Yousuf,
> Which version of drill are you running?
> Can you share the profile of the query?
>
> Thanks,
> Dechang
>
>
>
> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> <yo...@css.org.sa> wrote:
>
>
> Hello experts
>
> As a POC project, I've built a drill cluster on 5 VMs , each with the
> following specs
>
> 32 GB ram
>
> 1 TB storage
>
> 16 Cores
>
> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
> engine is mongo which has 5 million docs. (Our daily collection is close to
> 2.5 million tweets)
>
> *Problem*
>
> My aggregation queries are slow, not fit for my realtime dashboard. sample
> query as follows took 12.+ secs,
>
>
> SELECT count(*) as cnt, actor_preferred_username from tweets where
> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
> cnt desc limit 10;
>
> Pls. note that,
>
> *In Drill*
>
> 1) I've changed the conf/drill-env.sh on each node with following
>
> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>
> 2) changed few setting based on the docs <https://drill.apache.org/docs
> /sort-based-and-hash-based-memory-constrained-operators/> <https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>
>
> *In Mongo*
>
> data is sharded on 3 servers on shardkey tweet_id which distributes data
> evenly on all shards and created compound index on fields used by the above
> drill query.
>
>
> sample data
>
> {
>     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>     "rule_list" : [
>         "A",
>         "B",
>         "C",
>         "D13"
>     ],
>     "actor_friends_count" : 40,
>     "klout_score" : 28,
>     "actor_favorites_count" : 1697,
>     "actor_preferred_username" : "_20_xxxx",
>     "sentiment" : "neu",
>     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>     "object_actor_followers_count" : 573,
>     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>     "actor_id" : "id:twitter.com:xxxxxxxxxx",
>     "actor_display_name" : "xxxxxx",
>     "retweet_count" : 57,
>     "hashtag_list" : [
>         "c",
>         "d",
>         "E",
>         "f",
>         "VCVC",
>         "XXCXCXC",
>         "RRRR"
>     ],
>     "body" : "some tweet blah blah",
>     "actor_followers_count" : 21,
>     "actor_status_count" : 1756,
>     "verb" : "share",
>     "posted_time" : "2016-08-01T23:47:43.000Z",
>     "object_actor_status_count" : 2796,
>     "lang" : "ar",
>     "object_actor_preferred_username" : "xxxxxxx",
>     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>     "gender" : "male",
>     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>     "favorites_count" : 0,
>     "object_posted_time" : "2016-08-01T22:54:22.000Z",
>     "object_actor_friends_count" : 69,
>     "generator_display_name" : "Twitter Web Client",
>     "object_actor_display_name" : "xxxxxxxxxxxx",
>     "actor_listed_count" : 0
> }
>
>
>
> *Questions*
>
> 1) How to improve aggregation query performance?
>
> 2) Do I also need to do something in mongodb to enhance performance? (I
> mean optimize source)
>
> 3) Does Apache drill capable of handling/aggregating billions of documents
> in real-time?
>
> Your early response is highly appreciated!
>
> Thank you & Kind Regards
>
> Yousuf
>
>
>
>
>
>
>

Re: Aggregation performance

Posted by yousuf <yo...@css.org.sa>.
Hi Dechang,

Thanks for your reply,

Drill versiom : 1.8.0

Mongo:3.2.11

*Here is the query profile, Kindly also check my system options for 
possibly incorrect settings as attachement.
*

*Operator ID* 	*Type* 	*Min Setup Time* 	*Avg Setup Time* 	*Max Setup 
Time* 	*Min Process Time* 	*Avg Process Time* 	*Max Process Time* 	*Min 
Wait Time* 	*Avg Wait Time* 	*Max Wait Time* 	*Avg Peak Memory* 	*Max 
Peak Memory*
00-xx-00 	SCREEN 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.001s 
	0.001s 	0.001s 	52KB 	52KB
00-xx-01 	PROJECT 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 
0.000s 	0.000s 	0.000s 	52KB 	52KB
00-xx-02 	SELECTION_VECTOR_REMOVER 	0.001s 	0.001s 	0.001s 	0.000s 
0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	52KB 	52KB
00-xx-03 	LIMIT 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 
0.000s 	0.000s 	308KB 	308KB
00-xx-04 	MERGING_RECEIVER 	0.000s 	0.000s 	0.000s 	0.003s 	0.003s 
0.003s 	7.829s 	7.829s 	7.829s 	308KB 	308KB
01-xx-00 	SINGLE_SENDER 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 
0.001s 	0.001s 	0.002s 	52KB 	52KB
01-xx-01 	SELECTION_VECTOR_REMOVER 	0.001s 	0.001s 	0.001s 	0.000s 
0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	84KB 	84KB
01-xx-02 	TOP_N_SORT 	0.000s 	0.000s 	0.000s 	0.004s 	0.016s 	0.050s 
0.000s 	0.000s 	0.000s 	5MB 	18MB
01-xx-03 	PROJECT 	0.000s 	0.000s 	0.001s 	0.000s 	0.001s 	0.005s 
0.000s 	0.000s 	0.000s 	32KB 	32KB
01-xx-04 	UNORDERED_RECEIVER 	0.000s 	0.000s 	0.000s 	0.001s 	0.007s 
0.026s 	7.759s 	7.819s 	7.841s 	- 	-
02-xx-00 	HASH_PARTITION_SENDER 	0.000s 	0.000s 	0.000s 	0.073s 	0.081s 
0.090s 	0.039s 	0.044s 	0.051s 	373KB 	379KB
02-xx-01 	UNORDERED_RECEIVER 	0.000s 	0.000s 	0.000s 	0.001s 	0.001s 
0.001s 	7.748s 	7.755s 	7.761s 	- 	-
03-xx-00 	SINGLE_SENDER 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 	0.000s 
0.001s 	0.004s 	0.008s 	5MB 	5MB
03-xx-01 	PROJECT 	0.001s 	0.002s 	0.005s 	0.000s 	0.000s 	0.001s 
0.000s 	0.000s 	0.000s 	5MB 	5MB
03-xx-02 	PROJECT 	0.000s 	0.000s 	0.002s 	0.000s 	0.000s 	0.000s 
0.000s 	0.000s 	0.000s 	4MB 	4MB
03-xx-03 	HASH_AGGREGATE 	0.004s 	0.009s 	0.024s 	0.054s 	0.065s 	0.077s 
	0.000s 	0.000s 	0.000s 	6MB 	6MB
03-xx-04 	PROJECT 	0.000s 	0.001s 	0.006s 	0.001s 	0.002s 	0.002s 
0.000s 	0.000s 	0.000s 	32KB 	32KB
03-xx-05 	UNORDERED_RECEIVER 	0.000s 	0.000s 	0.000s 	0.007s 	0.009s 
0.018s 	7.646s 	7.670s 	7.695s 	- 	-
04-xx-00 	HASH_PARTITION_SENDER 	0.000s 	0.000s 	0.000s 	0.435s 	0.488s 
0.659s 	0.088s 	0.147s 	0.216s 	2MB 	2MB
04-xx-01 	UNORDERED_RECEIVER 	0.000s 	0.000s 	0.000s 	0.021s 	0.025s 
0.028s 	6.014s 	6.371s 	6.854s 	- 	-
05-xx-00 	SINGLE_SENDER 	0.000s 	0.000s 	0.000s 	0.000s 	0.002s 	0.004s 
0.001s 	0.004s 	0.050s 	116KB 	116KB
05-xx-01 	PROJECT 	0.001s 	0.002s 	0.010s 	0.001s 	0.007s 	0.012s 
0.000s 	0.000s 	0.000s 	116KB 	116KB
05-xx-02 	PROJECT 	0.000s 	0.001s 	0.004s 	0.000s 	0.001s 	0.004s 
0.000s 	0.000s 	0.000s 	100KB 	100KB
05-xx-03 	UNKNOWN_OPERATOR 	0.000s 	0.000s 	0.000s 	1.350s 	4.903s 
7.817s 	0.000s 	0.000s 	0.000s 	280KB 	280KB

Thanks & regards
On 12/16/2016 01:27 AM, Dechang Gu wrote:
> Yousuf,
> Which version of drill are you running?
> Can you share the profile of the query?
>
> Thanks,
> Dechang
>
>
>
> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>
>> Hello experts
>>
>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>> following specs
>>
>> 32 GB ram
>>
>> 1 TB storage
>>
>> 16 Cores
>>
>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>> engine is mongo which has 5 million docs. (Our daily collection is close to
>> 2.5 million tweets)
>>
>> *Problem*
>>
>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>> query as follows took 12.+ secs,
>>
>>
>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>> cnt desc limit 10;
>>
>> Pls. note that,
>>
>> *In Drill*
>>
>> 1) I've changed the conf/drill-env.sh on each node with following
>>
>> |export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>
>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>> /sort-based-and-hash-based-memory-constrained-operators/>
>>
>> *In Mongo*
>>
>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>> evenly on all shards and created compound index on fields used by the above
>> drill query.
>>
>>
>> sample data
>>
>> {
>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>      "rule_list" : [
>>          "A",
>>          "B",
>>          "C",
>>          "D13"
>>      ],
>>      "actor_friends_count" : 40,
>>      "klout_score" : 28,
>>      "actor_favorites_count" : 1697,
>>      "actor_preferred_username" : "_20_xxxx",
>>      "sentiment" : "neu",
>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>      "object_actor_followers_count" : 573,
>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>      "actor_display_name" : "xxxxxx",
>>      "retweet_count" : 57,
>>      "hashtag_list" : [
>>          "c",
>>          "d",
>>          "E",
>>          "f",
>>          "VCVC",
>>          "XXCXCXC",
>>          "RRRR"
>>      ],
>>      "body" : "some tweet blah blah",
>>      "actor_followers_count" : 21,
>>      "actor_status_count" : 1756,
>>      "verb" : "share",
>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>      "object_actor_status_count" : 2796,
>>      "lang" : "ar",
>>      "object_actor_preferred_username" : "xxxxxxx",
>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>      "gender" : "male",
>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>      "favorites_count" : 0,
>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>      "object_actor_friends_count" : 69,
>>      "generator_display_name" : "Twitter Web Client",
>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>      "actor_listed_count" : 0
>> }
>>
>>
>>
>> *Questions*
>>
>> 1) How to improve aggregation query performance?
>>
>> 2) Do I also need to do something in mongodb to enhance performance? (I
>> mean optimize source)
>>
>> 3) Does Apache drill capable of handling/aggregating billions of documents
>> in real-time?
>>
>> Your early response is highly appreciated!
>>
>> Thank you & Kind Regards
>>
>> Yousuf
>>
>>
>>
>>


Re: Aggregation performance

Posted by Kathleen Li <kl...@maprtech.com>.
Hi Yousuf,


Yes in my env, I was set store.mongo.bson.record.reader = true.
With one record you provided, the same query works fine for me,  the error you got is the schema changes related errors:
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order
. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
+----------+------+
| hashtag  | cnt  |
+----------+------+
| RRRR     | 1    |
| VCVC     | 1    |
| XXCXCXC  | 1    |
| c        | 1    |
| d        | 1    |
| f        | 1    |
| E        | 1    |


With that, you might try:  alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error 


Kathleen







On 12/18/16, 1:21 AM, "yousuf" <yo...@css.org.sa> wrote:

>Hi Kathleen,
>
>Thanks for responding...
>
>I've noticed when  alter session set store.mongo.bson.record.reader = 
>true; the performance is improved. However, the other queries are 
>failing :(.
>
>
>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>+-------+------------------------------------------+
>|  ok   |                 summary                  |
>+-------+------------------------------------------+
>| true  | store.mongo.bson.record.reader updated.  |
>+-------+------------------------------------------+
>1 row selected (0.082 seconds)
>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from 
>tweets group by actor_preferred_username order by cnt desc limit 10;
>+--------+---------------------------+
>|  cnt   | actor_preferred_username  |
>+--------+---------------------------+
>| 10770  | mrnota53                  |
>| 6239   | cyberahsokatano           |
>| 4609   | abeerlilak                |
>| 4562   | DaeshAjel3                |
>| 4523   | Aster__Q                  |
>| 4275   | DaeshCrimes               |
>| 4182   | AfwfwefOfwefho            |
>| 3980   | HewarMaftuh2              |
>| 3698   | DaeshAjel                 |
>| 3661   | sarosh_iq                 |
>+--------+---------------------------+
>10 rows selected (4.64 seconds)
>
>
>*Failed queries when **store.mongo.bson.record.reader = true;*
>
>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select 
>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order 
>by cnt desc  limit 10;
>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a 
>VarChar type when you are using a ValueWriter of type UnionListWriter.
>
>Fragment 4:1
>
>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>
>   (java.lang.IllegalArgumentException) You tried to write a VarChar 
>type when you are using a ValueWriter of type UnionListWriter.
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>     org.apache.drill.exec.physical.impl.ScanBatch.next():178
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>     org.apache.drill.exec.record.AbstractRecordBatch.next():142
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():422
>     org.apache.hadoop.security.UserGroupInformation.doAs():1657
>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>     java.lang.Thread.run():745 (state=,code=0)
>
>*
>
>
>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>> In my env, first time took about 1.6s, second time only took 0.5s
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt  | actor_preferred_username  |
>> +------+---------------------------+
>> | 1    | _20_xxxx                  |
>> +------+---------------------------+
>> 1 row selected (1.585 seconds)
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt  | actor_preferred_username  |
>> +------+---------------------------+
>> | 1    | _20_xxxx                  |
>> +------+---------------------------+
>> 1 row selected (0.505 seconds)
>>
>>
>>
>>
>>
>>
>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>>
>> As what Dechang suggested  you can check profile and see the time mainly spent on which step.
>>
>> Kathleen
>>
>>
>> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>>
>>> Yousuf,
>>> Which version of drill are you running?
>>> Can you share the profile of the query?
>>>
>>> Thanks,
>>> Dechang
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>>
>>>> Hello experts
>>>>
>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>> following specs
>>>>
>>>> 32 GB ram
>>>>
>>>> 1 TB storage
>>>>
>>>> 16 Cores
>>>>
>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>>> 2.5 million tweets)
>>>>
>>>> *Problem*
>>>>
>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>> query as follows took 12.+ secs,
>>>>
>>>>
>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>>> cnt desc limit 10;
>>>>
>>>> Pls. note that,
>>>>
>>>> *In Drill*
>>>>
>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>
>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>
>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>
>>>> *In Mongo*
>>>>
>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>> evenly on all shards and created compound index on fields used by the above
>>>> drill query.
>>>>
>>>>
>>>> sample data
>>>>
>>>> {
>>>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>      "rule_list" : [
>>>>          "A",
>>>>          "B",
>>>>          "C",
>>>>          "D13"
>>>>      ],
>>>>      "actor_friends_count" : 40,
>>>>      "klout_score" : 28,
>>>>      "actor_favorites_count" : 1697,
>>>>      "actor_preferred_username" : "_20_xxxx",
>>>>      "sentiment" : "neu",
>>>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>      "object_actor_followers_count" : 573,
>>>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>      "actor_display_name" : "xxxxxx",
>>>>      "retweet_count" : 57,
>>>>      "hashtag_list" : [
>>>>          "c",
>>>>          "d",
>>>>          "E",
>>>>          "f",
>>>>          "VCVC",
>>>>          "XXCXCXC",
>>>>          "RRRR"
>>>>      ],
>>>>      "body" : "some tweet blah blah",
>>>>      "actor_followers_count" : 21,
>>>>      "actor_status_count" : 1756,
>>>>      "verb" : "share",
>>>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>      "object_actor_status_count" : 2796,
>>>>      "lang" : "ar",
>>>>      "object_actor_preferred_username" : "xxxxxxx",
>>>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>      "gender" : "male",
>>>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>      "favorites_count" : 0,
>>>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>      "object_actor_friends_count" : 69,
>>>>      "generator_display_name" : "Twitter Web Client",
>>>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>      "actor_listed_count" : 0
>>>> }
>>>>
>>>>
>>>>
>>>> *Questions*
>>>>
>>>> 1) How to improve aggregation query performance?
>>>>
>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>> mean optimize source)
>>>>
>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>> in real-time?
>>>>
>>>> Your early response is highly appreciated!
>>>>
>>>> Thank you & Kind Regards
>>>>
>>>> Yousuf
>>>>
>>>>
>>>>
>>>>
>


Re: Aggregation performance

Posted by yousuf <yo...@css.org.sa>.
Hi Kathleen

A bug is reported on JIRA <https://issues.apache.org/jira/browse/DRILL-5139>

Thanks & Regards
Yousuf


On 12/19/2016 10:03 PM, Kathleen Li wrote:
> It seemed in Drill 1.8  parameter store.mongo.all_text_mode=true by default already
>
>   Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors, you might open one public JIRA with the detailed information.
>
>
> You might also try to use CTAS to create drill tables with parquet format, then run your queries against drill parquet tables to see if the performance getting better.
>
>
> Thanks,
>
> Kathleen
>
>
>
>
> On 12/19/16, 10:31 AM, "Kathleen Li" <kl...@maprtech.com> wrote:
>
>> Hi Yousuf,
>>
>>
>> Yes in my env, I was set store.mongo.bson.record.reader = true.
>> With one record you provided, the same query works fine for me,  the error you got is the schema changes related errors:
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
>> . . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order
>> . . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
>> +----------+------+
>> | hashtag  | cnt  |
>> +----------+------+
>> | RRRR     | 1    |
>> | VCVC     | 1    |
>> | XXCXCXC  | 1    |
>> | c        | 1    |
>> | d        | 1    |
>> | f        | 1    |
>> | E        | 1    |
>>
>>
>> With that, you might try:  alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error
>>
>>
>> Kathleen
>>
>>
>>
>>
>>
>>
>>
>> On 12/18/16, 1:21 AM, "yousuf" <yo...@css.org.sa> wrote:
>>
>>> Hi Kathleen,
>>>
>>> Thanks for responding...
>>>
>>> I've noticed when  alter session set store.mongo.bson.record.reader =
>>> true; the performance is improved. However, the other queries are
>>> failing :(.
>>>
>>>
>>> 0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>>> +-------+------------------------------------------+
>>> |  ok   |                 summary                  |
>>> +-------+------------------------------------------+
>>> | true  | store.mongo.bson.record.reader updated.  |
>>> +-------+------------------------------------------+
>>> 1 row selected (0.082 seconds)
>>> 0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
>>> tweets group by actor_preferred_username order by cnt desc limit 10;
>>> +--------+---------------------------+
>>> |  cnt   | actor_preferred_username  |
>>> +--------+---------------------------+
>>> | 10770  | mrnota53                  |
>>> | 6239   | cyberahsokatano           |
>>> | 4609   | abeerlilak                |
>>> | 4562   | DaeshAjel3                |
>>> | 4523   | Aster__Q                  |
>>> | 4275   | DaeshCrimes               |
>>> | 4182   | AfwfwefOfwefho            |
>>> | 3980   | HewarMaftuh2              |
>>> | 3698   | DaeshAjel                 |
>>> | 3661   | sarosh_iq                 |
>>> +--------+---------------------------+
>>> 10 rows selected (4.64 seconds)
>>>
>>>
>>> *Failed queries when **store.mongo.bson.record.reader = true;*
>>>
>>> *0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
>>> flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
>>> by cnt desc  limit 10;
>>> Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
>>> VarChar type when you are using a ValueWriter of type UnionListWriter.
>>>
>>> Fragment 4:1
>>>
>>> [Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>>>
>>>    (java.lang.IllegalArgumentException) You tried to write a VarChar
>>> type when you are using a ValueWriter of type UnionListWriter.
>>> org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>>> org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>>> org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>>>      org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>>> org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>>>      org.apache.drill.exec.physical.impl.ScanBatch.next():178
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():142
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>>      org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>>      org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>>      org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>>>      java.security.AccessController.doPrivileged():-2
>>>      javax.security.auth.Subject.doAs():422
>>>      org.apache.hadoop.security.UserGroupInformation.doAs():1657
>>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>>>      org.apache.drill.common.SelfCleaningRunnable.run():38
>>>      java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>>      java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>>      java.lang.Thread.run():745 (state=,code=0)
>>>
>>> *
>>>
>>>
>>> On 12/16/2016 01:55 AM, Kathleen Li wrote:
>>>> In my env, first time took about 1.6s, second time only took 0.5s
>>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>>> +------+---------------------------+
>>>> | cnt  | actor_preferred_username  |
>>>> +------+---------------------------+
>>>> | 1    | _20_xxxx                  |
>>>> +------+---------------------------+
>>>> 1 row selected (1.585 seconds)
>>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>>> +------+---------------------------+
>>>> | cnt  | actor_preferred_username  |
>>>> +------+---------------------------+
>>>> | 1    | _20_xxxx                  |
>>>> +------+---------------------------+
>>>> 1 row selected (0.505 seconds)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>>>>
>>>> As what Dechang suggested  you can check profile and see the time mainly spent on which step.
>>>>
>>>> Kathleen
>>>>
>>>>
>>>> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>>>>
>>>>> Yousuf,
>>>>> Which version of drill are you running?
>>>>> Can you share the profile of the query?
>>>>>
>>>>> Thanks,
>>>>> Dechang
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>>>>
>>>>>> Hello experts
>>>>>>
>>>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>>>> following specs
>>>>>>
>>>>>> 32 GB ram
>>>>>>
>>>>>> 1 TB storage
>>>>>>
>>>>>> 16 Cores
>>>>>>
>>>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>>>>> 2.5 million tweets)
>>>>>>
>>>>>> *Problem*
>>>>>>
>>>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>>>> query as follows took 12.+ secs,
>>>>>>
>>>>>>
>>>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>>>>> cnt desc limit 10;
>>>>>>
>>>>>> Pls. note that,
>>>>>>
>>>>>> *In Drill*
>>>>>>
>>>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>>>
>>>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
>>>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>>>
>>>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>>>
>>>>>> *In Mongo*
>>>>>>
>>>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>>>> evenly on all shards and created compound index on fields used by the above
>>>>>> drill query.
>>>>>>
>>>>>>
>>>>>> sample data
>>>>>>
>>>>>> {
>>>>>>       "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>>>       "rule_list" : [
>>>>>>           "A",
>>>>>>           "B",
>>>>>>           "C",
>>>>>>           "D13"
>>>>>>       ],
>>>>>>       "actor_friends_count" : 40,
>>>>>>       "klout_score" : 28,
>>>>>>       "actor_favorites_count" : 1697,
>>>>>>       "actor_preferred_username" : "_20_xxxx",
>>>>>>       "sentiment" : "neu",
>>>>>>       "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>>>       "object_actor_followers_count" : 573,
>>>>>>       "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>>>       "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>>>       "actor_display_name" : "xxxxxx",
>>>>>>       "retweet_count" : 57,
>>>>>>       "hashtag_list" : [
>>>>>>           "c",
>>>>>>           "d",
>>>>>>           "E",
>>>>>>           "f",
>>>>>>           "VCVC",
>>>>>>           "XXCXCXC",
>>>>>>           "RRRR"
>>>>>>       ],
>>>>>>       "body" : "some tweet blah blah",
>>>>>>       "actor_followers_count" : 21,
>>>>>>       "actor_status_count" : 1756,
>>>>>>       "verb" : "share",
>>>>>>       "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>>>       "object_actor_status_count" : 2796,
>>>>>>       "lang" : "ar",
>>>>>>       "object_actor_preferred_username" : "xxxxxxx",
>>>>>>       "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>>>       "gender" : "male",
>>>>>>       "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>>>       "favorites_count" : 0,
>>>>>>       "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>>>       "object_actor_friends_count" : 69,
>>>>>>       "generator_display_name" : "Twitter Web Client",
>>>>>>       "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>>>       "actor_listed_count" : 0
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Questions*
>>>>>>
>>>>>> 1) How to improve aggregation query performance?
>>>>>>
>>>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>>>> mean optimize source)
>>>>>>
>>>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>>>> in real-time?
>>>>>>
>>>>>> Your early response is highly appreciated!
>>>>>>
>>>>>> Thank you & Kind Regards
>>>>>>
>>>>>> Yousuf
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>


Re: Aggregation performance

Posted by Kathleen Li <kl...@maprtech.com>.
It seemed in Drill 1.8  parameter store.mongo.all_text_mode=true by default already

 Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors, you might open one public JIRA with the detailed information.


You might also try to use CTAS to create drill tables with parquet format, then run your queries against drill parquet tables to see if the performance getting better.


Thanks,

Kathleen




On 12/19/16, 10:31 AM, "Kathleen Li" <kl...@maprtech.com> wrote:

>Hi Yousuf,
>
>
>Yes in my env, I was set store.mongo.bson.record.reader = true.
>With one record you provided, the same query works fine for me,  the error you got is the schema changes related errors:
>0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
>. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order
>. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc  limit 10;
>+----------+------+
>| hashtag  | cnt  |
>+----------+------+
>| RRRR     | 1    |
>| VCVC     | 1    |
>| XXCXCXC  | 1    |
>| c        | 1    |
>| d        | 1    |
>| f        | 1    |
>| E        | 1    |
>
>
>With that, you might try:  alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error 
>
>
>Kathleen
>
>
>
>
>
>
>
>On 12/18/16, 1:21 AM, "yousuf" <yo...@css.org.sa> wrote:
>
>>Hi Kathleen,
>>
>>Thanks for responding...
>>
>>I've noticed when  alter session set store.mongo.bson.record.reader = 
>>true; the performance is improved. However, the other queries are 
>>failing :(.
>>
>>
>>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>>+-------+------------------------------------------+
>>|  ok   |                 summary                  |
>>+-------+------------------------------------------+
>>| true  | store.mongo.bson.record.reader updated.  |
>>+-------+------------------------------------------+
>>1 row selected (0.082 seconds)
>>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from 
>>tweets group by actor_preferred_username order by cnt desc limit 10;
>>+--------+---------------------------+
>>|  cnt   | actor_preferred_username  |
>>+--------+---------------------------+
>>| 10770  | mrnota53                  |
>>| 6239   | cyberahsokatano           |
>>| 4609   | abeerlilak                |
>>| 4562   | DaeshAjel3                |
>>| 4523   | Aster__Q                  |
>>| 4275   | DaeshCrimes               |
>>| 4182   | AfwfwefOfwefho            |
>>| 3980   | HewarMaftuh2              |
>>| 3698   | DaeshAjel                 |
>>| 3661   | sarosh_iq                 |
>>+--------+---------------------------+
>>10 rows selected (4.64 seconds)
>>
>>
>>*Failed queries when **store.mongo.bson.record.reader = true;*
>>
>>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select 
>>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order 
>>by cnt desc  limit 10;
>>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a 
>>VarChar type when you are using a ValueWriter of type UnionListWriter.
>>
>>Fragment 4:1
>>
>>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>>
>>   (java.lang.IllegalArgumentException) You tried to write a VarChar 
>>type when you are using a ValueWriter of type UnionListWriter.
>>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>>     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>>     org.apache.drill.exec.physical.impl.ScanBatch.next():178
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():142
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>>     java.security.AccessController.doPrivileged():-2
>>     javax.security.auth.Subject.doAs():422
>>     org.apache.hadoop.security.UserGroupInformation.doAs():1657
>>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>>     org.apache.drill.common.SelfCleaningRunnable.run():38
>>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>     java.lang.Thread.run():745 (state=,code=0)
>>
>>*
>>
>>
>>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>>> In my env, first time took about 1.6s, second time only took 0.5s
>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>> +------+---------------------------+
>>> | cnt  | actor_preferred_username  |
>>> +------+---------------------------+
>>> | 1    | _20_xxxx                  |
>>> +------+---------------------------+
>>> 1 row selected (1.585 seconds)
>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>> +------+---------------------------+
>>> | cnt  | actor_preferred_username  |
>>> +------+---------------------------+
>>> | 1    | _20_xxxx                  |
>>> +------+---------------------------+
>>> 1 row selected (0.505 seconds)
>>>
>>>
>>>
>>>
>>>
>>>
>>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>>>
>>> As what Dechang suggested  you can check profile and see the time mainly spent on which step.
>>>
>>> Kathleen
>>>
>>>
>>> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>>>
>>>> Yousuf,
>>>> Which version of drill are you running?
>>>> Can you share the profile of the query?
>>>>
>>>> Thanks,
>>>> Dechang
>>>>
>>>>
>>>>
>>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>>>
>>>>> Hello experts
>>>>>
>>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>>> following specs
>>>>>
>>>>> 32 GB ram
>>>>>
>>>>> 1 TB storage
>>>>>
>>>>> 16 Cores
>>>>>
>>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>>>> 2.5 million tweets)
>>>>>
>>>>> *Problem*
>>>>>
>>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>>> query as follows took 12.+ secs,
>>>>>
>>>>>
>>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>>>> cnt desc limit 10;
>>>>>
>>>>> Pls. note that,
>>>>>
>>>>> *In Drill*
>>>>>
>>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>>
>>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>>
>>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>>
>>>>> *In Mongo*
>>>>>
>>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>>> evenly on all shards and created compound index on fields used by the above
>>>>> drill query.
>>>>>
>>>>>
>>>>> sample data
>>>>>
>>>>> {
>>>>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>>      "rule_list" : [
>>>>>          "A",
>>>>>          "B",
>>>>>          "C",
>>>>>          "D13"
>>>>>      ],
>>>>>      "actor_friends_count" : 40,
>>>>>      "klout_score" : 28,
>>>>>      "actor_favorites_count" : 1697,
>>>>>      "actor_preferred_username" : "_20_xxxx",
>>>>>      "sentiment" : "neu",
>>>>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>>      "object_actor_followers_count" : 573,
>>>>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>>      "actor_display_name" : "xxxxxx",
>>>>>      "retweet_count" : 57,
>>>>>      "hashtag_list" : [
>>>>>          "c",
>>>>>          "d",
>>>>>          "E",
>>>>>          "f",
>>>>>          "VCVC",
>>>>>          "XXCXCXC",
>>>>>          "RRRR"
>>>>>      ],
>>>>>      "body" : "some tweet blah blah",
>>>>>      "actor_followers_count" : 21,
>>>>>      "actor_status_count" : 1756,
>>>>>      "verb" : "share",
>>>>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>>      "object_actor_status_count" : 2796,
>>>>>      "lang" : "ar",
>>>>>      "object_actor_preferred_username" : "xxxxxxx",
>>>>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>>      "gender" : "male",
>>>>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>>      "favorites_count" : 0,
>>>>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>>      "object_actor_friends_count" : 69,
>>>>>      "generator_display_name" : "Twitter Web Client",
>>>>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>>      "actor_listed_count" : 0
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> *Questions*
>>>>>
>>>>> 1) How to improve aggregation query performance?
>>>>>
>>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>>> mean optimize source)
>>>>>
>>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>>> in real-time?
>>>>>
>>>>> Your early response is highly appreciated!
>>>>>
>>>>> Thank you & Kind Regards
>>>>>
>>>>> Yousuf
>>>>>
>>>>>
>>>>>
>>>>>


Re: Aggregation performance

Posted by yousuf <yo...@css.org.sa>.
Hi Kathleen,

Thanks for responding...

I've noticed when  alter session set store.mongo.bson.record.reader = 
true; the performance is improved. However, the other queries are 
failing :(.


0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
+-------+------------------------------------------+
|  ok   |                 summary                  |
+-------+------------------------------------------+
| true  | store.mongo.bson.record.reader updated.  |
+-------+------------------------------------------+
1 row selected (0.082 seconds)
0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from 
tweets group by actor_preferred_username order by cnt desc limit 10;
+--------+---------------------------+
|  cnt   | actor_preferred_username  |
+--------+---------------------------+
| 10770  | mrnota53                  |
| 6239   | cyberahsokatano           |
| 4609   | abeerlilak                |
| 4562   | DaeshAjel3                |
| 4523   | Aster__Q                  |
| 4275   | DaeshCrimes               |
| 4182   | AfwfwefOfwefho            |
| 3980   | HewarMaftuh2              |
| 3698   | DaeshAjel                 |
| 3661   | sarosh_iq                 |
+--------+---------------------------+
10 rows selected (4.64 seconds)


*Failed queries when **store.mongo.bson.record.reader = true;*

*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select 
flatten(hashtag_list) as hashtag from hashtags) group by hashtag order 
by cnt desc  limit 10;
Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a 
VarChar type when you are using a ValueWriter of type UnionListWriter.

Fragment 4:1

[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]

   (java.lang.IllegalArgumentException) You tried to write a VarChar 
type when you are using a ValueWriter of type UnionListWriter.
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
     org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
     org.apache.drill.exec.physical.impl.ScanBatch.next():178
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
     org.apache.drill.exec.record.AbstractRecordBatch.next():142
     org.apache.drill.exec.record.AbstractRecordBatch.next():119
     org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
     org.apache.drill.exec.record.AbstractRecordBatch.next():162
     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
     java.security.AccessController.doPrivileged():-2
     javax.security.auth.Subject.doAs():422
     org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
     org.apache.drill.common.SelfCleaningRunnable.run():38
     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
     java.lang.Thread.run():745 (state=,code=0)

*


On 12/16/2016 01:55 AM, Kathleen Li wrote:
> In my env, first time took about 1.6s, second time only took 0.5s
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
> +------+---------------------------+
> | cnt  | actor_preferred_username  |
> +------+---------------------------+
> | 1    | _20_xxxx                  |
> +------+---------------------------+
> 1 row selected (1.585 seconds)
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
> +------+---------------------------+
> | cnt  | actor_preferred_username  |
> +------+---------------------------+
> | 1    | _20_xxxx                  |
> +------+---------------------------+
> 1 row selected (0.505 seconds)
>
>
>
>
>
>
> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>
> As what Dechang suggested  you can check profile and see the time mainly spent on which step.
>
> Kathleen
>
>
> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>
>> Yousuf,
>> Which version of drill are you running?
>> Can you share the profile of the query?
>>
>> Thanks,
>> Dechang
>>
>>
>>
>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>
>>> Hello experts
>>>
>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>> following specs
>>>
>>> 32 GB ram
>>>
>>> 1 TB storage
>>>
>>> 16 Cores
>>>
>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>> 2.5 million tweets)
>>>
>>> *Problem*
>>>
>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>> query as follows took 12.+ secs,
>>>
>>>
>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>>> cnt desc limit 10;
>>>
>>> Pls. note that,
>>>
>>> *In Drill*
>>>
>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>
>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>
>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>
>>> *In Mongo*
>>>
>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>> evenly on all shards and created compound index on fields used by the above
>>> drill query.
>>>
>>>
>>> sample data
>>>
>>> {
>>>      "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>      "rule_list" : [
>>>          "A",
>>>          "B",
>>>          "C",
>>>          "D13"
>>>      ],
>>>      "actor_friends_count" : 40,
>>>      "klout_score" : 28,
>>>      "actor_favorites_count" : 1697,
>>>      "actor_preferred_username" : "_20_xxxx",
>>>      "sentiment" : "neu",
>>>      "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>      "object_actor_followers_count" : 573,
>>>      "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>      "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>      "actor_display_name" : "xxxxxx",
>>>      "retweet_count" : 57,
>>>      "hashtag_list" : [
>>>          "c",
>>>          "d",
>>>          "E",
>>>          "f",
>>>          "VCVC",
>>>          "XXCXCXC",
>>>          "RRRR"
>>>      ],
>>>      "body" : "some tweet blah blah",
>>>      "actor_followers_count" : 21,
>>>      "actor_status_count" : 1756,
>>>      "verb" : "share",
>>>      "posted_time" : "2016-08-01T23:47:43.000Z",
>>>      "object_actor_status_count" : 2796,
>>>      "lang" : "ar",
>>>      "object_actor_preferred_username" : "xxxxxxx",
>>>      "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>      "gender" : "male",
>>>      "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>      "favorites_count" : 0,
>>>      "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>      "object_actor_friends_count" : 69,
>>>      "generator_display_name" : "Twitter Web Client",
>>>      "object_actor_display_name" : "xxxxxxxxxxxx",
>>>      "actor_listed_count" : 0
>>> }
>>>
>>>
>>>
>>> *Questions*
>>>
>>> 1) How to improve aggregation query performance?
>>>
>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>> mean optimize source)
>>>
>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>> in real-time?
>>>
>>> Your early response is highly appreciated!
>>>
>>> Thank you & Kind Regards
>>>
>>> Yousuf
>>>
>>>
>>>
>>>


Re: Aggregation performance

Posted by Kathleen Li <kl...@maprtech.com>.
In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt  | actor_preferred_username  |
+------+---------------------------+
| 1    | _20_xxxx                  |
+------+---------------------------+
1 row selected (0.505 seconds)






I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.

As what Dechang suggested  you can check profile and see the time mainly spent on which step.

Kathleen


On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:

>Yousuf,
>Which version of drill are you running?
>Can you share the profile of the query?
>
>Thanks,
>Dechang
>
>
>
>On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>
>> Hello experts
>>
>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>> following specs
>>
>> 32 GB ram
>>
>> 1 TB storage
>>
>> 16 Cores
>>
>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>> engine is mongo which has 5 million docs. (Our daily collection is close to
>> 2.5 million tweets)
>>
>> *Problem*
>>
>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>> query as follows took 12.+ secs,
>>
>>
>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
>> cnt desc limit 10;
>>
>> Pls. note that,
>>
>> *In Drill*
>>
>> 1) I've changed the conf/drill-env.sh on each node with following
>>
>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>
>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>> /sort-based-and-hash-based-memory-constrained-operators/>
>>
>> *In Mongo*
>>
>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>> evenly on all shards and created compound index on fields used by the above
>> drill query.
>>
>>
>> sample data
>>
>> {
>>     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>     "rule_list" : [
>>         "A",
>>         "B",
>>         "C",
>>         "D13"
>>     ],
>>     "actor_friends_count" : 40,
>>     "klout_score" : 28,
>>     "actor_favorites_count" : 1697,
>>     "actor_preferred_username" : "_20_xxxx",
>>     "sentiment" : "neu",
>>     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>     "object_actor_followers_count" : 573,
>>     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>     "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>     "actor_display_name" : "xxxxxx",
>>     "retweet_count" : 57,
>>     "hashtag_list" : [
>>         "c",
>>         "d",
>>         "E",
>>         "f",
>>         "VCVC",
>>         "XXCXCXC",
>>         "RRRR"
>>     ],
>>     "body" : "some tweet blah blah",
>>     "actor_followers_count" : 21,
>>     "actor_status_count" : 1756,
>>     "verb" : "share",
>>     "posted_time" : "2016-08-01T23:47:43.000Z",
>>     "object_actor_status_count" : 2796,
>>     "lang" : "ar",
>>     "object_actor_preferred_username" : "xxxxxxx",
>>     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>     "gender" : "male",
>>     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>     "favorites_count" : 0,
>>     "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>     "object_actor_friends_count" : 69,
>>     "generator_display_name" : "Twitter Web Client",
>>     "object_actor_display_name" : "xxxxxxxxxxxx",
>>     "actor_listed_count" : 0
>> }
>>
>>
>>
>> *Questions*
>>
>> 1) How to improve aggregation query performance?
>>
>> 2) Do I also need to do something in mongodb to enhance performance? (I
>> mean optimize source)
>>
>> 3) Does Apache drill capable of handling/aggregating billions of documents
>> in real-time?
>>
>> Your early response is highly appreciated!
>>
>> Thank you & Kind Regards
>>
>> Yousuf
>>
>>
>>
>>

Re: Aggregation performance

Posted by Dechang Gu <dg...@maprtech.com>.
Yousuf,
Which version of drill are you running?
Can you share the profile of the query?

Thanks,
Dechang



On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:

> Hello experts
>
> As a POC project, I've built a drill cluster on 5 VMs , each with the
> following specs
>
> 32 GB ram
>
> 1 TB storage
>
> 16 Cores
>
> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
> engine is mongo which has 5 million docs. (Our daily collection is close to
> 2.5 million tweets)
>
> *Problem*
>
> My aggregation queries are slow, not fit for my realtime dashboard. sample
> query as follows took 12.+ secs,
>
>
> SELECT count(*) as cnt, actor_preferred_username from tweets where
> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username   order by
> cnt desc limit 10;
>
> Pls. note that,
>
> *In Drill*
>
> 1) I've changed the conf/drill-env.sh on each node with following
>
> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>
> 2) changed few setting based on the docs <https://drill.apache.org/docs
> /sort-based-and-hash-based-memory-constrained-operators/>
>
> *In Mongo*
>
> data is sharded on 3 servers on shardkey tweet_id which distributes data
> evenly on all shards and created compound index on fields used by the above
> drill query.
>
>
> sample data
>
> {
>     "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>     "rule_list" : [
>         "A",
>         "B",
>         "C",
>         "D13"
>     ],
>     "actor_friends_count" : 40,
>     "klout_score" : 28,
>     "actor_favorites_count" : 1697,
>     "actor_preferred_username" : "_20_xxxx",
>     "sentiment" : "neu",
>     "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>     "object_actor_followers_count" : 573,
>     "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>     "actor_id" : "id:twitter.com:xxxxxxxxxx",
>     "actor_display_name" : "xxxxxx",
>     "retweet_count" : 57,
>     "hashtag_list" : [
>         "c",
>         "d",
>         "E",
>         "f",
>         "VCVC",
>         "XXCXCXC",
>         "RRRR"
>     ],
>     "body" : "some tweet blah blah",
>     "actor_followers_count" : 21,
>     "actor_status_count" : 1756,
>     "verb" : "share",
>     "posted_time" : "2016-08-01T23:47:43.000Z",
>     "object_actor_status_count" : 2796,
>     "lang" : "ar",
>     "object_actor_preferred_username" : "xxxxxxx",
>     "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>     "gender" : "male",
>     "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>     "favorites_count" : 0,
>     "object_posted_time" : "2016-08-01T22:54:22.000Z",
>     "object_actor_friends_count" : 69,
>     "generator_display_name" : "Twitter Web Client",
>     "object_actor_display_name" : "xxxxxxxxxxxx",
>     "actor_listed_count" : 0
> }
>
>
>
> *Questions*
>
> 1) How to improve aggregation query performance?
>
> 2) Do I also need to do something in mongodb to enhance performance? (I
> mean optimize source)
>
> 3) Does Apache drill capable of handling/aggregating billions of documents
> in real-time?
>
> Your early response is highly appreciated!
>
> Thank you & Kind Regards
>
> Yousuf
>
>
>
>