You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by yousuf <yo...@css.org.sa> on 2016/12/15 11:27:51 UTC
Aggregation performance
Hello experts
As a POC project, I've built a drill cluster on 5 VMs , each with the
following specs
32 GB ram
1 TB storage
16 Cores
Zookeeper quorum & apache drill installed on all 5 nodes. My storage
engine is mongo which has 5 million docs. (Our daily collection is close
to 2.5 million tweets)
*Problem*
My aggregation queries are slow, not fit for my realtime dashboard.
sample query as follows took 12.+ secs,
SELECT count(*) as cnt, actor_preferred_username from tweets where
posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
<='2016-08-10T00.00.00.000Z' group by actor_preferred_username order
by cnt desc limit 10;
Pls. note that,
*In Drill*
1) I've changed the conf/drill-env.sh on each node with following
|export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
2) changed few setting based on the docs
<https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>
*In Mongo*
data is sharded on 3 servers on shardkey tweet_id which distributes data
evenly on all shards and created compound index on fields used by the
above drill query.
sample data
{
"_id" : ObjectId("58524d507e08dae4c0377b9e"),
"rule_list" : [
"A",
"B",
"C",
"D13"
],
"actor_friends_count" : 40,
"klout_score" : 28,
"actor_favorites_count" : 1697,
"actor_preferred_username" : "_20_xxxx",
"sentiment" : "neu",
"tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
"object_actor_followers_count" : 573,
"actor_posted_time" : "2016-06-24T00:37:54.000Z",
"actor_id" : "id:twitter.com:xxxxxxxxxx",
"actor_display_name" : "xxxxxx",
"retweet_count" : 57,
"hashtag_list" : [
"c",
"d",
"E",
"f",
"VCVC",
"XXCXCXC",
"RRRR"
],
"body" : "some tweet blah blah",
"actor_followers_count" : 21,
"actor_status_count" : 1756,
"verb" : "share",
"posted_time" : "2016-08-01T23:47:43.000Z",
"object_actor_status_count" : 2796,
"lang" : "ar",
"object_actor_preferred_username" : "xxxxxxx",
"original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
"gender" : "male",
"object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
"favorites_count" : 0,
"object_posted_time" : "2016-08-01T22:54:22.000Z",
"object_actor_friends_count" : 69,
"generator_display_name" : "Twitter Web Client",
"object_actor_display_name" : "xxxxxxxxxxxx",
"actor_listed_count" : 0
}
*Questions*
1) How to improve aggregation query performance?
2) Do I also need to do something in mongodb to enhance performance? (I
mean optimize source)
3) Does Apache drill capable of handling/aggregating billions of
documents in real-time?
Your early response is highly appreciated!
Thank you & Kind Regards
Yousuf
Re: Aggregation performance
Posted by Dechang Gu <dg...@maprtech.com>.
Hi Yousuf,
Thanks for the update and profile. From the profile, looks like most of
the time was spent on the following operator:
05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s *1.350s* *4.903s* *7.817s*
0.000s 0.000s 0.000s 280KB 280KBwhich is mainly mongoScan. Also the min
(1.35s) and max (7.817s) processing times here are quite different, which
may indicate some
skewness of data distribution among the nodes. Take a look of the minor
fragment stats to find out how the scanned data is distributed.
Another thing to look is filter pushdown in mongoScan, shown in physical
plan. As an example, the physical plan from Kathleen's run on your sample
data is attached. Please compare it with yours (please also share it with
me if you want me to take a look).
Thanks,
Dechang
On Sat, Dec 17, 2016 at 11:53 PM, yousuf <yo...@css.org.sa> wrote:
> Hi Dechang,
>
> Thanks for your reply,
>
> Drill versiom : 1.8.0
>
> Mongo:3.2.11
>
>
> *Here is the query profile, Kindly also check my system options for
> possibly incorrect settings as attachement. *
>
> *Operator ID* *Type* *Min Setup Time* *Avg Setup Time* *Max Setup Time* *Min
> Process Time* *Avg Process Time* *Max Process Time* *Min Wait Time* *Avg
> Wait Time* *Max Wait Time* *Avg Peak Memory* *Max Peak Memory*
> 00-xx-00 SCREEN 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s 0.001s
> 0.001s 52KB 52KB
> 00-xx-01 PROJECT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 52KB 52KB
> 00-xx-02 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s 0.000s
> 0.000s 0.000s 0.000s 0.000s 52KB 52KB
> 00-xx-03 LIMIT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 308KB 308KB
> 00-xx-04 MERGING_RECEIVER 0.000s 0.000s 0.000s 0.003s 0.003s 0.003s 7.829s
> 7.829s 7.829s 308KB 308KB
> 01-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
> 0.001s 0.002s 52KB 52KB
> 01-xx-01 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s 0.000s
> 0.000s 0.000s 0.000s 0.000s 84KB 84KB
> 01-xx-02 TOP_N_SORT 0.000s 0.000s 0.000s 0.004s 0.016s 0.050s 0.000s
> 0.000s 0.000s 5MB 18MB
> 01-xx-03 PROJECT 0.000s 0.000s 0.001s 0.000s 0.001s 0.005s 0.000s 0.000s
> 0.000s 32KB 32KB
> 01-xx-04 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.007s 0.026s
> 7.759s 7.819s 7.841s - -
> 02-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.073s 0.081s 0.090s
> 0.039s 0.044s 0.051s 373KB 379KB
> 02-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.001s 0.001s
> 7.748s 7.755s 7.761s - -
> 03-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
> 0.004s 0.008s 5MB 5MB
> 03-xx-01 PROJECT 0.001s 0.002s 0.005s 0.000s 0.000s 0.001s 0.000s 0.000s
> 0.000s 5MB 5MB
> 03-xx-02 PROJECT 0.000s 0.000s 0.002s 0.000s 0.000s 0.000s 0.000s 0.000s
> 0.000s 4MB 4MB
> 03-xx-03 HASH_AGGREGATE 0.004s 0.009s 0.024s 0.054s 0.065s 0.077s 0.000s
> 0.000s 0.000s 6MB 6MB
> 03-xx-04 PROJECT 0.000s 0.001s 0.006s 0.001s 0.002s 0.002s 0.000s 0.000s
> 0.000s 32KB 32KB
> 03-xx-05 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.007s 0.009s 0.018s
> 7.646s 7.670s 7.695s - -
> 04-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.435s 0.488s 0.659s
> 0.088s 0.147s 0.216s 2MB 2MB
> 04-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.021s 0.025s 0.028s
> 6.014s 6.371s 6.854s - -
> 05-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.002s 0.004s 0.001s
> 0.004s 0.050s 116KB 116KB
> 05-xx-01 PROJECT 0.001s 0.002s 0.010s 0.001s 0.007s 0.012s 0.000s 0.000s
> 0.000s 116KB 116KB
> 05-xx-02 PROJECT 0.000s 0.001s 0.004s 0.000s 0.001s 0.004s 0.000s 0.000s
> 0.000s 100KB 100KB
> 05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s 1.350s 4.903s 7.817s 0.000s
> 0.000s 0.000s 280KB 280KB
>
> Thanks & regards
>
> On 12/16/2016 01:27 AM, Dechang Gu wrote:
>
> Yousuf,
> Which version of drill are you running?
> Can you share the profile of the query?
>
> Thanks,
> Dechang
>
>
>
> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> <yo...@css.org.sa> wrote:
>
>
> Hello experts
>
> As a POC project, I've built a drill cluster on 5 VMs , each with the
> following specs
>
> 32 GB ram
>
> 1 TB storage
>
> 16 Cores
>
> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
> engine is mongo which has 5 million docs. (Our daily collection is close to
> 2.5 million tweets)
>
> *Problem*
>
> My aggregation queries are slow, not fit for my realtime dashboard. sample
> query as follows took 12.+ secs,
>
>
> SELECT count(*) as cnt, actor_preferred_username from tweets where
> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
> cnt desc limit 10;
>
> Pls. note that,
>
> *In Drill*
>
> 1) I've changed the conf/drill-env.sh on each node with following
>
> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>
> 2) changed few setting based on the docs <https://drill.apache.org/docs
> /sort-based-and-hash-based-memory-constrained-operators/> <https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/>
>
> *In Mongo*
>
> data is sharded on 3 servers on shardkey tweet_id which distributes data
> evenly on all shards and created compound index on fields used by the above
> drill query.
>
>
> sample data
>
> {
> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
> "rule_list" : [
> "A",
> "B",
> "C",
> "D13"
> ],
> "actor_friends_count" : 40,
> "klout_score" : 28,
> "actor_favorites_count" : 1697,
> "actor_preferred_username" : "_20_xxxx",
> "sentiment" : "neu",
> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
> "object_actor_followers_count" : 573,
> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
> "actor_id" : "id:twitter.com:xxxxxxxxxx",
> "actor_display_name" : "xxxxxx",
> "retweet_count" : 57,
> "hashtag_list" : [
> "c",
> "d",
> "E",
> "f",
> "VCVC",
> "XXCXCXC",
> "RRRR"
> ],
> "body" : "some tweet blah blah",
> "actor_followers_count" : 21,
> "actor_status_count" : 1756,
> "verb" : "share",
> "posted_time" : "2016-08-01T23:47:43.000Z",
> "object_actor_status_count" : 2796,
> "lang" : "ar",
> "object_actor_preferred_username" : "xxxxxxx",
> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
> "gender" : "male",
> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
> "favorites_count" : 0,
> "object_posted_time" : "2016-08-01T22:54:22.000Z",
> "object_actor_friends_count" : 69,
> "generator_display_name" : "Twitter Web Client",
> "object_actor_display_name" : "xxxxxxxxxxxx",
> "actor_listed_count" : 0
> }
>
>
>
> *Questions*
>
> 1) How to improve aggregation query performance?
>
> 2) Do I also need to do something in mongodb to enhance performance? (I
> mean optimize source)
>
> 3) Does Apache drill capable of handling/aggregating billions of documents
> in real-time?
>
> Your early response is highly appreciated!
>
> Thank you & Kind Regards
>
> Yousuf
>
>
>
>
>
>
>
Re: Aggregation performance
Posted by yousuf <yo...@css.org.sa>.
Hi Dechang,
Thanks for your reply,
Drill versiom : 1.8.0
Mongo:3.2.11
*Here is the query profile, Kindly also check my system options for
possibly incorrect settings as attachement.
*
*Operator ID* *Type* *Min Setup Time* *Avg Setup Time* *Max Setup
Time* *Min Process Time* *Avg Process Time* *Max Process Time* *Min
Wait Time* *Avg Wait Time* *Max Wait Time* *Avg Peak Memory* *Max
Peak Memory*
00-xx-00 SCREEN 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.001s
0.001s 0.001s 52KB 52KB
00-xx-01 PROJECT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
0.000s 0.000s 0.000s 52KB 52KB
00-xx-02 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s
0.000s 0.000s 0.000s 0.000s 0.000s 52KB 52KB
00-xx-03 LIMIT 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
0.000s 0.000s 308KB 308KB
00-xx-04 MERGING_RECEIVER 0.000s 0.000s 0.000s 0.003s 0.003s
0.003s 7.829s 7.829s 7.829s 308KB 308KB
01-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
0.001s 0.001s 0.002s 52KB 52KB
01-xx-01 SELECTION_VECTOR_REMOVER 0.001s 0.001s 0.001s 0.000s
0.000s 0.000s 0.000s 0.000s 0.000s 84KB 84KB
01-xx-02 TOP_N_SORT 0.000s 0.000s 0.000s 0.004s 0.016s 0.050s
0.000s 0.000s 0.000s 5MB 18MB
01-xx-03 PROJECT 0.000s 0.000s 0.001s 0.000s 0.001s 0.005s
0.000s 0.000s 0.000s 32KB 32KB
01-xx-04 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.007s
0.026s 7.759s 7.819s 7.841s - -
02-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.073s 0.081s
0.090s 0.039s 0.044s 0.051s 373KB 379KB
02-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.001s 0.001s
0.001s 7.748s 7.755s 7.761s - -
03-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.000s 0.000s
0.001s 0.004s 0.008s 5MB 5MB
03-xx-01 PROJECT 0.001s 0.002s 0.005s 0.000s 0.000s 0.001s
0.000s 0.000s 0.000s 5MB 5MB
03-xx-02 PROJECT 0.000s 0.000s 0.002s 0.000s 0.000s 0.000s
0.000s 0.000s 0.000s 4MB 4MB
03-xx-03 HASH_AGGREGATE 0.004s 0.009s 0.024s 0.054s 0.065s 0.077s
0.000s 0.000s 0.000s 6MB 6MB
03-xx-04 PROJECT 0.000s 0.001s 0.006s 0.001s 0.002s 0.002s
0.000s 0.000s 0.000s 32KB 32KB
03-xx-05 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.007s 0.009s
0.018s 7.646s 7.670s 7.695s - -
04-xx-00 HASH_PARTITION_SENDER 0.000s 0.000s 0.000s 0.435s 0.488s
0.659s 0.088s 0.147s 0.216s 2MB 2MB
04-xx-01 UNORDERED_RECEIVER 0.000s 0.000s 0.000s 0.021s 0.025s
0.028s 6.014s 6.371s 6.854s - -
05-xx-00 SINGLE_SENDER 0.000s 0.000s 0.000s 0.000s 0.002s 0.004s
0.001s 0.004s 0.050s 116KB 116KB
05-xx-01 PROJECT 0.001s 0.002s 0.010s 0.001s 0.007s 0.012s
0.000s 0.000s 0.000s 116KB 116KB
05-xx-02 PROJECT 0.000s 0.001s 0.004s 0.000s 0.001s 0.004s
0.000s 0.000s 0.000s 100KB 100KB
05-xx-03 UNKNOWN_OPERATOR 0.000s 0.000s 0.000s 1.350s 4.903s
7.817s 0.000s 0.000s 0.000s 280KB 280KB
Thanks & regards
On 12/16/2016 01:27 AM, Dechang Gu wrote:
> Yousuf,
> Which version of drill are you running?
> Can you share the profile of the query?
>
> Thanks,
> Dechang
>
>
>
> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>
>> Hello experts
>>
>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>> following specs
>>
>> 32 GB ram
>>
>> 1 TB storage
>>
>> 16 Cores
>>
>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>> engine is mongo which has 5 million docs. (Our daily collection is close to
>> 2.5 million tweets)
>>
>> *Problem*
>>
>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>> query as follows took 12.+ secs,
>>
>>
>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>> cnt desc limit 10;
>>
>> Pls. note that,
>>
>> *In Drill*
>>
>> 1) I've changed the conf/drill-env.sh on each node with following
>>
>> |export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>
>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>> /sort-based-and-hash-based-memory-constrained-operators/>
>>
>> *In Mongo*
>>
>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>> evenly on all shards and created compound index on fields used by the above
>> drill query.
>>
>>
>> sample data
>>
>> {
>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>> "rule_list" : [
>> "A",
>> "B",
>> "C",
>> "D13"
>> ],
>> "actor_friends_count" : 40,
>> "klout_score" : 28,
>> "actor_favorites_count" : 1697,
>> "actor_preferred_username" : "_20_xxxx",
>> "sentiment" : "neu",
>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>> "object_actor_followers_count" : 573,
>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>> "actor_display_name" : "xxxxxx",
>> "retweet_count" : 57,
>> "hashtag_list" : [
>> "c",
>> "d",
>> "E",
>> "f",
>> "VCVC",
>> "XXCXCXC",
>> "RRRR"
>> ],
>> "body" : "some tweet blah blah",
>> "actor_followers_count" : 21,
>> "actor_status_count" : 1756,
>> "verb" : "share",
>> "posted_time" : "2016-08-01T23:47:43.000Z",
>> "object_actor_status_count" : 2796,
>> "lang" : "ar",
>> "object_actor_preferred_username" : "xxxxxxx",
>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>> "gender" : "male",
>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>> "favorites_count" : 0,
>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>> "object_actor_friends_count" : 69,
>> "generator_display_name" : "Twitter Web Client",
>> "object_actor_display_name" : "xxxxxxxxxxxx",
>> "actor_listed_count" : 0
>> }
>>
>>
>>
>> *Questions*
>>
>> 1) How to improve aggregation query performance?
>>
>> 2) Do I also need to do something in mongodb to enhance performance? (I
>> mean optimize source)
>>
>> 3) Does Apache drill capable of handling/aggregating billions of documents
>> in real-time?
>>
>> Your early response is highly appreciated!
>>
>> Thank you & Kind Regards
>>
>> Yousuf
>>
>>
>>
>>
Re: Aggregation performance
Posted by Kathleen Li <kl...@maprtech.com>.
Hi Yousuf,
Yes in my env, I was set store.mongo.bson.record.reader = true.
With one record you provided, the same query works fine for me, the error you got is the schema changes related errors:
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order
. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc limit 10;
+----------+------+
| hashtag | cnt |
+----------+------+
| RRRR | 1 |
| VCVC | 1 |
| XXCXCXC | 1 |
| c | 1 |
| d | 1 |
| f | 1 |
| E | 1 |
With that, you might try: alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error
Kathleen
On 12/18/16, 1:21 AM, "yousuf" <yo...@css.org.sa> wrote:
>Hi Kathleen,
>
>Thanks for responding...
>
>I've noticed when alter session set store.mongo.bson.record.reader =
>true; the performance is improved. However, the other queries are
>failing :(.
>
>
>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>+-------+------------------------------------------+
>| ok | summary |
>+-------+------------------------------------------+
>| true | store.mongo.bson.record.reader updated. |
>+-------+------------------------------------------+
>1 row selected (0.082 seconds)
>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
>tweets group by actor_preferred_username order by cnt desc limit 10;
>+--------+---------------------------+
>| cnt | actor_preferred_username |
>+--------+---------------------------+
>| 10770 | mrnota53 |
>| 6239 | cyberahsokatano |
>| 4609 | abeerlilak |
>| 4562 | DaeshAjel3 |
>| 4523 | Aster__Q |
>| 4275 | DaeshCrimes |
>| 4182 | AfwfwefOfwefho |
>| 3980 | HewarMaftuh2 |
>| 3698 | DaeshAjel |
>| 3661 | sarosh_iq |
>+--------+---------------------------+
>10 rows selected (4.64 seconds)
>
>
>*Failed queries when **store.mongo.bson.record.reader = true;*
>
>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
>by cnt desc limit 10;
>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
>VarChar type when you are using a ValueWriter of type UnionListWriter.
>
>Fragment 4:1
>
>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>
> (java.lang.IllegalArgumentException) You tried to write a VarChar
>type when you are using a ValueWriter of type UnionListWriter.
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
> org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
> org.apache.drill.exec.physical.impl.ScanBatch.next():178
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
> org.apache.drill.exec.record.AbstractRecordBatch.next():142
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> java.lang.Thread.run():745 (state=,code=0)
>
>*
>
>
>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>> In my env, first time took about 1.6s, second time only took 0.5s
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt | actor_preferred_username |
>> +------+---------------------------+
>> | 1 | _20_xxxx |
>> +------+---------------------------+
>> 1 row selected (1.585 seconds)
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>> +------+---------------------------+
>> | cnt | actor_preferred_username |
>> +------+---------------------------+
>> | 1 | _20_xxxx |
>> +------+---------------------------+
>> 1 row selected (0.505 seconds)
>>
>>
>>
>>
>>
>>
>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>>
>> As what Dechang suggested you can check profile and see the time mainly spent on which step.
>>
>> Kathleen
>>
>>
>> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>>
>>> Yousuf,
>>> Which version of drill are you running?
>>> Can you share the profile of the query?
>>>
>>> Thanks,
>>> Dechang
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>>
>>>> Hello experts
>>>>
>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>> following specs
>>>>
>>>> 32 GB ram
>>>>
>>>> 1 TB storage
>>>>
>>>> 16 Cores
>>>>
>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>>> 2.5 million tweets)
>>>>
>>>> *Problem*
>>>>
>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>> query as follows took 12.+ secs,
>>>>
>>>>
>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>>> cnt desc limit 10;
>>>>
>>>> Pls. note that,
>>>>
>>>> *In Drill*
>>>>
>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>
>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>
>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>
>>>> *In Mongo*
>>>>
>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>> evenly on all shards and created compound index on fields used by the above
>>>> drill query.
>>>>
>>>>
>>>> sample data
>>>>
>>>> {
>>>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>> "rule_list" : [
>>>> "A",
>>>> "B",
>>>> "C",
>>>> "D13"
>>>> ],
>>>> "actor_friends_count" : 40,
>>>> "klout_score" : 28,
>>>> "actor_favorites_count" : 1697,
>>>> "actor_preferred_username" : "_20_xxxx",
>>>> "sentiment" : "neu",
>>>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>> "object_actor_followers_count" : 573,
>>>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>> "actor_display_name" : "xxxxxx",
>>>> "retweet_count" : 57,
>>>> "hashtag_list" : [
>>>> "c",
>>>> "d",
>>>> "E",
>>>> "f",
>>>> "VCVC",
>>>> "XXCXCXC",
>>>> "RRRR"
>>>> ],
>>>> "body" : "some tweet blah blah",
>>>> "actor_followers_count" : 21,
>>>> "actor_status_count" : 1756,
>>>> "verb" : "share",
>>>> "posted_time" : "2016-08-01T23:47:43.000Z",
>>>> "object_actor_status_count" : 2796,
>>>> "lang" : "ar",
>>>> "object_actor_preferred_username" : "xxxxxxx",
>>>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>> "gender" : "male",
>>>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>> "favorites_count" : 0,
>>>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>> "object_actor_friends_count" : 69,
>>>> "generator_display_name" : "Twitter Web Client",
>>>> "object_actor_display_name" : "xxxxxxxxxxxx",
>>>> "actor_listed_count" : 0
>>>> }
>>>>
>>>>
>>>>
>>>> *Questions*
>>>>
>>>> 1) How to improve aggregation query performance?
>>>>
>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>> mean optimize source)
>>>>
>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>> in real-time?
>>>>
>>>> Your early response is highly appreciated!
>>>>
>>>> Thank you & Kind Regards
>>>>
>>>> Yousuf
>>>>
>>>>
>>>>
>>>>
>
Re: Aggregation performance
Posted by yousuf <yo...@css.org.sa>.
Hi Kathleen
A bug is reported on JIRA <https://issues.apache.org/jira/browse/DRILL-5139>
Thanks & Regards
Yousuf
On 12/19/2016 10:03 PM, Kathleen Li wrote:
> It seemed in Drill 1.8 parameter store.mongo.all_text_mode=true by default already
>
> Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors, you might open one public JIRA with the detailed information.
>
>
> You might also try to use CTAS to create drill tables with parquet format, then run your queries against drill parquet tables to see if the performance getting better.
>
>
> Thanks,
>
> Kathleen
>
>
>
>
> On 12/19/16, 10:31 AM, "Kathleen Li" <kl...@maprtech.com> wrote:
>
>> Hi Yousuf,
>>
>>
>> Yes in my env, I was set store.mongo.bson.record.reader = true.
>> With one record you provided, the same query works fine for me, the error you got is the schema changes related errors:
>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
>> . . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order
>> . . . . . . . . . . . . . . . . . . . . . . .> by cnt desc limit 10;
>> +----------+------+
>> | hashtag | cnt |
>> +----------+------+
>> | RRRR | 1 |
>> | VCVC | 1 |
>> | XXCXCXC | 1 |
>> | c | 1 |
>> | d | 1 |
>> | f | 1 |
>> | E | 1 |
>>
>>
>> With that, you might try: alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error
>>
>>
>> Kathleen
>>
>>
>>
>>
>>
>>
>>
>> On 12/18/16, 1:21 AM, "yousuf" <yo...@css.org.sa> wrote:
>>
>>> Hi Kathleen,
>>>
>>> Thanks for responding...
>>>
>>> I've noticed when alter session set store.mongo.bson.record.reader =
>>> true; the performance is improved. However, the other queries are
>>> failing :(.
>>>
>>>
>>> 0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>>> +-------+------------------------------------------+
>>> | ok | summary |
>>> +-------+------------------------------------------+
>>> | true | store.mongo.bson.record.reader updated. |
>>> +-------+------------------------------------------+
>>> 1 row selected (0.082 seconds)
>>> 0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
>>> tweets group by actor_preferred_username order by cnt desc limit 10;
>>> +--------+---------------------------+
>>> | cnt | actor_preferred_username |
>>> +--------+---------------------------+
>>> | 10770 | mrnota53 |
>>> | 6239 | cyberahsokatano |
>>> | 4609 | abeerlilak |
>>> | 4562 | DaeshAjel3 |
>>> | 4523 | Aster__Q |
>>> | 4275 | DaeshCrimes |
>>> | 4182 | AfwfwefOfwefho |
>>> | 3980 | HewarMaftuh2 |
>>> | 3698 | DaeshAjel |
>>> | 3661 | sarosh_iq |
>>> +--------+---------------------------+
>>> 10 rows selected (4.64 seconds)
>>>
>>>
>>> *Failed queries when **store.mongo.bson.record.reader = true;*
>>>
>>> *0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
>>> flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
>>> by cnt desc limit 10;
>>> Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
>>> VarChar type when you are using a ValueWriter of type UnionListWriter.
>>>
>>> Fragment 4:1
>>>
>>> [Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>>>
>>> (java.lang.IllegalArgumentException) You tried to write a VarChar
>>> type when you are using a ValueWriter of type UnionListWriter.
>>> org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>>> org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>>> org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>>> org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>>> org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>>> org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>>> org.apache.drill.exec.physical.impl.ScanBatch.next():178
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():142
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>>> java.security.AccessController.doPrivileged():-2
>>> javax.security.auth.Subject.doAs():422
>>> org.apache.hadoop.security.UserGroupInformation.doAs():1657
>>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>>> org.apache.drill.common.SelfCleaningRunnable.run():38
>>> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>> java.lang.Thread.run():745 (state=,code=0)
>>>
>>> *
>>>
>>>
>>> On 12/16/2016 01:55 AM, Kathleen Li wrote:
>>>> In my env, first time took about 1.6s, second time only took 0.5s
>>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>>> +------+---------------------------+
>>>> | cnt | actor_preferred_username |
>>>> +------+---------------------------+
>>>> | 1 | _20_xxxx |
>>>> +------+---------------------------+
>>>> 1 row selected (1.585 seconds)
>>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>>> +------+---------------------------+
>>>> | cnt | actor_preferred_username |
>>>> +------+---------------------------+
>>>> | 1 | _20_xxxx |
>>>> +------+---------------------------+
>>>> 1 row selected (0.505 seconds)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>>>>
>>>> As what Dechang suggested you can check profile and see the time mainly spent on which step.
>>>>
>>>> Kathleen
>>>>
>>>>
>>>> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>>>>
>>>>> Yousuf,
>>>>> Which version of drill are you running?
>>>>> Can you share the profile of the query?
>>>>>
>>>>> Thanks,
>>>>> Dechang
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>>>>
>>>>>> Hello experts
>>>>>>
>>>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>>>> following specs
>>>>>>
>>>>>> 32 GB ram
>>>>>>
>>>>>> 1 TB storage
>>>>>>
>>>>>> 16 Cores
>>>>>>
>>>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>>>>> 2.5 million tweets)
>>>>>>
>>>>>> *Problem*
>>>>>>
>>>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>>>> query as follows took 12.+ secs,
>>>>>>
>>>>>>
>>>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>>>>> cnt desc limit 10;
>>>>>>
>>>>>> Pls. note that,
>>>>>>
>>>>>> *In Drill*
>>>>>>
>>>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>>>
>>>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
>>>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>>>
>>>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>>>
>>>>>> *In Mongo*
>>>>>>
>>>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>>>> evenly on all shards and created compound index on fields used by the above
>>>>>> drill query.
>>>>>>
>>>>>>
>>>>>> sample data
>>>>>>
>>>>>> {
>>>>>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>>> "rule_list" : [
>>>>>> "A",
>>>>>> "B",
>>>>>> "C",
>>>>>> "D13"
>>>>>> ],
>>>>>> "actor_friends_count" : 40,
>>>>>> "klout_score" : 28,
>>>>>> "actor_favorites_count" : 1697,
>>>>>> "actor_preferred_username" : "_20_xxxx",
>>>>>> "sentiment" : "neu",
>>>>>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>>> "object_actor_followers_count" : 573,
>>>>>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>>> "actor_display_name" : "xxxxxx",
>>>>>> "retweet_count" : 57,
>>>>>> "hashtag_list" : [
>>>>>> "c",
>>>>>> "d",
>>>>>> "E",
>>>>>> "f",
>>>>>> "VCVC",
>>>>>> "XXCXCXC",
>>>>>> "RRRR"
>>>>>> ],
>>>>>> "body" : "some tweet blah blah",
>>>>>> "actor_followers_count" : 21,
>>>>>> "actor_status_count" : 1756,
>>>>>> "verb" : "share",
>>>>>> "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>>> "object_actor_status_count" : 2796,
>>>>>> "lang" : "ar",
>>>>>> "object_actor_preferred_username" : "xxxxxxx",
>>>>>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>>> "gender" : "male",
>>>>>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>>> "favorites_count" : 0,
>>>>>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>>> "object_actor_friends_count" : 69,
>>>>>> "generator_display_name" : "Twitter Web Client",
>>>>>> "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>>> "actor_listed_count" : 0
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Questions*
>>>>>>
>>>>>> 1) How to improve aggregation query performance?
>>>>>>
>>>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>>>> mean optimize source)
>>>>>>
>>>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>>>> in real-time?
>>>>>>
>>>>>> Your early response is highly appreciated!
>>>>>>
>>>>>> Thank you & Kind Regards
>>>>>>
>>>>>> Yousuf
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>
Re: Aggregation performance
Posted by Kathleen Li <kl...@maprtech.com>.
It seemed in Drill 1.8 parameter store.mongo.all_text_mode=true by default already
Try ALTER SESSION SET `exec.enable_union_type` = true; if you still get the errors, you might open one public JIRA with the detailed information.
You might also try to use CTAS to create drill tables with parquet format, then run your queries against drill parquet tables to see if the performance getting better.
Thanks,
Kathleen
On 12/19/16, 10:31 AM, "Kathleen Li" <kl...@maprtech.com> wrote:
>Hi Yousuf,
>
>
>Yes in my env, I was set store.mongo.bson.record.reader = true.
>With one record you provided, the same query works fine for me, the error you got is the schema changes related errors:
>0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT hashtag, count(*) as cnt from (select
>. . . . . . . . . . . . . . . . . . . . . . .> flatten(hashtag_list) as hashtag from test) group by hashtag order
>. . . . . . . . . . . . . . . . . . . . . . .> by cnt desc limit 10;
>+----------+------+
>| hashtag | cnt |
>+----------+------+
>| RRRR | 1 |
>| VCVC | 1 |
>| XXCXCXC | 1 |
>| c | 1 |
>| d | 1 |
>| f | 1 |
>| E | 1 |
>
>
>With that, you might try: alter session set store.mongo.all_text_mode=true and rerun the query to see if you can pass the error
>
>
>Kathleen
>
>
>
>
>
>
>
>On 12/18/16, 1:21 AM, "yousuf" <yo...@css.org.sa> wrote:
>
>>Hi Kathleen,
>>
>>Thanks for responding...
>>
>>I've noticed when alter session set store.mongo.bson.record.reader =
>>true; the performance is improved. However, the other queries are
>>failing :(.
>>
>>
>>0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
>>+-------+------------------------------------------+
>>| ok | summary |
>>+-------+------------------------------------------+
>>| true | store.mongo.bson.record.reader updated. |
>>+-------+------------------------------------------+
>>1 row selected (0.082 seconds)
>>0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
>>tweets group by actor_preferred_username order by cnt desc limit 10;
>>+--------+---------------------------+
>>| cnt | actor_preferred_username |
>>+--------+---------------------------+
>>| 10770 | mrnota53 |
>>| 6239 | cyberahsokatano |
>>| 4609 | abeerlilak |
>>| 4562 | DaeshAjel3 |
>>| 4523 | Aster__Q |
>>| 4275 | DaeshCrimes |
>>| 4182 | AfwfwefOfwefho |
>>| 3980 | HewarMaftuh2 |
>>| 3698 | DaeshAjel |
>>| 3661 | sarosh_iq |
>>+--------+---------------------------+
>>10 rows selected (4.64 seconds)
>>
>>
>>*Failed queries when **store.mongo.bson.record.reader = true;*
>>
>>*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
>>flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
>>by cnt desc limit 10;
>>Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
>>VarChar type when you are using a ValueWriter of type UnionListWriter.
>>
>>Fragment 4:1
>>
>>[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
>>
>> (java.lang.IllegalArgumentException) You tried to write a VarChar
>>type when you are using a ValueWriter of type UnionListWriter.
>>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
>>org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
>>org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
>>org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
>> org.apache.drill.exec.store.bson.BsonRecordReader.write():75
>>org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
>> org.apache.drill.exec.physical.impl.ScanBatch.next():178
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
>> org.apache.drill.exec.record.AbstractRecordBatch.next():142
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>>org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>> java.security.AccessController.doPrivileged():-2
>> javax.security.auth.Subject.doAs():422
>> org.apache.hadoop.security.UserGroupInformation.doAs():1657
>>org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>> org.apache.drill.common.SelfCleaningRunnable.run():38
>> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>> java.lang.Thread.run():745 (state=,code=0)
>>
>>*
>>
>>
>>On 12/16/2016 01:55 AM, Kathleen Li wrote:
>>> In my env, first time took about 1.6s, second time only took 0.5s
>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>> +------+---------------------------+
>>> | cnt | actor_preferred_username |
>>> +------+---------------------------+
>>> | 1 | _20_xxxx |
>>> +------+---------------------------+
>>> 1 row selected (1.585 seconds)
>>> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
>>> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
>>> +------+---------------------------+
>>> | cnt | actor_preferred_username |
>>> +------+---------------------------+
>>> | 1 | _20_xxxx |
>>> +------+---------------------------+
>>> 1 row selected (0.505 seconds)
>>>
>>>
>>>
>>>
>>>
>>>
>>> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>>>
>>> As what Dechang suggested you can check profile and see the time mainly spent on which step.
>>>
>>> Kathleen
>>>
>>>
>>> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>>>
>>>> Yousuf,
>>>> Which version of drill are you running?
>>>> Can you share the profile of the query?
>>>>
>>>> Thanks,
>>>> Dechang
>>>>
>>>>
>>>>
>>>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>>>
>>>>> Hello experts
>>>>>
>>>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>>>> following specs
>>>>>
>>>>> 32 GB ram
>>>>>
>>>>> 1 TB storage
>>>>>
>>>>> 16 Cores
>>>>>
>>>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>>>> 2.5 million tweets)
>>>>>
>>>>> *Problem*
>>>>>
>>>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>>>> query as follows took 12.+ secs,
>>>>>
>>>>>
>>>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>>>> cnt desc limit 10;
>>>>>
>>>>> Pls. note that,
>>>>>
>>>>> *In Drill*
>>>>>
>>>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>>>
>>>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>>>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>>>
>>>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>>>
>>>>> *In Mongo*
>>>>>
>>>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>>>> evenly on all shards and created compound index on fields used by the above
>>>>> drill query.
>>>>>
>>>>>
>>>>> sample data
>>>>>
>>>>> {
>>>>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>>>> "rule_list" : [
>>>>> "A",
>>>>> "B",
>>>>> "C",
>>>>> "D13"
>>>>> ],
>>>>> "actor_friends_count" : 40,
>>>>> "klout_score" : 28,
>>>>> "actor_favorites_count" : 1697,
>>>>> "actor_preferred_username" : "_20_xxxx",
>>>>> "sentiment" : "neu",
>>>>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>>>> "object_actor_followers_count" : 573,
>>>>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>>>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>>>> "actor_display_name" : "xxxxxx",
>>>>> "retweet_count" : 57,
>>>>> "hashtag_list" : [
>>>>> "c",
>>>>> "d",
>>>>> "E",
>>>>> "f",
>>>>> "VCVC",
>>>>> "XXCXCXC",
>>>>> "RRRR"
>>>>> ],
>>>>> "body" : "some tweet blah blah",
>>>>> "actor_followers_count" : 21,
>>>>> "actor_status_count" : 1756,
>>>>> "verb" : "share",
>>>>> "posted_time" : "2016-08-01T23:47:43.000Z",
>>>>> "object_actor_status_count" : 2796,
>>>>> "lang" : "ar",
>>>>> "object_actor_preferred_username" : "xxxxxxx",
>>>>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>>>> "gender" : "male",
>>>>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>>>> "favorites_count" : 0,
>>>>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>>>> "object_actor_friends_count" : 69,
>>>>> "generator_display_name" : "Twitter Web Client",
>>>>> "object_actor_display_name" : "xxxxxxxxxxxx",
>>>>> "actor_listed_count" : 0
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> *Questions*
>>>>>
>>>>> 1) How to improve aggregation query performance?
>>>>>
>>>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>>>> mean optimize source)
>>>>>
>>>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>>>> in real-time?
>>>>>
>>>>> Your early response is highly appreciated!
>>>>>
>>>>> Thank you & Kind Regards
>>>>>
>>>>> Yousuf
>>>>>
>>>>>
>>>>>
>>>>>
Re: Aggregation performance
Posted by yousuf <yo...@css.org.sa>.
Hi Kathleen,
Thanks for responding...
I've noticed when alter session set store.mongo.bson.record.reader =
true; the performance is improved. However, the other queries are
failing :(.
0: jdbc:drill:> alter session set store.mongo.bson.record.reader = true;
+-------+------------------------------------------+
| ok | summary |
+-------+------------------------------------------+
| true | store.mongo.bson.record.reader updated. |
+-------+------------------------------------------+
1 row selected (0.082 seconds)
0: jdbc:drill:> SELECT count(*) as cnt, actor_preferred_username from
tweets group by actor_preferred_username order by cnt desc limit 10;
+--------+---------------------------+
| cnt | actor_preferred_username |
+--------+---------------------------+
| 10770 | mrnota53 |
| 6239 | cyberahsokatano |
| 4609 | abeerlilak |
| 4562 | DaeshAjel3 |
| 4523 | Aster__Q |
| 4275 | DaeshCrimes |
| 4182 | AfwfwefOfwefho |
| 3980 | HewarMaftuh2 |
| 3698 | DaeshAjel |
| 3661 | sarosh_iq |
+--------+---------------------------+
10 rows selected (4.64 seconds)
*Failed queries when **store.mongo.bson.record.reader = true;*
*0: jdbc:drill:> SELECT hashtag, count(*) as cnt from (select
flatten(hashtag_list) as hashtag from hashtags) group by hashtag order
by cnt desc limit 10;
Error: SYSTEM ERROR: IllegalArgumentException: You tried to write a
VarChar type when you are using a ValueWriter of type UnionListWriter.
Fragment 4:1
[Error Id: 278752e2-a959-482c-b4b0-b79ba923f148 on test01.css.org:31010]
(java.lang.IllegalArgumentException) You tried to write a VarChar
type when you are using a ValueWriter of type UnionListWriter.
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.fail():762
org.apache.drill.exec.vector.complex.impl.AbstractFieldWriter.write():325
org.apache.drill.exec.vector.complex.impl.UnionListWriter.write():91
org.apache.drill.exec.store.bson.BsonRecordReader.writeString():275
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():167
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
org.apache.drill.exec.physical.impl.ScanBatch.next():178
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext():120
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.buildSchema():97
org.apache.drill.exec.record.AbstractRecordBatch.next():142
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
*
On 12/16/2016 01:55 AM, Kathleen Li wrote:
> In my env, first time took about 1.6s, second time only took 0.5s
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
> +------+---------------------------+
> | cnt | actor_preferred_username |
> +------+---------------------------+
> | 1 | _20_xxxx |
> +------+---------------------------+
> 1 row selected (1.585 seconds)
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
> . . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> . . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
> . . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
> +------+---------------------------+
> | cnt | actor_preferred_username |
> +------+---------------------------+
> | 1 | _20_xxxx |
> +------+---------------------------+
> 1 row selected (0.505 seconds)
>
>
>
>
>
>
> I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
>
> As what Dechang suggested you can check profile and see the time mainly spent on which step.
>
> Kathleen
>
>
> On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>
>> Yousuf,
>> Which version of drill are you running?
>> Can you share the profile of the query?
>>
>> Thanks,
>> Dechang
>>
>>
>>
>> On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>>
>>> Hello experts
>>>
>>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>>> following specs
>>>
>>> 32 GB ram
>>>
>>> 1 TB storage
>>>
>>> 16 Cores
>>>
>>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>>> engine is mongo which has 5 million docs. (Our daily collection is close to
>>> 2.5 million tweets)
>>>
>>> *Problem*
>>>
>>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>>> query as follows took 12.+ secs,
>>>
>>>
>>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>>> cnt desc limit 10;
>>>
>>> Pls. note that,
>>>
>>> *In Drill*
>>>
>>> 1) I've changed the conf/drill-env.sh on each node with following
>>>
>>> |export DRILL_HEAP=${DRILL_HEAP:-"8G\u201d} export
>>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>>
>>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>>> /sort-based-and-hash-based-memory-constrained-operators/>
>>>
>>> *In Mongo*
>>>
>>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>>> evenly on all shards and created compound index on fields used by the above
>>> drill query.
>>>
>>>
>>> sample data
>>>
>>> {
>>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>>> "rule_list" : [
>>> "A",
>>> "B",
>>> "C",
>>> "D13"
>>> ],
>>> "actor_friends_count" : 40,
>>> "klout_score" : 28,
>>> "actor_favorites_count" : 1697,
>>> "actor_preferred_username" : "_20_xxxx",
>>> "sentiment" : "neu",
>>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>>> "object_actor_followers_count" : 573,
>>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>>> "actor_display_name" : "xxxxxx",
>>> "retweet_count" : 57,
>>> "hashtag_list" : [
>>> "c",
>>> "d",
>>> "E",
>>> "f",
>>> "VCVC",
>>> "XXCXCXC",
>>> "RRRR"
>>> ],
>>> "body" : "some tweet blah blah",
>>> "actor_followers_count" : 21,
>>> "actor_status_count" : 1756,
>>> "verb" : "share",
>>> "posted_time" : "2016-08-01T23:47:43.000Z",
>>> "object_actor_status_count" : 2796,
>>> "lang" : "ar",
>>> "object_actor_preferred_username" : "xxxxxxx",
>>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>>> "gender" : "male",
>>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>>> "favorites_count" : 0,
>>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>>> "object_actor_friends_count" : 69,
>>> "generator_display_name" : "Twitter Web Client",
>>> "object_actor_display_name" : "xxxxxxxxxxxx",
>>> "actor_listed_count" : 0
>>> }
>>>
>>>
>>>
>>> *Questions*
>>>
>>> 1) How to improve aggregation query performance?
>>>
>>> 2) Do I also need to do something in mongodb to enhance performance? (I
>>> mean optimize source)
>>>
>>> 3) Does Apache drill capable of handling/aggregating billions of documents
>>> in real-time?
>>>
>>> Your early response is highly appreciated!
>>>
>>> Thank you & Kind Regards
>>>
>>> Yousuf
>>>
>>>
>>>
>>>
Re: Aggregation performance
Posted by Kathleen Li <kl...@maprtech.com>.
In my env, first time took about 1.6s, second time only took 0.5s
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt | actor_preferred_username |
+------+---------------------------+
| 1 | _20_xxxx |
+------+---------------------------+
1 row selected (1.585 seconds)
0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT count(*) as cnt, actor_preferred_username from test where
. . . . . . . . . . . . . . . . . . . . . . .> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
. . . . . . . . . . . . . . . . . . . . . . .> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
. . . . . . . . . . . . . . . . . . . . . . .> cnt desc limit 10;
+------+---------------------------+
| cnt | actor_preferred_username |
+------+---------------------------+
| 1 | _20_xxxx |
+------+---------------------------+
1 row selected (0.505 seconds)
I am running 4 vm servers, heap 4GB and direct 8GB. But this query only using one fragment , see attached drill profile.
As what Dechang suggested you can check profile and see the time mainly spent on which step.
Kathleen
On 12/15/16, 2:27 PM, "Dechang Gu" <dg...@maprtech.com> wrote:
>Yousuf,
>Which version of drill are you running?
>Can you share the profile of the query?
>
>Thanks,
>Dechang
>
>
>
>On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
>
>> Hello experts
>>
>> As a POC project, I've built a drill cluster on 5 VMs , each with the
>> following specs
>>
>> 32 GB ram
>>
>> 1 TB storage
>>
>> 16 Cores
>>
>> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
>> engine is mongo which has 5 million docs. (Our daily collection is close to
>> 2.5 million tweets)
>>
>> *Problem*
>>
>> My aggregation queries are slow, not fit for my realtime dashboard. sample
>> query as follows took 12.+ secs,
>>
>>
>> SELECT count(*) as cnt, actor_preferred_username from tweets where
>> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
>> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
>> cnt desc limit 10;
>>
>> Pls. note that,
>>
>> *In Drill*
>>
>> 1) I've changed the conf/drill-env.sh on each node with following
>>
>> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
>> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>>
>> 2) changed few setting based on the docs <https://drill.apache.org/docs
>> /sort-based-and-hash-based-memory-constrained-operators/>
>>
>> *In Mongo*
>>
>> data is sharded on 3 servers on shardkey tweet_id which distributes data
>> evenly on all shards and created compound index on fields used by the above
>> drill query.
>>
>>
>> sample data
>>
>> {
>> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
>> "rule_list" : [
>> "A",
>> "B",
>> "C",
>> "D13"
>> ],
>> "actor_friends_count" : 40,
>> "klout_score" : 28,
>> "actor_favorites_count" : 1697,
>> "actor_preferred_username" : "_20_xxxx",
>> "sentiment" : "neu",
>> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
>> "object_actor_followers_count" : 573,
>> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
>> "actor_id" : "id:twitter.com:xxxxxxxxxx",
>> "actor_display_name" : "xxxxxx",
>> "retweet_count" : 57,
>> "hashtag_list" : [
>> "c",
>> "d",
>> "E",
>> "f",
>> "VCVC",
>> "XXCXCXC",
>> "RRRR"
>> ],
>> "body" : "some tweet blah blah",
>> "actor_followers_count" : 21,
>> "actor_status_count" : 1756,
>> "verb" : "share",
>> "posted_time" : "2016-08-01T23:47:43.000Z",
>> "object_actor_status_count" : 2796,
>> "lang" : "ar",
>> "object_actor_preferred_username" : "xxxxxxx",
>> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
>> "gender" : "male",
>> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
>> "favorites_count" : 0,
>> "object_posted_time" : "2016-08-01T22:54:22.000Z",
>> "object_actor_friends_count" : 69,
>> "generator_display_name" : "Twitter Web Client",
>> "object_actor_display_name" : "xxxxxxxxxxxx",
>> "actor_listed_count" : 0
>> }
>>
>>
>>
>> *Questions*
>>
>> 1) How to improve aggregation query performance?
>>
>> 2) Do I also need to do something in mongodb to enhance performance? (I
>> mean optimize source)
>>
>> 3) Does Apache drill capable of handling/aggregating billions of documents
>> in real-time?
>>
>> Your early response is highly appreciated!
>>
>> Thank you & Kind Regards
>>
>> Yousuf
>>
>>
>>
>>
Re: Aggregation performance
Posted by Dechang Gu <dg...@maprtech.com>.
Yousuf,
Which version of drill are you running?
Can you share the profile of the query?
Thanks,
Dechang
On Thu, Dec 15, 2016 at 3:27 AM, yousuf <yo...@css.org.sa> wrote:
> Hello experts
>
> As a POC project, I've built a drill cluster on 5 VMs , each with the
> following specs
>
> 32 GB ram
>
> 1 TB storage
>
> 16 Cores
>
> Zookeeper quorum & apache drill installed on all 5 nodes. My storage
> engine is mongo which has 5 million docs. (Our daily collection is close to
> 2.5 million tweets)
>
> *Problem*
>
> My aggregation queries are slow, not fit for my realtime dashboard. sample
> query as follows took 12.+ secs,
>
>
> SELECT count(*) as cnt, actor_preferred_username from tweets where
> posted_time >= '2016-08-01T00.00.00.000Z' and posted_time
> <='2016-08-10T00.00.00.000Z' group by actor_preferred_username order by
> cnt desc limit 10;
>
> Pls. note that,
>
> *In Drill*
>
> 1) I've changed the conf/drill-env.sh on each node with following
>
> |export DRILL_HEAP=${DRILL_HEAP:-"8G”} export
> DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"20G"}|
>
> 2) changed few setting based on the docs <https://drill.apache.org/docs
> /sort-based-and-hash-based-memory-constrained-operators/>
>
> *In Mongo*
>
> data is sharded on 3 servers on shardkey tweet_id which distributes data
> evenly on all shards and created compound index on fields used by the above
> drill query.
>
>
> sample data
>
> {
> "_id" : ObjectId("58524d507e08dae4c0377b9e"),
> "rule_list" : [
> "A",
> "B",
> "C",
> "D13"
> ],
> "actor_friends_count" : 40,
> "klout_score" : 28,
> "actor_favorites_count" : 1697,
> "actor_preferred_username" : "_20_xxxx",
> "sentiment" : "neu",
> "tweet_id" : "tag:search.twitter.com,2005:000xxxxx",
> "object_actor_followers_count" : 573,
> "actor_posted_time" : "2016-06-24T00:37:54.000Z",
> "actor_id" : "id:twitter.com:xxxxxxxxxx",
> "actor_display_name" : "xxxxxx",
> "retweet_count" : 57,
> "hashtag_list" : [
> "c",
> "d",
> "E",
> "f",
> "VCVC",
> "XXCXCXC",
> "RRRR"
> ],
> "body" : "some tweet blah blah",
> "actor_followers_count" : 21,
> "actor_status_count" : 1756,
> "verb" : "share",
> "posted_time" : "2016-08-01T23:47:43.000Z",
> "object_actor_status_count" : 2796,
> "lang" : "ar",
> "object_actor_preferred_username" : "xxxxxxx",
> "original_tweet_id" : "tag:search.twitter.com,2005:xxxxxxxxxxxx",
> "gender" : "male",
> "object_actor_id" : "id:twitter.com:xxxxxxxxxxxxx",
> "favorites_count" : 0,
> "object_posted_time" : "2016-08-01T22:54:22.000Z",
> "object_actor_friends_count" : 69,
> "generator_display_name" : "Twitter Web Client",
> "object_actor_display_name" : "xxxxxxxxxxxx",
> "actor_listed_count" : 0
> }
>
>
>
> *Questions*
>
> 1) How to improve aggregation query performance?
>
> 2) Do I also need to do something in mongodb to enhance performance? (I
> mean optimize source)
>
> 3) Does Apache drill capable of handling/aggregating billions of documents
> in real-time?
>
> Your early response is highly appreciated!
>
> Thank you & Kind Regards
>
> Yousuf
>
>
>
>