You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@asterixdb.apache.org by Rana Alotaibi <ra...@eng.ucsd.edu> on 2018/01/26 04:46:05 UTC

AsterixDB Performance Tuning

Hi there,

I have a query that takes ~12.7mins on average (I have excluded the warm-up
time which was 30mins)!, and I would like to make sure that I didn't miss
any performance tuning parameters ( I have run the same query on MongoDB,
and it took ~2mins).

The query asks to find all patients that have 'abnormal' haptoglobin blood
test result. (The query result can have duplicate values).

*Query:*
USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
WHERE I.ITEMID=E.ITEMID AND
             E.FLAG = 'abnormal' AND
             I.FLUID='Blood' AND
             I.LABEL='Haptoglobin'

*Datasets Schema:*
- PATIENTS  and LABITEMS datasets have an open schema.
- LABITEMS's  primary key is ITEMID
- PATIENTS 's primary key is SUBJECT_ID
- The JSON schema for both datasets is attached.
- The DDL for both datasets is attached

*Performance Tuning Parameters:*
- 4 partitions (iodevices)
- The total memory size is : 125GB, and I have assigned ~ 57GB to the
buffercache (storage.buffercache.size).
- As you can see from the query, I set the parallelism to 5,  sort-memory
to 128MB, join-memory to 265MB.
- The data size is 7GB

Your feedback is highly appreciated!

--Rana

Re: AsterixDB Performance Tuning

Posted by Mike Carey <dt...@gmail.com>.

Yes, for this one I think the problem is our "storage-wide parallelism" 
(i.e., only using as many cores as storage partitions for a query like 
this - I suspect the plan is all 1:1 connectors until the very top 
aggregation step, the combine step for the counts).


On 1/29/18 9:06 AM, Chen Luo wrote:
> Just make sure this email has been heard properly... I don't have too much
> expertise on nested field access, but I think this is a fair comparison
> excluding the effects of storage layer (both datasets are cached).
> Moreover, it's also highly unlikely that MongoDB cheats by building some
> key index, as Rana is not accessing the primary key here...
>
> Best regards,
> Chen Luo
>
> On Sat, Jan 27, 2018 at 9:15 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
> wrote:
>
>> Hi all,
>>
>> I have a follow-up issue using the same query. Let's forget about the join
>> and selection predicates. I have the following query (same as previous
>> one), but without the join and selection predicates:
>>
>> *AsterixDB Query: *
>> USE mimiciii;
>> SET `compiler.parallelism` "5";
>> SET `compiler.sortmemory` "128MB";
>> SET `compiler.joinmemory` "265MB";
>> SELECT COUNT(*) AS cnt
>> FROM   PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>
>> Result : {cnt:22237108}
>>
>> Please note I have changed the page-size default configuration from 128 KB
>> to 1MB, and I have the buffercache stays that same ( 57GB)
>>
>> *MongoDB Equivalent Query:*
>> db.patients.aggregate(
>>     [
>>         {
>>             "$unwind":"$ADMISSIONS",
>>
>>         },
>>         {
>>             "$unwind":"$ADMISSIONS.LABEVENTS",
>>
>>         },
>>
>>         { $count: "cnt }
>>
>>       ]
>>    )
>>
>> Result : {cnt:22237108}
>>
>> The query takes ~7min in AsterixDB, and 30sec in MongoDB given that
>> MongoDB is running on a single core. I don't think that MongoDB storage
>> compression techniques play some factor here (All the data is cached in
>> memory) unless MongoDB does some in-memory compression (Which I need to
>> investigate more about that).
>>
>> Does this explain that navigating deeply into nested fields is an
>> expensive operation in AsterixDB? or I do still have some issues with
>> AsterixDB configuration parameters?.
>>
>> Thanks
>> Rana
>>
>>
>> On Sat, Jan 27, 2018 at 3:45 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>> wrote:
>>
>>> Hi Mike,
>>>
>>> Here is some results:
>>> 1) Non-reordered FROM clause+ no bcast added ~12mins
>>> 2) Reordered FROM clause + no bcast added ~12mins (same as (1) )
>>> 3) Non-reordered FROM clause+ bcast added ~6mins
>>> 4) Reordered FROM clause+bacst added ~6mins
>>>
>>> It seems the FROM clause datasets order has no impact. But in both cases,
>>> the bcast reduced the execution time.
>>>
>>> As for querying MongoDB, I'm almost writing a "logical" plan for that
>>> query (It took me days to understand MongoDB query operators). I totally
>>> prefer SQL++ using hints. However, think about data scientists who are
>>> mostly familiar with SQL queries, I don't expect them to spend time and
>>> determine for example the predicates selectivity and accordingly decide
>>> what's the appropriate join algorithms to use and specify this in their
>>> query (i.e /*indexnl*/) (Basically they end-up doing the cost-based
>>> optimizer job :) ).
>>>
>>> Thanks,
>>> --Rana
>>>
>>> On Fri, Jan 26, 2018 at 2:15 PM, Mike Carey <dt...@gmail.com> wrote:
>>>
>>>> Rana,
>>>>
>>>> We need the physical hints because we have a conservative cost-minded
>>>> rule set rather than an actual cost-based optimizer - so it always picks
>>>> partitioned hash joins when doing joins.  (I am curious as to how much the
>>>> bcast hint helps vs. the reordered from clause - what fraction does each
>>>> contribute to the win? - it would be cool to have the numbers without and
>>>> with that hint if you felt like trying that - but don't feel obligated).
>>>>
>>>> Question: In MongoDB, didn't you end up essentially writing a
>>>> query-plan-like program to solve this query - and isn't the SQL++ with
>>>> hints a lot smaller/simpler?  (Just asking - I'm curious as to your
>>>> feedback on that.)  We'd argue that you can write a mostly declarative
>>>> familiar query and then mess with it and annotate it a little to tune it -
>>>> which isn't as good as a great cost-based optimizer, but is better than
>>>> writing/maintaining a program.  Thoughts?
>>>>
>>>> In terms of how good we can get - the size answer is telling.  In past
>>>> days, when we were normally either on par with (smaller things) or beating
>>>> (larger things) MongoDB, they hadn't yet acquired their new storage engine
>>>> company (WiredTiger) with its compression.  Now they're running 3x+
>>>> smaller, I would not be surprised if that's now the explanation for the
>>>> remaining difference, which is indeed about 3x.  (We are going to
>>>> experiment with compression as well.)
>>>>
>>>> Cheers,
>>>>
>>>> Mike
>>>>
>>>> On 1/26/18 1:58 PM, Rana Alotaibi wrote:
>>>>
>>>> Hi all,
>>>> Thanks for your immediate and quick reply.
>>>>
>>>> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
>>>> AsterixDB setting, it seems that navigating deeply in the nested fields and
>>>> create an index is not supported.
>>>> @Taewoo Adding  /*+indexnl */ didn't change the plan.
>>>> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins
>>>> on average without including the warm-up time. The plan has changed and it
>>>> is attached.
>>>>
>>>> Can you please points me where I can find this in the documentation? And
>>>> here is my question: Why do I need to add some physical details in the
>>>> query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
>>>> broadcast_exchange?
>>>> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for
>>>> LABITEMS dataset (Both datasets have an open schema). After data ingestion,
>>>> in my setting the data in MongoDB is ~3GB(It seems MongoDB does some
>>>> compression) and in AsterixDB is ~10GB. (I have 4 partitions,and I checked
>>>> the size of files in each partition and the total is ~10GB!)
>>>>
>>>> Thanks!
>>>> --Rana
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> One thing I noticed is that the "large" unnested arrays are hash
>>>>> partitioned to the probably "small" index-filtered dataset.
>>>>> Since the data can fit in memory (7 GB in total), I think
>>>>> broadcast_exchange may do better in this particular case.
>>>>>
>>>>> USE mimiciii;
>>>>> SET `compiler.parallelism` "5";
>>>>> SET `compiler.sortmemory` "128MB";
>>>>> SET `compiler.joinmemory` "265MB";
>>>>> SELECT P.SUBJECT_ID
>>>>> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>>>>> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>>>>>               E.FLAG = 'abnormal' AND
>>>>>               I.FLUID='Blood' AND
>>>>>               I.LABEL='Haptoglobin'
>>>>>
>>>>> Note: I reordered the FROM clause...
>>>>>
>>>>> Another thing is that I think it's a CPU bound query ... and I'm not
>>>>> sure how MongoDB utilizes CPU resources compared with AsterixDB.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> PS: UNNEST doc
>>>>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#U
>>>>>> nnest_clauses
>>>>>>
>>>>>> Best,
>>>>>> Taewoo
>>>>>>
>>>>>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Rana,
>>>>>>>
>>>>>>> Thank you for attaching your plan. It seems that the selections are
>>>>>>> correctly made before each join. If your query predicate is selective
>>>>>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>>>>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>>>>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>>>>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>>>>>
>>>>>>> USE mimiciii;
>>>>>>> SET `compiler.parallelism` "5";
>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>> SELECT P.SUBJECT_ID
>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>>>>>               E.FLAG = 'abnormal' AND
>>>>>>>               I.FLUID='Blood' AND
>>>>>>>               I.LABEL='Haptoglobin'
>>>>>>>
>>>>>>> Best,
>>>>>>> Taewoo
>>>>>>>
>>>>>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>>
>>>>>>>> Hi Rana,
>>>>>>>>
>>>>>>>> I think the performance issue might related to the access of nested
>>>>>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>>>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>>>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>>>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>>>>>
>>>>>>>> @Wail, do you have any clue on nested fields access?
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Chen Luo
>>>>>>>>
>>>>>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <
>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi Taewoo,
>>>>>>>>>
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>>>>>
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>>>>>>
>>>>>>>>>     - Can you switch the predicate order? -- It seems for me that
>>>>>>>>>     the plan remains the same even if I changed the order of the predicates.
>>>>>>>>>     (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Rana
>>>>>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <
>>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Chen,
>>>>>>>>>>
>>>>>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING
>>>>>>>>>> localfs ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>>>>>> (\"format\"=\"json\"))).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Which version of AsterixDB are you using? *
>>>>>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Rana,
>>>>>>>>>>>
>>>>>>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org
>>>>>>>>>>> as well to get more feedbacks from our developers.
>>>>>>>>>>>
>>>>>>>>>>> Just clarify two things: how did you import data into the
>>>>>>>>>>> dataset? using "load" or "feed"? And which version of AsterixDB are you
>>>>>>>>>>> using? But any way in your case it seems the join takes a lot of time, and
>>>>>>>>>>> your data is pretty much cached into the memory...
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Chen Luo
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi there,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded
>>>>>>>>>>>> the warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>>>>>
>>>>>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>>>>>> haptoglobin blood test result. (The query result can have duplicate values).
>>>>>>>>>>>>
>>>>>>>>>>>> *Query:*
>>>>>>>>>>>> USE mimiciii;
>>>>>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>>>>>               E.FLAG = 'abnormal' AND
>>>>>>>>>>>>               I.FLUID='Blood' AND
>>>>>>>>>>>>               I.LABEL='Haptoglobin'
>>>>>>>>>>>>
>>>>>>>>>>>> *Datasets Schema:*
>>>>>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>>>>>
>>>>>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>>>>>> - 4 partitions (iodevices)
>>>>>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB
>>>>>>>>>>>> to the buffercache (storage.buffercache.size).
>>>>>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>>>>>> - The data size is 7GB
>>>>>>>>>>>>
>>>>>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>>>>>
>>>>>>>>>>>> --Rana
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> *Regards,*
>>>>> Wail Alkowaileet
>>>>>
>>>>
>>>>

Re: AsterixDB Performance Tuning

Posted by Chen Luo <cl...@uci.edu>.

Just make sure this email has been heard properly... I don't have too much
expertise on nested field access, but I think this is a fair comparison
excluding the effects of storage layer (both datasets are cached).
Moreover, it's also highly unlikely that MongoDB cheats by building some
key index, as Rana is not accessing the primary key here...

Best regards,
Chen Luo

On Sat, Jan 27, 2018 at 9:15 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi all,
>
> I have a follow-up issue using the same query. Let's forget about the join
> and selection predicates. I have the following query (same as previous
> one), but without the join and selection predicates:
>
> *AsterixDB Query: *
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT COUNT(*) AS cnt
> FROM   PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>
> Result : {cnt:22237108}
>
> Please note I have changed the page-size default configuration from 128 KB
> to 1MB, and I have the buffercache stays that same ( 57GB)
>
> *MongoDB Equivalent Query:*
> db.patients.aggregate(
>    [
>        {
>            "$unwind":"$ADMISSIONS",
>
>        },
>        {
>            "$unwind":"$ADMISSIONS.LABEVENTS",
>
>        },
>
>        { $count: "cnt }
>
>      ]
>   )
>
> Result : {cnt:22237108}
>
> The query takes ~7min in AsterixDB, and 30sec in MongoDB given that
> MongoDB is running on a single core. I don't think that MongoDB storage
> compression techniques play some factor here (All the data is cached in
> memory) unless MongoDB does some in-memory compression (Which I need to
> investigate more about that).
>
> Does this explain that navigating deeply into nested fields is an
> expensive operation in AsterixDB? or I do still have some issues with
> AsterixDB configuration parameters?.
>
> Thanks
> Rana
>
>
> On Sat, Jan 27, 2018 at 3:45 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
> wrote:
>
>> Hi Mike,
>>
>> Here is some results:
>> 1) Non-reordered FROM clause+ no bcast added ~12mins
>> 2) Reordered FROM clause + no bcast added ~12mins (same as (1) )
>> 3) Non-reordered FROM clause+ bcast added ~6mins
>> 4) Reordered FROM clause+bacst added ~6mins
>>
>> It seems the FROM clause datasets order has no impact. But in both cases,
>> the bcast reduced the execution time.
>>
>> As for querying MongoDB, I'm almost writing a "logical" plan for that
>> query (It took me days to understand MongoDB query operators). I totally
>> prefer SQL++ using hints. However, think about data scientists who are
>> mostly familiar with SQL queries, I don't expect them to spend time and
>> determine for example the predicates selectivity and accordingly decide
>> what's the appropriate join algorithms to use and specify this in their
>> query (i.e /*indexnl*/) (Basically they end-up doing the cost-based
>> optimizer job :) ).
>>
>> Thanks,
>> --Rana
>>
>> On Fri, Jan 26, 2018 at 2:15 PM, Mike Carey <dt...@gmail.com> wrote:
>>
>>> Rana,
>>>
>>> We need the physical hints because we have a conservative cost-minded
>>> rule set rather than an actual cost-based optimizer - so it always picks
>>> partitioned hash joins when doing joins.  (I am curious as to how much the
>>> bcast hint helps vs. the reordered from clause - what fraction does each
>>> contribute to the win? - it would be cool to have the numbers without and
>>> with that hint if you felt like trying that - but don't feel obligated).
>>>
>>> Question: In MongoDB, didn't you end up essentially writing a
>>> query-plan-like program to solve this query - and isn't the SQL++ with
>>> hints a lot smaller/simpler?  (Just asking - I'm curious as to your
>>> feedback on that.)  We'd argue that you can write a mostly declarative
>>> familiar query and then mess with it and annotate it a little to tune it -
>>> which isn't as good as a great cost-based optimizer, but is better than
>>> writing/maintaining a program.  Thoughts?
>>>
>>> In terms of how good we can get - the size answer is telling.  In past
>>> days, when we were normally either on par with (smaller things) or beating
>>> (larger things) MongoDB, they hadn't yet acquired their new storage engine
>>> company (WiredTiger) with its compression.  Now they're running 3x+
>>> smaller, I would not be surprised if that's now the explanation for the
>>> remaining difference, which is indeed about 3x.  (We are going to
>>> experiment with compression as well.)
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>> On 1/26/18 1:58 PM, Rana Alotaibi wrote:
>>>
>>> Hi all,
>>> Thanks for your immediate and quick reply.
>>>
>>> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
>>> AsterixDB setting, it seems that navigating deeply in the nested fields and
>>> create an index is not supported.
>>> @Taewoo Adding  /*+indexnl */ didn't change the plan.
>>> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins
>>> on average without including the warm-up time. The plan has changed and it
>>> is attached.
>>>
>>> Can you please points me where I can find this in the documentation? And
>>> here is my question: Why do I need to add some physical details in the
>>> query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
>>> broadcast_exchange?
>>> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for
>>> LABITEMS dataset (Both datasets have an open schema). After data ingestion,
>>> in my setting the data in MongoDB is ~3GB(It seems MongoDB does some
>>> compression) and in AsterixDB is ~10GB. (I have 4 partitions,and I checked
>>> the size of files in each partition and the total is ~10GB!)
>>>
>>> Thanks!
>>> --Rana
>>>
>>>
>>>
>>>
>>> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wa...@gmail.com>
>>> wrote:
>>>
>>>> One thing I noticed is that the "large" unnested arrays are hash
>>>> partitioned to the probably "small" index-filtered dataset.
>>>> Since the data can fit in memory (7 GB in total), I think
>>>> broadcast_exchange may do better in this particular case.
>>>>
>>>> USE mimiciii;
>>>> SET `compiler.parallelism` "5";
>>>> SET `compiler.sortmemory` "128MB";
>>>> SET `compiler.joinmemory` "265MB";
>>>> SELECT P.SUBJECT_ID
>>>> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>>>> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>>>>              E.FLAG = 'abnormal' AND
>>>>              I.FLUID='Blood' AND
>>>>              I.LABEL='Haptoglobin'
>>>>
>>>> Note: I reordered the FROM clause...
>>>>
>>>> Another thing is that I think it's a CPU bound query ... and I'm not
>>>> sure how MongoDB utilizes CPU resources compared with AsterixDB.
>>>>
>>>>
>>>>
>>>> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> PS: UNNEST doc
>>>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#U
>>>>> nnest_clauses
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Rana,
>>>>>>
>>>>>> Thank you for attaching your plan. It seems that the selections are
>>>>>> correctly made before each join. If your query predicate is selective
>>>>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>>>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>>>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>>>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>>>>
>>>>>> USE mimiciii;
>>>>>> SET `compiler.parallelism` "5";
>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>> SELECT P.SUBJECT_ID
>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>              I.FLUID='Blood' AND
>>>>>>              I.LABEL='Haptoglobin'
>>>>>>
>>>>>> Best,
>>>>>> Taewoo
>>>>>>
>>>>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>
>>>>>>> Hi Rana,
>>>>>>>
>>>>>>> I think the performance issue might related to the access of nested
>>>>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>>>>
>>>>>>> @Wail, do you have any clue on nested fields access?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Chen Luo
>>>>>>>
>>>>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <
>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>
>>>>>>>> Hi Taewoo,
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>>>>>
>>>>>>>>    - Can you switch the predicate order? -- It seems for me that
>>>>>>>>    the plan remains the same even if I changed the order of the predicates.
>>>>>>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Rana
>>>>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <
>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi Chen,
>>>>>>>>>
>>>>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING
>>>>>>>>> localfs ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>>>>> (\"format\"=\"json\"))).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Which version of AsterixDB are you using? *
>>>>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Rana,
>>>>>>>>>>
>>>>>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org
>>>>>>>>>> as well to get more feedbacks from our developers.
>>>>>>>>>>
>>>>>>>>>> Just clarify two things: how did you import data into the
>>>>>>>>>> dataset? using "load" or "feed"? And which version of AsterixDB are you
>>>>>>>>>> using? But any way in your case it seems the join takes a lot of time, and
>>>>>>>>>> your data is pretty much cached into the memory...
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Chen Luo
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi there,
>>>>>>>>>>>
>>>>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded
>>>>>>>>>>> the warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>>>>
>>>>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>>>>> haptoglobin blood test result. (The query result can have duplicate values).
>>>>>>>>>>>
>>>>>>>>>>> *Query:*
>>>>>>>>>>> USE mimiciii;
>>>>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>>>>
>>>>>>>>>>> *Datasets Schema:*
>>>>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>>>>
>>>>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>>>>> - 4 partitions (iodevices)
>>>>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB
>>>>>>>>>>> to the buffercache (storage.buffercache.size).
>>>>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>>>>> - The data size is 7GB
>>>>>>>>>>>
>>>>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>>>>
>>>>>>>>>>> --Rana
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Regards,*
>>>> Wail Alkowaileet
>>>>
>>>
>>>
>>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Chen Luo <cl...@uci.edu>.

Just make sure this email has been heard properly... I don't have too much
expertise on nested field access, but I think this is a fair comparison
excluding the effects of storage layer (both datasets are cached).
Moreover, it's also highly unlikely that MongoDB cheats by building some
key index, as Rana is not accessing the primary key here...

Best regards,
Chen Luo

On Sat, Jan 27, 2018 at 9:15 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi all,
>
> I have a follow-up issue using the same query. Let's forget about the join
> and selection predicates. I have the following query (same as previous
> one), but without the join and selection predicates:
>
> *AsterixDB Query: *
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT COUNT(*) AS cnt
> FROM   PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>
> Result : {cnt:22237108}
>
> Please note I have changed the page-size default configuration from 128 KB
> to 1MB, and I have the buffercache stays that same ( 57GB)
>
> *MongoDB Equivalent Query:*
> db.patients.aggregate(
>    [
>        {
>            "$unwind":"$ADMISSIONS",
>
>        },
>        {
>            "$unwind":"$ADMISSIONS.LABEVENTS",
>
>        },
>
>        { $count: "cnt }
>
>      ]
>   )
>
> Result : {cnt:22237108}
>
> The query takes ~7min in AsterixDB, and 30sec in MongoDB given that
> MongoDB is running on a single core. I don't think that MongoDB storage
> compression techniques play some factor here (All the data is cached in
> memory) unless MongoDB does some in-memory compression (Which I need to
> investigate more about that).
>
> Does this explain that navigating deeply into nested fields is an
> expensive operation in AsterixDB? or I do still have some issues with
> AsterixDB configuration parameters?.
>
> Thanks
> Rana
>
>
> On Sat, Jan 27, 2018 at 3:45 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
> wrote:
>
>> Hi Mike,
>>
>> Here is some results:
>> 1) Non-reordered FROM clause+ no bcast added ~12mins
>> 2) Reordered FROM clause + no bcast added ~12mins (same as (1) )
>> 3) Non-reordered FROM clause+ bcast added ~6mins
>> 4) Reordered FROM clause+bacst added ~6mins
>>
>> It seems the FROM clause datasets order has no impact. But in both cases,
>> the bcast reduced the execution time.
>>
>> As for querying MongoDB, I'm almost writing a "logical" plan for that
>> query (It took me days to understand MongoDB query operators). I totally
>> prefer SQL++ using hints. However, think about data scientists who are
>> mostly familiar with SQL queries, I don't expect them to spend time and
>> determine for example the predicates selectivity and accordingly decide
>> what's the appropriate join algorithms to use and specify this in their
>> query (i.e /*indexnl*/) (Basically they end-up doing the cost-based
>> optimizer job :) ).
>>
>> Thanks,
>> --Rana
>>
>> On Fri, Jan 26, 2018 at 2:15 PM, Mike Carey <dt...@gmail.com> wrote:
>>
>>> Rana,
>>>
>>> We need the physical hints because we have a conservative cost-minded
>>> rule set rather than an actual cost-based optimizer - so it always picks
>>> partitioned hash joins when doing joins.  (I am curious as to how much the
>>> bcast hint helps vs. the reordered from clause - what fraction does each
>>> contribute to the win? - it would be cool to have the numbers without and
>>> with that hint if you felt like trying that - but don't feel obligated).
>>>
>>> Question: In MongoDB, didn't you end up essentially writing a
>>> query-plan-like program to solve this query - and isn't the SQL++ with
>>> hints a lot smaller/simpler?  (Just asking - I'm curious as to your
>>> feedback on that.)  We'd argue that you can write a mostly declarative
>>> familiar query and then mess with it and annotate it a little to tune it -
>>> which isn't as good as a great cost-based optimizer, but is better than
>>> writing/maintaining a program.  Thoughts?
>>>
>>> In terms of how good we can get - the size answer is telling.  In past
>>> days, when we were normally either on par with (smaller things) or beating
>>> (larger things) MongoDB, they hadn't yet acquired their new storage engine
>>> company (WiredTiger) with its compression.  Now they're running 3x+
>>> smaller, I would not be surprised if that's now the explanation for the
>>> remaining difference, which is indeed about 3x.  (We are going to
>>> experiment with compression as well.)
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>> On 1/26/18 1:58 PM, Rana Alotaibi wrote:
>>>
>>> Hi all,
>>> Thanks for your immediate and quick reply.
>>>
>>> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
>>> AsterixDB setting, it seems that navigating deeply in the nested fields and
>>> create an index is not supported.
>>> @Taewoo Adding  /*+indexnl */ didn't change the plan.
>>> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins
>>> on average without including the warm-up time. The plan has changed and it
>>> is attached.
>>>
>>> Can you please points me where I can find this in the documentation? And
>>> here is my question: Why do I need to add some physical details in the
>>> query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
>>> broadcast_exchange?
>>> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for
>>> LABITEMS dataset (Both datasets have an open schema). After data ingestion,
>>> in my setting the data in MongoDB is ~3GB(It seems MongoDB does some
>>> compression) and in AsterixDB is ~10GB. (I have 4 partitions,and I checked
>>> the size of files in each partition and the total is ~10GB!)
>>>
>>> Thanks!
>>> --Rana
>>>
>>>
>>>
>>>
>>> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wa...@gmail.com>
>>> wrote:
>>>
>>>> One thing I noticed is that the "large" unnested arrays are hash
>>>> partitioned to the probably "small" index-filtered dataset.
>>>> Since the data can fit in memory (7 GB in total), I think
>>>> broadcast_exchange may do better in this particular case.
>>>>
>>>> USE mimiciii;
>>>> SET `compiler.parallelism` "5";
>>>> SET `compiler.sortmemory` "128MB";
>>>> SET `compiler.joinmemory` "265MB";
>>>> SELECT P.SUBJECT_ID
>>>> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>>>> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>>>>              E.FLAG = 'abnormal' AND
>>>>              I.FLUID='Blood' AND
>>>>              I.LABEL='Haptoglobin'
>>>>
>>>> Note: I reordered the FROM clause...
>>>>
>>>> Another thing is that I think it's a CPU bound query ... and I'm not
>>>> sure how MongoDB utilizes CPU resources compared with AsterixDB.
>>>>
>>>>
>>>>
>>>> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> PS: UNNEST doc
>>>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#U
>>>>> nnest_clauses
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Rana,
>>>>>>
>>>>>> Thank you for attaching your plan. It seems that the selections are
>>>>>> correctly made before each join. If your query predicate is selective
>>>>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>>>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>>>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>>>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>>>>
>>>>>> USE mimiciii;
>>>>>> SET `compiler.parallelism` "5";
>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>> SELECT P.SUBJECT_ID
>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>              I.FLUID='Blood' AND
>>>>>>              I.LABEL='Haptoglobin'
>>>>>>
>>>>>> Best,
>>>>>> Taewoo
>>>>>>
>>>>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>
>>>>>>> Hi Rana,
>>>>>>>
>>>>>>> I think the performance issue might related to the access of nested
>>>>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>>>>
>>>>>>> @Wail, do you have any clue on nested fields access?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Chen Luo
>>>>>>>
>>>>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <
>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>
>>>>>>>> Hi Taewoo,
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>>>>>
>>>>>>>>    - Can you switch the predicate order? -- It seems for me that
>>>>>>>>    the plan remains the same even if I changed the order of the predicates.
>>>>>>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Rana
>>>>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <
>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi Chen,
>>>>>>>>>
>>>>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING
>>>>>>>>> localfs ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>>>>> (\"format\"=\"json\"))).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Which version of AsterixDB are you using? *
>>>>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Rana,
>>>>>>>>>>
>>>>>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org
>>>>>>>>>> as well to get more feedbacks from our developers.
>>>>>>>>>>
>>>>>>>>>> Just clarify two things: how did you import data into the
>>>>>>>>>> dataset? using "load" or "feed"? And which version of AsterixDB are you
>>>>>>>>>> using? But any way in your case it seems the join takes a lot of time, and
>>>>>>>>>> your data is pretty much cached into the memory...
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Chen Luo
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi there,
>>>>>>>>>>>
>>>>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded
>>>>>>>>>>> the warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>>>>
>>>>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>>>>> haptoglobin blood test result. (The query result can have duplicate values).
>>>>>>>>>>>
>>>>>>>>>>> *Query:*
>>>>>>>>>>> USE mimiciii;
>>>>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>>>>
>>>>>>>>>>> *Datasets Schema:*
>>>>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>>>>
>>>>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>>>>> - 4 partitions (iodevices)
>>>>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB
>>>>>>>>>>> to the buffercache (storage.buffercache.size).
>>>>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>>>>> - The data size is 7GB
>>>>>>>>>>>
>>>>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>>>>
>>>>>>>>>>> --Rana
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Regards,*
>>>> Wail Alkowaileet
>>>>
>>>
>>>
>>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Rana Alotaibi <ra...@eng.ucsd.edu>.

Hi all,

I have a follow-up issue using the same query. Let's forget about the join
and selection predicates. I have the following query (same as previous
one), but without the join and selection predicates:

*AsterixDB Query: *
USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT COUNT(*) AS cnt
FROM   PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E

Result : {cnt:22237108}

Please note I have changed the page-size default configuration from 128 KB
to 1MB, and I have the buffercache stays that same ( 57GB)

*MongoDB Equivalent Query:*
db.patients.aggregate(
   [
       {
           "$unwind":"$ADMISSIONS",

       },
       {
           "$unwind":"$ADMISSIONS.LABEVENTS",

       },

       { $count: "cnt }

     ]
  )

Result : {cnt:22237108}

The query takes ~7min in AsterixDB, and 30sec in MongoDB given that MongoDB
is running on a single core. I don't think that MongoDB storage compression
techniques play some factor here (All the data is cached in memory) unless
MongoDB does some in-memory compression (Which I need to investigate more
about that).

Does this explain that navigating deeply into nested fields is an expensive
operation in AsterixDB? or I do still have some issues with AsterixDB
configuration parameters?.

Thanks
Rana


On Sat, Jan 27, 2018 at 3:45 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi Mike,
>
> Here is some results:
> 1) Non-reordered FROM clause+ no bcast added ~12mins
> 2) Reordered FROM clause + no bcast added ~12mins (same as (1) )
> 3) Non-reordered FROM clause+ bcast added ~6mins
> 4) Reordered FROM clause+bacst added ~6mins
>
> It seems the FROM clause datasets order has no impact. But in both cases,
> the bcast reduced the execution time.
>
> As for querying MongoDB, I'm almost writing a "logical" plan for that
> query (It took me days to understand MongoDB query operators). I totally
> prefer SQL++ using hints. However, think about data scientists who are
> mostly familiar with SQL queries, I don't expect them to spend time and
> determine for example the predicates selectivity and accordingly decide
> what's the appropriate join algorithms to use and specify this in their
> query (i.e /*indexnl*/) (Basically they end-up doing the cost-based
> optimizer job :) ).
>
> Thanks,
> --Rana
>
> On Fri, Jan 26, 2018 at 2:15 PM, Mike Carey <dt...@gmail.com> wrote:
>
>> Rana,
>>
>> We need the physical hints because we have a conservative cost-minded
>> rule set rather than an actual cost-based optimizer - so it always picks
>> partitioned hash joins when doing joins.  (I am curious as to how much the
>> bcast hint helps vs. the reordered from clause - what fraction does each
>> contribute to the win? - it would be cool to have the numbers without and
>> with that hint if you felt like trying that - but don't feel obligated).
>>
>> Question: In MongoDB, didn't you end up essentially writing a
>> query-plan-like program to solve this query - and isn't the SQL++ with
>> hints a lot smaller/simpler?  (Just asking - I'm curious as to your
>> feedback on that.)  We'd argue that you can write a mostly declarative
>> familiar query and then mess with it and annotate it a little to tune it -
>> which isn't as good as a great cost-based optimizer, but is better than
>> writing/maintaining a program.  Thoughts?
>>
>> In terms of how good we can get - the size answer is telling.  In past
>> days, when we were normally either on par with (smaller things) or beating
>> (larger things) MongoDB, they hadn't yet acquired their new storage engine
>> company (WiredTiger) with its compression.  Now they're running 3x+
>> smaller, I would not be surprised if that's now the explanation for the
>> remaining difference, which is indeed about 3x.  (We are going to
>> experiment with compression as well.)
>>
>> Cheers,
>>
>> Mike
>>
>> On 1/26/18 1:58 PM, Rana Alotaibi wrote:
>>
>> Hi all,
>> Thanks for your immediate and quick reply.
>>
>> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
>> AsterixDB setting, it seems that navigating deeply in the nested fields and
>> create an index is not supported.
>> @Taewoo Adding  /*+indexnl */ didn't change the plan.
>> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins
>> on average without including the warm-up time. The plan has changed and it
>> is attached.
>>
>> Can you please points me where I can find this in the documentation? And
>> here is my question: Why do I need to add some physical details in the
>> query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
>> broadcast_exchange?
>> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for LABITEMS
>> dataset (Both datasets have an open schema). After data ingestion, in my
>> setting the data in MongoDB is ~3GB(It seems MongoDB does some compression)
>> and in AsterixDB is ~10GB. (I have 4 partitions,and I checked the size of
>> files in each partition and the total is ~10GB!)
>>
>> Thanks!
>> --Rana
>>
>>
>>
>>
>> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wa...@gmail.com>
>> wrote:
>>
>>> One thing I noticed is that the "large" unnested arrays are hash
>>> partitioned to the probably "small" index-filtered dataset.
>>> Since the data can fit in memory (7 GB in total), I think
>>> broadcast_exchange may do better in this particular case.
>>>
>>> USE mimiciii;
>>> SET `compiler.parallelism` "5";
>>> SET `compiler.sortmemory` "128MB";
>>> SET `compiler.joinmemory` "265MB";
>>> SELECT P.SUBJECT_ID
>>> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>>> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>>>              E.FLAG = 'abnormal' AND
>>>              I.FLUID='Blood' AND
>>>              I.LABEL='Haptoglobin'
>>>
>>> Note: I reordered the FROM clause...
>>>
>>> Another thing is that I think it's a CPU bound query ... and I'm not
>>> sure how MongoDB utilizes CPU resources compared with AsterixDB.
>>>
>>>
>>>
>>> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com> wrote:
>>>
>>>> PS: UNNEST doc
>>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#U
>>>> nnest_clauses
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Rana,
>>>>>
>>>>> Thank you for attaching your plan. It seems that the selections are
>>>>> correctly made before each join. If your query predicate is selective
>>>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>>>
>>>>> USE mimiciii;
>>>>> SET `compiler.parallelism` "5";
>>>>> SET `compiler.sortmemory` "128MB";
>>>>> SET `compiler.joinmemory` "265MB";
>>>>> SELECT P.SUBJECT_ID
>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>>>              E.FLAG = 'abnormal' AND
>>>>>              I.FLUID='Blood' AND
>>>>>              I.LABEL='Haptoglobin'
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>>>>
>>>>>> Hi Rana,
>>>>>>
>>>>>> I think the performance issue might related to the access of nested
>>>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>>>
>>>>>> @Wail, do you have any clue on nested fields access?
>>>>>>
>>>>>> Best regards,
>>>>>> Chen Luo
>>>>>>
>>>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ralotaib@eng.ucsd.edu
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Taewoo,
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>>>>
>>>>>>>    - Can you switch the predicate order? -- It seems for me that
>>>>>>>    the plan remains the same even if I changed the order of the predicates.
>>>>>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Rana
>>>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <
>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>>
>>>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING
>>>>>>>> localfs ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>>>> (\"format\"=\"json\"))).
>>>>>>>>
>>>>>>>>
>>>>>>>> *Which version of AsterixDB are you using? *
>>>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi Rana,
>>>>>>>>>
>>>>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org
>>>>>>>>> as well to get more feedbacks from our developers.
>>>>>>>>>
>>>>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>>>>>> is pretty much cached into the memory...
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Chen Luo
>>>>>>>>>
>>>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded
>>>>>>>>>> the warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>>>
>>>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>>>> haptoglobin blood test result. (The query result can have duplicate values).
>>>>>>>>>>
>>>>>>>>>> *Query:*
>>>>>>>>>> USE mimiciii;
>>>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>>>
>>>>>>>>>> *Datasets Schema:*
>>>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>>>
>>>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>>>> - 4 partitions (iodevices)
>>>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>>>> - The data size is 7GB
>>>>>>>>>>
>>>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>>>
>>>>>>>>>> --Rana
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Regards,*
>>> Wail Alkowaileet
>>>
>>
>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Rana Alotaibi <ra...@eng.ucsd.edu>.

Hi Mike,

Here is some results:
1) Non-reordered FROM clause+ no bcast added ~12mins
2) Reordered FROM clause + no bcast added ~12mins (same as (1) )
3) Non-reordered FROM clause+ bcast added ~6mins
4) Reordered FROM clause+bacst added ~6mins

It seems the FROM clause datasets order has no impact. But in both cases,
the bcast reduced the execution time.

As for querying MongoDB, I'm almost writing a "logical" plan for that query
(It took me days to understand MongoDB query operators). I totally prefer
SQL++ using hints. However, think about data scientists who are mostly
familiar with SQL queries, I don't expect them to spend time and determine
for example the predicates selectivity and accordingly decide what's the
appropriate join algorithms to use and specify this in their query (i.e
/*indexnl*/) (Basically they end-up doing the cost-based optimizer job :) ).

Thanks,
--Rana

On Fri, Jan 26, 2018 at 2:15 PM, Mike Carey <dt...@gmail.com> wrote:

> Rana,
>
> We need the physical hints because we have a conservative cost-minded rule
> set rather than an actual cost-based optimizer - so it always picks
> partitioned hash joins when doing joins.  (I am curious as to how much the
> bcast hint helps vs. the reordered from clause - what fraction does each
> contribute to the win? - it would be cool to have the numbers without and
> with that hint if you felt like trying that - but don't feel obligated).
>
> Question: In MongoDB, didn't you end up essentially writing a
> query-plan-like program to solve this query - and isn't the SQL++ with
> hints a lot smaller/simpler?  (Just asking - I'm curious as to your
> feedback on that.)  We'd argue that you can write a mostly declarative
> familiar query and then mess with it and annotate it a little to tune it -
> which isn't as good as a great cost-based optimizer, but is better than
> writing/maintaining a program.  Thoughts?
>
> In terms of how good we can get - the size answer is telling.  In past
> days, when we were normally either on par with (smaller things) or beating
> (larger things) MongoDB, they hadn't yet acquired their new storage engine
> company (WiredTiger) with its compression.  Now they're running 3x+
> smaller, I would not be surprised if that's now the explanation for the
> remaining difference, which is indeed about 3x.  (We are going to
> experiment with compression as well.)
>
> Cheers,
>
> Mike
>
> On 1/26/18 1:58 PM, Rana Alotaibi wrote:
>
> Hi all,
> Thanks for your immediate and quick reply.
>
> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
> AsterixDB setting, it seems that navigating deeply in the nested fields and
> create an index is not supported.
> @Taewoo Adding  /*+indexnl */ didn't change the plan.
> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins
> on average without including the warm-up time. The plan has changed and it
> is attached.
>
> Can you please points me where I can find this in the documentation? And
> here is my question: Why do I need to add some physical details in the
> query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
> broadcast_exchange?
> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for LABITEMS
> dataset (Both datasets have an open schema). After data ingestion, in my
> setting the data in MongoDB is ~3GB(It seems MongoDB does some compression)
> and in AsterixDB is ~10GB. (I have 4 partitions,and I checked the size of
> files in each partition and the total is ~10GB!)
>
> Thanks!
> --Rana
>
>
>
>
> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wa...@gmail.com>
> wrote:
>
>> One thing I noticed is that the "large" unnested arrays are hash
>> partitioned to the probably "small" index-filtered dataset.
>> Since the data can fit in memory (7 GB in total), I think
>> broadcast_exchange may do better in this particular case.
>>
>> USE mimiciii;
>> SET `compiler.parallelism` "5";
>> SET `compiler.sortmemory` "128MB";
>> SET `compiler.joinmemory` "265MB";
>> SELECT P.SUBJECT_ID
>> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>>              E.FLAG = 'abnormal' AND
>>              I.FLUID='Blood' AND
>>              I.LABEL='Haptoglobin'
>>
>> Note: I reordered the FROM clause...
>>
>> Another thing is that I think it's a CPU bound query ... and I'm not sure
>> how MongoDB utilizes CPU resources compared with AsterixDB.
>>
>>
>>
>> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com> wrote:
>>
>>> PS: UNNEST doc
>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#U
>>> nnest_clauses
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com> wrote:
>>>
>>>> Hi Rana,
>>>>
>>>> Thank you for attaching your plan. It seems that the selections are
>>>> correctly made before each join. If your query predicate is selective
>>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>>
>>>> USE mimiciii;
>>>> SET `compiler.parallelism` "5";
>>>> SET `compiler.sortmemory` "128MB";
>>>> SET `compiler.joinmemory` "265MB";
>>>> SELECT P.SUBJECT_ID
>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>>              E.FLAG = 'abnormal' AND
>>>>              I.FLUID='Blood' AND
>>>>              I.LABEL='Haptoglobin'
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>>>
>>>>> Hi Rana,
>>>>>
>>>>> I think the performance issue might related to the access of nested
>>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>>
>>>>> @Wail, do you have any clue on nested fields access?
>>>>>
>>>>> Best regards,
>>>>> Chen Luo
>>>>>
>>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi Taewoo,
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>>>
>>>>>>    - Can you switch the predicate order? -- It seems for me that the
>>>>>>    plan remains the same even if I changed the order of the predicates.
>>>>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Rana
>>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <
>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>
>>>>>>> Hi Chen,
>>>>>>>
>>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>>> (\"format\"=\"json\"))).
>>>>>>>
>>>>>>>
>>>>>>> *Which version of AsterixDB are you using? *
>>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>>
>>>>>>>> Hi Rana,
>>>>>>>>
>>>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org as
>>>>>>>> well to get more feedbacks from our developers.
>>>>>>>>
>>>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>>>>> is pretty much cached into the memory...
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Chen Luo
>>>>>>>>
>>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded
>>>>>>>>> the warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>>
>>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>>> haptoglobin blood test result. (The query result can have duplicate values).
>>>>>>>>>
>>>>>>>>> *Query:*
>>>>>>>>> USE mimiciii;
>>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>>
>>>>>>>>> *Datasets Schema:*
>>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>>
>>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>>> - 4 partitions (iodevices)
>>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>>> - The data size is 7GB
>>>>>>>>>
>>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>>
>>>>>>>>> --Rana
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *Regards,*
>> Wail Alkowaileet
>>
>
>
>

Re: AsterixDB Performance Tuning

Posted by Mike Carey <dt...@gmail.com>.

Rana,

We need the physical hints because we have a conservative cost-minded 
rule set rather than an actual cost-based optimizer - so it always picks 
partitioned hash joins when doing joins.  (I am curious as to how much 
the bcast hint helps vs. the reordered from clause - what fraction does 
each contribute to the win? - it would be cool to have the numbers 
without and with that hint if you felt like trying that - but don't feel 
obligated).

Question: In MongoDB, didn't you end up essentially writing a 
query-plan-like program to solve this query - and isn't the SQL++ with 
hints a lot smaller/simpler? (Just asking - I'm curious as to your 
feedback on that.)  We'd argue that you can write a mostly declarative 
familiar query and then mess with it and annotate it a little to tune it 
- which isn't as good as a great cost-based optimizer, but is better 
than writing/maintaining a program.  Thoughts?

In terms of how good we can get - the size answer is telling.  In past 
days, when we were normally either on par with (smaller things) or 
beating (larger things) MongoDB, they hadn't yet acquired their new 
storage engine company (WiredTiger) with its compression.  Now they're 
running 3x+ smaller, I would not be surprised if that's now the 
explanation for the remaining difference, which is indeed about 3x.  (We 
are going to experiment with compression as well.)

Cheers,

Mike


On 1/26/18 1:58 PM, Rana Alotaibi wrote:
> Hi all,
> Thanks for your immediate and quick reply.
>
> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, 
> in AsterixDB setting, it seems that navigating deeply in the nested 
> fields and create an index is not supported.
> @Taewoo Adding  /*+indexnl */ didn't change the plan.
> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 
> mins on average without including the warm-up time. The plan has 
> changed and it is attached.
>
> Can you please points me where I can find this in the documentation? 
> And here is my question: Why do I need to add some physical details in 
> the query like adding  /*+indexnl */ index-nested-loop-join or 
> /*+bcast/ broadcast_exchange?
> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for 
> LABITEMS dataset (Both datasets have an open schema). After data 
> ingestion, in my setting the data in MongoDB is ~3GB(It seems MongoDB 
> does some compression) and in AsterixDB is ~10GB. (I have 4 
> partitions,and I checked the size of files in each partition and the 
> total is ~10GB!)
>
> Thanks!
> --Rana
>
>
>
>
> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wael.y.k@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     One thing I noticed is that the "large" unnested arrays are hash
>     partitioned to the probably "small" index-filtered dataset.
>     Since the data can fit in memory (7 GB in total), I think
>     broadcast_exchange may do better in this particular case.
>
>     USE mimiciii;
>     SET `compiler.parallelism` "5";
>     SET `compiler.sortmemory` "128MB";
>     SET `compiler.joinmemory` "265MB";
>     SELECT P.SUBJECT_ID
>     FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>     WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>                  E.FLAG = 'abnormal' AND
>                  I.FLUID='Blood' AND
>                  I.LABEL='Haptoglobin'
>
>     Note: I reordered the FROM clause...
>
>     Another thing is that I think it's a CPU bound query ... and I'm
>     not sure how MongoDB utilizes CPU resources compared with AsterixDB.
>
>
>
>     On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wangsaeu@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         PS: UNNEST doc
>         https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>         <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses>
>
>         Best,
>         Taewoo
>
>         On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim
>         <wangsaeu@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi Rana,
>
>             Thank you for attaching your plan. It seems that the
>             selections are correctly made before each join. If your
>             query predicate is selective enough (e.g., I.LABEL =
>             'Haptoglobin' generates less than 1% of records as the
>             result), I suggest you could try an
>             index-nested-loop-join. Changes are highlighted. And one
>             more question: if LABEVENTS.FLAG is an array, you can't
>             just use "E.FLAG="abnormal". I think you need to use UNNEST.
>
>             USE mimiciii;
>             SET `compiler.parallelism` "5";
>             SET `compiler.sortmemory` "128MB";
>             SET `compiler.joinmemory` "265MB";
>             SELECT P.SUBJECT_ID
>             FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>             WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>             E.FLAG = 'abnormal' AND
>             I.FLUID='Blood' AND
>             I.LABEL='Haptoglobin'
>
>             Best,
>             Taewoo
>
>             On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cluo8@uci.edu
>             <ma...@uci.edu>> wrote:
>
>                 Hi Rana,
>
>                 I think the performance issue might related to the
>                 access of nested fields, since the rest performance
>                 hot spots (index search, hash join etc looks normal to
>                 me), and I assume " I.FLUID='Blood' AND
>                 I.LABEL='Haptoglobin'" should be very selective. Since
>                 MongoDB supports array index, did you build an index
>                 on L.FLAG using MongoDB?
>
>                 @Wail, do you have any clue on nested fields access?
>
>                 Best regards,
>                 Chen Luo
>
>                 On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi
>                 <ralotaib@eng.ucsd.edu <ma...@eng.ucsd.edu>>
>                 wrote:
>
>                     Hi Taewoo,
>
>                      *
>
>                         Can you paste the optimized plan? -- Attached
>                         the plan (Plan_01.txt)
>
>                      *
>
>                         Can you create an index on LABEVENTS.FLAG? --
>                         I couldn't create an index on LABEVENTS.FLAG
>                         since LABEVENTS is of type array. I got this
>                         message when I tried to create the index :
>                         "msg": "ASX0001: Field type array can't be
>                         promoted to type object"
>
>                       * Can you switch the predicate order? -- It
>                         seems for me that the plan remains the same
>                         even if I changed the order of the predicates.
>                         (Attached the plan after changing the order of
>                         the predicates Plan_02.txt)
>
>                     Thanks
>
>                     Rana
>
>                     On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi
>                     <ralotaib@eng.ucsd.edu
>                     <ma...@eng.ucsd.edu>> wrote:
>
>                         Hi Chen,
>
>                         *How did you import data into the dataset?
>                         using "load" or "feed"?*
>                         I used "LOAD" (i.e USE mimiciii; LOAD DATASET
>                         PATIENTS USING localfs ((\"path\"=\"127.0.0.1
>                         <http://127.0.0.1>:///data/ralotaib/patients.json\"),
>                         (\"format\"=\"json\"))).
>
>                         *Which version of AsterixDB are you using?
>                         *
>                         AsterixDB Master (0.9.3-SNAPSHOT)
>
>                         Thanks!
>
>
>
>
>
>                         On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo
>                         <cluo8@uci.edu <ma...@uci.edu>> wrote:
>
>                             Hi Rana,
>
>                             Nice to see you again! You may post to
>                             dev@asterixdb.apache.org
>                             <ma...@asterixdb.apache.org> as well
>                             to get more feedbacks from our developers.
>
>                             Just clarify two things: how did you
>                             import data into the dataset? using "load"
>                             or "feed"? And which version of AsterixDB
>                             are you using? But any way in your case it
>                             seems the join takes a lot of time, and
>                             your data is pretty much cached into the
>                             memory...
>
>                             Best regards,
>                             Chen Luo
>
>                             On Thu, Jan 25, 2018 at 8:46 PM, Rana
>                             Alotaibi <ralotaib@eng.ucsd.edu
>                             <ma...@eng.ucsd.edu>> wrote:
>
>                                 Hi there,
>
>                                 I have a query that takes ~12.7mins on
>                                 average (I have excluded the warm-up
>                                 time which was 30mins)!, and I would
>                                 like to make sure that I didn't miss
>                                 any performance tuning parameters ( I
>                                 have run the same query on MongoDB,
>                                 and it took ~2mins).
>
>                                 The query asks to find all patients
>                                 that have 'abnormal' haptoglobin blood
>                                 test result. (The query result can
>                                 have duplicate values).
>
>                                 *Query:*
>                                 USE mimiciii;
>                                 SET `compiler.parallelism` "5";
>                                 SET `compiler.sortmemory` "128MB";
>                                 SET `compiler.joinmemory` "265MB";
>                                 SELECT P.SUBJECT_ID
>                                 FROM LABITEMS I, PATIENTS P,
>                                 P.ADMISSIONS A, A.LABEVENTS E
>                                 WHERE I.ITEMID=E.ITEMID AND
>                                 E.FLAG = 'abnormal' AND
>                                 I.FLUID='Blood' AND
>                                 I.LABEL='Haptoglobin'
>
>                                 *Datasets Schema:*
>                                 - PATIENTS and LABITEMS datasets have
>                                 an open schema.
>                                 - LABITEMS's primary key is ITEMID
>                                 - PATIENTS 's primary key is SUBJECT_ID
>                                 - The JSON schema for both datasets is
>                                 attached.
>                                 - The DDL for both datasets is attached
>
>                                 *Performance Tuning Parameters:*
>                                 - 4 partitions (iodevices)
>                                 - The total memory size is : 125GB,
>                                 and I have assigned ~ 57GB to the
>                                 buffercache (storage.buffercache.size).
>                                 - As you can see from the query, I set
>                                 the parallelism to 5, sort-memory to
>                                 128MB, join-memory to 265MB.
>                                 - The data size is 7GB
>
>                                 Your feedback is highly appreciated!
>
>                                 --Rana
>
>
>
>
>
>
>
>
>
>
>     -- 
>
>     *Regards,*
>     Wail Alkowaileet
>
>

Re: AsterixDB Performance Tuning

Posted by Rana Alotaibi <ra...@eng.ucsd.edu>.

Hi all,
Thanks for your immediate and quick reply.

@Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
AsterixDB setting, it seems that navigating deeply in the nested fields and
create an index is not supported.
@Taewoo Adding  /*+indexnl */ didn't change the plan.
@Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins on
average without including the warm-up time. The plan has changed and it is
attached.

Can you please points me where I can find this in the documentation? And
here is my question: Why do I need to add some physical details in the
query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
broadcast_exchange?
@Michael: The data size is ~7GB for PATEINTS dataset and 1MB for LABITEMS
dataset (Both datasets have an open schema). After data ingestion, in my
setting the data in MongoDB is ~3GB(It seems MongoDB does some compression)
and in AsterixDB is ~10GB. (I have 4 partitions,and I checked the size of
files in each partition and the total is ~10GB!)

Thanks!
--Rana




On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <wa...@gmail.com>
wrote:

> One thing I noticed is that the "large" unnested arrays are hash
> partitioned to the probably "small" index-filtered dataset.
> Since the data can fit in memory (7 GB in total), I think
> broadcast_exchange may do better in this particular case.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> Note: I reordered the FROM clause...
>
> Another thing is that I think it's a CPU bound query ... and I'm not sure
> how MongoDB utilizes CPU resources compared with AsterixDB.
>
>
>
> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com> wrote:
>
>> PS: UNNEST doc
>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>>
>> Best,
>> Taewoo
>>
>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com> wrote:
>>
>>> Hi Rana,
>>>
>>> Thank you for attaching your plan. It seems that the selections are
>>> correctly made before each join. If your query predicate is selective
>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>
>>> USE mimiciii;
>>> SET `compiler.parallelism` "5";
>>> SET `compiler.sortmemory` "128MB";
>>> SET `compiler.joinmemory` "265MB";
>>> SELECT P.SUBJECT_ID
>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>              E.FLAG = 'abnormal' AND
>>>              I.FLUID='Blood' AND
>>>              I.LABEL='Haptoglobin'
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>>
>>>> Hi Rana,
>>>>
>>>> I think the performance issue might related to the access of nested
>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>
>>>> @Wail, do you have any clue on nested fields access?
>>>>
>>>> Best regards,
>>>> Chen Luo
>>>>
>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>>> wrote:
>>>>
>>>>> Hi Taewoo,
>>>>>
>>>>>    -
>>>>>
>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>
>>>>>    -
>>>>>
>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>>
>>>>>    - Can you switch the predicate order? -- It seems for me that the
>>>>>    plan remains the same even if I changed the order of the predicates.
>>>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>>
>>>>> Thanks
>>>>>
>>>>> Rana
>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <ralotaib@eng.ucsd.edu
>>>>> > wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>>
>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>> (\"format\"=\"json\"))).
>>>>>>
>>>>>>
>>>>>> *Which version of AsterixDB are you using?*
>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>>
>>>>>>> Hi Rana,
>>>>>>>
>>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org as
>>>>>>> well to get more feedbacks from our developers.
>>>>>>>
>>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>>>> is pretty much cached into the memory...
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Chen Luo
>>>>>>>
>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>> ralotaib@eng.ucsd.edu> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>
>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>> haptoglobin blood test result. (The query result can have duplicate values).
>>>>>>>>
>>>>>>>> *Query:*
>>>>>>>> USE mimiciii;
>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>
>>>>>>>> *Datasets Schema:*
>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>
>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>> - 4 partitions (iodevices)
>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>> - The data size is 7GB
>>>>>>>>
>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>
>>>>>>>> --Rana
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> *Regards,*
> Wail Alkowaileet
>

Re: AsterixDB Performance Tuning

Posted by Mike Carey <dt...@gmail.com>.

It would be interesting to see what the "MongoDB query plan" (program 
:-)) is in this case, and how the AsterixDB plan compares.


On 1/26/18 12:52 PM, Wail Alkowaileet wrote:
> One thing I noticed is that the "large" unnested arrays are hash 
> partitioned to the probably "small" index-filtered dataset.
> Since the data can fit in memory (7 GB in total), I think 
> broadcast_exchange may do better in this particular case.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> Note: I reordered the FROM clause...
>
> Another thing is that I think it's a CPU bound query ... and I'm not 
> sure how MongoDB utilizes CPU resources compared with AsterixDB.
>
>
>
> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wangsaeu@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     PS: UNNEST doc
>     https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>     <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses>
>
>     Best,
>     Taewoo
>
>     On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wangsaeu@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Hi Rana,
>
>         Thank you for attaching your plan. It seems that the
>         selections are correctly made before each join. If your query
>         predicate is selective enough (e.g., I.LABEL = 'Haptoglobin'
>         generates less than 1% of records as the result), I suggest
>         you could try an index-nested-loop-join. Changes are
>         highlighted. And one more question: if LABEVENTS.FLAG is an
>         array, you can't just use "E.FLAG="abnormal". I think you need
>         to use UNNEST.
>
>         USE mimiciii;
>         SET `compiler.parallelism` "5";
>         SET `compiler.sortmemory` "128MB";
>         SET `compiler.joinmemory` "265MB";
>         SELECT P.SUBJECT_ID
>         FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>         WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>         E.FLAG = 'abnormal' AND
>         I.FLUID='Blood' AND
>         I.LABEL='Haptoglobin'
>
>         Best,
>         Taewoo
>
>         On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cluo8@uci.edu
>         <ma...@uci.edu>> wrote:
>
>             Hi Rana,
>
>             I think the performance issue might related to the access
>             of nested fields, since the rest performance hot spots
>             (index search, hash join etc looks normal to me), and I
>             assume " I.FLUID='Blood' AND I.LABEL='Haptoglobin'" should
>             be very selective. Since MongoDB supports array index, did
>             you build an index on L.FLAG using MongoDB?
>
>             @Wail, do you have any clue on nested fields access?
>
>             Best regards,
>             Chen Luo
>
>             On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi
>             <ralotaib@eng.ucsd.edu <ma...@eng.ucsd.edu>> wrote:
>
>                 Hi Taewoo,
>
>                  *
>
>                     Can you paste the optimized plan? -- Attached the
>                     plan (Plan_01.txt)
>
>                  *
>
>                     Can you create an index on LABEVENTS.FLAG? -- I
>                     couldn't create an index on LABEVENTS.FLAG since
>                     LABEVENTS is of type array. I got this message
>                     when I tried to create the index : "msg":
>                     "ASX0001: Field type array can't be promoted to
>                     type object"
>
>                   * Can you switch the predicate order? -- It seems
>                     for me that the plan remains the same even if I
>                     changed the order of the predicates. (Attached the
>                     plan after changing the order of the predicates
>                     Plan_02.txt)
>
>                 Thanks
>
>                 Rana
>
>                 On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi
>                 <ralotaib@eng.ucsd.edu <ma...@eng.ucsd.edu>>
>                 wrote:
>
>                     Hi Chen,
>
>                     *How did you import data into the dataset? using
>                     "load" or "feed"?*
>                     I used "LOAD" (i.e USE mimiciii; LOAD DATASET
>                     PATIENTS USING localfs ((\"path\"=\"127.0.0.1
>                     <http://127.0.0.1>:///data/ralotaib/patients.json\"),
>                     (\"format\"=\"json\"))).
>
>                     *Which version of AsterixDB are you using?
>                     *
>                     AsterixDB Master (0.9.3-SNAPSHOT)
>
>                     Thanks!
>
>
>
>
>
>                     On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo
>                     <cluo8@uci.edu <ma...@uci.edu>> wrote:
>
>                         Hi Rana,
>
>                         Nice to see you again! You may post to
>                         dev@asterixdb.apache.org
>                         <ma...@asterixdb.apache.org> as well to
>                         get more feedbacks from our developers.
>
>                         Just clarify two things: how did you import
>                         data into the dataset? using "load" or "feed"?
>                         And which version of AsterixDB are you using?
>                         But any way in your case it seems the join
>                         takes a lot of time, and your data is pretty
>                         much cached into the memory...
>
>                         Best regards,
>                         Chen Luo
>
>                         On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi
>                         <ralotaib@eng.ucsd.edu
>                         <ma...@eng.ucsd.edu>> wrote:
>
>                             Hi there,
>
>                             I have a query that takes ~12.7mins on
>                             average (I have excluded the warm-up time
>                             which was 30mins)!, and I would like to
>                             make sure that I didn't miss any
>                             performance tuning parameters ( I have run
>                             the same query on MongoDB, and it took
>                             ~2mins).
>
>                             The query asks to find all patients that
>                             have 'abnormal' haptoglobin blood test
>                             result. (The query result can have
>                             duplicate values).
>
>                             *Query:*
>                             USE mimiciii;
>                             SET `compiler.parallelism` "5";
>                             SET `compiler.sortmemory` "128MB";
>                             SET `compiler.joinmemory` "265MB";
>                             SELECT P.SUBJECT_ID
>                             FROM LABITEMS I, PATIENTS P, P.ADMISSIONS
>                             A, A.LABEVENTS E
>                             WHERE I.ITEMID=E.ITEMID AND
>                             E.FLAG = 'abnormal' AND
>                             I.FLUID='Blood' AND
>                             I.LABEL='Haptoglobin'
>
>                             *Datasets Schema:*
>                             - PATIENTS and LABITEMS datasets have an
>                             open schema.
>                             - LABITEMS's primary key is ITEMID
>                             - PATIENTS 's primary key is SUBJECT_ID
>                             - The JSON schema for both datasets is
>                             attached.
>                             - The DDL for both datasets is attached
>
>                             *Performance Tuning Parameters:*
>                             - 4 partitions (iodevices)
>                             - The total memory size is : 125GB, and I
>                             have assigned ~ 57GB to the buffercache
>                             (storage.buffercache.size).
>                             - As you can see from the query, I set the
>                             parallelism to 5, sort-memory to 128MB,
>                             join-memory to 265MB.
>                             - The data size is 7GB
>
>                             Your feedback is highly appreciated!
>
>                             --Rana
>
>
>
>
>
>
>
>
>
>
> -- 
>
> *Regards,*
> Wail Alkowaileet

Re: AsterixDB Performance Tuning

Posted by Wail Alkowaileet <wa...@gmail.com>.

One thing I noticed is that the "large" unnested arrays are hash
partitioned to the probably "small" index-filtered dataset.
Since the data can fit in memory (7 GB in total), I think
broadcast_exchange may do better in this particular case.

USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
             E.FLAG = 'abnormal' AND
             I.FLUID='Blood' AND
             I.LABEL='Haptoglobin'

Note: I reordered the FROM clause...

Another thing is that I think it's a CPU bound query ... and I'm not sure
how MongoDB utilizes CPU resources compared with AsterixDB.



On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <wa...@gmail.com> wrote:

> PS: UNNEST doc
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>
> Best,
> Taewoo
>
> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com> wrote:
>
>> Hi Rana,
>>
>> Thank you for attaching your plan. It seems that the selections are
>> correctly made before each join. If your query predicate is selective
>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>> the result), I suggest you could try an index-nested-loop-join. Changes are
>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>
>> USE mimiciii;
>> SET `compiler.parallelism` "5";
>> SET `compiler.sortmemory` "128MB";
>> SET `compiler.joinmemory` "265MB";
>> SELECT P.SUBJECT_ID
>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>              E.FLAG = 'abnormal' AND
>>              I.FLUID='Blood' AND
>>              I.LABEL='Haptoglobin'
>>
>> Best,
>> Taewoo
>>
>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>>
>>> Hi Rana,
>>>
>>> I think the performance issue might related to the access of nested
>>> fields, since the rest performance hot spots (index search, hash join etc
>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports
>>> array index, did you build an index on L.FLAG using MongoDB?
>>>
>>> @Wail, do you have any clue on nested fields access?
>>>
>>> Best regards,
>>> Chen Luo
>>>
>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>> wrote:
>>>
>>>> Hi Taewoo,
>>>>
>>>>    -
>>>>
>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>
>>>>    -
>>>>
>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>>
>>>>    - Can you switch the predicate order? -- It seems for me that the
>>>>    plan remains the same even if I changed the order of the predicates.
>>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>>
>>>> Thanks
>>>>
>>>> Rana
>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>>> wrote:
>>>>
>>>>> Hi Chen,
>>>>>
>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>> (\"format\"=\"json\"))).
>>>>>
>>>>>
>>>>> *Which version of AsterixDB are you using?*
>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>>
>>>>>> Hi Rana,
>>>>>>
>>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org as
>>>>>> well to get more feedbacks from our developers.
>>>>>>
>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>>> is pretty much cached into the memory...
>>>>>>
>>>>>> Best regards,
>>>>>> Chen Luo
>>>>>>
>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ralotaib@eng.ucsd.edu
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>
>>>>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>>>>> blood test result. (The query result can have duplicate values).
>>>>>>>
>>>>>>> *Query:*
>>>>>>> USE mimiciii;
>>>>>>> SET `compiler.parallelism` "5";
>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>> SELECT P.SUBJECT_ID
>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>              I.FLUID='Blood' AND
>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>
>>>>>>> *Datasets Schema:*
>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>> - The DDL for both datasets is attached
>>>>>>>
>>>>>>> *Performance Tuning Parameters:*
>>>>>>> - 4 partitions (iodevices)
>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>> - The data size is 7GB
>>>>>>>
>>>>>>> Your feedback is highly appreciated!
>>>>>>>
>>>>>>> --Rana
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 

*Regards,*
Wail Alkowaileet

Re: AsterixDB Performance Tuning

Posted by Taewoo Kim <wa...@gmail.com>.

PS: UNNEST doc
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses

Best,
Taewoo

On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <wa...@gmail.com> wrote:

> Hi Rana,
>
> Thank you for attaching your plan. It seems that the selections are
> correctly made before each join. If your query predicate is selective
> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
> the result), I suggest you could try an index-nested-loop-join. Changes are
> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> Best,
> Taewoo
>
> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:
>
>> Hi Rana,
>>
>> I think the performance issue might related to the access of nested
>> fields, since the rest performance hot spots (index search, hash join etc
>> looks normal to me), and I assume " I.FLUID='Blood' AND
>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports
>> array index, did you build an index on L.FLAG using MongoDB?
>>
>> @Wail, do you have any clue on nested fields access?
>>
>> Best regards,
>> Chen Luo
>>
>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ra...@eng.ucsd.edu>
>> wrote:
>>
>>> Hi Taewoo,
>>>
>>>    -
>>>
>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>
>>>    -
>>>
>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>>
>>>    - Can you switch the predicate order? -- It seems for me that the
>>>    plan remains the same even if I changed the order of the predicates.
>>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>>
>>> Thanks
>>>
>>> Rana
>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>> wrote:
>>>
>>>> Hi Chen,
>>>>
>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>> (\"format\"=\"json\"))).
>>>>
>>>>
>>>> *Which version of AsterixDB are you using?*
>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>>
>>>>> Hi Rana,
>>>>>
>>>>> Nice to see you again! You may post to dev@asterixdb.apache.org as
>>>>> well to get more feedbacks from our developers.
>>>>>
>>>>> Just clarify two things: how did you import data into the dataset?
>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>> is pretty much cached into the memory...
>>>>>
>>>>> Best regards,
>>>>> Chen Luo
>>>>>
>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>> on MongoDB, and it took ~2mins).
>>>>>>
>>>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>>>> blood test result. (The query result can have duplicate values).
>>>>>>
>>>>>> *Query:*
>>>>>> USE mimiciii;
>>>>>> SET `compiler.parallelism` "5";
>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>> SELECT P.SUBJECT_ID
>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>              I.FLUID='Blood' AND
>>>>>>              I.LABEL='Haptoglobin'
>>>>>>
>>>>>> *Datasets Schema:*
>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>> - The JSON schema for both datasets is attached.
>>>>>> - The DDL for both datasets is attached
>>>>>>
>>>>>> *Performance Tuning Parameters:*
>>>>>> - 4 partitions (iodevices)
>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>>>>>> buffercache (storage.buffercache.size).
>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>> - The data size is 7GB
>>>>>>
>>>>>> Your feedback is highly appreciated!
>>>>>>
>>>>>> --Rana
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Michael Carey <mj...@ics.uci.edu>.

Also:  What are the data sizes in the two systems?


On 1/26/18 10:00 AM, Taewoo Kim wrote:
> Hi Rana,
>
> Thank you for attaching your plan. It seems that the selections are 
> correctly made before each join. If your query predicate is selective 
> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of 
> records as the result), I suggest you could try an 
> index-nested-loop-join. Changes are highlighted. And one more 
> question: if LABEVENTS.FLAG is an array, you can't just use 
> "E.FLAG="abnormal". I think you need to use UNNEST.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>              E.FLAG = 'abnormal' AND
> I.FLUID='Blood' AND
> I.LABEL='Haptoglobin'
>
> Best,
> Taewoo
>
> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cluo8@uci.edu 
> <ma...@uci.edu>> wrote:
>
>     Hi Rana,
>
>     I think the performance issue might related to the access of
>     nested fields, since the rest performance hot spots (index search,
>     hash join etc looks normal to me), and I assume " I.FLUID='Blood'
>     AND I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>     supports array index, did you build an index on L.FLAG using MongoDB?
>
>     @Wail, do you have any clue on nested fields access?
>
>     Best regards,
>     Chen Luo
>
>     On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi
>     <ralotaib@eng.ucsd.edu <ma...@eng.ucsd.edu>> wrote:
>
>         Hi Taewoo,
>
>          *
>
>             Can you paste the optimized plan? -- Attached the plan
>             (Plan_01.txt)
>
>          *
>
>             Can you create an index on LABEVENTS.FLAG? -- I couldn't
>             create an index on LABEVENTS.FLAG since LABEVENTS is of
>             type array. I got this message when I tried to create the
>             index : "msg": "ASX0001: Field type array can't be
>             promoted to type object"
>
>           * Can you switch the predicate order? -- It seems for me
>             that the plan remains the same even if I changed the order
>             of the predicates. (Attached the plan after changing the
>             order of the predicates Plan_02.txt)
>
>         Thanks
>
>         Rana
>
>         On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi
>         <ralotaib@eng.ucsd.edu <ma...@eng.ucsd.edu>> wrote:
>
>             Hi Chen,
>
>             *How did you import data into the dataset? using "load" or
>             "feed"?*
>             I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS
>             USING localfs ((\"path\"=\"*MailScanner warning: numerical
>             links are often malicious:* 127.0.0.1
>             <http://127.0.0.1>:///data/ralotaib/patients.json\"),
>             (\"format\"=\"json\"))).
>
>             *Which version of AsterixDB are you using?
>             *
>             AsterixDB Master (0.9.3-SNAPSHOT)
>
>             Thanks!
>
>
>
>
>
>             On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cluo8@uci.edu
>             <ma...@uci.edu>> wrote:
>
>                 Hi Rana,
>
>                 Nice to see you again! You may post to
>                 dev@asterixdb.apache.org
>                 <ma...@asterixdb.apache.org> as well to get more
>                 feedbacks from our developers.
>
>                 Just clarify two things: how did you import data into
>                 the dataset? using "load" or "feed"? And which version
>                 of AsterixDB are you using? But any way in your case
>                 it seems the join takes a lot of time, and your data
>                 is pretty much cached into the memory...
>
>                 Best regards,
>                 Chen Luo
>
>                 On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi
>                 <ralotaib@eng.ucsd.edu <ma...@eng.ucsd.edu>>
>                 wrote:
>
>                     Hi there,
>
>                     I have a query that takes ~12.7mins on average (I
>                     have excluded the warm-up time which was 30mins)!,
>                     and I would like to make sure that I didn't miss
>                     any performance tuning parameters ( I have run the
>                     same query on MongoDB, and it took ~2mins).
>
>                     The query asks to find all patients that have
>                     'abnormal' haptoglobin blood test result. (The
>                     query result can have duplicate values).
>
>                     *Query:*
>                     USE mimiciii;
>                     SET `compiler.parallelism` "5";
>                     SET `compiler.sortmemory` "128MB";
>                     SET `compiler.joinmemory` "265MB";
>                     SELECT P.SUBJECT_ID
>                     FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A,
>                     A.LABEVENTS E
>                     WHERE I.ITEMID=E.ITEMID AND
>                     E.FLAG = 'abnormal' AND
>                     I.FLUID='Blood' AND
>                     I.LABEL='Haptoglobin'
>
>                     *Datasets Schema:*
>                     - PATIENTS and LABITEMS datasets have an open schema.
>                     - LABITEMS's primary key is ITEMID
>                     - PATIENTS 's primary key is SUBJECT_ID
>                     - The JSON schema for both datasets is attached.
>                     - The DDL for both datasets is attached
>
>                     *Performance Tuning Parameters:*
>                     - 4 partitions (iodevices)
>                     - The total memory size is : 125GB, and I have
>                     assigned ~ 57GB to the buffercache
>                     (storage.buffercache.size).
>                     - As you can see from the query, I set the
>                     parallelism to 5, sort-memory to 128MB,
>                     join-memory to 265MB.
>                     - The data size is 7GB
>
>                     Your feedback is highly appreciated!
>
>                     --Rana
>
>
>
>
>
>

Re: AsterixDB Performance Tuning

Posted by Taewoo Kim <wa...@gmail.com>.

Hi Rana,

Thank you for attaching your plan. It seems that the selections are
correctly made before each join. If your query predicate is selective
enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
the result), I suggest you could try an index-nested-loop-join. Changes are
highlighted. And one more question: if LABEVENTS.FLAG is an array, you
can't just use "E.FLAG="abnormal". I think you need to use UNNEST.

USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
             E.FLAG = 'abnormal' AND
             I.FLUID='Blood' AND
             I.LABEL='Haptoglobin'

Best,
Taewoo

On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <cl...@uci.edu> wrote:

> Hi Rana,
>
> I think the performance issue might related to the access of nested
> fields, since the rest performance hot spots (index search, hash join etc
> looks normal to me), and I assume " I.FLUID='Blood' AND
> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports
> array index, did you build an index on L.FLAG using MongoDB?
>
> @Wail, do you have any clue on nested fields access?
>
> Best regards,
> Chen Luo
>
> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ra...@eng.ucsd.edu>
> wrote:
>
>> Hi Taewoo,
>>
>>    -
>>
>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>
>>    -
>>
>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>>
>>    - Can you switch the predicate order? -- It seems for me that the
>>    plan remains the same even if I changed the order of the predicates.
>>    (Attached the plan after changing the order of the predicates Plan_02.txt)
>>
>> Thanks
>>
>> Rana
>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>> wrote:
>>
>>> Hi Chen,
>>>
>>> *How did you import data into the dataset? using "load" or "feed"?*
>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>> (\"format\"=\"json\"))).
>>>
>>>
>>> *Which version of AsterixDB are you using?*
>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>>
>>>> Hi Rana,
>>>>
>>>> Nice to see you again! You may post to dev@asterixdb.apache.org as
>>>> well to get more feedbacks from our developers.
>>>>
>>>> Just clarify two things: how did you import data into the dataset?
>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>> any way in your case it seems the join takes a lot of time, and your data
>>>> is pretty much cached into the memory...
>>>>
>>>> Best regards,
>>>> Chen Luo
>>>>
>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>>> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>> on MongoDB, and it took ~2mins).
>>>>>
>>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>>> blood test result. (The query result can have duplicate values).
>>>>>
>>>>> *Query:*
>>>>> USE mimiciii;
>>>>> SET `compiler.parallelism` "5";
>>>>> SET `compiler.sortmemory` "128MB";
>>>>> SET `compiler.joinmemory` "265MB";
>>>>> SELECT P.SUBJECT_ID
>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>              E.FLAG = 'abnormal' AND
>>>>>              I.FLUID='Blood' AND
>>>>>              I.LABEL='Haptoglobin'
>>>>>
>>>>> *Datasets Schema:*
>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>> - LABITEMS's  primary key is ITEMID
>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>> - The JSON schema for both datasets is attached.
>>>>> - The DDL for both datasets is attached
>>>>>
>>>>> *Performance Tuning Parameters:*
>>>>> - 4 partitions (iodevices)
>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>>>>> buffercache (storage.buffercache.size).
>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>> - The data size is 7GB
>>>>>
>>>>> Your feedback is highly appreciated!
>>>>>
>>>>> --Rana
>>>>>
>>>>
>>>>
>>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Chen Luo <cl...@uci.edu>.

Hi Rana,

I think the performance issue might related to the access of nested fields,
since the rest performance hot spots (index search, hash join etc looks
normal to me), and I assume " I.FLUID='Blood' AND I.LABEL='Haptoglobin'"
should be very selective. Since MongoDB supports array index, did you build
an index on L.FLAG using MongoDB?

@Wail, do you have any clue on nested fields access?

Best regards,
Chen Luo

On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi Taewoo,
>
>    -
>
>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>
>    -
>
>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I tried to create the index : "msg": "ASX0001: Field type array can't be promoted to type object"
>
>    - Can you switch the predicate order? -- It seems for me that the plan
>    remains the same even if I changed the order of the predicates. (Attached
>    the plan after changing the order of the predicates Plan_02.txt)
>
> Thanks
>
> Rana
> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
> wrote:
>
>> Hi Chen,
>>
>> *How did you import data into the dataset? using "load" or "feed"?*
>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>> (\"format\"=\"json\"))).
>>
>>
>> *Which version of AsterixDB are you using?*
>> AsterixDB Master (0.9.3-SNAPSHOT)
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>>
>>> Hi Rana,
>>>
>>> Nice to see you again! You may post to dev@asterixdb.apache.org as well
>>> to get more feedbacks from our developers.
>>>
>>> Just clarify two things: how did you import data into the dataset? using
>>> "load" or "feed"? And which version of AsterixDB are you using? But any way
>>> in your case it seems the join takes a lot of time, and your data is pretty
>>> much cached into the memory...
>>>
>>> Best regards,
>>> Chen Luo
>>>
>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>> on MongoDB, and it took ~2mins).
>>>>
>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>> blood test result. (The query result can have duplicate values).
>>>>
>>>> *Query:*
>>>> USE mimiciii;
>>>> SET `compiler.parallelism` "5";
>>>> SET `compiler.sortmemory` "128MB";
>>>> SET `compiler.joinmemory` "265MB";
>>>> SELECT P.SUBJECT_ID
>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>              E.FLAG = 'abnormal' AND
>>>>              I.FLUID='Blood' AND
>>>>              I.LABEL='Haptoglobin'
>>>>
>>>> *Datasets Schema:*
>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>> - LABITEMS's  primary key is ITEMID
>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>> - The JSON schema for both datasets is attached.
>>>> - The DDL for both datasets is attached
>>>>
>>>> *Performance Tuning Parameters:*
>>>> - 4 partitions (iodevices)
>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>>>> buffercache (storage.buffercache.size).
>>>> - As you can see from the query, I set the parallelism to 5,
>>>> sort-memory to 128MB, join-memory to 265MB.
>>>> - The data size is 7GB
>>>>
>>>> Your feedback is highly appreciated!
>>>>
>>>> --Rana
>>>>
>>>
>>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Rana Alotaibi <ra...@eng.ucsd.edu>.

Hi Taewoo,

   -

   Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)

   -

   Can you create an index on LABEVENTS.FLAG? -- I couldn't create an
index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this
message when I tried to create the index : "msg": "ASX0001: Field type
array can't be promoted to type object"

   - Can you switch the predicate order? -- It seems for me that the plan
   remains the same even if I changed the order of the predicates. (Attached
   the plan after changing the order of the predicates Plan_02.txt)

Thanks

Rana
On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi Chen,
>
> *How did you import data into the dataset? using "load" or "feed"?*
> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
> (\"format\"=\"json\"))).
>
>
> *Which version of AsterixDB are you using?*
> AsterixDB Master (0.9.3-SNAPSHOT)
>
> Thanks!
>
>
>
>
>
> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:
>
>> Hi Rana,
>>
>> Nice to see you again! You may post to dev@asterixdb.apache.org as well
>> to get more feedbacks from our developers.
>>
>> Just clarify two things: how did you import data into the dataset? using
>> "load" or "feed"? And which version of AsterixDB are you using? But any way
>> in your case it seems the join takes a lot of time, and your data is pretty
>> much cached into the memory...
>>
>> Best regards,
>> Chen Luo
>>
>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
>> wrote:
>>
>>> Hi there,
>>>
>>> I have a query that takes ~12.7mins on average (I have excluded the
>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>> didn't miss any performance tuning parameters ( I have run the same query
>>> on MongoDB, and it took ~2mins).
>>>
>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>> blood test result. (The query result can have duplicate values).
>>>
>>> *Query:*
>>> USE mimiciii;
>>> SET `compiler.parallelism` "5";
>>> SET `compiler.sortmemory` "128MB";
>>> SET `compiler.joinmemory` "265MB";
>>> SELECT P.SUBJECT_ID
>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>> WHERE I.ITEMID=E.ITEMID AND
>>>              E.FLAG = 'abnormal' AND
>>>              I.FLUID='Blood' AND
>>>              I.LABEL='Haptoglobin'
>>>
>>> *Datasets Schema:*
>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>> - LABITEMS's  primary key is ITEMID
>>> - PATIENTS 's primary key is SUBJECT_ID
>>> - The JSON schema for both datasets is attached.
>>> - The DDL for both datasets is attached
>>>
>>> *Performance Tuning Parameters:*
>>> - 4 partitions (iodevices)
>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>>> buffercache (storage.buffercache.size).
>>> - As you can see from the query, I set the parallelism to 5,
>>> sort-memory to 128MB, join-memory to 265MB.
>>> - The data size is 7GB
>>>
>>> Your feedback is highly appreciated!
>>>
>>> --Rana
>>>
>>
>>
>

Re: AsterixDB Performance Tuning

Posted by Rana Alotaibi <ra...@eng.ucsd.edu>.

Hi Chen,

*How did you import data into the dataset? using "load" or "feed"?*
I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
(\"format\"=\"json\"))).


*Which version of AsterixDB are you using?*
AsterixDB Master (0.9.3-SNAPSHOT)

Thanks!





On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <cl...@uci.edu> wrote:

> Hi Rana,
>
> Nice to see you again! You may post to dev@asterixdb.apache.org as well
> to get more feedbacks from our developers.
>
> Just clarify two things: how did you import data into the dataset? using
> "load" or "feed"? And which version of AsterixDB are you using? But any way
> in your case it seems the join takes a lot of time, and your data is pretty
> much cached into the memory...
>
> Best regards,
> Chen Luo
>
> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
> wrote:
>
>> Hi there,
>>
>> I have a query that takes ~12.7mins on average (I have excluded the
>> warm-up time which was 30mins)!, and I would like to make sure that I
>> didn't miss any performance tuning parameters ( I have run the same query
>> on MongoDB, and it took ~2mins).
>>
>> The query asks to find all patients that have 'abnormal' haptoglobin
>> blood test result. (The query result can have duplicate values).
>>
>> *Query:*
>> USE mimiciii;
>> SET `compiler.parallelism` "5";
>> SET `compiler.sortmemory` "128MB";
>> SET `compiler.joinmemory` "265MB";
>> SELECT P.SUBJECT_ID
>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>> WHERE I.ITEMID=E.ITEMID AND
>>              E.FLAG = 'abnormal' AND
>>              I.FLUID='Blood' AND
>>              I.LABEL='Haptoglobin'
>>
>> *Datasets Schema:*
>> - PATIENTS  and LABITEMS datasets have an open schema.
>> - LABITEMS's  primary key is ITEMID
>> - PATIENTS 's primary key is SUBJECT_ID
>> - The JSON schema for both datasets is attached.
>> - The DDL for both datasets is attached
>>
>> *Performance Tuning Parameters:*
>> - 4 partitions (iodevices)
>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>> buffercache (storage.buffercache.size).
>> - As you can see from the query, I set the parallelism to 5,  sort-memory
>> to 128MB, join-memory to 265MB.
>> - The data size is 7GB
>>
>> Your feedback is highly appreciated!
>>
>> --Rana
>>
>
>

Re: AsterixDB Performance Tuning

Posted by Chen Luo <cl...@uci.edu>.

Hi Rana,

Nice to see you again! You may post to dev@asterixdb.apache.org as well to
get more feedbacks from our developers.

Just clarify two things: how did you import data into the dataset? using
"load" or "feed"? And which version of AsterixDB are you using? But any way
in your case it seems the join takes a lot of time, and your data is pretty
much cached into the memory...

Best regards,
Chen Luo

On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi there,
>
> I have a query that takes ~12.7mins on average (I have excluded the
> warm-up time which was 30mins)!, and I would like to make sure that I
> didn't miss any performance tuning parameters ( I have run the same query
> on MongoDB, and it took ~2mins).
>
> The query asks to find all patients that have 'abnormal' haptoglobin blood
> test result. (The query result can have duplicate values).
>
> *Query:*
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
> WHERE I.ITEMID=E.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> *Datasets Schema:*
> - PATIENTS  and LABITEMS datasets have an open schema.
> - LABITEMS's  primary key is ITEMID
> - PATIENTS 's primary key is SUBJECT_ID
> - The JSON schema for both datasets is attached.
> - The DDL for both datasets is attached
>
> *Performance Tuning Parameters:*
> - 4 partitions (iodevices)
> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
> buffercache (storage.buffercache.size).
> - As you can see from the query, I set the parallelism to 5,  sort-memory
> to 128MB, join-memory to 265MB.
> - The data size is 7GB
>
> Your feedback is highly appreciated!
>
> --Rana
>

Re: AsterixDB Performance Tuning

Posted by Taewoo Kim <wa...@gmail.com>.

Hi Rana,

The field order in the search predicate might be related to the factor.
Here are my two questions.

1) Can you paste the optimized plan?

2) Can you create an index on LABEVENTS.FLAG?

3) Can you switch the predicate order? (e.g., moving E.FLAG='abnormal' and
I.FLUID='Blood' and I.LABEL='Haptoglobin' to the front and place the join
conditions after that)




Best,
Taewoo

On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <ra...@eng.ucsd.edu>
wrote:

> Hi there,
>
> I have a query that takes ~12.7mins on average (I have excluded the
> warm-up time which was 30mins)!, and I would like to make sure that I
> didn't miss any performance tuning parameters ( I have run the same query
> on MongoDB, and it took ~2mins).
>
> The query asks to find all patients that have 'abnormal' haptoglobin blood
> test result. (The query result can have duplicate values).
>
> *Query:*
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
> WHERE I.ITEMID=E.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> *Datasets Schema:*
> - PATIENTS  and LABITEMS datasets have an open schema.
> - LABITEMS's  primary key is ITEMID
> - PATIENTS 's primary key is SUBJECT_ID
> - The JSON schema for both datasets is attached.
> - The DDL for both datasets is attached
>
> *Performance Tuning Parameters:*
> - 4 partitions (iodevices)
> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
> buffercache (storage.buffercache.size).
> - As you can see from the query, I set the parallelism to 5,  sort-memory
> to 128MB, join-memory to 265MB.
> - The data size is 7GB
>
> Your feedback is highly appreciated!
>
> --Rana
>