You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Yiran Wang <wy...@gmail.com> on 2016/02/20 00:37:48 UTC

Two query questions on time bin function

Hi Asterix team,

I have two queries I'm struggling with. I'm hoping you could provide a
direction for me. Thanks in advance!

Here is what the data structure looks like:

create type HRMType as closed {

  row_id: int32,

  sid: int32,

  date: date,

  day: int32,

  time: time,

  bpm: int32,

  RR: float

};

create dataset HRM (HRMType)

primary key row_id;


Previously I have used the time bin function to calculate the standard
deviation of bpm for each time bin:

for $i in dataset HRM

group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i

return {

"sid": $sid,

"gdate": $gdate,

"gday": $gday,

"timebin": $timebin,

"stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};

Now I have two things I am hoping to do but need help with:

1. For each 1-min time bin, remove the bpm values that are above the top 5%
or below the bottom 5%. I thought about using the min/max function for a
few times to achieve this, but realized that it was not a good idea because
in each time bin, the number of instances are not always the same. So for
each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
remove instances accordingly, which I don't know how to do.

2. After removing the outliers of bpm for each 1-min time bin, calculate a
median absolute deviation (MAD) for each 1-min time bin (as another measure
of variation besides the standard deviation). MAD =
median(abs(x-median(x)). I'm not sure how to write a query to do the median
function in Asterix.

Thank you so much in advance. Let me know if my questions are clear.

Yiran

-- 
Best,
Yiran

Re: Two query questions on time bin function

Posted by Mike Carey <dt...@gmail.com>.

How about trying...

..... avg(for $irrc in $i.RR_clipped return $irrc) .....

That could make the compiler happier, potentially.
(Right now it isn't "feeling the love" in terms of the type of avg's 
argument - not sure why count worked.)

Cheers,
Mike


On 2/22/16 2:03 PM, Ian Maxon wrote:
> Regarding the latest query about computing some summary statistics from the
> min-maxed data, Yiran and I just finished meeting and we were able to come
> up with a work around.
>
> The query was as follows:
> use dataverse mt16
>
> for $i in dataset HRM_binned_clipped return {
> "row_id": $i.row_id,
> "sid": $i.sid,
> "gdate": $i.gdate,
> "gday": $i.gday,
> "timebin": $i.timebin,
> "stdv_RR_clipped": avg($i.RR_clipped)
> }
>
> However, it would fail like this:
>
> Type of argument in function-call: asterix:avg, Args:[function-call:
> asterix:field-access-by-name, Args:[%0->$$0, AString: {RR_clipped}]] should
> be a collection type instead of ANY [AlgebricksException]
>
>
> Because HRM_binned_clipped was open and there is some sort of bug in the
> avg() function for this. count() works fine.
> The work around, is to just copy everything into a closed dataset, where
> RR_clipped is of type [double]. The query then works.
>
>
> Again though this is kind of a work-around, to a work around. The original
> query:
>
>
> declare function minmax($x){
> let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
> for $y in $x
> where $y < (2*$stdv) + avg($x)
> and $y > avg($x) - (2*$stdv)
> return $y
> }
>
> for $i in dataset HRM
> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
> return {
> "sid": $sid,
> "gdate": $gdate,
> "gday": $gday,
> "timebin": $timebin,
> "stdv": (avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR *
> $ii.RR) - avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR)
> * avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR))^(0.5)
> }
>
> Expresses the same thing without any intermediate datasets. This query
> fails in compilation (only with the avgs added). I need to get the stack
> from the version that Yiran is running on though, I can't reproduce it on
> master; it fails in a different way.
>
>
> -Ian
>
>
> On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <dt...@gmail.com> wrote:
>
>> ARGH!!!!  This is what we would like you to *not* have to do.  Sorry...
>> Our aim is to be the Big Data antidote....
>>
>> @Yingyi:  Maybe you could take a quick peek at the query issue and
>> see if there is any low-hanging hope there?
>>
>> @Yiran:  How big are your windows, typically?  (Number of data points.)
>>
>> Cheers,
>> Mike
>>
>>
>> On 2/21/16 2:42 PM, Yiran Wang wrote:
>>
>>> Thank you Mike for your update and suggestions! And thank you Ian again
>>> for working with me.
>>>
>>> A little update from my end:
>>>
>>> I have been working on query (1) over the weekend for a work-around. I
>>> tried to simultaneously calculate the stdev on the new copy of the list of
>>> values with outliers removed, which in nature is the same as the query (2).
>>> So I ran into the same problem that the query did not compile.
>>>
>>> What I did was to export the entire dataset with the outliers removed
>>> into Excel and calculate the stdev in Excel. However, the entire dataset is
>>> now 363,466 x 200+ in dimension. Though they do not exceed the row x col
>>> limit in Excel, the memory of my computer is not big enough to do anything
>>> useful without crashing. So I've been breaking the dataset into smaller
>>> parts and working on each separately.
>>>
>>> Yiran
>>>
>>>
>>>
>>>
>>> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <dtabass@gmail.com <mailto:
>>> dtabass@gmail.com>> wrote:
>>>
>>>      Ian,
>>>
>>>      Thanks working with Yiran on this!  I think there is "good" and
>>>      bad news w.r.t these queries:
>>>
>>>       - The bad news is that they go beyond what we are likely to
>>>      optimize at all well at present,
>>>          as they go beyond what typical DB aggregate functions like
>>>      min/max/avg/count/sum do.
>>>          (I would try forming the groups and then doing these things on
>>>      the groups, but saying
>>>          them in AQL will be tricky, and may lead to queries that hit
>>>      edge cases in the optimizer.
>>>          For some of these my thought was to try using a positional
>>>      variable within a group...?)
>>>
>>>       - The "good" news (only for AsterixDB) is that this is exactly
>>>      the sort of inspiration that we
>>>          are looking for in terms of understanding how to better for
>>>      query-based analytics in real
>>>          use cases (and this is a very real one!).
>>>
>>>      To quote a short paper I reviewed just this AM on SQL queries kind
>>>      of like these:  "Percentage
>>>      queries are more complex than their conventional counterparts and
>>>      introduce new challenges
>>>      for optimization."  (The paper didn't have an applicable solution
>>>      for us, sadly.)
>>>
>>>      A more general facility that I wish we could offer was to do
>>>      grouping in AsterixDB but then
>>>      have the ability to pass a group to (e.g.) R and then get results
>>>      back for the group.  When
>>>      groups are small-ish (like Yiran's windows) that would be pretty
>>>      cool - then one could do
>>>      all sorts of advanced things per group.
>>>
>>>      Cheers,
>>>      Mike
>>>
>>>      On 2/21/16 12:35 AM, Ian Maxon wrote:
>>>
>>>>      Yiran and I came up with possible answers for these...
>>>>      For 1) , a function could be used that looks something like this:
>>>>
>>>>      declare function minmax($x){
>>>>      let $stdv := (avg(for $z in $x return $z*$z) - avg($x) *
>>>> avg($x))^(0.5)
>>>>      for $y in $x
>>>>      where $y < (2*$stdv) + avg($x)
>>>>      and $y > avg($x) - (2*$stdv)
>>>>      return $y
>>>>      }
>>>>
>>>>
>>>>
>>>>      And then applied to return a new copy of the list of values,
>>>> removing ones
>>>>      that are outside of 2 stdev.
>>>>
>>>>      For 2), we also did come up with a potential solution ,but the query
>>>> fails
>>>>      to compile (Filed ashttps://
>>>> issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>>>
>>>>
>>>>      Any thoughts on these queries would be welcome :) 1) especially seems
>>>>      inefficient to do as a function.
>>>>
>>>>      - Ian
>>>>
>>>>      On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<wy...@gmail.com>
>>>> <ma...@gmail.com>  wrote:
>>>>
>>>>      Hi Asterix team,
>>>>>      I have two queries I'm struggling with. I'm hoping you could
>>>>> provide a
>>>>>      direction for me. Thanks in advance!
>>>>>
>>>>>      Here is what the data structure looks like:
>>>>>
>>>>>      create type HRMType as closed {
>>>>>
>>>>>         row_id: int32,
>>>>>
>>>>>         sid: int32,
>>>>>
>>>>>         date: date,
>>>>>
>>>>>         day: int32,
>>>>>
>>>>>         time: time,
>>>>>
>>>>>         bpm: int32,
>>>>>
>>>>>         RR: float
>>>>>
>>>>>      };
>>>>>
>>>>>      create dataset HRM (HRMType)
>>>>>
>>>>>      primary key row_id;
>>>>>
>>>>>
>>>>>      Previously I have used the time bin function to calculate the
>>>>> standard
>>>>>      deviation of bpm for each time bin:
>>>>>
>>>>>      for $i in dataset HRM
>>>>>
>>>>>      group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day,
>>>>> $timebin :=
>>>>>      interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M"))
>>>>> with $i
>>>>>
>>>>>      return {
>>>>>
>>>>>      "sid": $sid,
>>>>>
>>>>>      "gdate": $gdate,
>>>>>
>>>>>      "gday": $gday,
>>>>>
>>>>>      "timebin": $timebin,
>>>>>
>>>>>      "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in
>>>>> $i
>>>>>      return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>>
>>>>>      Now I have two things I am hoping to do but need help with:
>>>>>
>>>>>      1. For each 1-min time bin, remove the bpm values that are above
>>>>> the top
>>>>>      5% or below the bottom 5%. I thought about using the min/max
>>>>> function for a
>>>>>      few times to achieve this, but realized that it was not a good idea
>>>>> because
>>>>>      in each time bin, the number of instances are not always the same.
>>>>> So for
>>>>>      each 1-min time bin, we do need to calculate the 5% and 95%
>>>>> threshold, and
>>>>>      remove instances accordingly, which I don't know how to do.
>>>>>
>>>>>      2. After removing the outliers of bpm for each 1-min time bin,
>>>>> calculate a
>>>>>      median absolute deviation (MAD) for each 1-min time bin (as another
>>>>> measure
>>>>>      of variation besides the standard deviation). MAD =
>>>>>      median(abs(x-median(x)). I'm not sure how to write a query to do
>>>>> the median
>>>>>      function in Asterix.
>>>>>
>>>>>      Thank you so much in advance. Let me know if my questions are clear.
>>>>>
>>>>>      Yiran
>>>>>
>>>>>      --
>>>>>      Best,
>>>>>      Yiran
>>>>>
>>>>>      --
>>>>>      You received this message because you are subscribed to the Google
>>>>> Groups
>>>>>      "asterixdb-dev" group.
>>>>>      To unsubscribe from this group and stop receiving emails from it,
>>>>> send an
>>>>>      email toasterixdb-dev+unsubscribe@googlegroups.com
>>>>>      <ma...@googlegroups.com>.
>>>>>      For more options, visithttps://groups.google.com/d/optout.
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Best,
>>> Yiran
>>>
>>

Re: Two query questions on time bin function

Posted by Ian Maxon <im...@uci.edu>.

Regarding the latest query about computing some summary statistics from the
min-maxed data, Yiran and I just finished meeting and we were able to come
up with a work around.

The query was as follows:
use dataverse mt16

for $i in dataset HRM_binned_clipped return {
"row_id": $i.row_id,
"sid": $i.sid,
"gdate": $i.gdate,
"gday": $i.gday,
"timebin": $i.timebin,
"stdv_RR_clipped": avg($i.RR_clipped)
}

However, it would fail like this:

Type of argument in function-call: asterix:avg, Args:[function-call:
asterix:field-access-by-name, Args:[%0->$$0, AString: {RR_clipped}]] should
be a collection type instead of ANY [AlgebricksException]


Because HRM_binned_clipped was open and there is some sort of bug in the
avg() function for this. count() works fine.
The work around, is to just copy everything into a closed dataset, where
RR_clipped is of type [double]. The query then works.


Again though this is kind of a work-around, to a work around. The original
query:


declare function minmax($x){
let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
for $y in $x
where $y < (2*$stdv) + avg($x)
and $y > avg($x) - (2*$stdv)
return $y
}

for $i in dataset HRM
group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
return {
"sid": $sid,
"gdate": $gdate,
"gday": $gday,
"timebin": $timebin,
"stdv": (avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR *
$ii.RR) - avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR)
* avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR))^(0.5)
}

Expresses the same thing without any intermediate datasets. This query
fails in compilation (only with the avgs added). I need to get the stack
from the version that Yiran is running on though, I can't reproduce it on
master; it fails in a different way.


-Ian


On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <dt...@gmail.com> wrote:

> ARGH!!!!  This is what we would like you to *not* have to do.  Sorry...
> Our aim is to be the Big Data antidote....
>
> @Yingyi:  Maybe you could take a quick peek at the query issue and
> see if there is any low-hanging hope there?
>
> @Yiran:  How big are your windows, typically?  (Number of data points.)
>
> Cheers,
> Mike
>
>
> On 2/21/16 2:42 PM, Yiran Wang wrote:
>
>> Thank you Mike for your update and suggestions! And thank you Ian again
>> for working with me.
>>
>> A little update from my end:
>>
>> I have been working on query (1) over the weekend for a work-around. I
>> tried to simultaneously calculate the stdev on the new copy of the list of
>> values with outliers removed, which in nature is the same as the query (2).
>> So I ran into the same problem that the query did not compile.
>>
>> What I did was to export the entire dataset with the outliers removed
>> into Excel and calculate the stdev in Excel. However, the entire dataset is
>> now 363,466 x 200+ in dimension. Though they do not exceed the row x col
>> limit in Excel, the memory of my computer is not big enough to do anything
>> useful without crashing. So I've been breaking the dataset into smaller
>> parts and working on each separately.
>>
>> Yiran
>>
>>
>>
>>
>> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <dtabass@gmail.com <mailto:
>> dtabass@gmail.com>> wrote:
>>
>>     Ian,
>>
>>     Thanks working with Yiran on this!  I think there is "good" and
>>     bad news w.r.t these queries:
>>
>>      - The bad news is that they go beyond what we are likely to
>>     optimize at all well at present,
>>         as they go beyond what typical DB aggregate functions like
>>     min/max/avg/count/sum do.
>>         (I would try forming the groups and then doing these things on
>>     the groups, but saying
>>         them in AQL will be tricky, and may lead to queries that hit
>>     edge cases in the optimizer.
>>         For some of these my thought was to try using a positional
>>     variable within a group...?)
>>
>>      - The "good" news (only for AsterixDB) is that this is exactly
>>     the sort of inspiration that we
>>         are looking for in terms of understanding how to better for
>>     query-based analytics in real
>>         use cases (and this is a very real one!).
>>
>>     To quote a short paper I reviewed just this AM on SQL queries kind
>>     of like these:  "Percentage
>>     queries are more complex than their conventional counterparts and
>>     introduce new challenges
>>     for optimization."  (The paper didn't have an applicable solution
>>     for us, sadly.)
>>
>>     A more general facility that I wish we could offer was to do
>>     grouping in AsterixDB but then
>>     have the ability to pass a group to (e.g.) R and then get results
>>     back for the group.  When
>>     groups are small-ish (like Yiran's windows) that would be pretty
>>     cool - then one could do
>>     all sorts of advanced things per group.
>>
>>     Cheers,
>>     Mike
>>
>>     On 2/21/16 12:35 AM, Ian Maxon wrote:
>>
>>>     Yiran and I came up with possible answers for these...
>>>     For 1) , a function could be used that looks something like this:
>>>
>>>     declare function minmax($x){
>>>     let $stdv := (avg(for $z in $x return $z*$z) - avg($x) *
>>> avg($x))^(0.5)
>>>     for $y in $x
>>>     where $y < (2*$stdv) + avg($x)
>>>     and $y > avg($x) - (2*$stdv)
>>>     return $y
>>>     }
>>>
>>>
>>>
>>>     And then applied to return a new copy of the list of values,
>>> removing ones
>>>     that are outside of 2 stdev.
>>>
>>>     For 2), we also did come up with a potential solution ,but the query
>>> fails
>>>     to compile (Filed ashttps://
>>> issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>>
>>>
>>>     Any thoughts on these queries would be welcome :) 1) especially seems
>>>     inefficient to do as a function.
>>>
>>>     - Ian
>>>
>>>     On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<wy...@gmail.com>
>>> <ma...@gmail.com>  wrote:
>>>
>>>     Hi Asterix team,
>>>>
>>>>     I have two queries I'm struggling with. I'm hoping you could
>>>> provide a
>>>>     direction for me. Thanks in advance!
>>>>
>>>>     Here is what the data structure looks like:
>>>>
>>>>     create type HRMType as closed {
>>>>
>>>>        row_id: int32,
>>>>
>>>>        sid: int32,
>>>>
>>>>        date: date,
>>>>
>>>>        day: int32,
>>>>
>>>>        time: time,
>>>>
>>>>        bpm: int32,
>>>>
>>>>        RR: float
>>>>
>>>>     };
>>>>
>>>>     create dataset HRM (HRMType)
>>>>
>>>>     primary key row_id;
>>>>
>>>>
>>>>     Previously I have used the time bin function to calculate the
>>>> standard
>>>>     deviation of bpm for each time bin:
>>>>
>>>>     for $i in dataset HRM
>>>>
>>>>     group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day,
>>>> $timebin :=
>>>>     interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M"))
>>>> with $i
>>>>
>>>>     return {
>>>>
>>>>     "sid": $sid,
>>>>
>>>>     "gdate": $gdate,
>>>>
>>>>     "gday": $gday,
>>>>
>>>>     "timebin": $timebin,
>>>>
>>>>     "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in
>>>> $i
>>>>     return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>
>>>>     Now I have two things I am hoping to do but need help with:
>>>>
>>>>     1. For each 1-min time bin, remove the bpm values that are above
>>>> the top
>>>>     5% or below the bottom 5%. I thought about using the min/max
>>>> function for a
>>>>     few times to achieve this, but realized that it was not a good idea
>>>> because
>>>>     in each time bin, the number of instances are not always the same.
>>>> So for
>>>>     each 1-min time bin, we do need to calculate the 5% and 95%
>>>> threshold, and
>>>>     remove instances accordingly, which I don't know how to do.
>>>>
>>>>     2. After removing the outliers of bpm for each 1-min time bin,
>>>> calculate a
>>>>     median absolute deviation (MAD) for each 1-min time bin (as another
>>>> measure
>>>>     of variation besides the standard deviation). MAD =
>>>>     median(abs(x-median(x)). I'm not sure how to write a query to do
>>>> the median
>>>>     function in Asterix.
>>>>
>>>>     Thank you so much in advance. Let me know if my questions are clear.
>>>>
>>>>     Yiran
>>>>
>>>>     --
>>>>     Best,
>>>>     Yiran
>>>>
>>>>     --
>>>>     You received this message because you are subscribed to the Google
>>>> Groups
>>>>     "asterixdb-dev" group.
>>>>     To unsubscribe from this group and stop receiving emails from it,
>>>> send an
>>>>     email toasterixdb-dev+unsubscribe@googlegroups.com
>>>>     <ma...@googlegroups.com>.
>>>>     For more options, visithttps://groups.google.com/d/optout.
>>>>
>>>>
>>
>>
>>
>> --
>> Best,
>> Yiran
>>
>
>

Re: Two query questions on time bin function

Posted by Mike Carey <dt...@gmail.com>.

Approaching Big Data!  :-)  Interesting....!
Q: Is each post-binning row a bin, meaning there are only ~86 rows per bin?
(Am I interpreting that correctly?)
If so, that's good news; materializing any one given bin shouldn't be a 
problem for our runtime, so maybe we can indeed get this to work in the 
short term.

Sorry for the hassles w/this.....!
Cheers,
Mike


On 2/22/16 7:09 AM, Yiran Wang wrote:
> Mike,
>
> The original dataset has 31132597 rows of records. After binning it 
> into 1-min time bin dataset, it has 363466 rows of records.
>
> Thanks,
> Yiran
>
> On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <dtabass@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     ARGH!!!! This is what we would like you to *not* have to do. Sorry...
>     Our aim is to be the Big Data antidote....
>
>     @Yingyi:  Maybe you could take a quick peek at the query issue and
>     see if there is any low-hanging hope there?
>
>     @Yiran:  How big are your windows, typically?  (Number of data
>     points.)
>
>     Cheers,
>     Mike
>
>
>     On 2/21/16 2:42 PM, Yiran Wang wrote:
>>     Thank you Mike for your update and suggestions! And thank you Ian
>>     again for working with me.
>>
>>     A little update from my end:
>>
>>     I have been working on query (1) over the weekend for a
>>     work-around. I tried to simultaneously calculate the stdev on the
>>     new copy of the list of values with outliers removed, which in
>>     nature is the same as the query (2). So I ran into the same
>>     problem that the query did not compile.
>>
>>     What I did was to export the entire dataset with the outliers
>>     removed into Excel and calculate the stdev in Excel. However, the
>>     entire dataset is now 363,466 x 200+ in dimension. Though they do
>>     not exceed the row x col limit in Excel, the memory of my
>>     computer is not big enough to do anything useful without
>>     crashing. So I've been breaking the dataset into smaller parts
>>     and working on each separately.
>>
>>     Yiran
>>
>>
>>
>>     On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <dtabass@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         Ian,
>>
>>         Thanks working with Yiran on this!  I think there is "good"
>>         and bad news w.r.t these queries:
>>
>>          - The bad news is that they go beyond what we are likely to
>>         optimize at all well at present,
>>             as they go beyond what typical DB aggregate functions
>>         like min/max/avg/count/sum do.
>>             (I would try forming the groups and then doing these
>>         things on the groups, but saying
>>             them in AQL will be tricky, and may lead to queries that
>>         hit edge cases in the optimizer.
>>             For some of these my thought was to try using a
>>         positional variable within a group...?)
>>
>>          - The "good" news (only for AsterixDB) is that this is
>>         exactly the sort of inspiration that we
>>             are looking for in terms of understanding how to better
>>         for query-based analytics in real
>>             use cases (and this is a very real one!).
>>
>>         To quote a short paper I reviewed just this AM on SQL queries
>>         kind of like these:  "Percentage
>>         queries are more complex than their conventional counterparts
>>         and introduce new challenges
>>         for optimization."  (The paper didn't have an applicable
>>         solution for us, sadly.)
>>
>>         A more general facility that I wish we could offer was to do
>>         grouping in AsterixDB but then
>>         have the ability to pass a group to (e.g.) R and then get
>>         results back for the group.  When
>>         groups are small-ish (like Yiran's windows) that would be
>>         pretty cool - then one could do
>>         all sorts of advanced things per group.
>>
>>         Cheers,
>>         Mike
>>
>>         On 2/21/16 12:35 AM, Ian Maxon wrote:
>>>         Yiran and I came up with possible answers for these...
>>>         For 1) , a function could be used that looks something like this:
>>>
>>>         declare function minmax($x){
>>>         let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
>>>         for $y in $x
>>>         where $y < (2*$stdv) + avg($x)
>>>         and $y > avg($x) - (2*$stdv)
>>>         return $y
>>>         }
>>>
>>>
>>>
>>>         And then applied to return a new copy of the list of values, removing ones
>>>         that are outside of 2 stdev.
>>>
>>>         For 2), we also did come up with a potential solution ,but the query fails
>>>         to compile (Filed ashttps://issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>>
>>>
>>>         Any thoughts on these queries would be welcome :) 1) especially seems
>>>         inefficient to do as a function.
>>>
>>>         - Ian
>>>
>>>         On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<wy...@gmail.com> <ma...@gmail.com>  wrote:
>>>
>>>>         Hi Asterix team,
>>>>
>>>>         I have two queries I'm struggling with. I'm hoping you could provide a
>>>>         direction for me. Thanks in advance!
>>>>
>>>>         Here is what the data structure looks like:
>>>>
>>>>         create type HRMType as closed {
>>>>
>>>>            row_id: int32,
>>>>
>>>>            sid: int32,
>>>>
>>>>            date: date,
>>>>
>>>>            day: int32,
>>>>
>>>>            time: time,
>>>>
>>>>            bpm: int32,
>>>>
>>>>            RR: float
>>>>
>>>>         };
>>>>
>>>>         create dataset HRM (HRMType)
>>>>
>>>>         primary key row_id;
>>>>
>>>>
>>>>         Previously I have used the time bin function to calculate the standard
>>>>         deviation of bpm for each time bin:
>>>>
>>>>         for $i in dataset HRM
>>>>
>>>>         group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
>>>>         interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
>>>>
>>>>         return {
>>>>
>>>>         "sid": $sid,
>>>>
>>>>         "gdate": $gdate,
>>>>
>>>>         "gday": $gday,
>>>>
>>>>         "timebin": $timebin,
>>>>
>>>>         "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
>>>>         return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>
>>>>         Now I have two things I am hoping to do but need help with:
>>>>
>>>>         1. For each 1-min time bin, remove the bpm values that are above the top
>>>>         5% or below the bottom 5%. I thought about using the min/max function for a
>>>>         few times to achieve this, but realized that it was not a good idea because
>>>>         in each time bin, the number of instances are not always the same. So for
>>>>         each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
>>>>         remove instances accordingly, which I don't know how to do.
>>>>
>>>>         2. After removing the outliers of bpm for each 1-min time bin, calculate a
>>>>         median absolute deviation (MAD) for each 1-min time bin (as another measure
>>>>         of variation besides the standard deviation). MAD =
>>>>         median(abs(x-median(x)). I'm not sure how to write a query to do the median
>>>>         function in Asterix.
>>>>
>>>>         Thank you so much in advance. Let me know if my questions are clear.
>>>>
>>>>         Yiran
>>>>
>>>>         --
>>>>         Best,
>>>>         Yiran
>>>>
>>>>         --
>>>>         You received this message because you are subscribed to the Google Groups
>>>>         "asterixdb-dev" group.
>>>>         To unsubscribe from this group and stop receiving emails from it, send an
>>>>         email toasterixdb-dev+unsubscribe@googlegroups.com
>>>>         <ma...@googlegroups.com>.
>>>>         For more options, visithttps://groups.google.com/d/optout.
>>>>
>>
>>
>>
>>
>>     -- 
>>     Best,
>>     Yiran
>
>
>
>
> -- 
> Best,
> Yiran

Re: Two query questions on time bin function

Posted by Mike Carey <dt...@gmail.com>.

ARGH!!!!  This is what we would like you to *not* have to do.  Sorry...
Our aim is to be the Big Data antidote....

@Yingyi:  Maybe you could take a quick peek at the query issue and
see if there is any low-hanging hope there?

@Yiran:  How big are your windows, typically?  (Number of data points.)

Cheers,
Mike


On 2/21/16 2:42 PM, Yiran Wang wrote:
> Thank you Mike for your update and suggestions! And thank you Ian 
> again for working with me.
>
> A little update from my end:
>
> I have been working on query (1) over the weekend for a work-around. I 
> tried to simultaneously calculate the stdev on the new copy of the 
> list of values with outliers removed, which in nature is the same as 
> the query (2). So I ran into the same problem that the query did not 
> compile.
>
> What I did was to export the entire dataset with the outliers removed 
> into Excel and calculate the stdev in Excel. However, the entire 
> dataset is now 363,466 x 200+ in dimension. Though they do not exceed 
> the row x col limit in Excel, the memory of my computer is not big 
> enough to do anything useful without crashing. So I've been breaking 
> the dataset into smaller parts and working on each separately.
>
> Yiran
>
>
>
> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <dtabass@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Ian,
>
>     Thanks working with Yiran on this!  I think there is "good" and
>     bad news w.r.t these queries:
>
>      - The bad news is that they go beyond what we are likely to
>     optimize at all well at present,
>         as they go beyond what typical DB aggregate functions like
>     min/max/avg/count/sum do.
>         (I would try forming the groups and then doing these things on
>     the groups, but saying
>         them in AQL will be tricky, and may lead to queries that hit
>     edge cases in the optimizer.
>         For some of these my thought was to try using a positional
>     variable within a group...?)
>
>      - The "good" news (only for AsterixDB) is that this is exactly
>     the sort of inspiration that we
>         are looking for in terms of understanding how to better for
>     query-based analytics in real
>         use cases (and this is a very real one!).
>
>     To quote a short paper I reviewed just this AM on SQL queries kind
>     of like these:  "Percentage
>     queries are more complex than their conventional counterparts and
>     introduce new challenges
>     for optimization."  (The paper didn't have an applicable solution
>     for us, sadly.)
>
>     A more general facility that I wish we could offer was to do
>     grouping in AsterixDB but then
>     have the ability to pass a group to (e.g.) R and then get results
>     back for the group.  When
>     groups are small-ish (like Yiran's windows) that would be pretty
>     cool - then one could do
>     all sorts of advanced things per group.
>
>     Cheers,
>     Mike
>
>     On 2/21/16 12:35 AM, Ian Maxon wrote:
>>     Yiran and I came up with possible answers for these...
>>     For 1) , a function could be used that looks something like this:
>>
>>     declare function minmax($x){
>>     let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
>>     for $y in $x
>>     where $y < (2*$stdv) + avg($x)
>>     and $y > avg($x) - (2*$stdv)
>>     return $y
>>     }
>>
>>
>>
>>     And then applied to return a new copy of the list of values, removing ones
>>     that are outside of 2 stdev.
>>
>>     For 2), we also did come up with a potential solution ,but the query fails
>>     to compile (Filed ashttps://issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>
>>
>>     Any thoughts on these queries would be welcome :) 1) especially seems
>>     inefficient to do as a function.
>>
>>     - Ian
>>
>>     On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<wy...@gmail.com> <ma...@gmail.com>  wrote:
>>
>>>     Hi Asterix team,
>>>
>>>     I have two queries I'm struggling with. I'm hoping you could provide a
>>>     direction for me. Thanks in advance!
>>>
>>>     Here is what the data structure looks like:
>>>
>>>     create type HRMType as closed {
>>>
>>>        row_id: int32,
>>>
>>>        sid: int32,
>>>
>>>        date: date,
>>>
>>>        day: int32,
>>>
>>>        time: time,
>>>
>>>        bpm: int32,
>>>
>>>        RR: float
>>>
>>>     };
>>>
>>>     create dataset HRM (HRMType)
>>>
>>>     primary key row_id;
>>>
>>>
>>>     Previously I have used the time bin function to calculate the standard
>>>     deviation of bpm for each time bin:
>>>
>>>     for $i in dataset HRM
>>>
>>>     group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
>>>     interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
>>>
>>>     return {
>>>
>>>     "sid": $sid,
>>>
>>>     "gdate": $gdate,
>>>
>>>     "gday": $gday,
>>>
>>>     "timebin": $timebin,
>>>
>>>     "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
>>>     return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>
>>>     Now I have two things I am hoping to do but need help with:
>>>
>>>     1. For each 1-min time bin, remove the bpm values that are above the top
>>>     5% or below the bottom 5%. I thought about using the min/max function for a
>>>     few times to achieve this, but realized that it was not a good idea because
>>>     in each time bin, the number of instances are not always the same. So for
>>>     each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
>>>     remove instances accordingly, which I don't know how to do.
>>>
>>>     2. After removing the outliers of bpm for each 1-min time bin, calculate a
>>>     median absolute deviation (MAD) for each 1-min time bin (as another measure
>>>     of variation besides the standard deviation). MAD =
>>>     median(abs(x-median(x)). I'm not sure how to write a query to do the median
>>>     function in Asterix.
>>>
>>>     Thank you so much in advance. Let me know if my questions are clear.
>>>
>>>     Yiran
>>>
>>>     --
>>>     Best,
>>>     Yiran
>>>
>>>     --
>>>     You received this message because you are subscribed to the Google Groups
>>>     "asterixdb-dev" group.
>>>     To unsubscribe from this group and stop receiving emails from it, send an
>>>     email toasterixdb-dev+unsubscribe@googlegroups.com
>>>     <ma...@googlegroups.com>.
>>>     For more options, visithttps://groups.google.com/d/optout.
>>>
>
>
>
>
> -- 
> Best,
> Yiran

Re: Two query questions on time bin function

Posted by Mike Carey <dt...@gmail.com>.

Ian,

Thanks working with Yiran on this!  I think there is "good" and bad news 
w.r.t these queries:

  - The bad news is that they go beyond what we are likely to optimize 
at all well at present,
     as they go beyond what typical DB aggregate functions like 
min/max/avg/count/sum do.
     (I would try forming the groups and then doing these things on the 
groups, but saying
     them in AQL will be tricky, and may lead to queries that hit edge 
cases in the optimizer.
     For some of these my thought was to try using a positional variable 
within a group...?)

  - The "good" news (only for AsterixDB) is that this is exactly the 
sort of inspiration that we
     are looking for in terms of understanding how to better for 
query-based analytics in real
     use cases (and this is a very real one!).

To quote a short paper I reviewed just this AM on SQL queries kind of 
like these:  "Percentage
queries are more complex than their conventional counterparts and 
introduce new challenges
for optimization."  (The paper didn't have an applicable solution for 
us, sadly.)

A more general facility that I wish we could offer was to do grouping in 
AsterixDB but then
have the ability to pass a group to (e.g.) R and then get results back 
for the group.  When
groups are small-ish (like Yiran's windows) that would be pretty cool - 
then one could do
all sorts of advanced things per group.

Cheers,
Mike

On 2/21/16 12:35 AM, Ian Maxon wrote:
> Yiran and I came up with possible answers for these...
> For 1) , a function could be used that looks something like this:
>
> declare function minmax($x){
> let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
> for $y in $x
> where $y < (2*$stdv) + avg($x)
> and $y > avg($x) - (2*$stdv)
> return $y
> }
>
>
>
> And then applied to return a new copy of the list of values, removing ones
> that are outside of 2 stdev.
>
> For 2), we also did come up with a potential solution ,but the query fails
> to compile (Filed as https://issues.apache.org/jira/browse/ASTERIXDB-1308 )
>
>
> Any thoughts on these queries would be welcome :) 1) especially seems
> inefficient to do as a function.
>
> - Ian
>
> On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang <wy...@gmail.com> wrote:
>
>> Hi Asterix team,
>>
>> I have two queries I'm struggling with. I'm hoping you could provide a
>> direction for me. Thanks in advance!
>>
>> Here is what the data structure looks like:
>>
>> create type HRMType as closed {
>>
>>    row_id: int32,
>>
>>    sid: int32,
>>
>>    date: date,
>>
>>    day: int32,
>>
>>    time: time,
>>
>>    bpm: int32,
>>
>>    RR: float
>>
>> };
>>
>> create dataset HRM (HRMType)
>>
>> primary key row_id;
>>
>>
>> Previously I have used the time bin function to calculate the standard
>> deviation of bpm for each time bin:
>>
>> for $i in dataset HRM
>>
>> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
>> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
>>
>> return {
>>
>> "sid": $sid,
>>
>> "gdate": $gdate,
>>
>> "gday": $gday,
>>
>> "timebin": $timebin,
>>
>> "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
>> return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>
>> Now I have two things I am hoping to do but need help with:
>>
>> 1. For each 1-min time bin, remove the bpm values that are above the top
>> 5% or below the bottom 5%. I thought about using the min/max function for a
>> few times to achieve this, but realized that it was not a good idea because
>> in each time bin, the number of instances are not always the same. So for
>> each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
>> remove instances accordingly, which I don't know how to do.
>>
>> 2. After removing the outliers of bpm for each 1-min time bin, calculate a
>> median absolute deviation (MAD) for each 1-min time bin (as another measure
>> of variation besides the standard deviation). MAD =
>> median(abs(x-median(x)). I'm not sure how to write a query to do the median
>> function in Asterix.
>>
>> Thank you so much in advance. Let me know if my questions are clear.
>>
>> Yiran
>>
>> --
>> Best,
>> Yiran
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "asterixdb-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to asterixdb-dev+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>

Re: Two query questions on time bin function

Posted by Ian Maxon <im...@uci.edu>.

Yiran and I came up with possible answers for these...
For 1) , a function could be used that looks something like this:

declare function minmax($x){
let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
for $y in $x
where $y < (2*$stdv) + avg($x)
and $y > avg($x) - (2*$stdv)
return $y
}



And then applied to return a new copy of the list of values, removing ones
that are outside of 2 stdev.

For 2), we also did come up with a potential solution ,but the query fails
to compile (Filed as https://issues.apache.org/jira/browse/ASTERIXDB-1308 )


Any thoughts on these queries would be welcome :) 1) especially seems
inefficient to do as a function.

- Ian

On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang <wy...@gmail.com> wrote:

> Hi Asterix team,
>
> I have two queries I'm struggling with. I'm hoping you could provide a
> direction for me. Thanks in advance!
>
> Here is what the data structure looks like:
>
> create type HRMType as closed {
>
>   row_id: int32,
>
>   sid: int32,
>
>   date: date,
>
>   day: int32,
>
>   time: time,
>
>   bpm: int32,
>
>   RR: float
>
> };
>
> create dataset HRM (HRMType)
>
> primary key row_id;
>
>
> Previously I have used the time bin function to calculate the standard
> deviation of bpm for each time bin:
>
> for $i in dataset HRM
>
> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
>
> return {
>
> "sid": $sid,
>
> "gdate": $gdate,
>
> "gday": $gday,
>
> "timebin": $timebin,
>
> "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
> return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>
> Now I have two things I am hoping to do but need help with:
>
> 1. For each 1-min time bin, remove the bpm values that are above the top
> 5% or below the bottom 5%. I thought about using the min/max function for a
> few times to achieve this, but realized that it was not a good idea because
> in each time bin, the number of instances are not always the same. So for
> each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
> remove instances accordingly, which I don't know how to do.
>
> 2. After removing the outliers of bpm for each 1-min time bin, calculate a
> median absolute deviation (MAD) for each 1-min time bin (as another measure
> of variation besides the standard deviation). MAD =
> median(abs(x-median(x)). I'm not sure how to write a query to do the median
> function in Asterix.
>
> Thank you so much in advance. Let me know if my questions are clear.
>
> Yiran
>
> --
> Best,
> Yiran
>
> --
> You received this message because you are subscribed to the Google Groups
> "asterixdb-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to asterixdb-dev+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>