You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@asterixdb.apache.org by Mike Carey <dt...@gmail.com> on 2019/09/02 16:34:51 UTC

Re: Hyracks apriori algorithm

Igor,

Not 100% sure what you mean by "filter" in this case - but - it sounds 
like you need to think about how to express Apriori as a sequence of 
operations that explicitly includes counted aggregation followed by 
normal filtering on the count field?  At least to get something 
working?  Limit doesn't feel like the right approach here; it's more of 
an important operator when you are trying to stop the flow of unwanted 
"lower down" data.  Think about how you might express Apriori in the 
extended relational algebra...?

Cheers,

Mike

On 8/30/19 8:28 AM, Ígor Chagas wrote:
> Hi, devs
>
> I started to work with Hyracks a few months ago. I tested several
> variations on WordCount code to learn about its paradigm. Now, I'm trying
> to implement Apriori Algorithm on it to do some studies.
>
> In some step of this algorithm, I need to filter tuples that have a number
> of occurrences higher than N. However, I did not find a proper filter
> operator. I tried to use the LimitOperator¹ as a template since it is
> simple and deals directly with tuples. However, I didn't figure out how to
> compare some tuple field values to a specific Apriori parameter, as support.
>
> Do you guys have some advice or a starting point to implement a filter
> operator in this situation? Is there something I'd missed?
>
> Thanks and Regards.
>
> Ígor Chagas Marques
>
> [1] -> dataflow/std/misc/LimitOperatorDescriptor.java
>

Re: Hyracks apriori algorithm

Posted by Ted Dunning <te...@gmail.com>.
Count followed by normal filtering is exactly what is needed for this kind
of operation. In fact, though, most uses of the Apriori algorithm also
require repeated passes of cooccurrence counting which is usually best
described in SQL terms as a cartesian self-join, possibly pre-limited by a
minimum frequency pass.

Thus, something like this is usually needed, assuming x is a table of
(event, tag) pairs:

with
    dedupe as select event, tag from x group by event, tag,
    counts as select count(1) cnt, tag from dedupe group by tag,
    common as select tag, event from counts join x on tag where cnt > 3,
    cooc as select a.tag tagA, b.tag tagB, count(1) cnt
                 from common as a join common as b on event
                 group by a.tag, b.tag
       ...

From here, Apriori has what it needs for a single pass.

Diverting from anything related much to hyracks for a moment to ground on
which I am much more solid, this approach is, in fact,  disastrously slow
because it has cost quadratic in the number of unique tags on each event.
It also isn't even quite what you need since Apriori doesn't have a very
sophisticated notion of which items are seen interestingly often together
and thus has very bad problems when very common items co-occur. A better
approach is to use something like a log-likelihood ratio test such as the
G^2 statistics [1,2] to determine which co-occurrences are interesting.
There are downsampling techniques as well that can contain the problem of
quadratic explosion [3, 4] while retaining useful statistical properties
and it is possible to implement these techniques in a streaming context. It
is also possible to do this in SQL, with some pretty large queries.

In real world applications, it is often not necessary to do anything more
than link single items, although scored co-occurrence can be used to do so
[5, 6].

[1] Accurate Methods for the Statistics of Surprise and Coincidence
https://aclweb.org/anthology/J93-1003
[2] Surprise and Coincidence
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
[3] Efficient Incremental Cooccurrence Analysis for Item-Based
Collaborative Filtering
https://www.researchgate.net/publication/334903217_Efficient_Incremental_Cooccurrence_Analysis_for_Item-Based_Collaborative_Filtering
[4] Fully real-time recommendation
https://mapr.com/resources/videos/fully-real-time-recommendation-ted-dunning-sf-data-mining/
[5] Finding Structure in Text, Genome and Other Symbolic Sequences
https://arxiv.org/abs/1207.1847
[6] Practical Machine Learning: Innovations in Recommendation
https://mapr.com/practical-machine-learning/

On Mon, Sep 2, 2019 at 9:34 AM Mike Carey <dt...@gmail.com> wrote:

> Igor,
>
> Not 100% sure what you mean by "filter" in this case - but - it sounds
> like you need to think about how to express Apriori as a sequence of
> operations that explicitly includes counted aggregation followed by
> normal filtering on the count field?  At least to get something
> working?  Limit doesn't feel like the right approach here; it's more of
> an important operator when you are trying to stop the flow of unwanted
> "lower down" data.  Think about how you might express Apriori in the
> extended relational algebra...?
>
> Cheers,
>
> Mike
>
> On 8/30/19 8:28 AM, Ígor Chagas wrote:
> > Hi, devs
> >
> > I started to work with Hyracks a few months ago. I tested several
> > variations on WordCount code to learn about its paradigm. Now, I'm trying
> > to implement Apriori Algorithm on it to do some studies.
> >
> > In some step of this algorithm, I need to filter tuples that have a
> number
> > of occurrences higher than N. However, I did not find a proper filter
> > operator. I tried to use the LimitOperator¹ as a template since it is
> > simple and deals directly with tuples. However, I didn't figure out how
> to
> > compare some tuple field values to a specific Apriori parameter, as
> support.
> >
> > Do you guys have some advice or a starting point to implement a filter
> > operator in this situation? Is there something I'd missed?
> >
> > Thanks and Regards.
> >
> > Ígor Chagas Marques
> >
> > [1] -> dataflow/std/misc/LimitOperatorDescriptor.java
> >
>