You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/01/17 19:34:26 UTC

[jira] [Commented] (MADLIB-1056) Add filtering options to Apriori to improve performance

    [ https://issues.apache.org/jira/browse/MADLIB-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826667#comment-15826667 ] 

Frank McQuillan commented on MADLIB-1056:
-----------------------------------------

[~njayaram] made the following comments to me on this topic:

The filtering items part for LHS/RHS can be made quite complex. The simplest approach is a comma separated list of exact item strings for inclusion in LHS/RHS of a rule.

Complexity can be increased by supporting regex, and also may be a "~" operator to say we only want rules that do NOT have the items specified. The complexity of the story will depend on these requirements, although it shouldn't change the complexity a lot for the regex requirement (not sure about the NOT operator, didn't explore that much since it was not part of the original requirement).
I couldn't identify obvious improvements to the SQL code that is already there. The existing SQL code does do the apriori based frequent itemset generation, contrary to my initial thoughts on it. An obvious suggestion would be to re-write for frequent itemset generation in C++ (the rule generation part is already in C++). But I really cannot say if that is going to truly outperform the existing implementation, and if it does, I am not sure if it is worth the effort (mainly due to the fact that I don't know how much improvement we can actually gain by using less SQL code).

There seems to be some room for improvement in the C++ code for rule generation. I think the current code blindly spits out all possible rules (permutations) given a frequent itemset, and then the rules are pruned out by their confidence. But we could certainly do a more careful pruning there. The faster way to do is to construct new rules from a frequent itemset using apriori again. This would require more effort.

We must change the interface to support this feature.

> Add filtering options to Apriori to improve performance
> -------------------------------------------------------
>
>                 Key: MADLIB-1056
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1056
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Association Rules
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> Consider adding something like a WHERE clause for LHS and RHS in order to reduce execution time, but still need the existence of the filtered transactions for support and confidence computation. (That is you can't filter them out ahead of time because would skew support and confidence values.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)