You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jaroslaw Odzga (JIRA)" <ji...@apache.org> on 2011/03/11 16:24:59 UTC

[jira] Created: (MAHOUT-625) Some of generated patterns have support higher than in reality

Some of generated patterns have support higher than in reality
--------------------------------------------------------------

                 Key: MAHOUT-625
                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
             Project: Mahout
          Issue Type: Bug
          Components: Frequent Itemset/Association Rule Mining
    Affects Versions: 0.4
            Reporter: Jaroslaw Odzga
            Priority: Critical


It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.

Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.

Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment: dataset_ok.txt

Attached email conversation with author of the data set used in test in which he gave OK for using the data in Mahout.
I believe it closes the issue of using the data set.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006242#comment-13006242 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Small Nit. You can use the Mahout eclipse formatter available on the how to contribute page(on the Mahout wiki) for newly submitted code.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "niu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

niu updated MAHOUT-625:
-----------------------

    Attachment: FPGrowth.java

I modify the source following the MAHOUT-625-patch.txt patch,but I find it become slower than original implementation in version 0.4 using my test datset which every transaction list is very long,for about 1000 columns in one transaction.





> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>         Attachments: FPGrowth.java, MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009093#comment-13009093 ] 

Jaroslaw Odzga commented on MAHOUT-625:
---------------------------------------

Created separate issue for performance improvement:
https://issues.apache.org/jira/browse/MAHOUT-629


> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment: mahout-test.zip

Attached test in zip archive

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: mahout-test.7z, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "niu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049660#comment-13049660 ] 

niu commented on MAHOUT-625:
----------------------------

It shows it is resolved in the 0.5 version ,but I don't see the related code of the patch in version 0.5. 

For example in growth function of FPGrowth.java,
while (i < headerTableCount) {
...
if (attribute == currentAttribute) {
}
else
{}
}
 I think just need to computing the condition of currentAttribute,other attribute should be done by other reduce task.

How do you think of this?

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006892#comment-13006892 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Jaroslaw. Can you submit a final patch with bug fix, test and the dataset. And also move optimization over to a new issue, I want to go test more and maybe add a cmd line flag to enable the optimization. The rest looks fine to commit. 

Thanks again for taking initiative on getting the dataset in. And kudos for the fix. It was not easy to figure out.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005770#comment-13005770 ] 

Ted Dunning commented on MAHOUT-625:
------------------------------------

Can you use a portable/standard compressor for this attachment?  7-zip is not widely used.

Try zip or tar.gz.  Or bzip.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: mahout-test.7z
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006196#comment-13006196 ] 

Jaroslaw Odzga commented on MAHOUT-625:
---------------------------------------

Hi,

After a bit of analysis I found the problem. It turned out that the alpha-pruning method had a bug - it was leaving the node attached to it's parent after pruning. This was causing issues in situation, when tree was pruned multiple times and unattached node landed in header table (this is rare for small data, but pretty common for fairly big number of transactions). In such case, the support of unattached node was counted even though it shouldn't. I fixed the problem by simply setting the support of such node to 0. Since it can not have any children it doesn't add to computation time and it is simpler solution that trying to unattach it from it's parent (parent has an arbitrary ordered list of children).

I added a comprehensive test, which uses transactions from the retail data from http://fimi.ua.ac.be/data/ (88162 transactions) and compares the results with the results generated by implementation from http://www.borgelt.net/fpgrowth.html. I added tests for both the serial and parallel (map-reduce) fpgrowth.

I also noticed that fpgrowth implementation can be optimized by not calculating patterns ending with given attributes multiple times. Depending on for how many features patterns are generated, speedup can be huge. More feature included - greater speedup. For mentioned test data, if all features were selected (i.e. we want to generate patterns for all items in transactions), patterns generation time dropped from 1h 15min to 8sec. For parallel fpgrowth, where the number of requested features is limited the speedup is not that dramatic, but still very high.

I attached patch (for version 0.5-SNAPSHOT) with all changes (bug fix, comprehensive tests (serial and parallel) and optimization). I hope it'll find it's way to the trunk :)


> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-625:
------------------------------

    Comment: was deleted

(was: Small Nit. You can use the Mahout eclipse formatter available on the how to contribute page(on the Mahout wiki) for newly submitted code.)

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-625:
-----------------------------

    Priority: Major  (was: Critical)

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009759#comment-13009759 ] 

Hudson commented on MAHOUT-625:
-------------------------------

Integrated in Mahout-Quality #684 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/684/])
    MAHOUT-625 Fixing support bug in due to dangling item in the header table, Adding tests based on retail data from  http://fimi.ua.ac.be/data/, Contributed by Jaroslaw Odzga


> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006258#comment-13006258 ] 

Jaroslaw Odzga commented on MAHOUT-625:
---------------------------------------

I attached isolated patch for bug fix - it's one-liner. The rest is for optimization.
Answering your questions:
1) Node that was removed was ending up in header table for certain data input - this is the reason for increased support for some of the generated patterns
2) The author of the dataset writes:
The data are provided ’as is’. Basically, any use of the data is allowed as long as the proper
acknowledgment is provided and a copy of the work is provided to Tom Brijs (see details below).
I think author mainly thinks of scientific papers when he mentions "copy of work". I'm not sure if is it enough to drop an email to the author and just ask if dataset can be used in mahout?
3) I don't see how we could achieve memory saving since the data is in preallocated array. Removing node is done merely by detaching it from the parent, which could be done, but I think benefit is not worth additional effort of doing it (currently parent has unordered list of children).
4) As to performance improvement, as I said it is dramatic when number of requested features is high (as in single node scenario or with very big groups in parallel scenario), it is still noticeable even with small number of features. Basically work done is always smaller than before the patch (as patterns for each item are calculated at most once). Obviously in parallel situation, when groups are small, the performance boost will not be that huge. If you notice any issues with it, please let me know.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment:     (was: mahout-test.7z)

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment: final_patch_with_bug_fix_test_and_the_dataset.txt

Hi,

Sorry for a long delay, but I was on holidays :)
I attached the patch containing fix with tests and test dataset.
I'll create separate issue for performance improvement.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006243#comment-13006243 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Small Nit. You can use the Mahout eclipse formatter available on the how to contribute page(on the Mahout wiki) for newly submitted code.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment: mahout-test.7z

Test which shows that FPGrowth has a bug. The data for test is from http://fimi.ua.ac.be/data/ (retail).

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: mahout-test.7z
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049672#comment-13049672 ] 

Sean Owen commented on MAHOUT-625:
----------------------------------

Please open a new issue with the patch that would bring HEAD to what you think is right. I/we may have misunderstood and not committed all of what you wanted to see.

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006240#comment-13006240 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Again Great job! on finding and fixing the bug

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment: MAHOUT-625-patch.txt

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-625.
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Robin Anil

(It looks like this was actually resolved in the past, and is in 0.5?)

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>            Priority: Critical
>             Fix For: 0.5
>
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009689#comment-13009689 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

The FPGrowthRetailTest100 takes 1 hour :) So I am commenting it out before checking it in. Can uncomment after the perf patch is in. PFPGrowthRetail100 test gets benefit from the splitting takes only 93sec

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, dataset_ok.txt, final_patch_with_bug_fix_test_and_the_dataset.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Jaroslaw Odzga (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaroslaw Odzga updated MAHOUT-625:
----------------------------------

    Attachment: bugfix-patch.txt

> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-625) Some of generated patterns have support higher than in reality

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006239#comment-13006239 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Right, I was testing vipuls dataset(MAHOUT-617) and was seeing the same issue. Was the header table having the node even after alpha pruning?

bq. I also noticed that fpgrowth implementation can be optimized by not calculating patterns ending with given attributes multiple times. Depending on for how many features patterns are generated, speedup can be huge. More feature included - greater speedup. For mentioned test data, if all features were selected (i.e. we want to generate patterns for all items in transactions), patterns generation time dropped from 1h 15min to 8sec

This might be useful for single node. For PFPGrowth this used to create issues with exact counts of patterns earlier. There is a lot of code here(:thumbs up:) for me to verify. Some issues

1) The dataset needs to have a signed agreement before can include in the Mahout codebase(see the website). Can you add another test to reproduce the test case. See MAHOUT-617
2) Again the comparison code, use a different dataset.
3) Can you split the optimization out of this into another patch. I want to test more before checking it in.
4) Bug fix by setting support = 0 maynot save the extra memory such nodes take. Its good for now, before a permanent solution is found.



> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also calculated in test), but the FPGrowth returns pattern (36, 39, 41) with support 573.
> Please note that mentioned pattern is not the only one with incorrect support - the test only point out one example to hace something to focus on. There is plenty more patterns with support higher than the real one. The biggest difference I noticed was support 8 higher than the real one for one of patterns.
> Please find attached failing unit test - it's actually a maven project, which contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira