You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Ungerer, Jens" <je...@student.kit.edu> on 2012/05/29 15:49:51 UTC

mahout FPGrowth problem

I am using mahout-distribution 0.6. My first test programm of mahout FPGrowth with a small data set
worked well (example1.txt).

In my second test programm I get this exception.

"Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 5
	at org.apache.mahout.fpm.pfpgrowth.convertors.TransactionIterator$
1.apply(TransactionIterator.java:48)
	at org.apache.mahout.fpm.pfpgrowth.convertors.TransactionIterator$1.apply(TransactionIterator.java:42)
	at com.google.common.collect.Iterators$8.next(Iterators.java:765)
	at com.google.common.collect.ForwardingIterator.next(ForwardingIterator.java:48)
	at org.apache.mahout.fpm.pfpgrowth.fpgrowth.FPGrowth.generateTopKFrequentPatterns(FPGrowth.java:290)
	at org.apache.mahout.fpm.pfpgrowth.fpgrowth.FPGrowth.generateTopKFrequentPatterns(FPGrowth.java:174)
	at fpgrowth.Fpgrowth.frequentPatternMining(Fpgrowth.java:82)
	at Main.main(Main.java:119)"


Then I added items to the itemsets in that way that each transaction has the same length.
This also worked well. (exmaple2.txt).

I have two questions.

Is it neccessary to use itemsets with equal length?
(In my first data set I didn't use itemsets with equal length... )

Is  it possible to use itemsets with  duplicates in mahout FPGrowth?


https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives
"Also, please send questions to this list to verify your problem before filing issues in JIRA."
-> I don't think so, but is my problem perhaps a bug in mahout FPGrowth?


best regards
Jens


AW: mahout FPGrowth problem

Posted by "Ungerer, Jens" <je...@student.kit.edu>.
Hi,

thank you for your response.

I removed the multiple items and know I don't get an exception.

>> Is it neccessary to use itemsets with equal length?
>No - fixed size itemsets are not required.

>> Is  it possible to use itemsets with  duplicates in mahout FPGrowth?
>Not reliably.  This crash looks like it caused by having more items in
>one particular itemset than in the set of items with at least
>min-support.  Perhaps more simply, if you have three items that meet
>your support cutoff, and you encounter an itemset like: "item1 item1
>tem2 item3", this will happen.

>It won't crash here in every case where you have multiples of items; for
>example "item1 item1 item2" will not crash.  I'm not sure precisely how
>that will be treated down the line, but my assumption is that your
>results would be subtly wrong somehow.

>It would be pretty straightforward to fix this, and either tolerate
>multiples gracefully by collapsing "baskets" into itemsets or bailing
>out with an error.  I think we'd just need to fix TransactionIterator
>and also check the initial counting pass (ParallelCountingDriver and its
>non-MR counterpart).

>-tom

regards
Jens

Re: mahout FPGrowth problem

Posted by tom pierce <tc...@apache.org>.
Hi Jens,

> Is it neccessary to use itemsets with equal length?
No - fixed size itemsets are not required.

> Is  it possible to use itemsets with  duplicates in mahout FPGrowth?
Not reliably.  This crash looks like it caused by having more items in 
one particular itemset than in the set of items with at least 
min-support.  Perhaps more simply, if you have three items that meet 
your support cutoff, and you encounter an itemset like: "item1 item1 
item2 item3", this will happen.

It won't crash here in every case where you have multiples of items; for 
example "item1 item1 item2" will not crash.  I'm not sure precisely how 
that will be treated down the line, but my assumption is that your 
results would be subtly wrong somehow.

It would be pretty straightforward to fix this, and either tolerate 
multiples gracefully by collapsing "baskets" into itemsets or bailing 
out with an error.  I think we'd just need to fix TransactionIterator 
and also check the initial counting pass (ParallelCountingDriver and its 
non-MR counterpart).

-tom

On 05/29/2012 09:49 AM, Ungerer, Jens wrote:
> I am using mahout-distribution 0.6. My first test programm of mahout FPGrowth with a small data set
> worked well (example1.txt).
>
> In my second test programm I get this exception.
>
> "Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 5
> 	at org.apache.mahout.fpm.pfpgrowth.convertors.TransactionIterator$
> 1.apply(TransactionIterator.java:48)
> 	at org.apache.mahout.fpm.pfpgrowth.convertors.TransactionIterator$1.apply(TransactionIterator.java:42)
> 	at com.google.common.collect.Iterators$8.next(Iterators.java:765)
> 	at com.google.common.collect.ForwardingIterator.next(ForwardingIterator.java:48)
> 	at org.apache.mahout.fpm.pfpgrowth.fpgrowth.FPGrowth.generateTopKFrequentPatterns(FPGrowth.java:290)
> 	at org.apache.mahout.fpm.pfpgrowth.fpgrowth.FPGrowth.generateTopKFrequentPatterns(FPGrowth.java:174)
> 	at fpgrowth.Fpgrowth.frequentPatternMining(Fpgrowth.java:82)
> 	at Main.main(Main.java:119)"
>
>
> Then I added items to the itemsets in that way that each transaction has the same length.
> This also worked well. (exmaple2.txt).
>
> I have two questions.
>
> Is it neccessary to use itemsets with equal length?
> (In my first data set I didn't use itemsets with equal length... )
>
> Is  it possible to use itemsets with  duplicates in mahout FPGrowth?
>
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives
> "Also, please send questions to this list to verify your problem before filing issues in JIRA."
> ->  I don't think so, but is my problem perhaps a bug in mahout FPGrowth?
>
>
> best regards
> Jens
>