You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vipul Pandey <vi...@gmail.com> on 2011/03/05 07:39:13 UTC

Re: PFPGrowth - weird output?

Hi All, 


I'm running into a different issue with PFP growth now. I see an output like : 

$ cat part-r-00000 | grep 1678807047
12	1678807047  
38	1678807047 3159925415  

which says that the support (12) for the item (1678807047) is lesser than the support (38) of a pair containing that item. Needless to say that this is ridiculous. 
I get this even with the Sequential version of FPGrowth. 

$ cat part-r-00000  | grep 1441690161 
12		1441690161 3910019844  
18		1604285941 1441690161 3910019844  
75		1441690161  


I'm sure I'm doing something "crafty" somewhere.

For sequential, I supply the file containing baskets and get the output as a file of sequences.

I run the following code to read the sequence file and write out the support and itemsets in plain text :

(MapReduce was written for PFPGrowth output, which is bigger.  My reducer is just an identity reducer)
	  @Override
	protected void map(Text key, TopKStringPatterns input, Context context)
			throws IOException, InterruptedException {
		  for(Pair<List<String>,Long> pair : input.getPatterns()){
			  StringBuffer sb = new StringBuffer();
			  for(String item : pair.getFirst())
				  sb.append(item).append(" ");
			  context.write(new LongWritable(pair.getSecond()), new Text(sb.toString()));
		  }
	}

This gives me the output above. 
Is this the right way? Am I doing something wrong while parsing the output?

My command line arguments are : 
-i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 -regex '[\t]' -s 10

Any help would be highly appreciated.

Regards,
Vipul




On Feb 3, 2011, at 6:44 PM, <pr...@nokia.com> <pr...@nokia.com> wrote:

> Hi Vipul,
> Frquent patterns are reported per feature which is why you are seeing the two patterns twice. First one is for feature 1518311 and second one is for feature 1476937.
> 
> However both should have the same exact support. I am not sure why you have different support for the same item set. May be if you send the full output from Mahout as it is we could take a look.
> 
> Are you running on multi node Hadoop cluster. If so did you read all the output files?
> 
> Praveen
> ________________________________________
> From: ext Vipul Pandey [vipandey@gmail.com]
> Sent: Thursday, February 03, 2011 8:21 PM
> To: user@mahout.apache.org
> Subject: PFPGrowth - weird output?
> 
> Hi all!
> 
> I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> note that I parse the output in frequentpatterns folder and generate this
> output with the support followed by the itemset)
> 
> support : Itemset
> *234     1518311    1476937  *
> 235     55843184
> 238     1238079
> 244     34541
> 247     4516454
> 252     106478
> 252     670864
> *254     1476937   1518311  *
> 
> You can see that two items are reported twice (*1518311    1476937*) with
> different supports.
> 
> And below are all the occurance of these two items together .... if you
> notice it has all the permutations of the three items (*1476937* *720020* *
> 1518311*  )
> 
> 22 *1476937* 720020 *1518311*
> 30 *1518311* *1476937* 720020
> 30 720020 *1518311* *1476937*
> 34 720020 *1476937* *1518311*
> 38 *1518311* 720020 *1476937*
> 42 *1476937* *1518311* 720020
> 234 *1518311* *1476937*
> 254 *1476937* *1518311*
> 
> Does this mean if I have to get the support of just the the pair  (*1476937*
> *1518311*  ) I will have to add all of them up !?
> 
> Even in that case ... this total comes out to *684* and if I count the
> number of co-ocurrances of these two items in the original baskets the
> support is *766*? Why's there a difference? any idea?
> 
> 
> Thanks!
> Vipul


Re: PFPGrowth - weird output?

Posted by Vipul Pandey <vi...@gmail.com>.
Robin, 

So here's how (P)FPGrowth looks - from where I see : 

FPGrowth reports the support of itemsets individually in that if Item X appears individually 12 times and appears with item Y 10 times (a total of 22 times) AND item Y appears individually 4 times (a total of 14 times) then this is what the output will be (say for min-support 2)

12 X
10 X Y
4   Y

If the minimum support is 5 then the output will look like : 
12 X
10 X Y

if the minimum support is 11 then the output will look like 
12 X

if the minimum support is 13 then there will be NO output.

even though all the way along Xs support was 22 and Y's was 14



Even if we want to show just the maximal itemsets (although i would like to see ALL the frequent itemsets - maximal or not) this output is wrong as with a support of 13 we should still have seen X(22) and Y(14)


Now Say you add XYZ 11 times


for support 1 you'd see
12 X
10 X Y
11 X Y Z
4   Y

And for support 11 you'd see
12 X
11 X Y Z

Although I'd expect the output (for s=11) to be 
33 X
25 Y 
21 XY
11 Z
11 XZ
11 YZ
11 XYZ

Hope this helps. 


Vipul

On Mar 5, 2011, at 2:13 AM, Robin Anil wrote:

> Hi Vipul Is it possible for you to attach a test data to a JIRA issue for me to investigate
> 
> Robin
> 
> On Sat, Mar 5, 2011 at 12:09 PM, Vipul Pandey <vi...@gmail.com> wrote:
> Hi All,
> 
> 
> I'm running into a different issue with PFP growth now. I see an output like :
> 
> $ cat part-r-00000 | grep 1678807047
> 12      1678807047
> 38      1678807047 3159925415
> 
> which says that the support (12) for the item (1678807047) is lesser than the support (38) of a pair containing that item. Needless to say that this is ridiculous.
> I get this even with the Sequential version of FPGrowth.
> 
> $ cat part-r-00000  | grep 1441690161
> 12              1441690161 3910019844
> 18              1604285941 1441690161 3910019844
> 75              1441690161
> 
> 
> I'm sure I'm doing something "crafty" somewhere.
> 
> For sequential, I supply the file containing baskets and get the output as a file of sequences.
> 
> I run the following code to read the sequence file and write out the support and itemsets in plain text :
> 
> (MapReduce was written for PFPGrowth output, which is bigger.  My reducer is just an identity reducer)
>          @Override
>        protected void map(Text key, TopKStringPatterns input, Context context)
>                        throws IOException, InterruptedException {
>                  for(Pair<List<String>,Long> pair : input.getPatterns()){
>                          StringBuffer sb = new StringBuffer();
>                          for(String item : pair.getFirst())
>                                  sb.append(item).append(" ");
>                          context.write(new LongWritable(pair.getSecond()), new Text(sb.toString()));
>                  }
>        }
> 
> This gives me the output above.
> Is this the right way? Am I doing something wrong while parsing the output?
> 
> My command line arguments are :
> -i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10 -regex '[\t]' -s 10
> 
> Any help would be highly appreciated.
> 
> Regards,
> Vipul
> 
> 
> 
> 
> On Feb 3, 2011, at 6:44 PM, <pr...@nokia.com> <pr...@nokia.com> wrote:
> 
> > Hi Vipul,
> > Frquent patterns are reported per feature which is why you are seeing the two patterns twice. First one is for feature 1518311 and second one is for feature 1476937.
> >
> > However both should have the same exact support. I am not sure why you have different support for the same item set. May be if you send the full output from Mahout as it is we could take a look.
> >
> > Are you running on multi node Hadoop cluster. If so did you read all the output files?
> >
> > Praveen
> > ________________________________________
> > From: ext Vipul Pandey [vipandey@gmail.com]
> > Sent: Thursday, February 03, 2011 8:21 PM
> > To: user@mahout.apache.org
> > Subject: PFPGrowth - weird output?
> >
> > Hi all!
> >
> > I'm trying to run PFPgrowth on my data and this is an output I get. (Please
> > note that I parse the output in frequentpatterns folder and generate this
> > output with the support followed by the itemset)
> >
> > support : Itemset
> > *234     1518311    1476937  *
> > 235     55843184
> > 238     1238079
> > 244     34541
> > 247     4516454
> > 252     106478
> > 252     670864
> > *254     1476937   1518311  *
> >
> > You can see that two items are reported twice (*1518311    1476937*) with
> > different supports.
> >
> > And below are all the occurance of these two items together .... if you
> > notice it has all the permutations of the three items (*1476937* *720020* *
> > 1518311*  )
> >
> > 22 *1476937* 720020 *1518311*
> > 30 *1518311* *1476937* 720020
> > 30 720020 *1518311* *1476937*
> > 34 720020 *1476937* *1518311*
> > 38 *1518311* 720020 *1476937*
> > 42 *1476937* *1518311* 720020
> > 234 *1518311* *1476937*
> > 254 *1476937* *1518311*
> >
> > Does this mean if I have to get the support of just the the pair  (*1476937*
> > *1518311*  ) I will have to add all of them up !?
> >
> > Even in that case ... this total comes out to *684* and if I count the
> > number of co-ocurrances of these two items in the original baskets the
> > support is *766*? Why's there a difference? any idea?
> >
> >
> > Thanks!
> > Vipul
> 
> 


Re: PFPGrowth - weird output?

Posted by Robin Anil <ro...@gmail.com>.
Hi Vipul Is it possible for you to attach a test data to a JIRA issue for me
to investigate

Robin

On Sat, Mar 5, 2011 at 12:09 PM, Vipul Pandey <vi...@gmail.com> wrote:

> Hi All,
>
>
> I'm running into a different issue with PFP growth now. I see an output
> like :
>
> $ cat part-r-00000 | grep 1678807047
> 12      1678807047
> 38      1678807047 3159925415
>
> which says that the support (12) for the item (1678807047) is lesser than
> the support (38) of a pair containing that item. Needless to say that this
> is ridiculous.
> I get this even with the Sequential version of FPGrowth.
>
> $ cat part-r-00000  | grep 1441690161
> 12              1441690161 3910019844
> 18              1604285941 1441690161 3910019844
> 75              1441690161
>
>
> I'm sure I'm doing something "crafty" somewhere.
>
> For sequential, I supply the file containing baskets and get the output as
> a file of sequences.
>
> I run the following code to read the sequence file and write out the
> support and itemsets in plain text :
>
> (MapReduce was written for PFPGrowth output, which is bigger.  My reducer
> is just an identity reducer)
>          @Override
>        protected void map(Text key, TopKStringPatterns input, Context
> context)
>                        throws IOException, InterruptedException {
>                  for(Pair<List<String>,Long> pair : input.getPatterns()){
>                          StringBuffer sb = new StringBuffer();
>                          for(String item : pair.getFirst())
>                                  sb.append(item).append(" ");
>                          context.write(new LongWritable(pair.getSecond()),
> new Text(sb.toString()));
>                  }
>        }
>
> This gives me the output above.
> Is this the right way? Am I doing something wrong while parsing the output?
>
> My command line arguments are :
> -i ./baskets/part-r-00000 -o ./patterns -k 50 -method sequential -g 10
> -regex '[\t]' -s 10
>
> Any help would be highly appreciated.
>
> Regards,
> Vipul
>
>
>
>
> On Feb 3, 2011, at 6:44 PM, <pr...@nokia.com> <
> praveen.peddi@nokia.com> wrote:
>
> > Hi Vipul,
> > Frquent patterns are reported per feature which is why you are seeing the
> two patterns twice. First one is for feature 1518311 and second one is for
> feature 1476937.
> >
> > However both should have the same exact support. I am not sure why you
> have different support for the same item set. May be if you send the full
> output from Mahout as it is we could take a look.
> >
> > Are you running on multi node Hadoop cluster. If so did you read all the
> output files?
> >
> > Praveen
> > ________________________________________
> > From: ext Vipul Pandey [vipandey@gmail.com]
> > Sent: Thursday, February 03, 2011 8:21 PM
> > To: user@mahout.apache.org
> > Subject: PFPGrowth - weird output?
> >
> > Hi all!
> >
> > I'm trying to run PFPgrowth on my data and this is an output I get.
> (Please
> > note that I parse the output in frequentpatterns folder and generate this
> > output with the support followed by the itemset)
> >
> > support : Itemset
> > *234     1518311    1476937  *
> > 235     55843184
> > 238     1238079
> > 244     34541
> > 247     4516454
> > 252     106478
> > 252     670864
> > *254     1476937   1518311  *
> >
> > You can see that two items are reported twice (*1518311    1476937*) with
> > different supports.
> >
> > And below are all the occurance of these two items together .... if you
> > notice it has all the permutations of the three items (*1476937* *720020*
> *
> > 1518311*  )
> >
> > 22 *1476937* 720020 *1518311*
> > 30 *1518311* *1476937* 720020
> > 30 720020 *1518311* *1476937*
> > 34 720020 *1476937* *1518311*
> > 38 *1518311* 720020 *1476937*
> > 42 *1476937* *1518311* 720020
> > 234 *1518311* *1476937*
> > 254 *1476937* *1518311*
> >
> > Does this mean if I have to get the support of just the the pair
>  (*1476937*
> > *1518311*  ) I will have to add all of them up !?
> >
> > Even in that case ... this total comes out to *684* and if I count the
> > number of co-ocurrances of these two items in the original baskets the
> > support is *766*? Why's there a difference? any idea?
> >
> >
> > Thanks!
> > Vipul
>
>