You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Yarco Hayduk <ya...@gmail.com> on 2011/03/08 06:15:07 UTC

FP-Growth + Uncertain data

Hey Guys,

As part of my master thesis I am developing a variation of your Paralle
FP-Growth FPM algorithm to work with uncertain data. (
http://wwwis.win.tue.nl/~tcalders/pubs/CALDERSPAKDD10.pdf). It's a very
interesting approach as it uses sampling and can exploit the power of
existing FPM algorithms by adding pre and post processing steps.

Currently, I am stepping thought your implementation of the FP-Growth
algorithm to better understand how it works, so that I can add my
modifications. The code is good. I found some small issues during these
 couple of days and thought that they would help you in some way ;-)

1. Add more comments to
org.apache.mahout.fpm.pfpgrowth.fpgrowth.FPTree
And I mean *more* comments. It takes hours to understand what these
variables mean.

in FPTree.java
private static final float GROWTH_RATE = 1.5f; //ok what is exactly growing
here .. is it the header table growth rate?

also properties like
-----
 private static final int HT_LAST = 1;
 private static final int HT_NEXT = 0;
 private int nodes;
etc
-----

are difficult to understand.

2. I've seen code that does this
lines 175-177: if (value >= minSupport) { ... }
way to many times. These have even caused funny errors:
in FPGrowth.java
if (frequency < minSupport) {break;}
I know that this check will not ever fire, but should this not be continue;
instead of break;?

Can you somehow centralize this code and not do check like these multiple
times?

3.
org.apache.mahout.fpm.pfpgrowth.fpgrowth.FPGrowth.generateFList(Iterator<Pair<List<A>,
Long>>, int).fList = new ArrayList<Pair<A, Long>>()

We can create that Array List specifying the initial capacity.

-------------

I will be sharing some more comments as I get more familiar with the code
base.

Today, I read about the Google Summer of Code 2011 Event. As I am working on
my sampling FP-Growth algorithm I thought that I just might as well
implement other FPM algorithms (I don't think that I would be able to share
this code with you due to copyright possible issues ... ).
As for Eclat or H-Mine - I don't know whether these algorithms can be
tailored to work with MapReduce though. I will have a C-implementation of
H-Mine ready in a week or two, than I would be able to judge whether I can
implement it using MapReduce or not.

If these algorithms don't work out, I can work on constraint-based mining. I
can try to push various constraints such as sum min max etc into the mining
process. The FP-Bonsai paper talked about monotone constraints only ...

Further ideas include working on uncertain data FPM algorithms.

Overall, I am interested in implementing frequent pattern mining algorithms
only.

yarco;)




On Mon, Mar 7, 2011 at 9:05 PM,  <us...@mahout.apache.org> wrote:
> Hi! This is the ezmlm program. I'm managing the
> user@mahout.apache.org mailing list.
>
> I'm working for my owner, who can be reached
> at user-owner@mahout.apache.org.
>
> Acknowledgment: I have added the address
>
>   yarcoh@gmail.com
>
> to the user mailing list.
>
> Welcome to user@mahout.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   <us...@mahout.apache.org>
>
> To remove your address from the list, send a message to:
>   <us...@mahout.apache.org>
>
> Send mail to the following for info and FAQ for this list:
>   <us...@mahout.apache.org>
>   <us...@mahout.apache.org>
>
> Similar addresses exist for the digest list:
>   <us...@mahout.apache.org>
>   <us...@mahout.apache.org>
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   <us...@mahout.apache.org>
>
> To get an index with subject and author for messages 123-456 , mail:
>   <us...@mahout.apache.org>
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   <us...@mahout.apache.org>
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> <us...@mahout.apache.org>
>
> To stop subscription for this address, mail:
> <us...@mahout.apache.org>
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> user-owner@mahout.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: <ya...@gmail.com>
> Received: (qmail 23979 invoked by uid 99); 8 Mar 2011 03:05:47 -0000
> Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 03:05:47
+0000
> X-ASF-Spam-Status: No, hits=-0.7 required=5.0
>
 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> X-Spam-Check-By: apache.org
> Received-SPF: pass (athena.apache.org: domain of yarcoh@gmail.comdesignates 74.125.82.50 as permitted sender)
> Received: from [74.125.82.50] (HELO mail-ww0-f50.google.com)
(74.125.82.50)
>    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 03:05:41
+0000
> Received: by wwc33 with SMTP id 33so212246wwc.7
>        for <user-sc.1299553356.oemgfpchgdfkaogbecke-yarcoh=gmail.com@
mahout.apache.org>; Mon, 07 Mar 2011 19:05:19 -0800 (PST)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>        d=gmail.com; s=gamma;
>        h=domainkey-signature:mime-version:in-reply-to:references:from:date
>         :message-id:subject:to:content-type:content-transfer-encoding;
>        bh=z5zinutBfbySuuWs4An0ysZNiv7t9dZMiMODSR1a1H8=;
>
 b=Pco+/gXIMnWPASOww9FHrCX37mBi+WvdKPBvn4au3w/Tlut3MoRtiKqLyxBhXdLD2H
>
Emxo1WEwk2hrfTRySordfBU3c29GGNrlufTiIVfxh44C28kEgXRiGFBX5vqnhlgWxGxV
>         Y2V0IMgz/hiUCcgpFI3hPoKijkvHevitsnqeE=
> DomainKey-Signature: a=rsa-sha1; c=nofws;
>        d=gmail.com; s=gamma;
>
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
>         :content-type:content-transfer-encoding;
>
 b=f/de6dOI2o50qb6C2yrXZ8KhWI298HoFHjDgxfI2jBh12H57qBjpdp7KSmhFdd9Ec8
>
Yv8yMfJME8prZgajH7JdWx+z5/aZh/SPFs2OSCYhHAYZEIqnEkTorkjp8lw6WM8Y7Cuq
>         K3Mb1PmUl80zGjGs3tfMSkZ4v14/KbxOJtKf0=
> Received: by 10.216.179.196 with SMTP id h46mr3903524wem.78.1299553519412;
>  Mon, 07 Mar 2011 19:05:19 -0800 (PST)
> MIME-Version: 1.0
> Received: by 10.216.186.18 with HTTP; Mon, 7 Mar 2011 19:04:58 -0800 (PST)
> In-Reply-To: <12...@mahout.apache.org>
> References: <12...@mahout.apache.org>
> From: Yarco Hayduk <ya...@gmail.com>
> Date: Mon, 7 Mar 2011 21:04:58 -0600
> Message-ID: <AA...@mail.gmail.com>
> Subject: Re: confirm subscribe to user@mahout.apache.org
> To: user-sc.1299553356.oemgfpchgdfkaogbecke-yarcoh=gmail.com@
mahout.apache.org
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
>
> On Mon, Mar 7, 2011 at 9:02 PM,  <us...@mahout.apache.org> wrote:
>> Hi! This is the ezmlm program. I'm managing the
>> user@mahout.apache.org mailing list.
>>
>> I'm working for my owner, who can be reached
>> at user-owner@mahout.apache.org.
>>
>> To confirm that you would like
>>
>> =A0 yarcoh@gmail.com
>>
>> added to the user mailing list, please send
>> a short reply to this address:
>>
>> =A0 user-sc.1299553356.oemgfpchgdfkaogbecke-yarcoh=3Dgmail.com@mahout.apa
=
> che.org
>>
>> Usually, this happens when you just hit the "reply" button.
>> If this does not work, simply copy the address and paste it into
>> the "To:" field of a new message.
>>
>> or click here:
>> =A0 =A0 =A0 =A0mailto:user-sc.1299553356.oemgfpchgdfkaogbecke-yarcoh
=3Dgm=
> ail.com@mahout.apache.org
>>
>> This confirmation serves two purposes. First, it verifies that I am able
>> to get mail through to you. Second, it protects you in case someone
>> forges a subscription request in your name.
>>
>> Please note that ALL Apache dev- and user- mailing lists are publicly
>> archived. =A0Do familiarize yourself with Apache's public archive policy
=
> at
>>
>> =A0 =A0http://www.apache.org/foundation/public-archives.html
>>
>> prior to subscribing and posting messages to user@mahout.apache.org.
>> If you're not sure whether or not the policy applies to this mailing
list=
> ,
>> assume it does unless the list name contains the word "private" in it.
>>
>> Some mail programs are broken and cannot handle long addresses. If you
>> cannot reply to this request, instead send a message to
>> <us...@mahout.apache.org> and put the
>> entire address listed above into the "Subject:" line.
>>
>>
>> --- Administrative commands for the user list ---
>>
>> I can handle administrative requests automatically. Please
>> do not send them to the list address! Instead, send
>> your message to the correct command address:
>>
>> To subscribe to the list, send a message to:
>> =A0 <us...@mahout.apache.org>
>>
>> To remove your address from the list, send a message to:
>> =A0 <us...@mahout.apache.org>
>>
>> Send mail to the following for info and FAQ for this list:
>> =A0 <us...@mahout.apache.org>
>> =A0 <us...@mahout.apache.org>
>>
>> Similar addresses exist for the digest list:
>> =A0 <us...@mahout.apache.org>
>> =A0 <us...@mahout.apache.org>
>>
>> To get messages 123 through 145 (a maximum of 100 per request), mail:
>> =A0 <us...@mahout.apache.org>
>>
>> To get an index with subject and author for messages 123-456 , mail:
>> =A0 <us...@mahout.apache.org>
>>
>> They are always returned as sets of 100, max 2000 per request,
>> so you'll actually get 100-499.
>>
>> To receive all messages with the same subject as message 12345,
>> send a short message to:
>> =A0 <us...@mahout.apache.org>
>>
>> The messages should contain one line or word of text to avoid being
>> treated as sp@m, but I will ignore their content.
>> Only the ADDRESS you send to is important.
>>
>> You can start a subscription for an alternate address,
>> for example "john@host.domain", just add a hyphen and your
>> address (with '=3D' instead of '@') after the command word:
>> <us...@mahout.apache.org>
>>
>> To stop subscription for this address, mail:
>> <us...@mahout.apache.org>
>>
>> In both cases, I'll send a confirmation message to that address. When
>> you receive it, simply reply to it to complete your subscription.
>>
>> If despite following these instructions, you do not get the
>> desired results, please contact my owner at
>> user-owner@mahout.apache.org. Please be patient, my owner is a
>> lot slower than I am ;-)
>>
>> --- Enclosed is a copy of the request I received.
>>
>> Return-Path: <ya...@gmail.com>
>> Received: (qmail 17259 invoked by uid 99); 8 Mar 2011 03:02:35 -0000
>> Received: from athena.apache.org (HELO athena.apache.org)
(140.211.11.136=
> )
>> =A0 =A0by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011
03:02:35=
>  +0000
>> X-ASF-Spam-Status: No, hits=3D-5.3 required=3D10.0
>> =A0 =A0 =A0
=A0tests=3DASF_EMPTY_LIST_OPS,ASF_LIST_OPS,EMPTY_MESSAGE,FREE=
> MAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
>> X-Spam-Check-By: apache.org
>> Received-SPF: pass (athena.apache.org: domain of yarcoh@gmail.comdesigna=
> tes 74.125.82.48 as permitted sender)
>> Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com)
(74.125.82.4=
> 8)
>> =A0 =A0by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011
03:02:29=
>  +0000
>> Received: by wwc33 with SMTP id 33so176852wwc.5
>> =A0 =A0 =A0 =A0for <ma...@apache.org>; Mon, 07 Mar 2011
1=
> 9:02:08 -0800 (PST)
>> DKIM-Signature: v=3D1; a=3Drsa-sha256; c=3Drelaxed/relaxed;
>> =A0 =A0 =A0 =A0d=3Dgmail.com; s=3Dgamma;
>> =A0 =A0 =A0
=A0h=3Ddomainkey-signature:mime-version:from:date:message-id:=
> subject:to
>> =A0 =A0 =A0 =A0 :content-type;
>> =A0 =A0 =A0 =A0bh=3D47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=3D;
>> =A0 =A0 =A0
=A0b=3Dvqwfxm6zFo2jvh+p9mrvB3kpfeoWAIiFq4Sa02ttR1gJfMUyC2EWam=
> WZC8nbZa6NrG
>> =A0 =A0 =A0 =A0
9lYH8E02pofsRcHDvucFPD35LzETPl8T1qFxwz62M3LCAZLHTbDyx0aO+=
> pMEaEj55Syg
>> =A0 =A0 =A0 =A0 nu5Mx1UOHc3FApFTDfE8+3ka4BgYA2nAgz/5A=3D
>> DomainKey-Signature: a=3Drsa-sha1; c=3Dnofws;
>> =A0 =A0 =A0 =A0d=3Dgmail.com; s=3Dgamma;
>> =A0 =A0 =A0
=A0h=3Dmime-version:from:date:message-id:subject:to:content-t=
> ype;
>> =A0 =A0 =A0
=A0b=3DtewAU5v55X8Qd311866tUdAkBwI61tVN81KkAjaTwU1wuIDpCzDQ8l=
> Dy/m+rdbG3gn
>> =A0 =A0 =A0 =A0
z0cpWSHnZY4gTYPTrI72gaIjliAsIVc7ODRXt8EMFyLySviB4o24Z+Ww1=
> GShfPJh4He3
>> =A0 =A0 =A0 =A0 gKKs5aoe/jgtf27pB8GFo+StbQAb1AHzpjTfs=3D
>> Received: by 10.216.179.196 with SMTP id
h46mr3901983wem.78.1299553328133=
> ;
>> =A0Mon, 07 Mar 2011 19:02:08 -0800 (PST)
>> MIME-Version: 1.0
>> Received: by 10.216.186.18 with HTTP; Mon, 7 Mar 2011 19:01:48 -0800
(PST=
> )
>> From: Yarco Hayduk <ya...@gmail.com>
>> Date: Mon, 7 Mar 2011 21:01:48 -0600
>> Message-ID: <AANLkTim+jsSB19ypr3GCjEa5E4RYEZ_d+gd3oe11sgC=3D@mail.gmail.c
=
> om>
>> Subject:
>> To: mahout-user-subscribe@apache.org
>> Content-Type: text/plain; charset=3DISO-8859-1
>>
>>
>