You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Yarco Hayduk <ya...@gmail.com> on 2011/03/22 22:51:47 UTC

H-Mine

Hi,

For the *Google Summer of Code 2011 would *you be interested if I
implemented the H-Mine algorithm?

http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf

Thank you,
yarco;)

Re: H-Mine

Posted by Ted Dunning <te...@gmail.com>.

I read that quote when you first posted the link to the articles.

The problem is, in my opinion, you don't lose information that wasn't there
in the first place.

I am totally on board with algorithms that preserve data where it matters.
 Keeping a patient's temperature is certainly better than just reducing to a
binary indicator of fever.

But observations are what they are.  If a doctor says that a patient has an
apparently swollen spleen, that is what the doctor says.  We may doubt the
accuracy of the doctor's word, but we lose no information by simply
recording that is what they said.  Trying to add a probabilistic value to
this statement is actually counter productive because we don't necessarily
have any basis for this conversion.  On the other hand introducing a hidden
variable that might be interpreted as "really does have a swollen spleen"
might well help us out.

Now, my doubts aside, if you have an urge to implement something, go for it.
 Your itch is what will drive Mahout to have code that solves your problem.

On Tue, Mar 29, 2011 at 8:56 PM, Yarco Hayduk <ya...@gmail.com> wrote:

> Moving apart from the definitions of uncertain data, I don't think that you
> guys are too excited with the whole uncertain data mining algorithm idea ;-)
>
>

Re: H-Mine

Posted by Yarco Hayduk <ya...@gmail.com>.

I guess the best way to reply to your message is with the help of the
following quote from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.6035

"Since the existence of an item in a transaction is indicated by a
probability, an advantage of the existential uncertain data model is that it
allows more information to be captured by the dataset. Consider again the
example patient dataset. If we adopt a binary data model, then each
symptom/illness can either be present (1) or absent (0) in a patient record.
Under the binary model, data analysts will be forced to set a threshold
value for each symptom/illness to quantize the probabilities into either 1
or 0. In other words, information about those (marginally) low values is
discarded. The uncertain data model, however, allows such information be
retained and be available for analysis. The disadvantage of retaining such
information is that the size of the dataset would be much larger than that
under the quantized binary model. This is particularly true if most
of the existential probabilities are very small. Consequently, mining
algorithms will run a lot slower on such large datasets."

Moving apart from the definitions of uncertain data, I don't think that you
guys are too excited with the whole uncertain data mining algorithm idea ;-)


On Thu, Mar 24, 2011 at 4:07 PM, Ted Dunning <te...@gmail.com> wrote:

> That still doesn't sound tenable.
>
> The problem is that you have an observable or you don't.  In your example,
> you have an observable "doctor says that patient has a disease".  You don't
> have access to the internal state of mind of the doctor, only what they say
> or do.  The thing that is uncertain is itself unobservable.
>
> I say this as a pretty committed Bayesian who things that probability is
> the best way for us to describe our uncertain state of knowledge about the
> world.  I am thus pretty sympathetic to uncertain inferences.  But
> observations are not uncertain, they are what they are.  They may be the
> noisy transduction of some unobserved real thing where the noise is due to
> an uncertain measurement process.  But that is just another case of
> observations being certain and the unobservable thing being uncertain.  As
> such, the data are not uncertain, just our interpretation.
>
> Given that, the right thing to do is actually model a posterior
> distribution of the unobserved uncertain cause.  Frequent itemsets are all
> about counting of observations, not inference.  We may do inference from
> those counts, but that doesn't impinge on the counting process.
>
>
> On Thu, Mar 24, 2011 at 1:11 PM, Yarco Hayduk <ya...@gmail.com> wrote:
>
>> In uncertain FPM we suspect but can not guarantee the presence or absence
>> of an item. Say when a doctor is 90% sure that a patient suffers from a
>> particular kind of a disease. This uncertainty can be expressed in terms of
>> existential probability. More precisely, uncertain data in this case means
>> existentially uncertain data.
>>
>> This paper's introduction has a great explanation of uncertain data.
>>
>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.4656&rep=rep1&type=pdf
>>
>> In the previous email when I was talking about the sampling approach, I
>> meant to post this link
>> http://wwwis.win.tue.nl/~tcalders/pubs/CALDERSPAKDD10.pdf
>>
>> sorry for the confusion
>>
>> yarco;)
>>
>>
>> On Thu, Mar 24, 2011 at 1:45 PM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> I don't see the value here.  This paper doesn't make their definitions
>>> clear and I find it particularly
>>> troubling that they seem to assume that the observations being analyzed
>>> involve the observation of
>>>  probabilities.
>>>
>>> In every case that I know of (and I mean every, not most) the
>>> observations are not uncertain.  They
>>> may be the result of an underlying stochastic process and I think it very
>>> important to incorporate that
>>> assumption into any analysis, but the data themselves do not involve any
>>> observed probabilities.  In
>>> some sense, the data that we observe are already the result of sampling
>>> from the probabilities that
>>> this paper refers to.
>>>
>>> As such, I really don't see what value this work has.
>>>
>>> I might agree with the authors that a Bayesian approach is required in
>>> which we consider our task to
>>> be the estimation of these underlying probabilities rather than the
>>> rather pedestrian way that most
>>> frequent item set algorithms simply assume that counts are the end of the
>>> story.
>>>
>>> But that doesn't sound like what the paper is saying.  Maybe I am
>>> misreading it.
>>>
>>>
>>> On Thu, Mar 24, 2011 at 10:53 AM, Yarco Hayduk <ya...@gmail.com> wrote:
>>>
>>>> What Mahout does not have at this point is the ability to mine frequent
>>>> patterns from uncertain data.
>>>>
>>>> You could implement this feature in two ways:
>>>>
>>>> 1. Using sampling (http://www.usukita.org/papers/4931/freq.pdf )
>>>> 2. Rewriting the FPM algorithms so that they can work with uncertain
>>>> data (http://www.usukita.org/papers/4931/freq.pdf)
>>>>
>>>> I could implement the UH-Mine algorithm that would work with uncertain
>>>> data. Its described in (2).
>>>> yarco;)
>>>>
>>>>
>>>> > On Thu, Mar 24, 2011 at 12:27 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Moderately good as a project.
>>>> >> I think that Mahout would benefit more from new capabilities than
>>>> from a reimplementation of old capabilities with a new algorithm.  Item set
>>>> mining is not particularly subject to radical improvement in terms of the
>>>> results.  Speed improvements would be nice, but are not a major point of
>>>> pain with the current implementation.
>>>> >> What do you see about this project that would be good for Mahout?
>>>>  Are there related things that you need that Mahout doesn't do?
>>>> >>
>>>> >> On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <ya...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> So does this look like a valid project idea?
>>>> >>>
>>>> >>> Thanks
>>>> >>>
>>>> >>> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> > I believe that Section 3.1 of the aforementioned paper talks about
>>>> the
>>>> >>> > parallel version of H-Mine.
>>>> >>> >
>>>> >>> > You are right - H-Mine has a backtracking step, which adds nodes
>>>> to the
>>>> >>> > next node and introduces a dependency. i.e. you can not start
>>>> working on the
>>>> >>> > i+1 header element before the i-th element is not mined.
>>>> >>> >
>>>> >>> > The H-Mine algorithm works as follows:
>>>> >>> >
>>>> >>> > 1. prunes the intial DB such that all singleton infrequent items
>>>> are
>>>> >>> > removed
>>>> >>> > 2. divides the pruned db into equal chunks.
>>>> >>> > 3. mines these chunks separately using the H-Mine(mem) algorithm
>>>> >>> > 4. joins the results and
>>>> >>> > 5. scans the pruned db once again to remove false positives and
>>>> obtain the
>>>> >>> > actual counts.
>>>> >>> >
>>>> >>> > I'm pretty sure that I can map these steps to MapReduce.
>>>> >>> >
>>>> >>> > Having said that, I am not sure that this approach would work
>>>> better than
>>>> >>> > Parallel FP-Growth.
>>>> >>> >
>>>> >>> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com>
>>>> wrote:
>>>> >>> >
>>>> >>> >> We have an Parallel FPGrowth implementation. I have read that by
>>>> itself
>>>> >>> >> HMine is faster in sequential version, but it may not be of use
>>>> to Mahout in
>>>> >>> >> that format. If you are able to propose a parallel implementation
>>>> of H-Mine,
>>>> >>> >> using MapReduce, it will be of great Interest to Mahout.
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com>
>>>> wrote:
>>>> >>> >>
>>>> >>> >>> Hi,
>>>> >>> >>>
>>>> >>> >>> For the *Google Summer of Code 2011 would *you be interested if
>>>> I
>>>> >>> >>> implemented the H-Mine algorithm?
>>>> >>> >>>
>>>> >>> >>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>>>> >>> >>>
>>>> >>> >>> Thank you,
>>>> >>> >>> yarco;)
>>>> >>> >>>
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>
>>>> >
>>>>
>>>>
>>>
>>
>

Re: H-Mine

Posted by Ted Dunning <te...@gmail.com>.

That still doesn't sound tenable.

The problem is that you have an observable or you don't.  In your example,
you have an observable "doctor says that patient has a disease".  You don't
have access to the internal state of mind of the doctor, only what they say
or do.  The thing that is uncertain is itself unobservable.

I say this as a pretty committed Bayesian who things that probability is the
best way for us to describe our uncertain state of knowledge about the
world.  I am thus pretty sympathetic to uncertain inferences.  But
observations are not uncertain, they are what they are.  They may be the
noisy transduction of some unobserved real thing where the noise is due to
an uncertain measurement process.  But that is just another case of
observations being certain and the unobservable thing being uncertain.  As
such, the data are not uncertain, just our interpretation.

Given that, the right thing to do is actually model a posterior distribution
of the unobserved uncertain cause.  Frequent itemsets are all about counting
of observations, not inference.  We may do inference from those counts, but
that doesn't impinge on the counting process.

On Thu, Mar 24, 2011 at 1:11 PM, Yarco Hayduk <ya...@gmail.com> wrote:

> In uncertain FPM we suspect but can not guarantee the presence or absence
> of an item. Say when a doctor is 90% sure that a patient suffers from a
> particular kind of a disease. This uncertainty can be expressed in terms of
> existential probability. More precisely, uncertain data in this case means
> existentially uncertain data.
>
> This paper's introduction has a great explanation of uncertain data.
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.4656&rep=rep1&type=pdf
>
> In the previous email when I was talking about the sampling approach, I
> meant to post this link
> http://wwwis.win.tue.nl/~tcalders/pubs/CALDERSPAKDD10.pdf
>
> sorry for the confusion
>
> yarco;)
>
>
> On Thu, Mar 24, 2011 at 1:45 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> I don't see the value here.  This paper doesn't make their definitions
>> clear and I find it particularly
>> troubling that they seem to assume that the observations being analyzed
>> involve the observation of
>>  probabilities.
>>
>> In every case that I know of (and I mean every, not most) the observations
>> are not uncertain.  They
>> may be the result of an underlying stochastic process and I think it very
>> important to incorporate that
>> assumption into any analysis, but the data themselves do not involve any
>> observed probabilities.  In
>> some sense, the data that we observe are already the result of sampling
>> from the probabilities that
>> this paper refers to.
>>
>> As such, I really don't see what value this work has.
>>
>> I might agree with the authors that a Bayesian approach is required in
>> which we consider our task to
>> be the estimation of these underlying probabilities rather than the rather
>> pedestrian way that most
>> frequent item set algorithms simply assume that counts are the end of the
>> story.
>>
>> But that doesn't sound like what the paper is saying.  Maybe I am
>> misreading it.
>>
>>
>> On Thu, Mar 24, 2011 at 10:53 AM, Yarco Hayduk <ya...@gmail.com> wrote:
>>
>>> What Mahout does not have at this point is the ability to mine frequent
>>> patterns from uncertain data.
>>>
>>> You could implement this feature in two ways:
>>>
>>> 1. Using sampling (http://www.usukita.org/papers/4931/freq.pdf )
>>> 2. Rewriting the FPM algorithms so that they can work with uncertain data
>>> (http://www.usukita.org/papers/4931/freq.pdf)
>>>
>>> I could implement the UH-Mine algorithm that would work with uncertain
>>> data. Its described in (2).
>>> yarco;)
>>>
>>>
>>> > On Thu, Mar 24, 2011 at 12:27 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >>
>>> >> Moderately good as a project.
>>> >> I think that Mahout would benefit more from new capabilities than from
>>> a reimplementation of old capabilities with a new algorithm.  Item set
>>> mining is not particularly subject to radical improvement in terms of the
>>> results.  Speed improvements would be nice, but are not a major point of
>>> pain with the current implementation.
>>> >> What do you see about this project that would be good for Mahout?  Are
>>> there related things that you need that Mahout doesn't do?
>>> >>
>>> >> On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <ya...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> So does this look like a valid project idea?
>>> >>>
>>> >>> Thanks
>>> >>>
>>> >>> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > I believe that Section 3.1 of the aforementioned paper talks about
>>> the
>>> >>> > parallel version of H-Mine.
>>> >>> >
>>> >>> > You are right - H-Mine has a backtracking step, which adds nodes to
>>> the
>>> >>> > next node and introduces a dependency. i.e. you can not start
>>> working on the
>>> >>> > i+1 header element before the i-th element is not mined.
>>> >>> >
>>> >>> > The H-Mine algorithm works as follows:
>>> >>> >
>>> >>> > 1. prunes the intial DB such that all singleton infrequent items
>>> are
>>> >>> > removed
>>> >>> > 2. divides the pruned db into equal chunks.
>>> >>> > 3. mines these chunks separately using the H-Mine(mem) algorithm
>>> >>> > 4. joins the results and
>>> >>> > 5. scans the pruned db once again to remove false positives and
>>> obtain the
>>> >>> > actual counts.
>>> >>> >
>>> >>> > I'm pretty sure that I can map these steps to MapReduce.
>>> >>> >
>>> >>> > Having said that, I am not sure that this approach would work
>>> better than
>>> >>> > Parallel FP-Growth.
>>> >>> >
>>> >>> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>> >>> >
>>> >>> >> We have an Parallel FPGrowth implementation. I have read that by
>>> itself
>>> >>> >> HMine is faster in sequential version, but it may not be of use to
>>> Mahout in
>>> >>> >> that format. If you are able to propose a parallel implementation
>>> of H-Mine,
>>> >>> >> using MapReduce, it will be of great Interest to Mahout.
>>> >>> >>
>>> >>> >>
>>> >>> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com>
>>> wrote:
>>> >>> >>
>>> >>> >>> Hi,
>>> >>> >>>
>>> >>> >>> For the *Google Summer of Code 2011 would *you be interested if I
>>> >>> >>> implemented the H-Mine algorithm?
>>> >>> >>>
>>> >>> >>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>>> >>> >>>
>>> >>> >>> Thank you,
>>> >>> >>> yarco;)
>>> >>> >>>
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>
>>> >
>>>
>>>
>>
>

Re: H-Mine

Posted by Yarco Hayduk <ya...@gmail.com>.

In uncertain FPM we suspect but can not guarantee the presence or absence of
an item. Say when a doctor is 90% sure that a patient suffers from a
particular kind of a disease. This uncertainty can be expressed in terms of
existential probability. More precisely, uncertain data in this case means
existentially uncertain data.

This paper's introduction has a great explanation of uncertain data.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.4656&rep=rep1&type=pdf

In the previous email when I was talking about the sampling approach, I
meant to post this link
http://wwwis.win.tue.nl/~tcalders/pubs/CALDERSPAKDD10.pdf

sorry for the confusion

yarco;)

On Thu, Mar 24, 2011 at 1:45 PM, Ted Dunning <te...@gmail.com> wrote:

> I don't see the value here.  This paper doesn't make their definitions
> clear and I find it particularly
> troubling that they seem to assume that the observations being analyzed
> involve the observation of
>  probabilities.
>
> In every case that I know of (and I mean every, not most) the observations
> are not uncertain.  They
> may be the result of an underlying stochastic process and I think it very
> important to incorporate that
> assumption into any analysis, but the data themselves do not involve any
> observed probabilities.  In
> some sense, the data that we observe are already the result of sampling
> from the probabilities that
> this paper refers to.
>
> As such, I really don't see what value this work has.
>
> I might agree with the authors that a Bayesian approach is required in
> which we consider our task to
> be the estimation of these underlying probabilities rather than the rather
> pedestrian way that most
> frequent item set algorithms simply assume that counts are the end of the
> story.
>
> But that doesn't sound like what the paper is saying.  Maybe I am
> misreading it.
>
>
> On Thu, Mar 24, 2011 at 10:53 AM, Yarco Hayduk <ya...@gmail.com> wrote:
>
>> What Mahout does not have at this point is the ability to mine frequent
>> patterns from uncertain data.
>>
>> You could implement this feature in two ways:
>>
>> 1. Using sampling (http://www.usukita.org/papers/4931/freq.pdf )
>> 2. Rewriting the FPM algorithms so that they can work with uncertain data
>> (http://www.usukita.org/papers/4931/freq.pdf)
>>
>> I could implement the UH-Mine algorithm that would work with uncertain
>> data. Its described in (2).
>> yarco;)
>>
>>
>> > On Thu, Mar 24, 2011 at 12:27 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >>
>> >> Moderately good as a project.
>> >> I think that Mahout would benefit more from new capabilities than from
>> a reimplementation of old capabilities with a new algorithm.  Item set
>> mining is not particularly subject to radical improvement in terms of the
>> results.  Speed improvements would be nice, but are not a major point of
>> pain with the current implementation.
>> >> What do you see about this project that would be good for Mahout?  Are
>> there related things that you need that Mahout doesn't do?
>> >>
>> >> On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <ya...@gmail.com>
>> wrote:
>> >>>
>> >>> So does this look like a valid project idea?
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com>
>> wrote:
>> >>>
>> >>> > I believe that Section 3.1 of the aforementioned paper talks about
>> the
>> >>> > parallel version of H-Mine.
>> >>> >
>> >>> > You are right - H-Mine has a backtracking step, which adds nodes to
>> the
>> >>> > next node and introduces a dependency. i.e. you can not start
>> working on the
>> >>> > i+1 header element before the i-th element is not mined.
>> >>> >
>> >>> > The H-Mine algorithm works as follows:
>> >>> >
>> >>> > 1. prunes the intial DB such that all singleton infrequent items are
>> >>> > removed
>> >>> > 2. divides the pruned db into equal chunks.
>> >>> > 3. mines these chunks separately using the H-Mine(mem) algorithm
>> >>> > 4. joins the results and
>> >>> > 5. scans the pruned db once again to remove false positives and
>> obtain the
>> >>> > actual counts.
>> >>> >
>> >>> > I'm pretty sure that I can map these steps to MapReduce.
>> >>> >
>> >>> > Having said that, I am not sure that this approach would work better
>> than
>> >>> > Parallel FP-Growth.
>> >>> >
>> >>> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >>> >
>> >>> >> We have an Parallel FPGrowth implementation. I have read that by
>> itself
>> >>> >> HMine is faster in sequential version, but it may not be of use to
>> Mahout in
>> >>> >> that format. If you are able to propose a parallel implementation
>> of H-Mine,
>> >>> >> using MapReduce, it will be of great Interest to Mahout.
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com>
>> wrote:
>> >>> >>
>> >>> >>> Hi,
>> >>> >>>
>> >>> >>> For the *Google Summer of Code 2011 would *you be interested if I
>> >>> >>> implemented the H-Mine algorithm?
>> >>> >>>
>> >>> >>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>> >>> >>>
>> >>> >>> Thank you,
>> >>> >>> yarco;)
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>> >
>> >>
>> >
>>
>>
>

Re: H-Mine

Posted by Ted Dunning <te...@gmail.com>.

I don't see the value here.  This paper doesn't make their definitions clear
and I find it particularly
troubling that they seem to assume that the observations being analyzed
involve the observation of
probabilities.

In every case that I know of (and I mean every, not most) the observations
are not uncertain.  They
may be the result of an underlying stochastic process and I think it very
important to incorporate that
assumption into any analysis, but the data themselves do not involve any
observed probabilities.  In
some sense, the data that we observe are already the result of sampling from
the probabilities that
this paper refers to.

As such, I really don't see what value this work has.

I might agree with the authors that a Bayesian approach is required in which
we consider our task to
be the estimation of these underlying probabilities rather than the rather
pedestrian way that most
frequent item set algorithms simply assume that counts are the end of the
story.

But that doesn't sound like what the paper is saying.  Maybe I am misreading
it.

On Thu, Mar 24, 2011 at 10:53 AM, Yarco Hayduk <ya...@gmail.com> wrote:

> What Mahout does not have at this point is the ability to mine frequent
> patterns from uncertain data.
>
> You could implement this feature in two ways:
>
> 1. Using sampling (http://www.usukita.org/papers/4931/freq.pdf )
> 2. Rewriting the FPM algorithms so that they can work with uncertain data (
> http://www.usukita.org/papers/4931/freq.pdf)
>
> I could implement the UH-Mine algorithm that would work with uncertain
> data. Its described in (2).
> yarco;)
>
>
> > On Thu, Mar 24, 2011 at 12:27 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >>
> >> Moderately good as a project.
> >> I think that Mahout would benefit more from new capabilities than from a
> reimplementation of old capabilities with a new algorithm.  Item set mining
> is not particularly subject to radical improvement in terms of the results.
>  Speed improvements would be nice, but are not a major point of pain with
> the current implementation.
> >> What do you see about this project that would be good for Mahout?  Are
> there related things that you need that Mahout doesn't do?
> >>
> >> On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <ya...@gmail.com>
> wrote:
> >>>
> >>> So does this look like a valid project idea?
> >>>
> >>> Thanks
> >>>
> >>> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com>
> wrote:
> >>>
> >>> > I believe that Section 3.1 of the aforementioned paper talks about
> the
> >>> > parallel version of H-Mine.
> >>> >
> >>> > You are right - H-Mine has a backtracking step, which adds nodes to
> the
> >>> > next node and introduces a dependency. i.e. you can not start working
> on the
> >>> > i+1 header element before the i-th element is not mined.
> >>> >
> >>> > The H-Mine algorithm works as follows:
> >>> >
> >>> > 1. prunes the intial DB such that all singleton infrequent items are
> >>> > removed
> >>> > 2. divides the pruned db into equal chunks.
> >>> > 3. mines these chunks separately using the H-Mine(mem) algorithm
> >>> > 4. joins the results and
> >>> > 5. scans the pruned db once again to remove false positives and
> obtain the
> >>> > actual counts.
> >>> >
> >>> > I'm pretty sure that I can map these steps to MapReduce.
> >>> >
> >>> > Having said that, I am not sure that this approach would work better
> than
> >>> > Parallel FP-Growth.
> >>> >
> >>> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >>> >
> >>> >> We have an Parallel FPGrowth implementation. I have read that by
> itself
> >>> >> HMine is faster in sequential version, but it may not be of use to
> Mahout in
> >>> >> that format. If you are able to propose a parallel implementation of
> H-Mine,
> >>> >> using MapReduce, it will be of great Interest to Mahout.
> >>> >>
> >>> >>
> >>> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com>
> wrote:
> >>> >>
> >>> >>> Hi,
> >>> >>>
> >>> >>> For the *Google Summer of Code 2011 would *you be interested if I
> >>> >>> implemented the H-Mine algorithm?
> >>> >>>
> >>> >>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
> >>> >>>
> >>> >>> Thank you,
> >>> >>> yarco;)
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>
> >
>
>

Re: H-Mine

Posted by Yarco Hayduk <ya...@gmail.com>.

What Mahout does not have at this point is the ability to mine frequent
patterns from uncertain data.

You could implement this feature in two ways:

1. Using sampling (http://www.usukita.org/papers/4931/freq.pdf )
2. Rewriting the FPM algorithms so that they can work with uncertain data (
http://www.usukita.org/papers/4931/freq.pdf)

I could implement the UH-Mine algorithm that would work with uncertain data.
Its described in (2).
yarco;)


> On Thu, Mar 24, 2011 at 12:27 PM, Ted Dunning <te...@gmail.com>
wrote:
>>
>> Moderately good as a project.
>> I think that Mahout would benefit more from new capabilities than from a
reimplementation of old capabilities with a new algorithm.  Item set mining
is not particularly subject to radical improvement in terms of the results.
 Speed improvements would be nice, but are not a major point of pain with
the current implementation.
>> What do you see about this project that would be good for Mahout?  Are
there related things that you need that Mahout doesn't do?
>>
>> On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <ya...@gmail.com> wrote:
>>>
>>> So does this look like a valid project idea?
>>>
>>> Thanks
>>>
>>> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com> wrote:
>>>
>>> > I believe that Section 3.1 of the aforementioned paper talks about the
>>> > parallel version of H-Mine.
>>> >
>>> > You are right - H-Mine has a backtracking step, which adds nodes to
the
>>> > next node and introduces a dependency. i.e. you can not start working
on the
>>> > i+1 header element before the i-th element is not mined.
>>> >
>>> > The H-Mine algorithm works as follows:
>>> >
>>> > 1. prunes the intial DB such that all singleton infrequent items are
>>> > removed
>>> > 2. divides the pruned db into equal chunks.
>>> > 3. mines these chunks separately using the H-Mine(mem) algorithm
>>> > 4. joins the results and
>>> > 5. scans the pruned db once again to remove false positives and obtain
the
>>> > actual counts.
>>> >
>>> > I'm pretty sure that I can map these steps to MapReduce.
>>> >
>>> > Having said that, I am not sure that this approach would work better
than
>>> > Parallel FP-Growth.
>>> >
>>> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com>
wrote:
>>> >
>>> >> We have an Parallel FPGrowth implementation. I have read that by
itself
>>> >> HMine is faster in sequential version, but it may not be of use to
Mahout in
>>> >> that format. If you are able to propose a parallel implementation of
H-Mine,
>>> >> using MapReduce, it will be of great Interest to Mahout.
>>> >>
>>> >>
>>> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com>
wrote:
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> For the *Google Summer of Code 2011 would *you be interested if I
>>> >>> implemented the H-Mine algorithm?
>>> >>>
>>> >>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>>> >>>
>>> >>> Thank you,
>>> >>> yarco;)
>>> >>>
>>> >>
>>> >>
>>> >
>>
>

Re: H-Mine

Posted by Ted Dunning <te...@gmail.com>.

Moderately good as a project.

I think that Mahout would benefit more from new capabilities than from a
reimplementation of old capabilities with a new algorithm.  Item set mining
is not particularly subject to radical improvement in terms of the results.
 Speed improvements would be nice, but are not a major point of pain with
the current implementation.

What do you see about this project that would be good for Mahout?  Are there
related things that you need that Mahout doesn't do?

On Thu, Mar 24, 2011 at 10:22 AM, Yarco Hayduk <ya...@gmail.com> wrote:

> So does this look like a valid project idea?
>
> Thanks
>
> On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com> wrote:
>
> > I believe that Section 3.1 of the aforementioned paper talks about the
> > parallel version of H-Mine.
> >
> > You are right - H-Mine has a backtracking step, which adds nodes to the
> > next node and introduces a dependency. i.e. you can not start working on
> the
> > i+1 header element before the i-th element is not mined.
> >
> > The H-Mine algorithm works as follows:
> >
> > 1. prunes the intial DB such that all singleton infrequent items are
> > removed
> > 2. divides the pruned db into equal chunks.
> > 3. mines these chunks separately using the H-Mine(mem) algorithm
> > 4. joins the results and
> > 5. scans the pruned db once again to remove false positives and obtain
> the
> > actual counts.
> >
> > I'm pretty sure that I can map these steps to MapReduce.
> >
> > Having said that, I am not sure that this approach would work better than
> > Parallel FP-Growth.
> >
> > On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> >> We have an Parallel FPGrowth implementation. I have read that by itself
> >> HMine is faster in sequential version, but it may not be of use to
> Mahout in
> >> that format. If you are able to propose a parallel implementation of
> H-Mine,
> >> using MapReduce, it will be of great Interest to Mahout.
> >>
> >>
> >> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> For the *Google Summer of Code 2011 would *you be interested if I
> >>> implemented the H-Mine algorithm?
> >>>
> >>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
> >>>
> >>> Thank you,
> >>> yarco;)
> >>>
> >>
> >>
> >
>

Re: H-Mine

Posted by Yarco Hayduk <ya...@gmail.com>.

So does this look like a valid project idea?

Thanks

On Tue, Mar 22, 2011 at 11:12 PM, Yarco Hayduk <ya...@gmail.com> wrote:

> I believe that Section 3.1 of the aforementioned paper talks about the
> parallel version of H-Mine.
>
> You are right - H-Mine has a backtracking step, which adds nodes to the
> next node and introduces a dependency. i.e. you can not start working on the
> i+1 header element before the i-th element is not mined.
>
> The H-Mine algorithm works as follows:
>
> 1. prunes the intial DB such that all singleton infrequent items are
> removed
> 2. divides the pruned db into equal chunks.
> 3. mines these chunks separately using the H-Mine(mem) algorithm
> 4. joins the results and
> 5. scans the pruned db once again to remove false positives and obtain the
> actual counts.
>
> I'm pretty sure that I can map these steps to MapReduce.
>
> Having said that, I am not sure that this approach would work better than
> Parallel FP-Growth.
>
> On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> We have an Parallel FPGrowth implementation. I have read that by itself
>> HMine is faster in sequential version, but it may not be of use to Mahout in
>> that format. If you are able to propose a parallel implementation of H-Mine,
>> using MapReduce, it will be of great Interest to Mahout.
>>
>>
>> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> For the *Google Summer of Code 2011 would *you be interested if I
>>> implemented the H-Mine algorithm?
>>>
>>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>>>
>>> Thank you,
>>> yarco;)
>>>
>>
>>
>

Re: H-Mine

Posted by Yarco Hayduk <ya...@gmail.com>.

I believe that Section 3.1 of the aforementioned paper talks about the
parallel version of H-Mine.

You are right - H-Mine has a backtracking step, which adds nodes to the next
node and introduces a dependency. i.e. you can not start working on the i+1
header element before the i-th element is not mined.

The H-Mine algorithm works as follows:

1. prunes the intial DB such that all singleton infrequent items are removed
2. divides the pruned db into equal chunks.
3. mines these chunks separately using the H-Mine(mem) algorithm
4. joins the results and
5. scans the pruned db once again to remove false positives and obtain the
actual counts.

I'm pretty sure that I can map these steps to MapReduce.

Having said that, I am not sure that this approach would work better than
Parallel FP-Growth.

On Tue, Mar 22, 2011 at 4:59 PM, Robin Anil <ro...@gmail.com> wrote:

> We have an Parallel FPGrowth implementation. I have read that by itself
> HMine is faster in sequential version, but it may not be of use to Mahout in
> that format. If you are able to propose a parallel implementation of H-Mine,
> using MapReduce, it will be of great Interest to Mahout.
>
>
> On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com> wrote:
>
>> Hi,
>>
>> For the *Google Summer of Code 2011 would *you be interested if I
>> implemented the H-Mine algorithm?
>>
>> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>>
>> Thank you,
>> yarco;)
>>
>
>

Re: H-Mine

Posted by Robin Anil <ro...@gmail.com>.

We have an Parallel FPGrowth implementation. I have read that by itself
HMine is faster in sequential version, but it may not be of use to Mahout in
that format. If you are able to propose a parallel implementation of H-Mine,
using MapReduce, it will be of great Interest to Mahout.

On Wed, Mar 23, 2011 at 3:21 AM, Yarco Hayduk <ya...@gmail.com> wrote:

> Hi,
>
> For the *Google Summer of Code 2011 would *you be interested if I
> implemented the H-Mine algorithm?
>
> http://www.cs.sfu.ca/~jpei/publications/Hmine-jn.pdf
>
> Thank you,
> yarco;)
>