You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Zoltán Tóth-Czifra <tc...@tcz.hu> on 2012/03/03 18:25:13 UTC

Washing machines - Mahout algorithm advice

Hi,

I posted this on stack overflow, but I've been informed that the mailing
list is the appropriate forum to ask questions like this.

What I need is actually just a hint where I can start, as I'm not sure
which direction to go. Although this is not a classic machine learning
problem, AFAIK Mahout is not only for those. I was thinking it's an
appropriate tool, because I work with a large set of data and the output
has similar characteristics than a classic Mahout job. I think a problem
below can be implemented with MapReduce, but I don't know if there is any
existing, similar algorithm. I need this to be scalable, that's why I think
the Mahout on top of Hadoop's MapReduce power could be an appropriate
solution.

Quickly what I want to do is:

The final goal is to define one scalar (a "score") of each one of a set of
entities based on some "known" entities. The entities interact with each
other, known scores influence and define the unknown ones. You can imagine
with the following example.

I have a lot if white clothes and a few pieces of colorful ones; red, blue,
green... I put them into the washing machine. I want to know what colors
the white ones will get after the wash.

Things to take into account:

   - we make a series of washing with different "actors"... some clothes
   are washed in the 1st and 3rd washing, some of them only in the 2nd, some
   of them are washed in all
   - in consecutive washes the clothes that were white before but now
   colored also influence the rest, but not as strong (as they are not as
   colored)
   - some colors don't "color" as much as others. for example red has a
   strong effect on most of the clothes, but green not so much
   - coloring effect also depends on how many clothes are in one washing.
   If you wash a red shirt with a white t-shirt, it gets much more colored,
   than if there were 100 other white t-shirt
   - clothes don't "lose" their color when influencing others

You can see that while calculating, entities actually have 2 assigned
scalars:

   - the color hue (this also defines "coloring power" as mentioned above).
   The hue can be represented as a number, from 0 to 1, let's say. The
   coherence between the coloring power and the color number is not linear. It
   is more like the ends of the scale have more coloring power (0 and 1) while
   the middle (0.5) has less
   - the color "lightness" (how much an entity is colored, for originally
   colored clothes it's 1, for white ones it's 0), which in the same time also
   defines coloring power regardless of the hue

So, again, what I know:

   - which clothes where washed in which consecutive washing
   - I know the original color of some of them, the rest is white in the
   beginning

What I want to know:


   - the hue of all clothes in the end of the washing

The problem is that I don't know what (type) of algorithm should I start
with. If you were so kind to read so far, please suggest me something (or
further reading).

Obviously I don't ask for any detailed thing, again, only hints on the
algorithm.

Thank you!

Re: Washing machines - Mahout algorithm advice

Posted by Zoltán Tóth-Czifra <tc...@tcz.hu>.
Sean, Jack, the PR is an excellent idea, I was not even thinking about it,
but a tweaked version of the PageRank algorithm can absolutely work for
this, and is relatively easy to implement.

Ted, straight question, thank you. You are right, I'm not working for a
washing powder company nor a laundry. I was thinking that it's easier to
understand the idea with this example.

The real problem is the following:

Given a large set of professional profiles with their job history, I need
to find a general, automated way that can help selection process of job
applicants. People are "clothes", companies are "washing machines".

The idea is that people working in the same company and same field interact
with each other, and effect each others' professional level. Also,
companies for one field have "standard" hiring requirements. Having said
that, if we can set levels of some known professionals initially (or maybe
companies as well), we can estimate the professional level of the
individuals in their network.

For example if I worked together with Frank Gehry, I'm most likely a good
architect.

In my example hue is the level of a professional (red is excellent, blue is
poor, if you like it), while the "coloring power" means how certain we are
about the level. For the initial "cornerstone" professionals we are very
sure, so they are deep red/blue. Things like the size of the company and
the duration while they worked together influence the effect of the people
on each other. That's roughly it.

Of course it is very approximate and has a lot of guessing and rounding. I
don't believe it can substitute classic selection but it might be a good
aid. I need to test this.

On Sat, Mar 3, 2012 at 8:09 PM, Ted Dunning <te...@gmail.com> wrote:

> And further linear Markov chains can be expressed as matrix products which
> can be computed efficiently using SVD's.
>
> Zoltan, is this literally the problem that you are working on?  Or is this
> a shadow of the problem that you are interested in?
>
> On Sat, Mar 3, 2012 at 9:55 AM, Jack Tanner <ih...@hotmail.com> wrote:
>
> > It may further help to note that PageRank is a special case of a Markov
> > chain, and this washing system may well be described by a Markov chain
> > model.
> >
> >
> > On 3/3/2012 12:35 PM, Sean Owen wrote:
> >
> >> I answered on SO:
> >>
> >> The only thing I can think of that sounds like this problem is
> >> PageRank. It's computed by a sort of iterative simluation. Each page
> >> has some influence (color) which flows via its links (socks its washed
> >> with) and at some point the page influence reaches a steady state
> >> (final color). You can look up PageRank algorithms but it is
> >> essentially a matter of calculating eigenvectors of a big, erm, sock
> >> color matrix.
> >>
> >> On Sat, Mar 3, 2012 at 5:25 PM, Zoltán Tóth-Czifra<tc...@tcz.hu>  wrote:
> >>
> >>> Hi,
> >>>
> >>> I posted this on stack overflow, but I've been informed that the
> mailing
> >>> list is the appropriate forum to ask questions like this.
> >>>
> >>> What I need is actually just a hint where I can start, as I'm not sure
> >>> which direction to go. Although this is not a classic machine learning
> >>> problem, AFAIK Mahout is not only for those. I was thinking it's an
> >>> appropriate tool, because I work with a large set of data and the
> output
> >>> has similar characteristics than a classic Mahout job. I think a
> problem
> >>> below can be implemented with MapReduce, but I don't know if there is
> any
> >>> existing, similar algorithm. I need this to be scalable, that's why I
> >>> think
> >>> the Mahout on top of Hadoop's MapReduce power could be an appropriate
> >>> solution.
> >>>
> >>> Quickly what I want to do is:
> >>>
> >>> The final goal is to define one scalar (a "score") of each one of a set
> >>> of
> >>> entities based on some "known" entities. The entities interact with
> each
> >>> other, known scores influence and define the unknown ones. You can
> >>> imagine
> >>> with the following example.
> >>>
> >>> I have a lot if white clothes and a few pieces of colorful ones; red,
> >>> blue,
> >>> green... I put them into the washing machine. I want to know what
> colors
> >>> the white ones will get after the wash.
> >>>
> >>> Things to take into account:
> >>>
> >>>   - we make a series of washing with different "actors"... some clothes
> >>>   are washed in the 1st and 3rd washing, some of them only in the 2nd,
> >>> some
> >>>   of them are washed in all
> >>>   - in consecutive washes the clothes that were white before but now
> >>>   colored also influence the rest, but not as strong (as they are not
> as
> >>>   colored)
> >>>   - some colors don't "color" as much as others. for example red has a
> >>>   strong effect on most of the clothes, but green not so much
> >>>   - coloring effect also depends on how many clothes are in one
> washing.
> >>>   If you wash a red shirt with a white t-shirt, it gets much more
> >>> colored,
> >>>   than if there were 100 other white t-shirt
> >>>   - clothes don't "lose" their color when influencing others
> >>>
> >>> You can see that while calculating, entities actually have 2 assigned
> >>> scalars:
> >>>
> >>>   - the color hue (this also defines "coloring power" as mentioned
> >>> above).
> >>>   The hue can be represented as a number, from 0 to 1, let's say. The
> >>>   coherence between the coloring power and the color number is not
> >>> linear. It
> >>>   is more like the ends of the scale have more coloring power (0 and 1)
> >>> while
> >>>   the middle (0.5) has less
> >>>   - the color "lightness" (how much an entity is colored, for
> originally
> >>>   colored clothes it's 1, for white ones it's 0), which in the same
> time
> >>> also
> >>>   defines coloring power regardless of the hue
> >>>
> >>> So, again, what I know:
> >>>
> >>>   - which clothes where washed in which consecutive washing
> >>>   - I know the original color of some of them, the rest is white in the
> >>>   beginning
> >>>
> >>> What I want to know:
> >>>
> >>>
> >>>   - the hue of all clothes in the end of the washing
> >>>
> >>> The problem is that I don't know what (type) of algorithm should I
> start
> >>> with. If you were so kind to read so far, please suggest me something
> (or
> >>> further reading).
> >>>
> >>> Obviously I don't ask for any detailed thing, again, only hints on the
> >>> algorithm.
> >>>
> >>> Thank you!
> >>>
> >>
> >>
> >>
> >
>

Re: Washing machines - Mahout algorithm advice

Posted by Ted Dunning <te...@gmail.com>.
And further linear Markov chains can be expressed as matrix products which
can be computed efficiently using SVD's.

Zoltan, is this literally the problem that you are working on?  Or is this
a shadow of the problem that you are interested in?

On Sat, Mar 3, 2012 at 9:55 AM, Jack Tanner <ih...@hotmail.com> wrote:

> It may further help to note that PageRank is a special case of a Markov
> chain, and this washing system may well be described by a Markov chain
> model.
>
>
> On 3/3/2012 12:35 PM, Sean Owen wrote:
>
>> I answered on SO:
>>
>> The only thing I can think of that sounds like this problem is
>> PageRank. It's computed by a sort of iterative simluation. Each page
>> has some influence (color) which flows via its links (socks its washed
>> with) and at some point the page influence reaches a steady state
>> (final color). You can look up PageRank algorithms but it is
>> essentially a matter of calculating eigenvectors of a big, erm, sock
>> color matrix.
>>
>> On Sat, Mar 3, 2012 at 5:25 PM, Zoltán Tóth-Czifra<tc...@tcz.hu>  wrote:
>>
>>> Hi,
>>>
>>> I posted this on stack overflow, but I've been informed that the mailing
>>> list is the appropriate forum to ask questions like this.
>>>
>>> What I need is actually just a hint where I can start, as I'm not sure
>>> which direction to go. Although this is not a classic machine learning
>>> problem, AFAIK Mahout is not only for those. I was thinking it's an
>>> appropriate tool, because I work with a large set of data and the output
>>> has similar characteristics than a classic Mahout job. I think a problem
>>> below can be implemented with MapReduce, but I don't know if there is any
>>> existing, similar algorithm. I need this to be scalable, that's why I
>>> think
>>> the Mahout on top of Hadoop's MapReduce power could be an appropriate
>>> solution.
>>>
>>> Quickly what I want to do is:
>>>
>>> The final goal is to define one scalar (a "score") of each one of a set
>>> of
>>> entities based on some "known" entities. The entities interact with each
>>> other, known scores influence and define the unknown ones. You can
>>> imagine
>>> with the following example.
>>>
>>> I have a lot if white clothes and a few pieces of colorful ones; red,
>>> blue,
>>> green... I put them into the washing machine. I want to know what colors
>>> the white ones will get after the wash.
>>>
>>> Things to take into account:
>>>
>>>   - we make a series of washing with different "actors"... some clothes
>>>   are washed in the 1st and 3rd washing, some of them only in the 2nd,
>>> some
>>>   of them are washed in all
>>>   - in consecutive washes the clothes that were white before but now
>>>   colored also influence the rest, but not as strong (as they are not as
>>>   colored)
>>>   - some colors don't "color" as much as others. for example red has a
>>>   strong effect on most of the clothes, but green not so much
>>>   - coloring effect also depends on how many clothes are in one washing.
>>>   If you wash a red shirt with a white t-shirt, it gets much more
>>> colored,
>>>   than if there were 100 other white t-shirt
>>>   - clothes don't "lose" their color when influencing others
>>>
>>> You can see that while calculating, entities actually have 2 assigned
>>> scalars:
>>>
>>>   - the color hue (this also defines "coloring power" as mentioned
>>> above).
>>>   The hue can be represented as a number, from 0 to 1, let's say. The
>>>   coherence between the coloring power and the color number is not
>>> linear. It
>>>   is more like the ends of the scale have more coloring power (0 and 1)
>>> while
>>>   the middle (0.5) has less
>>>   - the color "lightness" (how much an entity is colored, for originally
>>>   colored clothes it's 1, for white ones it's 0), which in the same time
>>> also
>>>   defines coloring power regardless of the hue
>>>
>>> So, again, what I know:
>>>
>>>   - which clothes where washed in which consecutive washing
>>>   - I know the original color of some of them, the rest is white in the
>>>   beginning
>>>
>>> What I want to know:
>>>
>>>
>>>   - the hue of all clothes in the end of the washing
>>>
>>> The problem is that I don't know what (type) of algorithm should I start
>>> with. If you were so kind to read so far, please suggest me something (or
>>> further reading).
>>>
>>> Obviously I don't ask for any detailed thing, again, only hints on the
>>> algorithm.
>>>
>>> Thank you!
>>>
>>
>>
>>
>

Re: Washing machines - Mahout algorithm advice

Posted by Jack Tanner <ih...@hotmail.com>.
It may further help to note that PageRank is a special case of a Markov 
chain, and this washing system may well be described by a Markov chain 
model.

On 3/3/2012 12:35 PM, Sean Owen wrote:
> I answered on SO:
>
> The only thing I can think of that sounds like this problem is
> PageRank. It's computed by a sort of iterative simluation. Each page
> has some influence (color) which flows via its links (socks its washed
> with) and at some point the page influence reaches a steady state
> (final color). You can look up PageRank algorithms but it is
> essentially a matter of calculating eigenvectors of a big, erm, sock
> color matrix.
>
> On Sat, Mar 3, 2012 at 5:25 PM, Zoltán Tóth-Czifra<tc...@tcz.hu>  wrote:
>> Hi,
>>
>> I posted this on stack overflow, but I've been informed that the mailing
>> list is the appropriate forum to ask questions like this.
>>
>> What I need is actually just a hint where I can start, as I'm not sure
>> which direction to go. Although this is not a classic machine learning
>> problem, AFAIK Mahout is not only for those. I was thinking it's an
>> appropriate tool, because I work with a large set of data and the output
>> has similar characteristics than a classic Mahout job. I think a problem
>> below can be implemented with MapReduce, but I don't know if there is any
>> existing, similar algorithm. I need this to be scalable, that's why I think
>> the Mahout on top of Hadoop's MapReduce power could be an appropriate
>> solution.
>>
>> Quickly what I want to do is:
>>
>> The final goal is to define one scalar (a "score") of each one of a set of
>> entities based on some "known" entities. The entities interact with each
>> other, known scores influence and define the unknown ones. You can imagine
>> with the following example.
>>
>> I have a lot if white clothes and a few pieces of colorful ones; red, blue,
>> green... I put them into the washing machine. I want to know what colors
>> the white ones will get after the wash.
>>
>> Things to take into account:
>>
>>    - we make a series of washing with different "actors"... some clothes
>>    are washed in the 1st and 3rd washing, some of them only in the 2nd, some
>>    of them are washed in all
>>    - in consecutive washes the clothes that were white before but now
>>    colored also influence the rest, but not as strong (as they are not as
>>    colored)
>>    - some colors don't "color" as much as others. for example red has a
>>    strong effect on most of the clothes, but green not so much
>>    - coloring effect also depends on how many clothes are in one washing.
>>    If you wash a red shirt with a white t-shirt, it gets much more colored,
>>    than if there were 100 other white t-shirt
>>    - clothes don't "lose" their color when influencing others
>>
>> You can see that while calculating, entities actually have 2 assigned
>> scalars:
>>
>>    - the color hue (this also defines "coloring power" as mentioned above).
>>    The hue can be represented as a number, from 0 to 1, let's say. The
>>    coherence between the coloring power and the color number is not linear. It
>>    is more like the ends of the scale have more coloring power (0 and 1) while
>>    the middle (0.5) has less
>>    - the color "lightness" (how much an entity is colored, for originally
>>    colored clothes it's 1, for white ones it's 0), which in the same time also
>>    defines coloring power regardless of the hue
>>
>> So, again, what I know:
>>
>>    - which clothes where washed in which consecutive washing
>>    - I know the original color of some of them, the rest is white in the
>>    beginning
>>
>> What I want to know:
>>
>>
>>    - the hue of all clothes in the end of the washing
>>
>> The problem is that I don't know what (type) of algorithm should I start
>> with. If you were so kind to read so far, please suggest me something (or
>> further reading).
>>
>> Obviously I don't ask for any detailed thing, again, only hints on the
>> algorithm.
>>
>> Thank you!
>
>


Re: Washing machines - Mahout algorithm advice

Posted by Sean Owen <sr...@gmail.com>.
I answered on SO:

The only thing I can think of that sounds like this problem is
PageRank. It's computed by a sort of iterative simluation. Each page
has some influence (color) which flows via its links (socks its washed
with) and at some point the page influence reaches a steady state
(final color). You can look up PageRank algorithms but it is
essentially a matter of calculating eigenvectors of a big, erm, sock
color matrix.

On Sat, Mar 3, 2012 at 5:25 PM, Zoltán Tóth-Czifra <tc...@tcz.hu> wrote:
> Hi,
>
> I posted this on stack overflow, but I've been informed that the mailing
> list is the appropriate forum to ask questions like this.
>
> What I need is actually just a hint where I can start, as I'm not sure
> which direction to go. Although this is not a classic machine learning
> problem, AFAIK Mahout is not only for those. I was thinking it's an
> appropriate tool, because I work with a large set of data and the output
> has similar characteristics than a classic Mahout job. I think a problem
> below can be implemented with MapReduce, but I don't know if there is any
> existing, similar algorithm. I need this to be scalable, that's why I think
> the Mahout on top of Hadoop's MapReduce power could be an appropriate
> solution.
>
> Quickly what I want to do is:
>
> The final goal is to define one scalar (a "score") of each one of a set of
> entities based on some "known" entities. The entities interact with each
> other, known scores influence and define the unknown ones. You can imagine
> with the following example.
>
> I have a lot if white clothes and a few pieces of colorful ones; red, blue,
> green... I put them into the washing machine. I want to know what colors
> the white ones will get after the wash.
>
> Things to take into account:
>
>   - we make a series of washing with different "actors"... some clothes
>   are washed in the 1st and 3rd washing, some of them only in the 2nd, some
>   of them are washed in all
>   - in consecutive washes the clothes that were white before but now
>   colored also influence the rest, but not as strong (as they are not as
>   colored)
>   - some colors don't "color" as much as others. for example red has a
>   strong effect on most of the clothes, but green not so much
>   - coloring effect also depends on how many clothes are in one washing.
>   If you wash a red shirt with a white t-shirt, it gets much more colored,
>   than if there were 100 other white t-shirt
>   - clothes don't "lose" their color when influencing others
>
> You can see that while calculating, entities actually have 2 assigned
> scalars:
>
>   - the color hue (this also defines "coloring power" as mentioned above).
>   The hue can be represented as a number, from 0 to 1, let's say. The
>   coherence between the coloring power and the color number is not linear. It
>   is more like the ends of the scale have more coloring power (0 and 1) while
>   the middle (0.5) has less
>   - the color "lightness" (how much an entity is colored, for originally
>   colored clothes it's 1, for white ones it's 0), which in the same time also
>   defines coloring power regardless of the hue
>
> So, again, what I know:
>
>   - which clothes where washed in which consecutive washing
>   - I know the original color of some of them, the rest is white in the
>   beginning
>
> What I want to know:
>
>
>   - the hue of all clothes in the end of the washing
>
> The problem is that I don't know what (type) of algorithm should I start
> with. If you were so kind to read so far, please suggest me something (or
> further reading).
>
> Obviously I don't ask for any detailed thing, again, only hints on the
> algorithm.
>
> Thank you!