You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Em <ma...@yahoo.de> on 2011/07/18 19:03:45 UTC

How would you translate this into MapReduce?

Hello list,

as a newbie I got a tricky use-case in mind which I want to implement
with Hadoop to train my skillz. There is no real scenario behind that,
so I can extend or shrink the problem to the extent I like.

I create random lists of person-IDs and places plus a time-value.

The result of my map-reduce-operations should be something like that:
The key is a place and the value is a list of places that were visited
by persons after they visited the key-place.
Additionally the value should be sorted in a way were I use some
time/count-biased metric. This way the value-list should reflect the
place which was the most popular i.e. second-station on a tour.

I think this is a complex almost real-world-scenario.

In pseudo-code it will be something like this:
for every place p
  for every person m that visited p
    select list l of all the places that m visited after p
    write a key-value-pair p=>l to disc and l is in order of the visits

for every key k in the list of key-value-pairs
   get the value list of places v for k -
       create another key-value-pair pv where the key is the place and
the value is its index in v (for a place p in v)

       for every k
          get all pv
          for every pv aggregate the key-value-pairs by key and sum up
the index i for every place p so that it becomes the kv-pair opv
          sort opv in ascending order by its value

The result would be what I wanted, no?

It looks like I need multiple MR-phases, however I do not even know how
to start.

My first guess is: Create a MR-Job where I invert my list so that I got
a place as the key and as value all persons that visited it.
The next phase needs to iterate over the value's persons and join with
the original data to get an idea of when this person visited this place
and what places came next.
And now the problems arise:
- First: What happens to places that are so popular that the number of
persons that visited it is so large, that I can not pass the whole
KV-pair to a single node to iterate over it?
- Second: I need to re-join the original data. Without a database this
would be extremely slow, wouldn't it?

I hope that you guys can give me some ideas and input to make my first
serious steps in Hadoop-land.

Regards,
Em

Re: How would you translate this into MapReduce?

Posted by Em <ma...@yahoo.de>.
Interesting to see the upper bound for Hadoop.
However I guess this is a rare problem.

I'll try to implement what we discussed so far and train myself.

Regards,
Em

Am 19.07.2011 21:40, schrieb Steve Lewis:
> If the size of a record is too big to be processed by a node you
> probably need to re-architect using a different 
> record which scales better and combines cleanly 
> You also need to ask at the start what data you need to retrieve and how
> you intend to retrieve it- 
> at some point a database may start to look like a good solution although
> in this case I might think about saying I can track the order of trips
> to - say 16 and using a comma delimited list for the counts 
> 
> On Tue, Jul 19, 2011 at 11:14 AM, Em <mailformailinglists@yahoo.de
> <ma...@yahoo.de>> wrote:
> 
>     Of course it won't scale or at least not as good as your suggested
>     model. Chances are good that my idea is not an option for a
>     production-system and not as usefull as the less-complex variant. So you
>     are right!
> 
>     The reason why I asked was to get an idea of what should be done, if a
>     record is too big to be processable by a node.
> 
>     Regards,
>     Em
> 
>     Am 19.07.2011 19:54, schrieb Steve Lewis:
>     > I assumed the problem was count the number of people visiting Moscow
>     > after London without considering iany intermediate stops. This
>     leads to
>     > a data structure which is easy to combine. The structure you propose
>     > adds more information and is difficult to combine. I doubt it could
>     > handle a billion people and  recommend trying with a hundred people
>     > visiting 5 out of 20 destinations in random order to see how bad it is
>     > getting.
>     >
>     > My schema can handle billions of combinations assuming only that the
>     > total destinations in any node can be handled - i.e. a billion people
>     > can visit any of a thousand cities in random order and worst case
>     I need
>     > a thousand cities and a thousand counts - now I doubt that the schema
>     > you propose with added order information will scale to those levels
>     >
>     > On Tue, Jul 19, 2011 at 10:39 AM, Em <mailformailinglists@yahoo.de
>     <ma...@yahoo.de>
>     > <mailto:mailformailinglists@yahoo.de
>     <ma...@yahoo.de>>> wrote:
>     >
>     >     Thanks!
>     >
>     >     So you invert the data and than walk through each inverted result.
>     >     Good point!
>     >     What do you think about prefixing each city-name with the index in
>     >     the list?
>     >
>     >     This way you can say:
>     >     London: 1_Moscow:2, 1_Paris:2, 2_Moscow:1, 2_Riga:4, 2_Paris:1,
>     >     3_Berlin:1...
>     >
>     >     >From this list you can see that people are likely to visit
>     moscow right
>     >     after london at their first or second journey. This would
>     maintain a
>     >     strong order (whether that's good or bad depends on a
>     >     real-world-scenario).
>     >
>     >     Since your ideas gave me a good starting-point for realizing
>     this job
>     >     (I'll practice it), we can make the problem more heavy-weight, if
>     >     you like?
>     >
>     >     What happens to records that are too big to be processable by
>     one node?
>     >     Let's say from my above example of a strongly-ordered list one
>     gets a
>     >     billion combinations - way too much for one node (we assume that).
>     >     What possibilities does Hadoop offer to deal with such things?
>     >
>     >     Regards and many thanks for the insights,
>     >     Em
>     >
>     >
>     >     Am 19.07.2011 19:15, schrieb Steve Lewis:
>     >     > Assume Joe visits Washington, London, Paris and Moscow
>     >     >
>     >     > You start with records like
>     >     > Joe:Washington:20-Jan-2011
>     >     > Joe:London:14-Feb2011
>     >     > Joe:Paris :9-Mar-2011
>     >     >
>     >     > You want
>     >     > Joe: Washington, London, Paris and Moscow
>     >     >
>     >     > For the next step the person is irrelevant
>     >     > you want
>     >     >
>     >     >
>     >     > Washington:  London:1, Paris:1 ,Moscow:1
>     >     >  London: , Paris:1  Moscow:1
>     >     >  Paris:   Moscow:1
>     >     > The first say after a visit to Washington there was one visit to
>     >     London,
>     >     > one to Paris and one to Moscow
>     >     >
>     >     >
>     >     > This can be combined with the one from Joe
>     >     >
>     >     >
>     >     > Now suppose Bill visits London and Moscow
>     >     > So he generates
>     >     > London:    Moscow:1
>     >     >
>     >     > This can be combined with the one from Joe saying  London: ,
>     >     Paris:1 and
>     >     > Moscow:1
>     >     >  to give
>     >     >
>     >     >  London: , Paris:1 and Moscow:2
>     >     >
>     >     > Now suppose Sue visits London and  Riga and Paris
>     >     > So she generates
>     >     > London: , Paris:1,Riga 1
>     >     >
>     >     > This can be combined with  London: , Paris:1 and Moscow:2 to
>     give
>     >     >
>     >     > London: , Paris:2 and Moscow:2,Riga 1
>     >     >
>     >     > Note I can keep places in alphabetical order in the result
>     >     >
>     >     >
>     >     >
>     >     > On Tue, Jul 19, 2011 at 9:53 AM, Em
>     <mailformailinglists@yahoo.de <ma...@yahoo.de>
>     >     <mailto:mailformailinglists@yahoo.de
>     <ma...@yahoo.de>>
>     >     > <mailto:mailformailinglists@yahoo.de
>     <ma...@yahoo.de>
>     >     <mailto:mailformailinglists@yahoo.de
>     <ma...@yahoo.de>>>> wrote:
>     >     >
>     >     >     Hi Steven,
>     >     >
>     >     >     thanks for your response! For the ease of use we can
>     make those
>     >     >     assumptions you made - maybe this makes it much easier to
>     >     help. Those
>     >     >     little extras are something for after solving the "easy"
>     >     version of the
>     >     >     task. :)
>     >     >
>     >     >     What do you mean with the following?
>     >     >
>     >     >     > The second job takes Person : list of places and
>     return for
>     >     each place
>     >     >     > in the list consructs
>     >     >     > place : 1 | place after P : 1 | next place : 1 ...
>     >     >
>     >     >     You mean something like that?
>     >     >
>     >     >     Washington DC:1
>     >     >     New York after Washington DC:1
>     >     >     Miami after New York:1
>     >     >
>     >     >     I do not see the benefit for the result I like to get?
>     >     >
>     >     >     The end-result should be something like that:
>     >     >     Washington DC => New York, Miami, Los Angeles
>     >     >     New York => Chicago, Seattle, San Francisco
>     >     >
>     >     >     The point is, that one can see that persons that visited
>     >     Washington DC
>     >     >     are likely to visit New York as the next place, Miami as the
>     >     second and
>     >     >     L.A. as the third.
>     >     >     However, if I choose New York as my starting point, I
>     can see that
>     >     >     persons that start their journey in New York (and maybe
>     >     weren't in DC
>     >     >     before) are likely to visit Chicago, Seattle and San
>     >     Francisco. Maybe
>     >     >     Los Angeles comes at the 10th position.
>     >     >
>     >     >     Regards,
>     >     >     Em
>     >     >
>     >     >
>     >     >
>     >     >
>     >     > --
>     >     > Steven M. Lewis PhD
>     >     > 4221 105th Ave NE
>     >     > Kirkland, WA 98033
>     >     > 206-384-1340 <tel:206-384-1340> <tel:206-384-1340
>     <tel:206-384-1340>> (cell)
>     >     > Skype lordjoe_com
>     >     >
>     >     >
>     >
>     >
>     >
>     >
>     > --
>     > Steven M. Lewis PhD
>     > 4221 105th Ave NE
>     > Kirkland, WA 98033
>     > 206-384-1340 <tel:206-384-1340> (cell)
>     > Skype lordjoe_com
>     >
>     >
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 

Re: How would you translate this into MapReduce?

Posted by Steve Lewis <lo...@gmail.com>.
If the size of a record is too big to be processed by a node you probably
need to re-architect using a different
record which scales better and combines cleanly
You also need to ask at the start what data you need to retrieve and how you
intend to retrieve it-
at some point a database may start to look like a good solution although in
this case I might think about saying I can track the order of trips to - say
16 and using a comma delimited list for the counts

On Tue, Jul 19, 2011 at 11:14 AM, Em <ma...@yahoo.de> wrote:

> Of course it won't scale or at least not as good as your suggested
> model. Chances are good that my idea is not an option for a
> production-system and not as usefull as the less-complex variant. So you
> are right!
>
> The reason why I asked was to get an idea of what should be done, if a
> record is too big to be processable by a node.
>
> Regards,
> Em
>
> Am 19.07.2011 19:54, schrieb Steve Lewis:
> > I assumed the problem was count the number of people visiting Moscow
> > after London without considering iany intermediate stops. This leads to
> > a data structure which is easy to combine. The structure you propose
> > adds more information and is difficult to combine. I doubt it could
> > handle a billion people and  recommend trying with a hundred people
> > visiting 5 out of 20 destinations in random order to see how bad it is
> > getting.
> >
> > My schema can handle billions of combinations assuming only that the
> > total destinations in any node can be handled - i.e. a billion people
> > can visit any of a thousand cities in random order and worst case I need
> > a thousand cities and a thousand counts - now I doubt that the schema
> > you propose with added order information will scale to those levels
> >
> > On Tue, Jul 19, 2011 at 10:39 AM, Em <mailformailinglists@yahoo.de
> > <ma...@yahoo.de>> wrote:
> >
> >     Thanks!
> >
> >     So you invert the data and than walk through each inverted result.
> >     Good point!
> >     What do you think about prefixing each city-name with the index in
> >     the list?
> >
> >     This way you can say:
> >     London: 1_Moscow:2, 1_Paris:2, 2_Moscow:1, 2_Riga:4, 2_Paris:1,
> >     3_Berlin:1...
> >
> >     >From this list you can see that people are likely to visit moscow
> right
> >     after london at their first or second journey. This would maintain a
> >     strong order (whether that's good or bad depends on a
> >     real-world-scenario).
> >
> >     Since your ideas gave me a good starting-point for realizing this job
> >     (I'll practice it), we can make the problem more heavy-weight, if
> >     you like?
> >
> >     What happens to records that are too big to be processable by one
> node?
> >     Let's say from my above example of a strongly-ordered list one gets a
> >     billion combinations - way too much for one node (we assume that).
> >     What possibilities does Hadoop offer to deal with such things?
> >
> >     Regards and many thanks for the insights,
> >     Em
> >
> >
> >     Am 19.07.2011 19:15, schrieb Steve Lewis:
> >     > Assume Joe visits Washington, London, Paris and Moscow
> >     >
> >     > You start with records like
> >     > Joe:Washington:20-Jan-2011
> >     > Joe:London:14-Feb2011
> >     > Joe:Paris :9-Mar-2011
> >     >
> >     > You want
> >     > Joe: Washington, London, Paris and Moscow
> >     >
> >     > For the next step the person is irrelevant
> >     > you want
> >     >
> >     >
> >     > Washington:  London:1, Paris:1 ,Moscow:1
> >     >  London: , Paris:1  Moscow:1
> >     >  Paris:   Moscow:1
> >     > The first say after a visit to Washington there was one visit to
> >     London,
> >     > one to Paris and one to Moscow
> >     >
> >     >
> >     > This can be combined with the one from Joe
> >     >
> >     >
> >     > Now suppose Bill visits London and Moscow
> >     > So he generates
> >     > London:    Moscow:1
> >     >
> >     > This can be combined with the one from Joe saying  London: ,
> >     Paris:1 and
> >     > Moscow:1
> >     >  to give
> >     >
> >     >  London: , Paris:1 and Moscow:2
> >     >
> >     > Now suppose Sue visits London and  Riga and Paris
> >     > So she generates
> >     > London: , Paris:1,Riga 1
> >     >
> >     > This can be combined with  London: , Paris:1 and Moscow:2 to give
> >     >
> >     > London: , Paris:2 and Moscow:2,Riga 1
> >     >
> >     > Note I can keep places in alphabetical order in the result
> >     >
> >     >
> >     >
> >     > On Tue, Jul 19, 2011 at 9:53 AM, Em <mailformailinglists@yahoo.de
> >     <ma...@yahoo.de>
> >     > <mailto:mailformailinglists@yahoo.de
> >     <ma...@yahoo.de>>> wrote:
> >     >
> >     >     Hi Steven,
> >     >
> >     >     thanks for your response! For the ease of use we can make those
> >     >     assumptions you made - maybe this makes it much easier to
> >     help. Those
> >     >     little extras are something for after solving the "easy"
> >     version of the
> >     >     task. :)
> >     >
> >     >     What do you mean with the following?
> >     >
> >     >     > The second job takes Person : list of places and return for
> >     each place
> >     >     > in the list consructs
> >     >     > place : 1 | place after P : 1 | next place : 1 ...
> >     >
> >     >     You mean something like that?
> >     >
> >     >     Washington DC:1
> >     >     New York after Washington DC:1
> >     >     Miami after New York:1
> >     >
> >     >     I do not see the benefit for the result I like to get?
> >     >
> >     >     The end-result should be something like that:
> >     >     Washington DC => New York, Miami, Los Angeles
> >     >     New York => Chicago, Seattle, San Francisco
> >     >
> >     >     The point is, that one can see that persons that visited
> >     Washington DC
> >     >     are likely to visit New York as the next place, Miami as the
> >     second and
> >     >     L.A. as the third.
> >     >     However, if I choose New York as my starting point, I can see
> that
> >     >     persons that start their journey in New York (and maybe
> >     weren't in DC
> >     >     before) are likely to visit Chicago, Seattle and San
> >     Francisco. Maybe
> >     >     Los Angeles comes at the 10th position.
> >     >
> >     >     Regards,
> >     >     Em
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > Steven M. Lewis PhD
> >     > 4221 105th Ave NE
> >     > Kirkland, WA 98033
> >     > 206-384-1340 <tel:206-384-1340> (cell)
> >     > Skype lordjoe_com
> >     >
> >     >
> >
> >
> >
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
> >
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How would you translate this into MapReduce?

Posted by Em <ma...@yahoo.de>.
Of course it won't scale or at least not as good as your suggested
model. Chances are good that my idea is not an option for a
production-system and not as usefull as the less-complex variant. So you
are right!

The reason why I asked was to get an idea of what should be done, if a
record is too big to be processable by a node.

Regards,
Em

Am 19.07.2011 19:54, schrieb Steve Lewis:
> I assumed the problem was count the number of people visiting Moscow
> after London without considering iany intermediate stops. This leads to
> a data structure which is easy to combine. The structure you propose
> adds more information and is difficult to combine. I doubt it could
> handle a billion people and  recommend trying with a hundred people
> visiting 5 out of 20 destinations in random order to see how bad it is
> getting. 
> 
> My schema can handle billions of combinations assuming only that the
> total destinations in any node can be handled - i.e. a billion people
> can visit any of a thousand cities in random order and worst case I need
> a thousand cities and a thousand counts - now I doubt that the schema
> you propose with added order information will scale to those levels
> 
> On Tue, Jul 19, 2011 at 10:39 AM, Em <mailformailinglists@yahoo.de
> <ma...@yahoo.de>> wrote:
> 
>     Thanks!
> 
>     So you invert the data and than walk through each inverted result.
>     Good point!
>     What do you think about prefixing each city-name with the index in
>     the list?
> 
>     This way you can say:
>     London: 1_Moscow:2, 1_Paris:2, 2_Moscow:1, 2_Riga:4, 2_Paris:1,
>     3_Berlin:1...
> 
>     >From this list you can see that people are likely to visit moscow right
>     after london at their first or second journey. This would maintain a
>     strong order (whether that's good or bad depends on a
>     real-world-scenario).
> 
>     Since your ideas gave me a good starting-point for realizing this job
>     (I'll practice it), we can make the problem more heavy-weight, if
>     you like?
> 
>     What happens to records that are too big to be processable by one node?
>     Let's say from my above example of a strongly-ordered list one gets a
>     billion combinations - way too much for one node (we assume that).
>     What possibilities does Hadoop offer to deal with such things?
> 
>     Regards and many thanks for the insights,
>     Em
> 
> 
>     Am 19.07.2011 19:15, schrieb Steve Lewis:
>     > Assume Joe visits Washington, London, Paris and Moscow
>     >
>     > You start with records like
>     > Joe:Washington:20-Jan-2011
>     > Joe:London:14-Feb2011
>     > Joe:Paris :9-Mar-2011
>     >
>     > You want
>     > Joe: Washington, London, Paris and Moscow
>     >
>     > For the next step the person is irrelevant
>     > you want
>     >
>     >
>     > Washington:  London:1, Paris:1 ,Moscow:1
>     >  London: , Paris:1  Moscow:1
>     >  Paris:   Moscow:1
>     > The first say after a visit to Washington there was one visit to
>     London,
>     > one to Paris and one to Moscow
>     >
>     >
>     > This can be combined with the one from Joe
>     >
>     >
>     > Now suppose Bill visits London and Moscow
>     > So he generates
>     > London:    Moscow:1
>     >
>     > This can be combined with the one from Joe saying  London: ,
>     Paris:1 and
>     > Moscow:1
>     >  to give
>     >
>     >  London: , Paris:1 and Moscow:2
>     >
>     > Now suppose Sue visits London and  Riga and Paris
>     > So she generates
>     > London: , Paris:1,Riga 1
>     >
>     > This can be combined with  London: , Paris:1 and Moscow:2 to give
>     >
>     > London: , Paris:2 and Moscow:2,Riga 1
>     >
>     > Note I can keep places in alphabetical order in the result
>     >
>     >
>     >
>     > On Tue, Jul 19, 2011 at 9:53 AM, Em <mailformailinglists@yahoo.de
>     <ma...@yahoo.de>
>     > <mailto:mailformailinglists@yahoo.de
>     <ma...@yahoo.de>>> wrote:
>     >
>     >     Hi Steven,
>     >
>     >     thanks for your response! For the ease of use we can make those
>     >     assumptions you made - maybe this makes it much easier to
>     help. Those
>     >     little extras are something for after solving the "easy"
>     version of the
>     >     task. :)
>     >
>     >     What do you mean with the following?
>     >
>     >     > The second job takes Person : list of places and return for
>     each place
>     >     > in the list consructs
>     >     > place : 1 | place after P : 1 | next place : 1 ...
>     >
>     >     You mean something like that?
>     >
>     >     Washington DC:1
>     >     New York after Washington DC:1
>     >     Miami after New York:1
>     >
>     >     I do not see the benefit for the result I like to get?
>     >
>     >     The end-result should be something like that:
>     >     Washington DC => New York, Miami, Los Angeles
>     >     New York => Chicago, Seattle, San Francisco
>     >
>     >     The point is, that one can see that persons that visited
>     Washington DC
>     >     are likely to visit New York as the next place, Miami as the
>     second and
>     >     L.A. as the third.
>     >     However, if I choose New York as my starting point, I can see that
>     >     persons that start their journey in New York (and maybe
>     weren't in DC
>     >     before) are likely to visit Chicago, Seattle and San
>     Francisco. Maybe
>     >     Los Angeles comes at the 10th position.
>     >
>     >     Regards,
>     >     Em
>     >
>     >
>     >
>     >
>     > --
>     > Steven M. Lewis PhD
>     > 4221 105th Ave NE
>     > Kirkland, WA 98033
>     > 206-384-1340 <tel:206-384-1340> (cell)
>     > Skype lordjoe_com
>     >
>     >
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 

Re: How would you translate this into MapReduce?

Posted by Steve Lewis <lo...@gmail.com>.
I assumed the problem was count the number of people visiting Moscow after
London without considering iany intermediate stops. This leads to a data
structure which is easy to combine. The structure you propose adds more
information and is difficult to combine. I doubt it could handle a billion
people and  recommend trying with a hundred people visiting 5 out of 20
destinations in random order to see how bad it is getting.

My schema can handle billions of combinations assuming only that the total
destinations in any node can be handled - i.e. a billion people can visit
any of a thousand cities in random order and worst case I need a thousand
cities and a thousand counts - now I doubt that the schema you propose with
added order information will scale to those levels

On Tue, Jul 19, 2011 at 10:39 AM, Em <ma...@yahoo.de> wrote:

> Thanks!
>
> So you invert the data and than walk through each inverted result.
> Good point!
> What do you think about prefixing each city-name with the index in the
> list?
>
> This way you can say:
> London: 1_Moscow:2, 1_Paris:2, 2_Moscow:1, 2_Riga:4, 2_Paris:1,
> 3_Berlin:1...
>
> From this list you can see that people are likely to visit moscow right
> after london at their first or second journey. This would maintain a
> strong order (whether that's good or bad depends on a real-world-scenario).
>
> Since your ideas gave me a good starting-point for realizing this job
> (I'll practice it), we can make the problem more heavy-weight, if you like?
>
> What happens to records that are too big to be processable by one node?
> Let's say from my above example of a strongly-ordered list one gets a
> billion combinations - way too much for one node (we assume that).
> What possibilities does Hadoop offer to deal with such things?
>
> Regards and many thanks for the insights,
> Em
>
>
> Am 19.07.2011 19:15, schrieb Steve Lewis:
> > Assume Joe visits Washington, London, Paris and Moscow
> >
> > You start with records like
> > Joe:Washington:20-Jan-2011
> > Joe:London:14-Feb2011
> > Joe:Paris :9-Mar-2011
> >
> > You want
> > Joe: Washington, London, Paris and Moscow
> >
> > For the next step the person is irrelevant
> > you want
> >
> >
> > Washington:  London:1, Paris:1 ,Moscow:1
> >  London: , Paris:1  Moscow:1
> >  Paris:   Moscow:1
> > The first say after a visit to Washington there was one visit to London,
> > one to Paris and one to Moscow
> >
> >
> > This can be combined with the one from Joe
> >
> >
> > Now suppose Bill visits London and Moscow
> > So he generates
> > London:    Moscow:1
> >
> > This can be combined with the one from Joe saying  London: , Paris:1 and
> > Moscow:1
> >  to give
> >
> >  London: , Paris:1 and Moscow:2
> >
> > Now suppose Sue visits London and  Riga and Paris
> > So she generates
> > London: , Paris:1,Riga 1
> >
> > This can be combined with  London: , Paris:1 and Moscow:2 to give
> >
> > London: , Paris:2 and Moscow:2,Riga 1
> >
> > Note I can keep places in alphabetical order in the result
> >
> >
> >
> > On Tue, Jul 19, 2011 at 9:53 AM, Em <mailformailinglists@yahoo.de
> > <ma...@yahoo.de>> wrote:
> >
> >     Hi Steven,
> >
> >     thanks for your response! For the ease of use we can make those
> >     assumptions you made - maybe this makes it much easier to help. Those
> >     little extras are something for after solving the "easy" version of
> the
> >     task. :)
> >
> >     What do you mean with the following?
> >
> >     > The second job takes Person : list of places and return for each
> place
> >     > in the list consructs
> >     > place : 1 | place after P : 1 | next place : 1 ...
> >
> >     You mean something like that?
> >
> >     Washington DC:1
> >     New York after Washington DC:1
> >     Miami after New York:1
> >
> >     I do not see the benefit for the result I like to get?
> >
> >     The end-result should be something like that:
> >     Washington DC => New York, Miami, Los Angeles
> >     New York => Chicago, Seattle, San Francisco
> >
> >     The point is, that one can see that persons that visited Washington
> DC
> >     are likely to visit New York as the next place, Miami as the second
> and
> >     L.A. as the third.
> >     However, if I choose New York as my starting point, I can see that
> >     persons that start their journey in New York (and maybe weren't in DC
> >     before) are likely to visit Chicago, Seattle and San Francisco. Maybe
> >     Los Angeles comes at the 10th position.
> >
> >     Regards,
> >     Em
> >
> >
> >
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
> >
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How would you translate this into MapReduce?

Posted by Em <ma...@yahoo.de>.
Thanks!

So you invert the data and than walk through each inverted result.
Good point!
What do you think about prefixing each city-name with the index in the list?

This way you can say:
London: 1_Moscow:2, 1_Paris:2, 2_Moscow:1, 2_Riga:4, 2_Paris:1,
3_Berlin:1...

>From this list you can see that people are likely to visit moscow right
after london at their first or second journey. This would maintain a
strong order (whether that's good or bad depends on a real-world-scenario).

Since your ideas gave me a good starting-point for realizing this job
(I'll practice it), we can make the problem more heavy-weight, if you like?

What happens to records that are too big to be processable by one node?
Let's say from my above example of a strongly-ordered list one gets a
billion combinations - way too much for one node (we assume that).
What possibilities does Hadoop offer to deal with such things?

Regards and many thanks for the insights,
Em


Am 19.07.2011 19:15, schrieb Steve Lewis:
> Assume Joe visits Washington, London, Paris and Moscow
> 
> You start with records like
> Joe:Washington:20-Jan-2011
> Joe:London:14-Feb2011
> Joe:Paris :9-Mar-2011
> 
> You want
> Joe: Washington, London, Paris and Moscow
> 
> For the next step the person is irrelevant
> you want 
> 
> 
> Washington:  London:1, Paris:1 ,Moscow:1
>  London: , Paris:1  Moscow:1
>  Paris:   Moscow:1
> The first say after a visit to Washington there was one visit to London,
> one to Paris and one to Moscow
> 
> 
> This can be combined with the one from Joe
> 
> 
> Now suppose Bill visits London and Moscow
> So he generates 
> London:    Moscow:1
> 
> This can be combined with the one from Joe saying  London: , Paris:1 and
> Moscow:1
>  to give
> 
>  London: , Paris:1 and Moscow:2
> 
> Now suppose Sue visits London and  Riga and Paris
> So she generates 
> London: , Paris:1,Riga 1
> 
> This can be combined with  London: , Paris:1 and Moscow:2 to give
> 
> London: , Paris:2 and Moscow:2,Riga 1
> 
> Note I can keep places in alphabetical order in the result
> 
> 
> 
> On Tue, Jul 19, 2011 at 9:53 AM, Em <mailformailinglists@yahoo.de
> <ma...@yahoo.de>> wrote:
> 
>     Hi Steven,
> 
>     thanks for your response! For the ease of use we can make those
>     assumptions you made - maybe this makes it much easier to help. Those
>     little extras are something for after solving the "easy" version of the
>     task. :)
> 
>     What do you mean with the following?
> 
>     > The second job takes Person : list of places and return for each place
>     > in the list consructs
>     > place : 1 | place after P : 1 | next place : 1 ...
> 
>     You mean something like that?
> 
>     Washington DC:1
>     New York after Washington DC:1
>     Miami after New York:1
> 
>     I do not see the benefit for the result I like to get?
> 
>     The end-result should be something like that:
>     Washington DC => New York, Miami, Los Angeles
>     New York => Chicago, Seattle, San Francisco
> 
>     The point is, that one can see that persons that visited Washington DC
>     are likely to visit New York as the next place, Miami as the second and
>     L.A. as the third.
>     However, if I choose New York as my starting point, I can see that
>     persons that start their journey in New York (and maybe weren't in DC
>     before) are likely to visit Chicago, Seattle and San Francisco. Maybe
>     Los Angeles comes at the 10th position.
> 
>     Regards,
>     Em
> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
> 
> 

Re: How would you translate this into MapReduce?

Posted by Steve Lewis <lo...@gmail.com>.
Assume Joe visits Washington, London, Paris and Moscow

You start with records like
Joe:Washington:20-Jan-2011
Joe:London:14-Feb2011
Joe:Paris :9-Mar-2011

You want
Joe: Washington, London, Paris and Moscow

For the next step the person is irrelevant
you want


Washington:  London:1, Paris:1 ,Moscow:1
 London: , Paris:1  Moscow:1
 Paris:   Moscow:1
The first say after a visit to Washington there was one visit to London, one
to Paris and one to Moscow


This can be combined with the one from Joe


Now suppose Bill visits London and Moscow
So he generates
London:    Moscow:1

This can be combined with the one from Joe saying  London: , Paris:1 and
Moscow:1
 to give

 London: , Paris:1 and Moscow:2

Now suppose Sue visits London and  Riga and Paris
So she generates
London: , Paris:1,Riga 1

This can be combined with  London: , Paris:1 and Moscow:2 to give

London: , Paris:2 and Moscow:2,Riga 1

Note I can keep places in alphabetical order in the result



On Tue, Jul 19, 2011 at 9:53 AM, Em <ma...@yahoo.de> wrote:

> Hi Steven,
>
> thanks for your response! For the ease of use we can make those
> assumptions you made - maybe this makes it much easier to help. Those
> little extras are something for after solving the "easy" version of the
> task. :)
>
> What do you mean with the following?
>
> > The second job takes Person : list of places and return for each place
> > in the list consructs
> > place : 1 | place after P : 1 | next place : 1 ...
>
> You mean something like that?
>
> Washington DC:1
> New York after Washington DC:1
> Miami after New York:1
>
> I do not see the benefit for the result I like to get?
>
> The end-result should be something like that:
> Washington DC => New York, Miami, Los Angeles
> New York => Chicago, Seattle, San Francisco
>
> The point is, that one can see that persons that visited Washington DC
> are likely to visit New York as the next place, Miami as the second and
> L.A. as the third.
> However, if I choose New York as my starting point, I can see that
> persons that start their journey in New York (and maybe weren't in DC
> before) are likely to visit Chicago, Seattle and San Francisco. Maybe
> Los Angeles comes at the 10th position.
>
> Regards,
> Em
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How would you translate this into MapReduce?

Posted by Em <ma...@yahoo.de>.
Hi Steven,

thanks for your response! For the ease of use we can make those
assumptions you made - maybe this makes it much easier to help. Those
little extras are something for after solving the "easy" version of the
task. :)

What do you mean with the following?

> The second job takes Person : list of places and return for each place
> in the list consructs
> place : 1 | place after P : 1 | next place : 1 ...

You mean something like that?

Washington DC:1
New York after Washington DC:1
Miami after New York:1

I do not see the benefit for the result I like to get?

The end-result should be something like that:
Washington DC => New York, Miami, Los Angeles
New York => Chicago, Seattle, San Francisco

The point is, that one can see that persons that visited Washington DC
are likely to visit New York as the next place, Miami as the second and
L.A. as the third.
However, if I choose New York as my starting point, I can see that
persons that start their journey in New York (and maybe weren't in DC
before) are likely to visit Chicago, Seattle and San Francisco. Maybe
Los Angeles comes at the 10th position.

Regards,
Em

Re: How would you translate this into MapReduce?

Posted by Steve Lewis <lo...@gmail.com>.
It is a little unclear what you start with and where you want to end up.

Let us assume that you have a collection of triplets of
person : place : time
we might imagine this information stored on a line of text.
It somewhat simplifies the problem to assume that the number of places
visited by one person is small enough to keep in memory.

Where you want to end up is something like this you have a place
Place: number of visitors | place1 : number | place2 number ...
where place2 is a place someone visited after visiting place 1 and number of
the number of people visiting place 1 after visiting place 2.

Once again if simplifies the problem if we assume that the number of places
(and the structure described above) can be kept in memory. It is not
necessarily to assume that the number of visits and the number of people
cannot grow very large.

I would use two Hadoop jobs. The first passes the person: place : time trio
with the person as the key and lets the reducer construct a structure with
person : List of places in the order visited.

The second job takes Person : list of places and return for each place in
the list consructs
place : 1 | place after P : 1 | next place : 1 ...
for every place in the list make and pass one of these -
Then you can write a simple routine to combine these structures - probably
write a combiner and follow the word count model


On Mon, Jul 18, 2011 at 10:03 AM, Em <ma...@yahoo.de> wrote:

> Hello list,
>
> as a newbie I got a tricky use-case in mind which I want to implement
> with Hadoop to train my skillz. There is no real scenario behind that,
> so I can extend or shrink the problem to the extent I like.
>
> I create random lists of person-IDs and places plus a time-value.
>
> The result of my map-reduce-operations should be something like that:
> The key is a place and the value is a list of places that were visited
> by persons after they visited the key-place.
> Additionally the value should be sorted in a way were I use some
> time/count-biased metric. This way the value-list should reflect the
> place which was the most popular i.e. second-station on a tour.
>
> I think this is a complex almost real-world-scenario.
>
> In pseudo-code it will be something like this:
> for every place p
>  for every person m that visited p
>    select list l of all the places that m visited after p
>    write a key-value-pair p=>l to disc and l is in order of the visits
>
> for every key k in the list of key-value-pairs
>   get the value list of places v for k -
>       create another key-value-pair pv where the key is the place and
> the value is its index in v (for a place p in v)
>
>       for every k
>          get all pv
>          for every pv aggregate the key-value-pairs by key and sum up
> the index i for every place p so that it becomes the kv-pair opv
>          sort opv in ascending order by its value
>
> The result would be what I wanted, no?
>
> It looks like I need multiple MR-phases, however I do not even know how
> to start.
>
> My first guess is: Create a MR-Job where I invert my list so that I got
> a place as the key and as value all persons that visited it.
> The next phase needs to iterate over the value's persons and join with
> the original data to get an idea of when this person visited this place
> and what places came next.
> And now the problems arise:
> - First: What happens to places that are so popular that the number of
> persons that visited it is so large, that I can not pass the whole
> KV-pair to a single node to iterate over it?
> - Second: I need to re-join the original data. Without a database this
> would be extremely slow, wouldn't it?
>
> I hope that you guys can give me some ideas and input to make my first
> serious steps in Hadoop-land.
>
> Regards,
> Em
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com