You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Michael Parker <mi...@gmail.com> on 2012/08/23 06:42:46 UTC

Side-loading output from one MR into another?

Hi all,

Is it possible to take a collection of sorted key-value pairs,
generated from one MapReduce, and side-load them into another
MapReduce, i.e. as it runs, the second MapReduce can look up the value
for a given key computed by the first MapReduce?

I need this for a cohort study -- one MR puts users into cohorts, and
the second MR needs that user-to-cohort mapping to see how cohorts
behave over time.

Any help would be greatly appreciated. Thanks!

- Mike

Re: Side-loading output from one MR into another?

Posted by Serge Blazhiyevskyy <Se...@nice.com>.
I have map-side join example here

http://askhadoop.blogspot.com/2011/12/map-side-join_27.html

It is a great way to load data into memory on multiple machines


Regards,
Serge



On 8/23/12 3:57 PM, "Michael Parker" <mi...@gmail.com> wrote:

>Actually, I was able to do some tricks and reduce the size to
>something that can be held in memory.
>
>Nonetheless, if anyone has an example of or more information about a
>map-side join, I would love to see it.
>
>Thanks!
>
>- Mike
>
>
>On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
><mi...@gmail.com> wrote:
>> Thanks for the prompt reply!
>>
>> Unfortunately, it's not that small.
>>
>> I'm using the new API; are map side joins accomplished using
>> 
>>http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib
>>/utils/join/package-summary.html?
>> Are there any examples which use this package or map side joins?
>>
>> The way I was thinking of doing it was to output the user-to-cohort
>> mapping from the first MR as a SequenceFile, and then each mapper in
>> the second MR could use a SequenceFile.Reader to find the cohort for a
>> user. It seems reasonable, but is this actually doable? It's like a
>> manual map-side join, I suppose, although likely not as elegant as
>> what you were proposing.
>>
>> Thanks,
>> Mike
>>
>> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If it is a small set, you can load it onto distributed cache and then
>>> onto the task's memory, or if its pretty big, perhaps you can do a
>>> map-side join?
>>>
>>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>>> <mi...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> Is it possible to take a collection of sorted key-value pairs,
>>>> generated from one MapReduce, and side-load them into another
>>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>>> for a given key computed by the first MapReduce?
>>>>
>>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>>> behave over time.
>>>>
>>>> Any help would be greatly appreciated. Thanks!
>>>>
>>>> - Mike
>>>
>>>
>>>
>>> --
>>> Harsh J


Re: Side-loading output from one MR into another?

Posted by Serge Blazhiyevskyy <Se...@nice.com>.
I have map-side join example here

http://askhadoop.blogspot.com/2011/12/map-side-join_27.html

It is a great way to load data into memory on multiple machines


Regards,
Serge



On 8/23/12 3:57 PM, "Michael Parker" <mi...@gmail.com> wrote:

>Actually, I was able to do some tricks and reduce the size to
>something that can be held in memory.
>
>Nonetheless, if anyone has an example of or more information about a
>map-side join, I would love to see it.
>
>Thanks!
>
>- Mike
>
>
>On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
><mi...@gmail.com> wrote:
>> Thanks for the prompt reply!
>>
>> Unfortunately, it's not that small.
>>
>> I'm using the new API; are map side joins accomplished using
>> 
>>http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib
>>/utils/join/package-summary.html?
>> Are there any examples which use this package or map side joins?
>>
>> The way I was thinking of doing it was to output the user-to-cohort
>> mapping from the first MR as a SequenceFile, and then each mapper in
>> the second MR could use a SequenceFile.Reader to find the cohort for a
>> user. It seems reasonable, but is this actually doable? It's like a
>> manual map-side join, I suppose, although likely not as elegant as
>> what you were proposing.
>>
>> Thanks,
>> Mike
>>
>> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If it is a small set, you can load it onto distributed cache and then
>>> onto the task's memory, or if its pretty big, perhaps you can do a
>>> map-side join?
>>>
>>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>>> <mi...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> Is it possible to take a collection of sorted key-value pairs,
>>>> generated from one MapReduce, and side-load them into another
>>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>>> for a given key computed by the first MapReduce?
>>>>
>>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>>> behave over time.
>>>>
>>>> Any help would be greatly appreciated. Thanks!
>>>>
>>>> - Mike
>>>
>>>
>>>
>>> --
>>> Harsh J


Re: Side-loading output from one MR into another?

Posted by Serge Blazhiyevskyy <Se...@nice.com>.
I have map-side join example here

http://askhadoop.blogspot.com/2011/12/map-side-join_27.html

It is a great way to load data into memory on multiple machines


Regards,
Serge



On 8/23/12 3:57 PM, "Michael Parker" <mi...@gmail.com> wrote:

>Actually, I was able to do some tricks and reduce the size to
>something that can be held in memory.
>
>Nonetheless, if anyone has an example of or more information about a
>map-side join, I would love to see it.
>
>Thanks!
>
>- Mike
>
>
>On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
><mi...@gmail.com> wrote:
>> Thanks for the prompt reply!
>>
>> Unfortunately, it's not that small.
>>
>> I'm using the new API; are map side joins accomplished using
>> 
>>http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib
>>/utils/join/package-summary.html?
>> Are there any examples which use this package or map side joins?
>>
>> The way I was thinking of doing it was to output the user-to-cohort
>> mapping from the first MR as a SequenceFile, and then each mapper in
>> the second MR could use a SequenceFile.Reader to find the cohort for a
>> user. It seems reasonable, but is this actually doable? It's like a
>> manual map-side join, I suppose, although likely not as elegant as
>> what you were proposing.
>>
>> Thanks,
>> Mike
>>
>> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If it is a small set, you can load it onto distributed cache and then
>>> onto the task's memory, or if its pretty big, perhaps you can do a
>>> map-side join?
>>>
>>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>>> <mi...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> Is it possible to take a collection of sorted key-value pairs,
>>>> generated from one MapReduce, and side-load them into another
>>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>>> for a given key computed by the first MapReduce?
>>>>
>>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>>> behave over time.
>>>>
>>>> Any help would be greatly appreciated. Thanks!
>>>>
>>>> - Mike
>>>
>>>
>>>
>>> --
>>> Harsh J


Re: Side-loading output from one MR into another?

Posted by Serge Blazhiyevskyy <Se...@nice.com>.
I have map-side join example here

http://askhadoop.blogspot.com/2011/12/map-side-join_27.html

It is a great way to load data into memory on multiple machines


Regards,
Serge



On 8/23/12 3:57 PM, "Michael Parker" <mi...@gmail.com> wrote:

>Actually, I was able to do some tricks and reduce the size to
>something that can be held in memory.
>
>Nonetheless, if anyone has an example of or more information about a
>map-side join, I would love to see it.
>
>Thanks!
>
>- Mike
>
>
>On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
><mi...@gmail.com> wrote:
>> Thanks for the prompt reply!
>>
>> Unfortunately, it's not that small.
>>
>> I'm using the new API; are map side joins accomplished using
>> 
>>http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib
>>/utils/join/package-summary.html?
>> Are there any examples which use this package or map side joins?
>>
>> The way I was thinking of doing it was to output the user-to-cohort
>> mapping from the first MR as a SequenceFile, and then each mapper in
>> the second MR could use a SequenceFile.Reader to find the cohort for a
>> user. It seems reasonable, but is this actually doable? It's like a
>> manual map-side join, I suppose, although likely not as elegant as
>> what you were proposing.
>>
>> Thanks,
>> Mike
>>
>> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If it is a small set, you can load it onto distributed cache and then
>>> onto the task's memory, or if its pretty big, perhaps you can do a
>>> map-side join?
>>>
>>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>>> <mi...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> Is it possible to take a collection of sorted key-value pairs,
>>>> generated from one MapReduce, and side-load them into another
>>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>>> for a given key computed by the first MapReduce?
>>>>
>>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>>> behave over time.
>>>>
>>>> Any help would be greatly appreciated. Thanks!
>>>>
>>>> - Mike
>>>
>>>
>>>
>>> --
>>> Harsh J


Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Actually, I was able to do some tricks and reduce the size to
something that can be held in memory.

Nonetheless, if anyone has an example of or more information about a
map-side join, I would love to see it.

Thanks!

- Mike


On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
<mi...@gmail.com> wrote:
> Thanks for the prompt reply!
>
> Unfortunately, it's not that small.
>
> I'm using the new API; are map side joins accomplished using
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
> Are there any examples which use this package or map side joins?
>
> The way I was thinking of doing it was to output the user-to-cohort
> mapping from the first MR as a SequenceFile, and then each mapper in
> the second MR could use a SequenceFile.Reader to find the cohort for a
> user. It seems reasonable, but is this actually doable? It's like a
> manual map-side join, I suppose, although likely not as elegant as
> what you were proposing.
>
> Thanks,
> Mike
>
> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>> If it is a small set, you can load it onto distributed cache and then
>> onto the task's memory, or if its pretty big, perhaps you can do a
>> map-side join?
>>
>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>> <mi...@gmail.com> wrote:
>>> Hi all,
>>>
>>> Is it possible to take a collection of sorted key-value pairs,
>>> generated from one MapReduce, and side-load them into another
>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>> for a given key computed by the first MapReduce?
>>>
>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>> behave over time.
>>>
>>> Any help would be greatly appreciated. Thanks!
>>>
>>> - Mike
>>
>>
>>
>> --
>> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Actually, I was able to do some tricks and reduce the size to
something that can be held in memory.

Nonetheless, if anyone has an example of or more information about a
map-side join, I would love to see it.

Thanks!

- Mike


On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
<mi...@gmail.com> wrote:
> Thanks for the prompt reply!
>
> Unfortunately, it's not that small.
>
> I'm using the new API; are map side joins accomplished using
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
> Are there any examples which use this package or map side joins?
>
> The way I was thinking of doing it was to output the user-to-cohort
> mapping from the first MR as a SequenceFile, and then each mapper in
> the second MR could use a SequenceFile.Reader to find the cohort for a
> user. It seems reasonable, but is this actually doable? It's like a
> manual map-side join, I suppose, although likely not as elegant as
> what you were proposing.
>
> Thanks,
> Mike
>
> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>> If it is a small set, you can load it onto distributed cache and then
>> onto the task's memory, or if its pretty big, perhaps you can do a
>> map-side join?
>>
>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>> <mi...@gmail.com> wrote:
>>> Hi all,
>>>
>>> Is it possible to take a collection of sorted key-value pairs,
>>> generated from one MapReduce, and side-load them into another
>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>> for a given key computed by the first MapReduce?
>>>
>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>> behave over time.
>>>
>>> Any help would be greatly appreciated. Thanks!
>>>
>>> - Mike
>>
>>
>>
>> --
>> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Actually, I was able to do some tricks and reduce the size to
something that can be held in memory.

Nonetheless, if anyone has an example of or more information about a
map-side join, I would love to see it.

Thanks!

- Mike


On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
<mi...@gmail.com> wrote:
> Thanks for the prompt reply!
>
> Unfortunately, it's not that small.
>
> I'm using the new API; are map side joins accomplished using
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
> Are there any examples which use this package or map side joins?
>
> The way I was thinking of doing it was to output the user-to-cohort
> mapping from the first MR as a SequenceFile, and then each mapper in
> the second MR could use a SequenceFile.Reader to find the cohort for a
> user. It seems reasonable, but is this actually doable? It's like a
> manual map-side join, I suppose, although likely not as elegant as
> what you were proposing.
>
> Thanks,
> Mike
>
> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>> If it is a small set, you can load it onto distributed cache and then
>> onto the task's memory, or if its pretty big, perhaps you can do a
>> map-side join?
>>
>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>> <mi...@gmail.com> wrote:
>>> Hi all,
>>>
>>> Is it possible to take a collection of sorted key-value pairs,
>>> generated from one MapReduce, and side-load them into another
>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>> for a given key computed by the first MapReduce?
>>>
>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>> behave over time.
>>>
>>> Any help would be greatly appreciated. Thanks!
>>>
>>> - Mike
>>
>>
>>
>> --
>> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Actually, I was able to do some tricks and reduce the size to
something that can be held in memory.

Nonetheless, if anyone has an example of or more information about a
map-side join, I would love to see it.

Thanks!

- Mike


On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
<mi...@gmail.com> wrote:
> Thanks for the prompt reply!
>
> Unfortunately, it's not that small.
>
> I'm using the new API; are map side joins accomplished using
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
> Are there any examples which use this package or map side joins?
>
> The way I was thinking of doing it was to output the user-to-cohort
> mapping from the first MR as a SequenceFile, and then each mapper in
> the second MR could use a SequenceFile.Reader to find the cohort for a
> user. It seems reasonable, but is this actually doable? It's like a
> manual map-side join, I suppose, although likely not as elegant as
> what you were proposing.
>
> Thanks,
> Mike
>
> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
>> If it is a small set, you can load it onto distributed cache and then
>> onto the task's memory, or if its pretty big, perhaps you can do a
>> map-side join?
>>
>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>> <mi...@gmail.com> wrote:
>>> Hi all,
>>>
>>> Is it possible to take a collection of sorted key-value pairs,
>>> generated from one MapReduce, and side-load them into another
>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>> for a given key computed by the first MapReduce?
>>>
>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>> behave over time.
>>>
>>> Any help would be greatly appreciated. Thanks!
>>>
>>> - Mike
>>
>>
>>
>> --
>> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Thanks for the prompt reply!

Unfortunately, it's not that small.

I'm using the new API; are map side joins accomplished using
http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
Are there any examples which use this package or map side joins?

The way I was thinking of doing it was to output the user-to-cohort
mapping from the first MR as a SequenceFile, and then each mapper in
the second MR could use a SequenceFile.Reader to find the cohort for a
user. It seems reasonable, but is this actually doable? It's like a
manual map-side join, I suppose, although likely not as elegant as
what you were proposing.

Thanks,
Mike

On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
> If it is a small set, you can load it onto distributed cache and then
> onto the task's memory, or if its pretty big, perhaps you can do a
> map-side join?
>
> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
> <mi...@gmail.com> wrote:
>> Hi all,
>>
>> Is it possible to take a collection of sorted key-value pairs,
>> generated from one MapReduce, and side-load them into another
>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>> for a given key computed by the first MapReduce?
>>
>> I need this for a cohort study -- one MR puts users into cohorts, and
>> the second MR needs that user-to-cohort mapping to see how cohorts
>> behave over time.
>>
>> Any help would be greatly appreciated. Thanks!
>>
>> - Mike
>
>
>
> --
> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Thanks for the prompt reply!

Unfortunately, it's not that small.

I'm using the new API; are map side joins accomplished using
http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
Are there any examples which use this package or map side joins?

The way I was thinking of doing it was to output the user-to-cohort
mapping from the first MR as a SequenceFile, and then each mapper in
the second MR could use a SequenceFile.Reader to find the cohort for a
user. It seems reasonable, but is this actually doable? It's like a
manual map-side join, I suppose, although likely not as elegant as
what you were proposing.

Thanks,
Mike

On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
> If it is a small set, you can load it onto distributed cache and then
> onto the task's memory, or if its pretty big, perhaps you can do a
> map-side join?
>
> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
> <mi...@gmail.com> wrote:
>> Hi all,
>>
>> Is it possible to take a collection of sorted key-value pairs,
>> generated from one MapReduce, and side-load them into another
>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>> for a given key computed by the first MapReduce?
>>
>> I need this for a cohort study -- one MR puts users into cohorts, and
>> the second MR needs that user-to-cohort mapping to see how cohorts
>> behave over time.
>>
>> Any help would be greatly appreciated. Thanks!
>>
>> - Mike
>
>
>
> --
> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Thanks for the prompt reply!

Unfortunately, it's not that small.

I'm using the new API; are map side joins accomplished using
http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
Are there any examples which use this package or map side joins?

The way I was thinking of doing it was to output the user-to-cohort
mapping from the first MR as a SequenceFile, and then each mapper in
the second MR could use a SequenceFile.Reader to find the cohort for a
user. It seems reasonable, but is this actually doable? It's like a
manual map-side join, I suppose, although likely not as elegant as
what you were proposing.

Thanks,
Mike

On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
> If it is a small set, you can load it onto distributed cache and then
> onto the task's memory, or if its pretty big, perhaps you can do a
> map-side join?
>
> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
> <mi...@gmail.com> wrote:
>> Hi all,
>>
>> Is it possible to take a collection of sorted key-value pairs,
>> generated from one MapReduce, and side-load them into another
>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>> for a given key computed by the first MapReduce?
>>
>> I need this for a cohort study -- one MR puts users into cohorts, and
>> the second MR needs that user-to-cohort mapping to see how cohorts
>> behave over time.
>>
>> Any help would be greatly appreciated. Thanks!
>>
>> - Mike
>
>
>
> --
> Harsh J

Re: Side-loading output from one MR into another?

Posted by Michael Parker <mi...@gmail.com>.
Thanks for the prompt reply!

Unfortunately, it's not that small.

I'm using the new API; are map side joins accomplished using
http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib/utils/join/package-summary.html?
Are there any examples which use this package or map side joins?

The way I was thinking of doing it was to output the user-to-cohort
mapping from the first MR as a SequenceFile, and then each mapper in
the second MR could use a SequenceFile.Reader to find the cohort for a
user. It seems reasonable, but is this actually doable? It's like a
manual map-side join, I suppose, although likely not as elegant as
what you were proposing.

Thanks,
Mike

On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <ha...@cloudera.com> wrote:
> If it is a small set, you can load it onto distributed cache and then
> onto the task's memory, or if its pretty big, perhaps you can do a
> map-side join?
>
> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
> <mi...@gmail.com> wrote:
>> Hi all,
>>
>> Is it possible to take a collection of sorted key-value pairs,
>> generated from one MapReduce, and side-load them into another
>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>> for a given key computed by the first MapReduce?
>>
>> I need this for a cohort study -- one MR puts users into cohorts, and
>> the second MR needs that user-to-cohort mapping to see how cohorts
>> behave over time.
>>
>> Any help would be greatly appreciated. Thanks!
>>
>> - Mike
>
>
>
> --
> Harsh J

Re: Side-loading output from one MR into another?

Posted by Harsh J <ha...@cloudera.com>.
If it is a small set, you can load it onto distributed cache and then
onto the task's memory, or if its pretty big, perhaps you can do a
map-side join?

On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
<mi...@gmail.com> wrote:
> Hi all,
>
> Is it possible to take a collection of sorted key-value pairs,
> generated from one MapReduce, and side-load them into another
> MapReduce, i.e. as it runs, the second MapReduce can look up the value
> for a given key computed by the first MapReduce?
>
> I need this for a cohort study -- one MR puts users into cohorts, and
> the second MR needs that user-to-cohort mapping to see how cohorts
> behave over time.
>
> Any help would be greatly appreciated. Thanks!
>
> - Mike



-- 
Harsh J

Re: Side-loading output from one MR into another?

Posted by Harsh J <ha...@cloudera.com>.
If it is a small set, you can load it onto distributed cache and then
onto the task's memory, or if its pretty big, perhaps you can do a
map-side join?

On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
<mi...@gmail.com> wrote:
> Hi all,
>
> Is it possible to take a collection of sorted key-value pairs,
> generated from one MapReduce, and side-load them into another
> MapReduce, i.e. as it runs, the second MapReduce can look up the value
> for a given key computed by the first MapReduce?
>
> I need this for a cohort study -- one MR puts users into cohorts, and
> the second MR needs that user-to-cohort mapping to see how cohorts
> behave over time.
>
> Any help would be greatly appreciated. Thanks!
>
> - Mike



-- 
Harsh J

Re: Side-loading output from one MR into another?

Posted by Harsh J <ha...@cloudera.com>.
If it is a small set, you can load it onto distributed cache and then
onto the task's memory, or if its pretty big, perhaps you can do a
map-side join?

On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
<mi...@gmail.com> wrote:
> Hi all,
>
> Is it possible to take a collection of sorted key-value pairs,
> generated from one MapReduce, and side-load them into another
> MapReduce, i.e. as it runs, the second MapReduce can look up the value
> for a given key computed by the first MapReduce?
>
> I need this for a cohort study -- one MR puts users into cohorts, and
> the second MR needs that user-to-cohort mapping to see how cohorts
> behave over time.
>
> Any help would be greatly appreciated. Thanks!
>
> - Mike



-- 
Harsh J

Re: Side-loading output from one MR into another?

Posted by Harsh J <ha...@cloudera.com>.
If it is a small set, you can load it onto distributed cache and then
onto the task's memory, or if its pretty big, perhaps you can do a
map-side join?

On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
<mi...@gmail.com> wrote:
> Hi all,
>
> Is it possible to take a collection of sorted key-value pairs,
> generated from one MapReduce, and side-load them into another
> MapReduce, i.e. as it runs, the second MapReduce can look up the value
> for a given key computed by the first MapReduce?
>
> I need this for a cohort study -- one MR puts users into cohorts, and
> the second MR needs that user-to-cohort mapping to see how cohorts
> behave over time.
>
> Any help would be greatly appreciated. Thanks!
>
> - Mike



-- 
Harsh J