You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/01/08 00:21:23 UTC

Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

:) Great. Thanks for the input. :)


On Mon, Jan 7, 2013 at 3:31 PM, Pamecha, Abhishek <ap...@ebay.com> wrote:

>  You will incur the cost of map reduce across all nodes in your cluster
> anyways. Not sure, you will get enough speed advantage.****
>
> HBase may help you get close to what you are looking for but that wont be
> map reduce.****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 3:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

:) Great. Thanks for the input. :)


On Mon, Jan 7, 2013 at 3:31 PM, Pamecha, Abhishek <ap...@ebay.com> wrote:

>  You will incur the cost of map reduce across all nodes in your cluster
> anyways. Not sure, you will get enough speed advantage.****
>
> HBase may help you get close to what you are looking for but that wont be
> map reduce.****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 3:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

:) Great. Thanks for the input. :)


On Mon, Jan 7, 2013 at 3:31 PM, Pamecha, Abhishek <ap...@ebay.com> wrote:

>  You will incur the cost of map reduce across all nodes in your cluster
> anyways. Not sure, you will get enough speed advantage.****
>
> HBase may help you get close to what you are looking for but that wont be
> map reduce.****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 3:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

:) Great. Thanks for the input. :)


On Mon, Jan 7, 2013 at 3:31 PM, Pamecha, Abhishek <ap...@ebay.com> wrote:

>  You will incur the cost of map reduce across all nodes in your cluster
> anyways. Not sure, you will get enough speed advantage.****
>
> HBase may help you get close to what you are looking for but that wont be
> map reduce.****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 3:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

RE: Binary Search in map reduce

Posted by "Pamecha, Abhishek" <ap...@ebay.com>.

You will incur the cost of map reduce across all nodes in your cluster anyways. Not sure, you will get enough speed advantage.
HBase may help you get close to what you are looking for but that wont be map reduce.

Thanks,
Abhishek

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 3:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

RE: Binary Search in map reduce

Posted by "Pamecha, Abhishek" <ap...@ebay.com>.

You will incur the cost of map reduce across all nodes in your cluster anyways. Not sure, you will get enough speed advantage.
HBase may help you get close to what you are looking for but that wont be map reduce.

Thanks,
Abhishek

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 3:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

RE: Binary Search in map reduce

Posted by "Pamecha, Abhishek" <ap...@ebay.com>.

You will incur the cost of map reduce across all nodes in your cluster anyways. Not sure, you will get enough speed advantage.
HBase may help you get close to what you are looking for but that wont be map reduce.

Thanks,
Abhishek

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 3:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

Re: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Oh that is very clever, you just need to make pseudo keys so that the graph record orders before the corresponding change records.

--John Lilley

Mahesh Balija <ba...@gmail.com> wrote:



Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same key (the graph key and changes key).
       In order you to know which record is been emitted from which file you can use key as the graph key for both the mappers but MapWritable as your value in mapper where the key in the mapwritable will be some constant say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com>> wrote:
awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>> wrote:
Let’s call these “the graph” and “the changes”.

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition “the graph” and “the changes” into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me…).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn’t even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com<ma...@gmail.com>]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

Re: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Oh that is very clever, you just need to make pseudo keys so that the graph record orders before the corresponding change records.

--John Lilley

Mahesh Balija <ba...@gmail.com> wrote:



Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same key (the graph key and changes key).
       In order you to know which record is been emitted from which file you can use key as the graph key for both the mappers but MapWritable as your value in mapper where the key in the mapwritable will be some constant say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com>> wrote:
awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>> wrote:
Let’s call these “the graph” and “the changes”.

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition “the graph” and “the changes” into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me…).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn’t even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com<ma...@gmail.com>]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

Re: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Oh that is very clever, you just need to make pseudo keys so that the graph record orders before the corresponding change records.

--John Lilley

Mahesh Balija <ba...@gmail.com> wrote:



Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same key (the graph key and changes key).
       In order you to know which record is been emitted from which file you can use key as the graph key for both the mappers but MapWritable as your value in mapper where the key in the mapwritable will be some constant say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com>> wrote:
awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>> wrote:
Let’s call these “the graph” and “the changes”.

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition “the graph” and “the changes” into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me…).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn’t even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com<ma...@gmail.com>]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

Re: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Oh that is very clever, you just need to make pseudo keys so that the graph record orders before the corresponding change records.

--John Lilley

Mahesh Balija <ba...@gmail.com> wrote:



Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same key (the graph key and changes key).
       In order you to know which record is been emitted from which file you can use key as the graph key for both the mappers but MapWritable as your value in mapper where the key in the mapwritable will be some constant say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com>> wrote:
awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>> wrote:
Let’s call these “the graph” and “the changes”.

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition “the graph” and “the changes” into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me…).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn’t even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com<ma...@gmail.com>]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

Re: Binary Search in map reduce

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into
memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records
from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same
key (the graph key and changes key).
       In order you to know which record is been emitted from which file
you can use key as the graph key for both the mappers but MapWritable as
your value in mapper where the key in the mapwritable will be some constant
say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the
actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com> wrote:

> awesome.
> thanks
>
>
> On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Let’s call these “the graph” and “the changes”.****
>>
>> ** **
>>
>> Will both the graph and the changes fit into memory?****
>>
>> Yes -> You do not have a Hadoop-scale problem.  Just write some code
>> using HashTable or Dictionary.****
>>
>> ** **
>>
>> Will the graph fit into memory once it is partitioned amongst all of the
>> nodes?****
>>
>> Yes -> You can get away without a join.  Partition the graph and the
>> changes like below, but instead of doing a join on each partition, stream
>> the changes against the graph partition in memory, using a HashTable for
>> the graph partition.****
>>
>> ** **
>>
>> Otherwise, you can do this in a few steps.  Realize that you are doing a
>> parallel join.  A parallel join can be done in hadoop by a simple modulo of
>> the keys of the graph and the changes.  So first, create a couple of MR
>> jobs just to partition “the graph” and “the changes” into N buckets using
>> (key%N).  I **think** this is pretty straightforward because if your
>> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
>> this behavior automatically (is it really that simple? someone with more MR
>> expertise please correct me…).   Once the graph and the changes are
>> partitioned, run another MR job to (1) join each graph partition file to
>> the corresponding changes partition file (2) process the changes into the
>> graph (3) write out the resulting graph.  This part is not a parallel join;
>> it is a bunch of independent simple joins.  Finally, merge the resulting
>> graphs together.  ****
>>
>> ** **
>>
>> You may find that it isn’t even this easy.  If nothing fits into memory
>> and you must perform a non-trivial graph traversal for each change record,
>> you have something must harder to do.****
>>
>> ** **
>>
>> FYI top google results for joins in Hadoop here:
>> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
>> ****
>>
>> ** **
>>
>> john****
>>
>> ** **
>>
>> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
>> *Sent:* Monday, January 07, 2013 4:43 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Binary Search in map reduce****
>>
>> ** **
>>
>> Hi****
>>
>>  Thanks for the reply. So here is the intent.****
>>
>> I process some data and output of that processing is this set of json
>> documents outputting {key:[values]}  (This is essentially a form of graph
>> where each entry is an edge)****
>>
>> Now.. I process a different set of data and the idea is to modify the
>> existing document based on this new data.****
>>
>> If the key is present then add/modify values.****
>>
>> Else... create new key:[values] json object and save.****
>>
>> ** **
>>
>> So, the first step is checking whether the key is present or not..****
>>
>> So thats why I thought of doing the binary search.****
>>
>> Any suggestions?****
>>
>> ** **
>>
>> ** **
>>
>
>

Re: Binary Search in map reduce

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into
memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records
from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same
key (the graph key and changes key).
       In order you to know which record is been emitted from which file
you can use key as the graph key for both the mappers but MapWritable as
your value in mapper where the key in the mapwritable will be some constant
say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the
actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com> wrote:

> awesome.
> thanks
>
>
> On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Let’s call these “the graph” and “the changes”.****
>>
>> ** **
>>
>> Will both the graph and the changes fit into memory?****
>>
>> Yes -> You do not have a Hadoop-scale problem.  Just write some code
>> using HashTable or Dictionary.****
>>
>> ** **
>>
>> Will the graph fit into memory once it is partitioned amongst all of the
>> nodes?****
>>
>> Yes -> You can get away without a join.  Partition the graph and the
>> changes like below, but instead of doing a join on each partition, stream
>> the changes against the graph partition in memory, using a HashTable for
>> the graph partition.****
>>
>> ** **
>>
>> Otherwise, you can do this in a few steps.  Realize that you are doing a
>> parallel join.  A parallel join can be done in hadoop by a simple modulo of
>> the keys of the graph and the changes.  So first, create a couple of MR
>> jobs just to partition “the graph” and “the changes” into N buckets using
>> (key%N).  I **think** this is pretty straightforward because if your
>> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
>> this behavior automatically (is it really that simple? someone with more MR
>> expertise please correct me…).   Once the graph and the changes are
>> partitioned, run another MR job to (1) join each graph partition file to
>> the corresponding changes partition file (2) process the changes into the
>> graph (3) write out the resulting graph.  This part is not a parallel join;
>> it is a bunch of independent simple joins.  Finally, merge the resulting
>> graphs together.  ****
>>
>> ** **
>>
>> You may find that it isn’t even this easy.  If nothing fits into memory
>> and you must perform a non-trivial graph traversal for each change record,
>> you have something must harder to do.****
>>
>> ** **
>>
>> FYI top google results for joins in Hadoop here:
>> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
>> ****
>>
>> ** **
>>
>> john****
>>
>> ** **
>>
>> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
>> *Sent:* Monday, January 07, 2013 4:43 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Binary Search in map reduce****
>>
>> ** **
>>
>> Hi****
>>
>>  Thanks for the reply. So here is the intent.****
>>
>> I process some data and output of that processing is this set of json
>> documents outputting {key:[values]}  (This is essentially a form of graph
>> where each entry is an edge)****
>>
>> Now.. I process a different set of data and the idea is to modify the
>> existing document based on this new data.****
>>
>> If the key is present then add/modify values.****
>>
>> Else... create new key:[values] json object and save.****
>>
>> ** **
>>
>> So, the first step is checking whether the key is present or not..****
>>
>> So thats why I thought of doing the binary search.****
>>
>> Any suggestions?****
>>
>> ** **
>>
>> ** **
>>
>
>

Re: Binary Search in map reduce

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into
memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records
from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same
key (the graph key and changes key).
       In order you to know which record is been emitted from which file
you can use key as the graph key for both the mappers but MapWritable as
your value in mapper where the key in the mapwritable will be some constant
say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the
actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com> wrote:

> awesome.
> thanks
>
>
> On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Let’s call these “the graph” and “the changes”.****
>>
>> ** **
>>
>> Will both the graph and the changes fit into memory?****
>>
>> Yes -> You do not have a Hadoop-scale problem.  Just write some code
>> using HashTable or Dictionary.****
>>
>> ** **
>>
>> Will the graph fit into memory once it is partitioned amongst all of the
>> nodes?****
>>
>> Yes -> You can get away without a join.  Partition the graph and the
>> changes like below, but instead of doing a join on each partition, stream
>> the changes against the graph partition in memory, using a HashTable for
>> the graph partition.****
>>
>> ** **
>>
>> Otherwise, you can do this in a few steps.  Realize that you are doing a
>> parallel join.  A parallel join can be done in hadoop by a simple modulo of
>> the keys of the graph and the changes.  So first, create a couple of MR
>> jobs just to partition “the graph” and “the changes” into N buckets using
>> (key%N).  I **think** this is pretty straightforward because if your
>> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
>> this behavior automatically (is it really that simple? someone with more MR
>> expertise please correct me…).   Once the graph and the changes are
>> partitioned, run another MR job to (1) join each graph partition file to
>> the corresponding changes partition file (2) process the changes into the
>> graph (3) write out the resulting graph.  This part is not a parallel join;
>> it is a bunch of independent simple joins.  Finally, merge the resulting
>> graphs together.  ****
>>
>> ** **
>>
>> You may find that it isn’t even this easy.  If nothing fits into memory
>> and you must perform a non-trivial graph traversal for each change record,
>> you have something must harder to do.****
>>
>> ** **
>>
>> FYI top google results for joins in Hadoop here:
>> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
>> ****
>>
>> ** **
>>
>> john****
>>
>> ** **
>>
>> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
>> *Sent:* Monday, January 07, 2013 4:43 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Binary Search in map reduce****
>>
>> ** **
>>
>> Hi****
>>
>>  Thanks for the reply. So here is the intent.****
>>
>> I process some data and output of that processing is this set of json
>> documents outputting {key:[values]}  (This is essentially a form of graph
>> where each entry is an edge)****
>>
>> Now.. I process a different set of data and the idea is to modify the
>> existing document based on this new data.****
>>
>> If the key is present then add/modify values.****
>>
>> Else... create new key:[values] json object and save.****
>>
>> ** **
>>
>> So, the first step is checking whether the key is present or not..****
>>
>> So thats why I thought of doing the binary search.****
>>
>> Any suggestions?****
>>
>> ** **
>>
>> ** **
>>
>
>

Re: Binary Search in map reduce

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jamal,

       Another simple approach if your data is too huge and cannot fit into
memory would be just to use the MultipleInputs mechanism.
       Where your MR job will have two mappers one emitting the records
from "the graph" file and other from changes file.

       Any how your reducer will aggregate the records based on the same
key (the graph key and changes key).
       In order you to know which record is been emitted from which file
you can use key as the graph key for both the mappers but MapWritable as
your value in mapper where the key in the mapwritable will be some constant
say 1 -> the graph and 2 -> changes and value will be the actual value.

       Now the only thing left for you is to append your changes to the
actual key and emit the final result.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Jan 8, 2013 at 5:47 AM, jamal sasha <ja...@gmail.com> wrote:

> awesome.
> thanks
>
>
> On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:
>
>>  Let’s call these “the graph” and “the changes”.****
>>
>> ** **
>>
>> Will both the graph and the changes fit into memory?****
>>
>> Yes -> You do not have a Hadoop-scale problem.  Just write some code
>> using HashTable or Dictionary.****
>>
>> ** **
>>
>> Will the graph fit into memory once it is partitioned amongst all of the
>> nodes?****
>>
>> Yes -> You can get away without a join.  Partition the graph and the
>> changes like below, but instead of doing a join on each partition, stream
>> the changes against the graph partition in memory, using a HashTable for
>> the graph partition.****
>>
>> ** **
>>
>> Otherwise, you can do this in a few steps.  Realize that you are doing a
>> parallel join.  A parallel join can be done in hadoop by a simple modulo of
>> the keys of the graph and the changes.  So first, create a couple of MR
>> jobs just to partition “the graph” and “the changes” into N buckets using
>> (key%N).  I **think** this is pretty straightforward because if your
>> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
>> this behavior automatically (is it really that simple? someone with more MR
>> expertise please correct me…).   Once the graph and the changes are
>> partitioned, run another MR job to (1) join each graph partition file to
>> the corresponding changes partition file (2) process the changes into the
>> graph (3) write out the resulting graph.  This part is not a parallel join;
>> it is a bunch of independent simple joins.  Finally, merge the resulting
>> graphs together.  ****
>>
>> ** **
>>
>> You may find that it isn’t even this easy.  If nothing fits into memory
>> and you must perform a non-trivial graph traversal for each change record,
>> you have something must harder to do.****
>>
>> ** **
>>
>> FYI top google results for joins in Hadoop here:
>> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
>> ****
>>
>> ** **
>>
>> john****
>>
>> ** **
>>
>> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
>> *Sent:* Monday, January 07, 2013 4:43 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Binary Search in map reduce****
>>
>> ** **
>>
>> Hi****
>>
>>  Thanks for the reply. So here is the intent.****
>>
>> I process some data and output of that processing is this set of json
>> documents outputting {key:[values]}  (This is essentially a form of graph
>> where each entry is an edge)****
>>
>> Now.. I process a different set of data and the idea is to modify the
>> existing document based on this new data.****
>>
>> If the key is present then add/modify values.****
>>
>> Else... create new key:[values] json object and save.****
>>
>> ** **
>>
>> So, the first step is checking whether the key is present or not..****
>>
>> So thats why I thought of doing the binary search.****
>>
>> Any suggestions?****
>>
>> ** **
>>
>> ** **
>>
>
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:

>  Let’s call these “the graph” and “the changes”.****
>
> ** **
>
> Will both the graph and the changes fit into memory?****
>
> Yes -> You do not have a Hadoop-scale problem.  Just write some code using
> HashTable or Dictionary.****
>
> ** **
>
> Will the graph fit into memory once it is partitioned amongst all of the
> nodes?****
>
> Yes -> You can get away without a join.  Partition the graph and the
> changes like below, but instead of doing a join on each partition, stream
> the changes against the graph partition in memory, using a HashTable for
> the graph partition.****
>
> ** **
>
> Otherwise, you can do this in a few steps.  Realize that you are doing a
> parallel join.  A parallel join can be done in hadoop by a simple modulo of
> the keys of the graph and the changes.  So first, create a couple of MR
> jobs just to partition “the graph” and “the changes” into N buckets using
> (key%N).  I **think** this is pretty straightforward because if your
> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
> this behavior automatically (is it really that simple? someone with more MR
> expertise please correct me…).   Once the graph and the changes are
> partitioned, run another MR job to (1) join each graph partition file to
> the corresponding changes partition file (2) process the changes into the
> graph (3) write out the resulting graph.  This part is not a parallel join;
> it is a bunch of independent simple joins.  Finally, merge the resulting
> graphs together.  ****
>
> ** **
>
> You may find that it isn’t even this easy.  If nothing fits into memory
> and you must perform a non-trivial graph traversal for each change record,
> you have something must harder to do.****
>
> ** **
>
> FYI top google results for joins in Hadoop here:
> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
> ****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:43 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Binary Search in map reduce****
>
> ** **
>
> Hi****
>
>  Thanks for the reply. So here is the intent.****
>
> I process some data and output of that processing is this set of json
> documents outputting {key:[values]}  (This is essentially a form of graph
> where each entry is an edge)****
>
> Now.. I process a different set of data and the idea is to modify the
> existing document based on this new data.****
>
> If the key is present then add/modify values.****
>
> Else... create new key:[values] json object and save.****
>
> ** **
>
> So, the first step is checking whether the key is present or not..****
>
> So thats why I thought of doing the binary search.****
>
> Any suggestions?****
>
> ** **
>
> ** **
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:

>  Let’s call these “the graph” and “the changes”.****
>
> ** **
>
> Will both the graph and the changes fit into memory?****
>
> Yes -> You do not have a Hadoop-scale problem.  Just write some code using
> HashTable or Dictionary.****
>
> ** **
>
> Will the graph fit into memory once it is partitioned amongst all of the
> nodes?****
>
> Yes -> You can get away without a join.  Partition the graph and the
> changes like below, but instead of doing a join on each partition, stream
> the changes against the graph partition in memory, using a HashTable for
> the graph partition.****
>
> ** **
>
> Otherwise, you can do this in a few steps.  Realize that you are doing a
> parallel join.  A parallel join can be done in hadoop by a simple modulo of
> the keys of the graph and the changes.  So first, create a couple of MR
> jobs just to partition “the graph” and “the changes” into N buckets using
> (key%N).  I **think** this is pretty straightforward because if your
> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
> this behavior automatically (is it really that simple? someone with more MR
> expertise please correct me…).   Once the graph and the changes are
> partitioned, run another MR job to (1) join each graph partition file to
> the corresponding changes partition file (2) process the changes into the
> graph (3) write out the resulting graph.  This part is not a parallel join;
> it is a bunch of independent simple joins.  Finally, merge the resulting
> graphs together.  ****
>
> ** **
>
> You may find that it isn’t even this easy.  If nothing fits into memory
> and you must perform a non-trivial graph traversal for each change record,
> you have something must harder to do.****
>
> ** **
>
> FYI top google results for joins in Hadoop here:
> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
> ****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:43 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Binary Search in map reduce****
>
> ** **
>
> Hi****
>
>  Thanks for the reply. So here is the intent.****
>
> I process some data and output of that processing is this set of json
> documents outputting {key:[values]}  (This is essentially a form of graph
> where each entry is an edge)****
>
> Now.. I process a different set of data and the idea is to modify the
> existing document based on this new data.****
>
> If the key is present then add/modify values.****
>
> Else... create new key:[values] json object and save.****
>
> ** **
>
> So, the first step is checking whether the key is present or not..****
>
> So thats why I thought of doing the binary search.****
>
> Any suggestions?****
>
> ** **
>
> ** **
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:

>  Let’s call these “the graph” and “the changes”.****
>
> ** **
>
> Will both the graph and the changes fit into memory?****
>
> Yes -> You do not have a Hadoop-scale problem.  Just write some code using
> HashTable or Dictionary.****
>
> ** **
>
> Will the graph fit into memory once it is partitioned amongst all of the
> nodes?****
>
> Yes -> You can get away without a join.  Partition the graph and the
> changes like below, but instead of doing a join on each partition, stream
> the changes against the graph partition in memory, using a HashTable for
> the graph partition.****
>
> ** **
>
> Otherwise, you can do this in a few steps.  Realize that you are doing a
> parallel join.  A parallel join can be done in hadoop by a simple modulo of
> the keys of the graph and the changes.  So first, create a couple of MR
> jobs just to partition “the graph” and “the changes” into N buckets using
> (key%N).  I **think** this is pretty straightforward because if your
> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
> this behavior automatically (is it really that simple? someone with more MR
> expertise please correct me…).   Once the graph and the changes are
> partitioned, run another MR job to (1) join each graph partition file to
> the corresponding changes partition file (2) process the changes into the
> graph (3) write out the resulting graph.  This part is not a parallel join;
> it is a bunch of independent simple joins.  Finally, merge the resulting
> graphs together.  ****
>
> ** **
>
> You may find that it isn’t even this easy.  If nothing fits into memory
> and you must perform a non-trivial graph traversal for each change record,
> you have something must harder to do.****
>
> ** **
>
> FYI top google results for joins in Hadoop here:
> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
> ****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:43 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Binary Search in map reduce****
>
> ** **
>
> Hi****
>
>  Thanks for the reply. So here is the intent.****
>
> I process some data and output of that processing is this set of json
> documents outputting {key:[values]}  (This is essentially a form of graph
> where each entry is an edge)****
>
> Now.. I process a different set of data and the idea is to modify the
> existing document based on this new data.****
>
> If the key is present then add/modify values.****
>
> Else... create new key:[values] json object and save.****
>
> ** **
>
> So, the first step is checking whether the key is present or not..****
>
> So thats why I thought of doing the binary search.****
>
> Any suggestions?****
>
> ** **
>
> ** **
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

awesome.
thanks


On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <jo...@redpoint.net>wrote:

>  Let’s call these “the graph” and “the changes”.****
>
> ** **
>
> Will both the graph and the changes fit into memory?****
>
> Yes -> You do not have a Hadoop-scale problem.  Just write some code using
> HashTable or Dictionary.****
>
> ** **
>
> Will the graph fit into memory once it is partitioned amongst all of the
> nodes?****
>
> Yes -> You can get away without a join.  Partition the graph and the
> changes like below, but instead of doing a join on each partition, stream
> the changes against the graph partition in memory, using a HashTable for
> the graph partition.****
>
> ** **
>
> Otherwise, you can do this in a few steps.  Realize that you are doing a
> parallel join.  A parallel join can be done in hadoop by a simple modulo of
> the keys of the graph and the changes.  So first, create a couple of MR
> jobs just to partition “the graph” and “the changes” into N buckets using
> (key%N).  I **think** this is pretty straightforward because if your
> mapper adds new_key=(key%N) to the tuple and you use N reducers you get
> this behavior automatically (is it really that simple? someone with more MR
> expertise please correct me…).   Once the graph and the changes are
> partitioned, run another MR job to (1) join each graph partition file to
> the corresponding changes partition file (2) process the changes into the
> graph (3) write out the resulting graph.  This part is not a parallel join;
> it is a bunch of independent simple joins.  Finally, merge the resulting
> graphs together.  ****
>
> ** **
>
> You may find that it isn’t even this easy.  If nothing fits into memory
> and you must perform a non-trivial graph traversal for each change record,
> you have something must harder to do.****
>
> ** **
>
> FYI top google results for joins in Hadoop here:
> https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8
> ****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:43 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Binary Search in map reduce****
>
> ** **
>
> Hi****
>
>  Thanks for the reply. So here is the intent.****
>
> I process some data and output of that processing is this set of json
> documents outputting {key:[values]}  (This is essentially a form of graph
> where each entry is an edge)****
>
> Now.. I process a different set of data and the idea is to modify the
> existing document based on this new data.****
>
> If the key is present then add/modify values.****
>
> Else... create new key:[values] json object and save.****
>
> ** **
>
> So, the first step is checking whether the key is present or not..****
>
> So thats why I thought of doing the binary search.****
>
> Any suggestions?****
>
> ** **
>
> ** **
>

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Let's call these "the graph" and "the changes".

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition "the graph" and "the changes" into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me...).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn't even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Let's call these "the graph" and "the changes".

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition "the graph" and "the changes" into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me...).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn't even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Let's call these "the graph" and "the changes".

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition "the graph" and "the changes" into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me...).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn't even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

Let's call these "the graph" and "the changes".

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes like below, but instead of doing a join on each partition, stream the changes against the graph partition in memory, using a HashTable for the graph partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a parallel join.  A parallel join can be done in hadoop by a simple modulo of the keys of the graph and the changes.  So first, create a couple of MR jobs just to partition "the graph" and "the changes" into N buckets using (key%N).  I *think* this is pretty straightforward because if your mapper adds new_key=(key%N) to the tuple and you use N reducers you get this behavior automatically (is it really that simple? someone with more MR expertise please correct me...).   Once the graph and the changes are partitioned, run another MR job to (1) join each graph partition file to the corresponding changes partition file (2) process the changes into the graph (3) write out the resulting graph.  This part is not a parallel join; it is a bunch of independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn't even this easy.  If nothing fits into memory and you must perform a non-trivial graph traversal for each change record, you have something must harder to do.

FYI top google results for joins in Hadoop here: https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:43 PM
To: user@hadoop.apache.org
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents outputting {key:[values]}  (This is essentially a form of graph where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json
documents outputting {key:[values]}  (This is essentially a form of graph
where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the
existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

On Mon, Jan 7, 2013 at 3:35 PM, John Lilley <jo...@redpoint.net>wrote:

>  It depends.  What data is going into the table, and what keys will drive
> the lookup?****
>
> ** **
>
> Let’s suppose that you have a single JSON file that has some reasonable
> number of key/value tuples.  You could easily load a Hashtable to associate
> the integer keys with the values (which appear to be lists of integers).
> Each task in your MapReduce could process each input tuple, doing a lookup
> by key and appending values to the output records, and that is a perfectly
> fine thing to do in MapReduce.  In this model, the JSON file is effectively
> a constant singleton table for the entire MapReduce job.  You can just load
> it from HDFS or any file system.  Specifying it as a cached file may
> improve performance somewhat.****
>
> ** **
>
> If you explain your intent we might be able to help better.****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json
documents outputting {key:[values]}  (This is essentially a form of graph
where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the
existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

On Mon, Jan 7, 2013 at 3:35 PM, John Lilley <jo...@redpoint.net>wrote:

>  It depends.  What data is going into the table, and what keys will drive
> the lookup?****
>
> ** **
>
> Let’s suppose that you have a single JSON file that has some reasonable
> number of key/value tuples.  You could easily load a Hashtable to associate
> the integer keys with the values (which appear to be lists of integers).
> Each task in your MapReduce could process each input tuple, doing a lookup
> by key and appending values to the output records, and that is a perfectly
> fine thing to do in MapReduce.  In this model, the JSON file is effectively
> a constant singleton table for the entire MapReduce job.  You can just load
> it from HDFS or any file system.  Specifying it as a cached file may
> improve performance somewhat.****
>
> ** **
>
> If you explain your intent we might be able to help better.****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json
documents outputting {key:[values]}  (This is essentially a form of graph
where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the
existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

On Mon, Jan 7, 2013 at 3:35 PM, John Lilley <jo...@redpoint.net>wrote:

>  It depends.  What data is going into the table, and what keys will drive
> the lookup?****
>
> ** **
>
> Let’s suppose that you have a single JSON file that has some reasonable
> number of key/value tuples.  You could easily load a Hashtable to associate
> the integer keys with the values (which appear to be lists of integers).
> Each task in your MapReduce could process each input tuple, doing a lookup
> by key and appending values to the output records, and that is a perfectly
> fine thing to do in MapReduce.  In this model, the JSON file is effectively
> a constant singleton table for the entire MapReduce job.  You can just load
> it from HDFS or any file system.  Specifying it as a cached file may
> improve performance somewhat.****
>
> ** **
>
> If you explain your intent we might be able to help better.****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

Re: Binary Search in map reduce

Posted by jamal sasha <ja...@gmail.com>.

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json
documents outputting {key:[values]}  (This is essentially a form of graph
where each entry is an edge)
Now.. I process a different set of data and the idea is to modify the
existing document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?

On Mon, Jan 7, 2013 at 3:35 PM, John Lilley <jo...@redpoint.net>wrote:

>  It depends.  What data is going into the table, and what keys will drive
> the lookup?****
>
> ** **
>
> Let’s suppose that you have a single JSON file that has some reasonable
> number of key/value tuples.  You could easily load a Hashtable to associate
> the integer keys with the values (which appear to be lists of integers).
> Each task in your MapReduce could process each input tuple, doing a lookup
> by key and appending values to the output records, and that is a perfectly
> fine thing to do in MapReduce.  In this model, the JSON file is effectively
> a constant singleton table for the entire MapReduce job.  You can just load
> it from HDFS or any file system.  Specifying it as a cached file may
> improve performance somewhat.****
>
> ** **
>
> If you explain your intent we might be able to help better.****
>
> ** **
>
> john****
>
> ** **
>
> *From:* jamal sasha [mailto:jamalshasha@gmail.com]
> *Sent:* Monday, January 07, 2013 4:21 PM
> *To:* user@hadoop.apache.org
> *Subject:* Binary Search in map reduce****
>
> ** **
>
> Hi,****
>
>  I have data in json format like:****
>
> ** **
>
> {key:[values.....]}****
>
> key, values are longints.****
>
> Now, I want to do a fast lookup of a key.****
>
> How would I implement a binary search in map reduce abstraction.****
>
> ** **
>
> Or am i not thinking about this correctly?****
>
> Any suggestions/advices?****
>
> Thanks****
>

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

It depends.  What data is going into the table, and what keys will drive the lookup?

Let's suppose that you have a single JSON file that has some reasonable number of key/value tuples.  You could easily load a Hashtable to associate the integer keys with the values (which appear to be lists of integers).  Each task in your MapReduce could process each input tuple, doing a lookup by key and appending values to the output records, and that is a perfectly fine thing to do in MapReduce.  In this model, the JSON file is effectively a constant singleton table for the entire MapReduce job.  You can just load it from HDFS or any file system.  Specifying it as a cached file may improve performance somewhat.

If you explain your intent we might be able to help better.

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

It depends.  What data is going into the table, and what keys will drive the lookup?

Let's suppose that you have a single JSON file that has some reasonable number of key/value tuples.  You could easily load a Hashtable to associate the integer keys with the values (which appear to be lists of integers).  Each task in your MapReduce could process each input tuple, doing a lookup by key and appending values to the output records, and that is a perfectly fine thing to do in MapReduce.  In this model, the JSON file is effectively a constant singleton table for the entire MapReduce job.  You can just load it from HDFS or any file system.  Specifying it as a cached file may improve performance somewhat.

If you explain your intent we might be able to help better.

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

It depends.  What data is going into the table, and what keys will drive the lookup?

Let's suppose that you have a single JSON file that has some reasonable number of key/value tuples.  You could easily load a Hashtable to associate the integer keys with the values (which appear to be lists of integers).  Each task in your MapReduce could process each input tuple, doing a lookup by key and appending values to the output records, and that is a perfectly fine thing to do in MapReduce.  In this model, the JSON file is effectively a constant singleton table for the entire MapReduce job.  You can just load it from HDFS or any file system.  Specifying it as a cached file may improve performance somewhat.

If you explain your intent we might be able to help better.

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

RE: Binary Search in map reduce

Posted by John Lilley <jo...@redpoint.net>.

It depends.  What data is going into the table, and what keys will drive the lookup?

Let's suppose that you have a single JSON file that has some reasonable number of key/value tuples.  You could easily load a Hashtable to associate the integer keys with the values (which appear to be lists of integers).  Each task in your MapReduce could process each input tuple, doing a lookup by key and appending values to the output records, and that is a perfectly fine thing to do in MapReduce.  In this model, the JSON file is effectively a constant singleton table for the entire MapReduce job.  You can just load it from HDFS or any file system.  Specifying it as a cached file may improve performance somewhat.

If you explain your intent we might be able to help better.

john

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 4:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks

RE: Binary Search in map reduce

Posted by "Pamecha, Abhishek" <ap...@ebay.com>.

You will incur the cost of map reduce across all nodes in your cluster anyways. Not sure, you will get enough speed advantage.
HBase may help you get close to what you are looking for but that wont be map reduce.

Thanks,
Abhishek

From: jamal sasha [mailto:jamalshasha@gmail.com]
Sent: Monday, January 07, 2013 3:21 PM
To: user@hadoop.apache.org
Subject: Binary Search in map reduce

Hi,
 I have data in json format like:

{key:[values.....]}
key, values are longints.
Now, I want to do a fast lookup of a key.
How would I implement a binary search in map reduce abstraction.

Or am i not thinking about this correctly?
Any suggestions/advices?
Thanks