You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rohit Kelkar <ro...@gmail.com> on 2011/11/07 12:02:17 UTC

mapreduce on two tables

I needed some feedback about best way of implementing the following -
In my document table I have documentid as row-id and content:author,
content:text stored in each row. I want to process all documents
pertaining to each author in a map reduce job. ie. my map will take
key=author and values="all documentids sent by that sender". But for
this first I would have to find all distinct authors and store them in
another table. Then run map-reduce job on the second table. Am I
thinking in the right direction or is there a better way to achieve
this?
- Rohit Kelkar

Re: mapreduce on two tables

Posted by Rohit Kelkar <ro...@gmail.com>.
Thanks for the inputs (Amandeep and Jean). Writing author -> [list of
documents] mapping to an hdfs file works best for me because that file
(as an NLineInputFormat) will act as input to another mapreduce job in
which the map part is processor intensive. Also, I won't be using the
file for random access. I am not inclined towards emitting <author,
document> in map and consuming <author, list of docs> in reduce
because then I have to do the processor intensive process in the
reduce part and that limits the number of parallel heavy processes
that I can spawn.
Thanks again.
- Rohit Kelkar

On Mon, Nov 7, 2011 at 8:36 PM, Amandeep Khurana <am...@gmail.com> wrote:
> Rohit,
>
> It'll depend on what processing you want to do on all documents for a
> given author. You could either write author -> {list of documents} to
> an HDFS file and scan through that file using a MR job to do the
> processing. Or you could simply output <author, document> as the
> output of the map stage and do the processing on <author, {list of
> documents}> in the reduce stage of the same job.
>
> -ak
>
> On Mon, Nov 7, 2011 at 3:02 AM, Rohit Kelkar <ro...@gmail.com> wrote:
>> I needed some feedback about best way of implementing the following -
>> In my document table I have documentid as row-id and content:author,
>> content:text stored in each row. I want to process all documents
>> pertaining to each author in a map reduce job. ie. my map will take
>> key=author and values="all documentids sent by that sender". But for
>> this first I would have to find all distinct authors and store them in
>> another table. Then run map-reduce job on the second table. Am I
>> thinking in the right direction or is there a better way to achieve
>> this?
>> - Rohit Kelkar
>>
>

Re: mapreduce on two tables

Posted by Amandeep Khurana <am...@gmail.com>.
Rohit,

It'll depend on what processing you want to do on all documents for a
given author. You could either write author -> {list of documents} to
an HDFS file and scan through that file using a MR job to do the
processing. Or you could simply output <author, document> as the
output of the map stage and do the processing on <author, {list of
documents}> in the reduce stage of the same job.

-ak

On Mon, Nov 7, 2011 at 3:02 AM, Rohit Kelkar <ro...@gmail.com> wrote:
> I needed some feedback about best way of implementing the following -
> In my document table I have documentid as row-id and content:author,
> content:text stored in each row. I want to process all documents
> pertaining to each author in a map reduce job. ie. my map will take
> key=author and values="all documentids sent by that sender". But for
> this first I would have to find all distinct authors and store them in
> another table. Then run map-reduce job on the second table. Am I
> thinking in the right direction or is there a better way to achieve
> this?
> - Rohit Kelkar
>

Re: mapreduce on two tables

Posted by Jean-Daniel Cryans <jd...@apache.org>.
You don't really need to store that into another HBase table, just
dump it into HDFS (unless you want to do random access on that second
table, which acts as a secondary index for documents by authors).

It's a workable solution, it's just brute force.

J-D

On Mon, Nov 7, 2011 at 11:02 AM, Rohit Kelkar <ro...@gmail.com> wrote:
> I needed some feedback about best way of implementing the following -
> In my document table I have documentid as row-id and content:author,
> content:text stored in each row. I want to process all documents
> pertaining to each author in a map reduce job. ie. my map will take
> key=author and values="all documentids sent by that sender". But for
> this first I would have to find all distinct authors and store them in
> another table. Then run map-reduce job on the second table. Am I
> thinking in the right direction or is there a better way to achieve
> this?
> - Rohit Kelkar
>