You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Pablo Musa <pa...@psafe.com> on 2012/10/02 00:27:26 UTC

RE: Two map inputs (file & HBase). "Join" file data and Hbase data into a map reduce.

Hi Bejoy,
thank you for the answer, I will try it. But I still have a doubt.
How should I manage connections to HBase inside the job?

Should I open a new connection in each job? How can I set  a
connectionPool inside a job?

Thank you,
Pablo

From: Bejoy Ks [mailto:bejoy.hadoop@gmail.com]
Sent: quinta-feira, 27 de setembro de 2012 14:55
To: user@hadoop.apache.org
Subject: Re: Two map inputs (file & HBase). "Join" file data and Hbase data into a map reduce.

Hi Pablo

>I could read the file and do a get followed by a put, but this would not be a MR job
>and would be very slow if there are a lot of entries in the file.

If you have a large file, by using mapreduce you can parallelize the hbase gets and puts. Configure the split size accordingly so that there are sufficient number of mappers to ensure enough parallelism and good performance.
On Thu, Sep 27, 2012 at 11:14 PM, Pablo Musa <pa...@psafe.com>> wrote:
Hey guys,
I am not sure if this is the correct list (could also be HBase), but I think my doubt
is more related to the MR than to the HBase itself.

I am trying to update some columns of a family in my Hbase db using a MR job.
In each column I have a byte array with different information concatenated.

So far, so easy. A MR job with the table as input and the scan.setfamily.
My problem is that the rows that I want to update are inside a file. In
other words, I have a big file containing all the rows that should be updated.

So, I have to read the row:column content so I can update it and then write it again.
But I also have to read the file in order to know which files I should update.

I could read the file and do a get followed by a put, but this would not be a MR job
and would be very slow if there are a lot of entries in the file.

Another possibility I thought but don't know how to implement, is to use the table
as input and have a Map with the rows that should be updated. The problem is that
I don't know how to distribute this Map or how to distribute the file so every Map
can read it.

Any thoughts?

Thanks,
Pablo