You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Pavel Hančar <pa...@gmail.com> on 2013/04/06 18:45:12 UTC

HBase tasks

  Hello,
maybe I don't understand one basic thing. MapReduce jobs are there for long
jobs, that process some big data. But what to do, if I have an HBase
in-memory table, where I would like to process all (or selected) records
with minimal time response. Also MapReduce?
   If so, are there any features to speed up the processing? Is possible to
avoid some disk writes/reads?
  I try to compare some vectors extracted from pictures and sort the output
with a single empty reducer. Then I take the output by a web application.
Especially the last write of the output of the single reducer and then the
reading it by the web application seems strange to me. Is it possible to
get an iterator from the reducer instead of the output file?
  Thanks,
  Pavel Hančar

Re: HBase tasks

Posted by Pavel Hančar <pa...@gmail.com>.
 Hi,
thanks for the answer. Yes I meant in-memory column family. But please,
does it matter if I have two column families in separate tables or not? Or
is it somehow stupid to have a table with only one CF?
I have one column family with pictures and the other with two columns. In
the first there are vectors (small text files) extracted from those
pictures and the second column is filled by the pictures, but dimineshed.
 I have a program measuring similarity distances of the vectors. I want to
have a real-time web application calculating the distances and displaying
the diminished pictures sorted by them. My question is, if I should use
MapReduce or if there is an alternative. MapReduce seems to me quite
cumbersome.
 I use CDH3 (HBase 0.90.6). Now I'm developing everything on my laptop with
small amount of data, but we expect to have about 30 nodes cluster with
hundreds of GB. On the laptop I have 3430 pictures and the response with
MapReduce is 26 sec. I thought, I could speed up the processing if the
second CF was in-memory. But the response is the same. I mean the MapReduce
does so many writes/reads on the disk, that it hardly can be quicker. Or is
there any possibility to make "in-memory" all the processing? Especially I
feel stupid, when my only reducer writes it's output on the disk and then I
read it immediately with a java web application. Can I somehow get an
Iterator instead of the output file from the reducer?
  Thanks,
  Pavel Hančar


2013/4/8 Anoop Sam John <an...@huawei.com>

> Hi
> >But what to do, if I have an HBase in-memory table,
> Why you say in memory table? All the data in memory? Can u explain a bit
> abt this?
>
> Yes there is MR job to scan the HBase table data. (Full or part)
>
> When you say you want to retrieve data fast, what is the ammount of data?
> How many regions? Any testing u have done with scan APIs?
>
> Which version of HBase?
>
> -Anoop-
> ________________________________________
> From: Pavel Hančar [pavel.hancar@gmail.com]
> Sent: Saturday, April 06, 2013 10:15 PM
> To: user@hbase.apache.org
> Subject: HBase tasks
>
>   Hello,
> maybe I don't understand one basic thing. MapReduce jobs are there for long
> jobs, that process some big data. But what to do, if I have an HBase
> in-memory table, where I would like to process all (or selected) records
> with minimal time response. Also MapReduce?
>    If so, are there any features to speed up the processing? Is possible to
> avoid some disk writes/reads?
>   I try to compare some vectors extracted from pictures and sort the output
> with a single empty reducer. Then I take the output by a web application.
> Especially the last write of the output of the single reducer and then the
> reading it by the web application seems strange to me. Is it possible to
> get an iterator from the reducer instead of the output file?
>   Thanks,
>   Pavel Hančar
>

RE: HBase tasks

Posted by Azuryy Yu <az...@gmail.com>.
I guess he has only one CF, which is in memory, so he called in-memory
table.

--Send from my Sony mobile.
On Apr 8, 2013 11:46 AM, "Anoop Sam John" <an...@huawei.com> wrote:

> Hi
> >But what to do, if I have an HBase in-memory table,
> Why you say in memory table? All the data in memory? Can u explain a bit
> abt this?
>
> Yes there is MR job to scan the HBase table data. (Full or part)
>
> When you say you want to retrieve data fast, what is the ammount of data?
> How many regions? Any testing u have done with scan APIs?
>
> Which version of HBase?
>
> -Anoop-
> ________________________________________
> From: Pavel Hančar [pavel.hancar@gmail.com]
> Sent: Saturday, April 06, 2013 10:15 PM
> To: user@hbase.apache.org
> Subject: HBase tasks
>
>   Hello,
> maybe I don't understand one basic thing. MapReduce jobs are there for long
> jobs, that process some big data. But what to do, if I have an HBase
> in-memory table, where I would like to process all (or selected) records
> with minimal time response. Also MapReduce?
>    If so, are there any features to speed up the processing? Is possible to
> avoid some disk writes/reads?
>   I try to compare some vectors extracted from pictures and sort the output
> with a single empty reducer. Then I take the output by a web application.
> Especially the last write of the output of the single reducer and then the
> reading it by the web application seems strange to me. Is it possible to
> get an iterator from the reducer instead of the output file?
>   Thanks,
>   Pavel Hančar

RE: HBase tasks

Posted by Anoop Sam John <an...@huawei.com>.
Hi
>But what to do, if I have an HBase in-memory table, 
Why you say in memory table? All the data in memory? Can u explain a bit abt this?

Yes there is MR job to scan the HBase table data. (Full or part)

When you say you want to retrieve data fast, what is the ammount of data? How many regions? Any testing u have done with scan APIs?

Which version of HBase?

-Anoop-
________________________________________
From: Pavel Hančar [pavel.hancar@gmail.com]
Sent: Saturday, April 06, 2013 10:15 PM
To: user@hbase.apache.org
Subject: HBase tasks

  Hello,
maybe I don't understand one basic thing. MapReduce jobs are there for long
jobs, that process some big data. But what to do, if I have an HBase
in-memory table, where I would like to process all (or selected) records
with minimal time response. Also MapReduce?
   If so, are there any features to speed up the processing? Is possible to
avoid some disk writes/reads?
  I try to compare some vectors extracted from pictures and sort the output
with a single empty reducer. Then I take the output by a web application.
Especially the last write of the output of the single reducer and then the
reading it by the web application seems strange to me. Is it possible to
get an iterator from the reducer instead of the output file?
  Thanks,
  Pavel Hančar