You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Li Li <fa...@gmail.com> on 2014/04/14 08:08:36 UTC

Scan vs map-reduce

I have a full table scan which cost about 10 minutes. it seems a
bottleneck for our application. if use map-reduce to rewrite it. will
it be faster?

Re: Scan vs map-reduce

Posted by Li Li <fa...@gmail.com>.
thanks, I will try List<Get> later

On Tue, Apr 15, 2014 at 3:39 AM, Doug Meil
<do...@explorysmedical.com> wrote:
>
> re:  "my first version is using 20,000 Get²
>
> Just throwing this out there, but have you looked at multi-get?  Multi-get
> will group the gets by RegionServer internally.
>
> You are doing a lot of IO for a web-app so this is going to be tough to
> make ³fast², but there are ways to make it ³faster.²
>
> But since you only have 1,000,000 rows you might not have many regions, so
> this might wind up all going on the same RegionServer.
>
>
>
>
> On 4/14/14, 7:52 AM, "Li Li" <fa...@gmail.com> wrote:
>
>>I need to get about 20,000 rows from the table. the table is about
>>1,000,000 rows.
>>my first version is using 20,000 Get and I found it's very slow. So I
>>modified it to a scan and filter unrelated rows in the client.
>>maybe I should write a coprocessor. btw, is there any filter available
>>for me? something like sql statement where rowkey in('abc', 'abd'
>>....). a very long in statement
>>
>>On Mon, Apr 14, 2014 at 7:46 PM, Jean-Marc Spaggiari
>><je...@spaggiari.org> wrote:
>>> Hi Li Li,
>>>
>>> If you have more than one region, might be useful. MR will scan all the
>>> regions in parallel. If you do a full scan from a client API with no
>>> parallelism, then the MR job might be faster. But it will take more
>>> resources on the cluster and might impact the SLA of the other clients,
>>>if
>>> any,
>>>
>>> JM
>>>
>>>
>>> 2014-04-14 2:42 GMT-04:00 Mohammad Tariq <do...@gmail.com>:
>>>
>>>> Well, it depends. Could you please provide some more details?It will
>>>>help
>>>> us in giving a proper answer.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Mon, Apr 14, 2014 at 11:38 AM, Li Li <fa...@gmail.com> wrote:
>>>>
>>>> > I have a full table scan which cost about 10 minutes. it seems a
>>>> > bottleneck for our application. if use map-reduce to rewrite it. will
>>>> > it be faster?
>>>> >
>>>>
>

Re: Scan vs map-reduce

Posted by Doug Meil <do...@explorysmedical.com>.
re:  "my first version is using 20,000 Get²

Just throwing this out there, but have you looked at multi-get?  Multi-get
will group the gets by RegionServer internally.

You are doing a lot of IO for a web-app so this is going to be tough to
make ³fast², but there are ways to make it ³faster.²

But since you only have 1,000,000 rows you might not have many regions, so
this might wind up all going on the same RegionServer.




On 4/14/14, 7:52 AM, "Li Li" <fa...@gmail.com> wrote:

>I need to get about 20,000 rows from the table. the table is about
>1,000,000 rows.
>my first version is using 20,000 Get and I found it's very slow. So I
>modified it to a scan and filter unrelated rows in the client.
>maybe I should write a coprocessor. btw, is there any filter available
>for me? something like sql statement where rowkey in('abc', 'abd'
>....). a very long in statement
>
>On Mon, Apr 14, 2014 at 7:46 PM, Jean-Marc Spaggiari
><je...@spaggiari.org> wrote:
>> Hi Li Li,
>>
>> If you have more than one region, might be useful. MR will scan all the
>> regions in parallel. If you do a full scan from a client API with no
>> parallelism, then the MR job might be faster. But it will take more
>> resources on the cluster and might impact the SLA of the other clients,
>>if
>> any,
>>
>> JM
>>
>>
>> 2014-04-14 2:42 GMT-04:00 Mohammad Tariq <do...@gmail.com>:
>>
>>> Well, it depends. Could you please provide some more details?It will
>>>help
>>> us in giving a proper answer.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, Apr 14, 2014 at 11:38 AM, Li Li <fa...@gmail.com> wrote:
>>>
>>> > I have a full table scan which cost about 10 minutes. it seems a
>>> > bottleneck for our application. if use map-reduce to rewrite it. will
>>> > it be faster?
>>> >
>>>


Re: Scan vs map-reduce

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
This might help you: http://phoenix.incubator.apache.org/

JM
Le 2014-04-14 07:53, "Li Li" <fa...@gmail.com> a écrit :

> I need to get about 20,000 rows from the table. the table is about
> 1,000,000 rows.
> my first version is using 20,000 Get and I found it's very slow. So I
> modified it to a scan and filter unrelated rows in the client.
> maybe I should write a coprocessor. btw, is there any filter available
> for me? something like sql statement where rowkey in('abc', 'abd'
> ....). a very long in statement
>
> On Mon, Apr 14, 2014 at 7:46 PM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
> > Hi Li Li,
> >
> > If you have more than one region, might be useful. MR will scan all the
> > regions in parallel. If you do a full scan from a client API with no
> > parallelism, then the MR job might be faster. But it will take more
> > resources on the cluster and might impact the SLA of the other clients,
> if
> > any,
> >
> > JM
> >
> >
> > 2014-04-14 2:42 GMT-04:00 Mohammad Tariq <do...@gmail.com>:
> >
> >> Well, it depends. Could you please provide some more details?It will
> help
> >> us in giving a proper answer.
> >>
> >> Warm Regards,
> >> Tariq
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Apr 14, 2014 at 11:38 AM, Li Li <fa...@gmail.com> wrote:
> >>
> >> > I have a full table scan which cost about 10 minutes. it seems a
> >> > bottleneck for our application. if use map-reduce to rewrite it. will
> >> > it be faster?
> >> >
> >>
>

Re: Scan vs map-reduce

Posted by Li Li <fa...@gmail.com>.
I need to get about 20,000 rows from the table. the table is about
1,000,000 rows.
my first version is using 20,000 Get and I found it's very slow. So I
modified it to a scan and filter unrelated rows in the client.
maybe I should write a coprocessor. btw, is there any filter available
for me? something like sql statement where rowkey in('abc', 'abd'
....). a very long in statement

On Mon, Apr 14, 2014 at 7:46 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Hi Li Li,
>
> If you have more than one region, might be useful. MR will scan all the
> regions in parallel. If you do a full scan from a client API with no
> parallelism, then the MR job might be faster. But it will take more
> resources on the cluster and might impact the SLA of the other clients, if
> any,
>
> JM
>
>
> 2014-04-14 2:42 GMT-04:00 Mohammad Tariq <do...@gmail.com>:
>
>> Well, it depends. Could you please provide some more details?It will help
>> us in giving a proper answer.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Mon, Apr 14, 2014 at 11:38 AM, Li Li <fa...@gmail.com> wrote:
>>
>> > I have a full table scan which cost about 10 minutes. it seems a
>> > bottleneck for our application. if use map-reduce to rewrite it. will
>> > it be faster?
>> >
>>

Re: Scan vs map-reduce

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Li Li,

If you have more than one region, might be useful. MR will scan all the
regions in parallel. If you do a full scan from a client API with no
parallelism, then the MR job might be faster. But it will take more
resources on the cluster and might impact the SLA of the other clients, if
any,

JM


2014-04-14 2:42 GMT-04:00 Mohammad Tariq <do...@gmail.com>:

> Well, it depends. Could you please provide some more details?It will help
> us in giving a proper answer.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Mon, Apr 14, 2014 at 11:38 AM, Li Li <fa...@gmail.com> wrote:
>
> > I have a full table scan which cost about 10 minutes. it seems a
> > bottleneck for our application. if use map-reduce to rewrite it. will
> > it be faster?
> >
>

Re: Scan vs map-reduce

Posted by Mohammad Tariq <do...@gmail.com>.
Well, it depends. Could you please provide some more details?It will help
us in giving a proper answer.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Mon, Apr 14, 2014 at 11:38 AM, Li Li <fa...@gmail.com> wrote:

> I have a full table scan which cost about 10 minutes. it seems a
> bottleneck for our application. if use map-reduce to rewrite it. will
> it be faster?
>