You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Sean McNamara <Se...@Webtrends.com> on 2012/11/28 07:28:00 UTC

Parallel reading advice

I have a table who's keys are prefixed with a byte to help distribute the keys so scans don't hotspot.

I also have a bunch of slave processes that work to scan the prefix partitions in parallel.  Currently each slave sets up their own hbase connection, scanner, etc..  Most of the slave processes finish their scan and return within 2-3 seconds.  It tends to take the same amount of time regardless of if there's lots of data, or very little.  So I think that 2 sec overhead is there because each slave will setup a new connection on each request (I am unable to reuse connections in the slaves).

I'm wondering if I could remove some of that overhead by using the master (which can reuse it's hbase connection) to determine the splits, and then delegating that information out to each slave. I think I could possible use TableInputFormat/TableRecordReader to accomplish this?  Would this route make sense?

Re: Parallel reading advice

Posted by Sean McNamara <Se...@Webtrends.com>.

> Is 2-3 seconds including the startup time for the JVM?

Not in this case.  I put a timer wrapper around my call to
HTablePool.getTable and the scan.

Thanks J-D

Sean


On 11/29/12 3:07 PM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:

>Thanks for posting your findings back to the list.
>
>Is 2-3 seconds including the startup time for the JVM?
>
>J-D
>
>On Wed, Nov 28, 2012 at 2:25 PM, Sean McNamara
><Se...@webtrends.com>wrote:
>
>> Turns out there is a way to reuse the connection in Spark.  I was also
>> forgetting to call setCaching (that was the primary reason). So it's
>>very
>> fast now and I have the data where I need it.
>>
>> The first request still takes 2-3 seconds to setup and see data
>> (regardless of how much), but after that it's super fast.
>>
>> Sean
>>
>>
>> On 11/28/12 10:37 AM, "Sean McNamara" <Se...@Webtrends.com>
>>wrote:
>>
>> >Hi J-D
>> >
>> >Really good questions.  I will check for a misconfiguration.
>> >
>> >
>> >> I'm not sure what you're talking about here. Which master
>> >
>> >I am using http://spark-project.org/ , so the master I am referring to
>>is
>> >really the spark driver.  Spark can read from a hadoop InputFormat and
>> >populate itself that way, but you don't have control over which
>> >slave/worker data will land on using it.  My goal is to use spark to
>>reach
>> >in for slices of data that are in HBase, and be able to perform set
>> >operations on the data in parallel using spark.  Being able to load a
>> >partition onto the right node is important. This is so that I don't
>>have
>> >to reshuffle the data, just to get it onto the right node that handles
>>a
>> >particular data partition.
>> >
>> >
>> >> BTW why can't you keep the connections around?
>> >
>> >The spark api is totally functional, AFAIK it's not possible to setup a
>> >connection and keep it around (I am asking on that mailing list to be
>> >sure).
>> >
>> >
>> >> Since this is something done within the HBase client, doing it
>> >>externally sounds terribly tacky
>> >
>> >Yup.  The reason I am entertaining this route is that using an
>>InputFormat
>> >with spark I was able to load in way more data, and it was all sub
>>second.
>> > Since moving to having the spark slaves handle pulling in their data
>>(not
>> >using the InputFormat) it seems slower for some reason.  I figured it
>> >might be because using an InputFormat the slaves were told what to
>>load,
>> >vs. each of the 40 slaves having to do more work to find what to load.
>> >Perhaps my assumption is wrong?  Thoughts?
>> >
>> >
>> >I really appreciate your insights.  Thanks!
>> >
>> >
>> >
>> >
>> >
>> >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:
>> >
>> >>Inline.
>> >>
>> >>J-D
>> >>
>> >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
>> >><Se...@webtrends.com>wrote:
>> >>
>> >>> I have a table who's keys are prefixed with a byte to help
>>distribute
>> >>>the
>> >>> keys so scans don't hotspot.
>> >>>
>> >>> I also have a bunch of slave processes that work to scan the prefix
>> >>> partitions in parallel.  Currently each slave sets up their own
>>hbase
>> >>> connection, scanner, etc..  Most of the slave processes finish their
>> >>>scan
>> >>> and return within 2-3 seconds.  It tends to take the same amount of
>> >>>time
>> >>> regardless of if there's lots of data, or very little.  So I think
>>that
>> >>>2
>> >>> sec overhead is there because each slave will setup a new
>>connection on
>> >>> each request (I am unable to reuse connections in the slaves).
>> >>>
>> >>
>> >>2 secs sounds way too high. I recommend you check into this and see
>>where
>> >>the time is spent as you may find underlying issues lis
>>misconfiguration.
>> >>
>> >>
>> >>>
>> >>> I'm wondering if I could remove some of that overhead by using the
>> >>>master
>> >>> (which can reuse it's hbase connection) to determine the splits, and
>> >>>then
>> >>> delegating that information out to each slave. I think I could
>>possible
>> >>>use
>> >>> TableInputFormat/TableRecordReader to accomplish this?  Would this
>> >>>route
>> >>> make sense?
>> >>>
>> >>
>> >>I'm not sure what you're talking about here. Which master? HBase's or
>> >>there's something in your infrastructure that's also called "master"?
>> >>Then
>> >>I'm not sure what your are trying to achieve by "determine the
>>splits",
>> >>you
>> >>mean finding the regions you need to contact from your slaves? Since
>>this
>> >>is something done within the HBase client, doing it externally sounds
>> >>terribly hacky. BTW why can't you keep the connections around? Is it a
>> >>problem of JVMs being re-spawned? If so, there are techniques you can
>>use
>> >>to keep them around for reuse and then you would also benefit from
>> >>reusing
>> >>connections.
>> >>
>> >>Hope this helps,
>> >>
>> >>J-D
>> >
>>
>>

Re: Parallel reading advice

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Thanks for posting your findings back to the list.

Is 2-3 seconds including the startup time for the JVM?

J-D

On Wed, Nov 28, 2012 at 2:25 PM, Sean McNamara
<Se...@webtrends.com>wrote:

> Turns out there is a way to reuse the connection in Spark.  I was also
> forgetting to call setCaching (that was the primary reason). So it's very
> fast now and I have the data where I need it.
>
> The first request still takes 2-3 seconds to setup and see data
> (regardless of how much), but after that it's super fast.
>
> Sean
>
>
> On 11/28/12 10:37 AM, "Sean McNamara" <Se...@Webtrends.com> wrote:
>
> >Hi J-D
> >
> >Really good questions.  I will check for a misconfiguration.
> >
> >
> >> I'm not sure what you're talking about here. Which master
> >
> >I am using http://spark-project.org/ , so the master I am referring to is
> >really the spark driver.  Spark can read from a hadoop InputFormat and
> >populate itself that way, but you don't have control over which
> >slave/worker data will land on using it.  My goal is to use spark to reach
> >in for slices of data that are in HBase, and be able to perform set
> >operations on the data in parallel using spark.  Being able to load a
> >partition onto the right node is important. This is so that I don't have
> >to reshuffle the data, just to get it onto the right node that handles a
> >particular data partition.
> >
> >
> >> BTW why can't you keep the connections around?
> >
> >The spark api is totally functional, AFAIK it's not possible to setup a
> >connection and keep it around (I am asking on that mailing list to be
> >sure).
> >
> >
> >> Since this is something done within the HBase client, doing it
> >>externally sounds terribly tacky
> >
> >Yup.  The reason I am entertaining this route is that using an InputFormat
> >with spark I was able to load in way more data, and it was all sub second.
> > Since moving to having the spark slaves handle pulling in their data (not
> >using the InputFormat) it seems slower for some reason.  I figured it
> >might be because using an InputFormat the slaves were told what to load,
> >vs. each of the 40 slaves having to do more work to find what to load.
> >Perhaps my assumption is wrong?  Thoughts?
> >
> >
> >I really appreciate your insights.  Thanks!
> >
> >
> >
> >
> >
> >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:
> >
> >>Inline.
> >>
> >>J-D
> >>
> >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
> >><Se...@webtrends.com>wrote:
> >>
> >>> I have a table who's keys are prefixed with a byte to help distribute
> >>>the
> >>> keys so scans don't hotspot.
> >>>
> >>> I also have a bunch of slave processes that work to scan the prefix
> >>> partitions in parallel.  Currently each slave sets up their own hbase
> >>> connection, scanner, etc..  Most of the slave processes finish their
> >>>scan
> >>> and return within 2-3 seconds.  It tends to take the same amount of
> >>>time
> >>> regardless of if there's lots of data, or very little.  So I think that
> >>>2
> >>> sec overhead is there because each slave will setup a new connection on
> >>> each request (I am unable to reuse connections in the slaves).
> >>>
> >>
> >>2 secs sounds way too high. I recommend you check into this and see where
> >>the time is spent as you may find underlying issues lis misconfiguration.
> >>
> >>
> >>>
> >>> I'm wondering if I could remove some of that overhead by using the
> >>>master
> >>> (which can reuse it's hbase connection) to determine the splits, and
> >>>then
> >>> delegating that information out to each slave. I think I could possible
> >>>use
> >>> TableInputFormat/TableRecordReader to accomplish this?  Would this
> >>>route
> >>> make sense?
> >>>
> >>
> >>I'm not sure what you're talking about here. Which master? HBase's or
> >>there's something in your infrastructure that's also called "master"?
> >>Then
> >>I'm not sure what your are trying to achieve by "determine the splits",
> >>you
> >>mean finding the regions you need to contact from your slaves? Since this
> >>is something done within the HBase client, doing it externally sounds
> >>terribly hacky. BTW why can't you keep the connections around? Is it a
> >>problem of JVMs being re-spawned? If so, there are techniques you can use
> >>to keep them around for reuse and then you would also benefit from
> >>reusing
> >>connections.
> >>
> >>Hope this helps,
> >>
> >>J-D
> >
>
>

Re: Parallel reading advice

Posted by Sean McNamara <Se...@Webtrends.com>.

Turns out there is a way to reuse the connection in Spark.  I was also
forgetting to call setCaching (that was the primary reason). So it's very
fast now and I have the data where I need it.

The first request still takes 2-3 seconds to setup and see data
(regardless of how much), but after that it's super fast.

Sean


On 11/28/12 10:37 AM, "Sean McNamara" <Se...@Webtrends.com> wrote:

>Hi J-D
>
>Really good questions.  I will check for a misconfiguration.
>
>
>> I'm not sure what you're talking about here. Which master
>
>I am using http://spark-project.org/ , so the master I am referring to is
>really the spark driver.  Spark can read from a hadoop InputFormat and
>populate itself that way, but you don't have control over which
>slave/worker data will land on using it.  My goal is to use spark to reach
>in for slices of data that are in HBase, and be able to perform set
>operations on the data in parallel using spark.  Being able to load a
>partition onto the right node is important. This is so that I don't have
>to reshuffle the data, just to get it onto the right node that handles a
>particular data partition.
>
>
>> BTW why can't you keep the connections around?
>
>The spark api is totally functional, AFAIK it's not possible to setup a
>connection and keep it around (I am asking on that mailing list to be
>sure).
>
>
>> Since this is something done within the HBase client, doing it
>>externally sounds terribly tacky
>
>Yup.  The reason I am entertaining this route is that using an InputFormat
>with spark I was able to load in way more data, and it was all sub second.
> Since moving to having the spark slaves handle pulling in their data (not
>using the InputFormat) it seems slower for some reason.  I figured it
>might be because using an InputFormat the slaves were told what to load,
>vs. each of the 40 slaves having to do more work to find what to load.
>Perhaps my assumption is wrong?  Thoughts?
>
>
>I really appreciate your insights.  Thanks!
>
>
>
>
>
>On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:
>
>>Inline.
>>
>>J-D
>>
>>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
>><Se...@webtrends.com>wrote:
>>
>>> I have a table who's keys are prefixed with a byte to help distribute
>>>the
>>> keys so scans don't hotspot.
>>>
>>> I also have a bunch of slave processes that work to scan the prefix
>>> partitions in parallel.  Currently each slave sets up their own hbase
>>> connection, scanner, etc..  Most of the slave processes finish their
>>>scan
>>> and return within 2-3 seconds.  It tends to take the same amount of
>>>time
>>> regardless of if there's lots of data, or very little.  So I think that
>>>2
>>> sec overhead is there because each slave will setup a new connection on
>>> each request (I am unable to reuse connections in the slaves).
>>>
>>
>>2 secs sounds way too high. I recommend you check into this and see where
>>the time is spent as you may find underlying issues lis misconfiguration.
>>
>>
>>>
>>> I'm wondering if I could remove some of that overhead by using the
>>>master
>>> (which can reuse it's hbase connection) to determine the splits, and
>>>then
>>> delegating that information out to each slave. I think I could possible
>>>use
>>> TableInputFormat/TableRecordReader to accomplish this?  Would this
>>>route
>>> make sense?
>>>
>>
>>I'm not sure what you're talking about here. Which master? HBase's or
>>there's something in your infrastructure that's also called "master"?
>>Then
>>I'm not sure what your are trying to achieve by "determine the splits",
>>you
>>mean finding the regions you need to contact from your slaves? Since this
>>is something done within the HBase client, doing it externally sounds
>>terribly hacky. BTW why can't you keep the connections around? Is it a
>>problem of JVMs being re-spawned? If so, there are techniques you can use
>>to keep them around for reuse and then you would also benefit from
>>reusing
>>connections.
>>
>>Hope this helps,
>>
>>J-D
>

Re: Parallel reading advice

Posted by Sean McNamara <Se...@Webtrends.com>.

Hi J-D

Really good questions.  I will check for a misconfiguration.

> I'm not sure what you're talking about here. Which master

I am using http://spark-project.org/ , so the master I am referring to is
really the spark driver.  Spark can read from a hadoop InputFormat and
populate itself that way, but you don't have control over which
slave/worker data will land on using it.  My goal is to use spark to reach
in for slices of data that are in HBase, and be able to perform set
operations on the data in parallel using spark.  Being able to load a
partition onto the right node is important. This is so that I don't have
to reshuffle the data, just to get it onto the right node that handles a
particular data partition.

> BTW why can't you keep the connections around?

The spark api is totally functional, AFAIK it's not possible to setup a
connection and keep it around (I am asking on that mailing list to be
sure).

> Since this is something done within the HBase client, doing it
>externally sounds terribly tacky

Yup.  The reason I am entertaining this route is that using an InputFormat
with spark I was able to load in way more data, and it was all sub second.
 Since moving to having the spark slaves handle pulling in their data (not
using the InputFormat) it seems slower for some reason.  I figured it
might be because using an InputFormat the slaves were told what to load,
vs. each of the 40 slaves having to do more work to find what to load.
Perhaps my assumption is wrong?  Thoughts?

I really appreciate your insights.  Thanks!

On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <jd...@apache.org> wrote:

>Inline.
>
>J-D
>
>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
><Se...@webtrends.com>wrote:
>
>> I have a table who's keys are prefixed with a byte to help distribute
>>the
>> keys so scans don't hotspot.
>>
>> I also have a bunch of slave processes that work to scan the prefix
>> partitions in parallel.  Currently each slave sets up their own hbase
>> connection, scanner, etc..  Most of the slave processes finish their
>>scan
>> and return within 2-3 seconds.  It tends to take the same amount of time
>> regardless of if there's lots of data, or very little.  So I think that
>>2
>> sec overhead is there because each slave will setup a new connection on
>> each request (I am unable to reuse connections in the slaves).
>>
>
>2 secs sounds way too high. I recommend you check into this and see where
>the time is spent as you may find underlying issues lis misconfiguration.
>
>
>>
>> I'm wondering if I could remove some of that overhead by using the
>>master
>> (which can reuse it's hbase connection) to determine the splits, and
>>then
>> delegating that information out to each slave. I think I could possible
>>use
>> TableInputFormat/TableRecordReader to accomplish this?  Would this route
>> make sense?
>>
>
>I'm not sure what you're talking about here. Which master? HBase's or
>there's something in your infrastructure that's also called "master"? Then
>I'm not sure what your are trying to achieve by "determine the splits",
>you
>mean finding the regions you need to contact from your slaves? Since this
>is something done within the HBase client, doing it externally sounds
>terribly hacky. BTW why can't you keep the connections around? Is it a
>problem of JVMs being re-spawned? If so, there are techniques you can use
>to keep them around for reuse and then you would also benefit from reusing
>connections.
>
>Hope this helps,
>
>J-D

Re: Parallel reading advice

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Inline.

J-D

On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
<Se...@webtrends.com>wrote:

> I have a table who's keys are prefixed with a byte to help distribute the
> keys so scans don't hotspot.
>
> I also have a bunch of slave processes that work to scan the prefix
> partitions in parallel.  Currently each slave sets up their own hbase
> connection, scanner, etc..  Most of the slave processes finish their scan
> and return within 2-3 seconds.  It tends to take the same amount of time
> regardless of if there's lots of data, or very little.  So I think that 2
> sec overhead is there because each slave will setup a new connection on
> each request (I am unable to reuse connections in the slaves).
>

2 secs sounds way too high. I recommend you check into this and see where
the time is spent as you may find underlying issues lis misconfiguration.

>
> I'm wondering if I could remove some of that overhead by using the master
> (which can reuse it's hbase connection) to determine the splits, and then
> delegating that information out to each slave. I think I could possible use
> TableInputFormat/TableRecordReader to accomplish this?  Would this route
> make sense?
>

I'm not sure what you're talking about here. Which master? HBase's or
there's something in your infrastructure that's also called "master"? Then
I'm not sure what your are trying to achieve by "determine the splits", you
mean finding the regions you need to contact from your slaves? Since this
is something done within the HBase client, doing it externally sounds
terribly hacky. BTW why can't you keep the connections around? Is it a
problem of JVMs being re-spawned? If so, there are techniques you can use
to keep them around for reuse and then you would also benefit from reusing
connections.

Hope this helps,

J-D