You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Nithin Asokan <an...@gmail.com> on 2015/03/17 13:51:09 UTC

Question about HBaseSourceTarget

Hello,
I came across a unique behavior while using HBaseSourceTarget. Suppose I
have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
and passes all the data to a reduce phase, the number of reducers set by
planner will be equal to 1. The reason being [1]. So, it looks like the
planner assumes there is only about 1Gb of data that's read from the
source, and sets the number of reducers accordingly. However, let's say my
HBase scan is returning very less data or huge amounts of data. The planner
still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
interesting is, if there are dependent jobs, the planner will set the
number of reducers based on the initially determined size from HBase source.

As a fix for the above problem, I can set the number of reducers on the
groupByKey(), but that does not offer much flexibility when dealing with
data that is of varying sizes. The other option, is to have a map only job
that reads from HBase and writes to HDFS and have a run(). The next job
will determine the size right, since FileSourceImpl calculates the size on
disk.

I noticed the comment on HBaseSourceTarget, and was wondering if there was
anything planned to have it implemented.

[1]
https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173

Thanks
Nithin

Re: Question about HBaseSourceTarget

Posted by Gabriel Reid <ga...@gmail.com>.

Hi Nithin,

Unfortunately, the HBase classes aren't included in the published API docs.
I just took a look at adding them, but it appears to be more complex than I
would have hoped -- I'll create a JIRA ticket to look into this further,
but I won't be able to get to it right away.

In any case, these HBase classes (HBaseFrom, HBaseTo) are in the
org.apache.crunch.io.hbase package in the crunch-hbase module.

- Gabriel


On Wed, Mar 18, 2015 at 2:16 AM Nithin Asokan <an...@gmail.com> wrote:

> Thanks for looking at this everyone.
>
> I can try the suggestion Gabriel posted here, I'm not familiar with the
> HBaseFrom.table(String) API, and tried searching online. It will be really
> helpful if someone can point me to the API.
>
> Thanks everyone!
>
> On Tue, Mar 17, 2015 at 3:34 PM Gabriel Reid <ga...@gmail.com>
> wrote:
>
>> Yep, that looks like it could be pretty handy -- according to that ticket
>> it's in 0.98.1 as well.
>>
>>
>> On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Would this help for 0.99+?
>>>
>>> https://issues.apache.org/jira/browse/HBASE-10413
>>>
>>> On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <ga...@gmail.com>
>>> wrote:
>>>
>>>> That sounds like it would work pretty well, although the situation
>>>> where a custom Scan is used is still problematic.
>>>>
>>>> I think Hannibal [1] does some clever stuff as far as figuring out data
>>>> size as well (I think just via HBase RPC and not by looking at HDFS), there
>>>> could be some useful ideas in there.
>>>>
>>>> - Gabriel
>>>>
>>>> 1. https://github.com/sentric/hannibal
>>>>
>>>>
>>>> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mk...@gmail.com>
>>>> wrote:
>>>>
>>>>> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>>>>>  The case where this would overestimate would be if someone pre-split
>>>>> a table upon creation.   Otherwise as the table fills up over time in
>>>>> theory each region would grow and split evenly (and possibly hit max size
>>>>> and therefore split again).
>>>>>
>>>>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jw...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Also open to suggestion here-- this has annoyed me for some time (as
>>>>>> Gabriel pointed out), but I don't have a good fix for it.
>>>>>>
>>>>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <gabriel.reid@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Nithin,
>>>>>>>
>>>>>>> This is a long-standing issue in Crunch (I think it's been present
>>>>>>> since Crunch was originally open-sourced). I'd love to get this fixed
>>>>>>> somehow, although it seems to not be trivial to do -- it can be difficult
>>>>>>> to accurately estimate the size of data that will come from an HBase table,
>>>>>>> especially considering that filters and selections of a subset of columns
>>>>>>> can be done on an HBase table.
>>>>>>>
>>>>>>> One short-term way of working around this is to add a simple
>>>>>>> identity function directly after the HBaseSourceTarget that implements the
>>>>>>> scaleFactor method to manipulate the calculated size of the HBase data, but
>>>>>>> this is just another hack.
>>>>>>>
>>>>>>> Maybe the better solution would be to estimate the size of the HBase
>>>>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>>>>>> also take a long value which is the estimated byte size (or perhaps scale
>>>>>>> factor) of the table content that is expected to be returned from the given
>>>>>>> Scan.
>>>>>>>
>>>>>>> Any thoughts on either of these?
>>>>>>>
>>>>>>> - Gabriel
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>> I came across a unique behavior while using HBaseSourceTarget.
>>>>>>>> Suppose I
>>>>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>>>>> HBaseSourceTarget
>>>>>>>> and passes all the data to a reduce phase, the number of reducers
>>>>>>>> set by
>>>>>>>> planner will be equal to 1. The reason being [1]. So, it looks like
>>>>>>>> the
>>>>>>>> planner assumes there is only about 1Gb of data that's read from the
>>>>>>>> source, and sets the number of reducers accordingly. However, let's
>>>>>>>> say my
>>>>>>>> HBase scan is returning very less data or huge amounts of data. The
>>>>>>>> planner
>>>>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>>>>>> interesting is, if there are dependent jobs, the planner will set
>>>>>>>> the
>>>>>>>> number of reducers based on the initially determined size from
>>>>>>>> HBase source.
>>>>>>>>
>>>>>>>> As a fix for the above problem, I can set the number of reducers on
>>>>>>>> the
>>>>>>>> groupByKey(), but that does not offer much flexibility when dealing
>>>>>>>> with
>>>>>>>> data that is of varying sizes. The other option, is to have a map
>>>>>>>> only job
>>>>>>>> that reads from HBase and writes to HDFS and have a run(). The next
>>>>>>>> job
>>>>>>>> will determine the size right, since FileSourceImpl calculates the
>>>>>>>> size on
>>>>>>>> disk.
>>>>>>>>
>>>>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if
>>>>>>>> there was
>>>>>>>> anything planned to have it implemented.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Nithin
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>

Re: Question about HBaseSourceTarget

Posted by Nithin Asokan <an...@gmail.com>.

Thanks for looking at this everyone.

I can try the suggestion Gabriel posted here, I'm not familiar with the
HBaseFrom.table(String) API, and tried searching online. It will be really
helpful if someone can point me to the API.

Thanks everyone!

On Tue, Mar 17, 2015 at 3:34 PM Gabriel Reid <ga...@gmail.com> wrote:

> Yep, that looks like it could be pretty handy -- according to that ticket
> it's in 0.98.1 as well.
>
>
> On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <jw...@cloudera.com> wrote:
>
>> Would this help for 0.99+?
>>
>> https://issues.apache.org/jira/browse/HBASE-10413
>>
>> On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <ga...@gmail.com>
>> wrote:
>>
>>> That sounds like it would work pretty well, although the situation where
>>> a custom Scan is used is still problematic.
>>>
>>> I think Hannibal [1] does some clever stuff as far as figuring out data
>>> size as well (I think just via HBase RPC and not by looking at HDFS), there
>>> could be some useful ideas in there.
>>>
>>> - Gabriel
>>>
>>> 1. https://github.com/sentric/hannibal
>>>
>>>
>>> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mk...@gmail.com>
>>> wrote:
>>>
>>>> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>>>>  The case where this would overestimate would be if someone pre-split
>>>> a table upon creation.   Otherwise as the table fills up over time in
>>>> theory each region would grow and split evenly (and possibly hit max size
>>>> and therefore split again).
>>>>
>>>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jw...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Also open to suggestion here-- this has annoyed me for some time (as
>>>>> Gabriel pointed out), but I don't have a good fix for it.
>>>>>
>>>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <ga...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Nithin,
>>>>>>
>>>>>> This is a long-standing issue in Crunch (I think it's been present
>>>>>> since Crunch was originally open-sourced). I'd love to get this fixed
>>>>>> somehow, although it seems to not be trivial to do -- it can be difficult
>>>>>> to accurately estimate the size of data that will come from an HBase table,
>>>>>> especially considering that filters and selections of a subset of columns
>>>>>> can be done on an HBase table.
>>>>>>
>>>>>> One short-term way of working around this is to add a simple identity
>>>>>> function directly after the HBaseSourceTarget that implements the
>>>>>> scaleFactor method to manipulate the calculated size of the HBase data, but
>>>>>> this is just another hack.
>>>>>>
>>>>>> Maybe the better solution would be to estimate the size of the HBase
>>>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>>>>> also take a long value which is the estimated byte size (or perhaps scale
>>>>>> factor) of the table content that is expected to be returned from the given
>>>>>> Scan.
>>>>>>
>>>>>> Any thoughts on either of these?
>>>>>>
>>>>>> - Gabriel
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>> I came across a unique behavior while using HBaseSourceTarget.
>>>>>>> Suppose I
>>>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>>>> HBaseSourceTarget
>>>>>>> and passes all the data to a reduce phase, the number of reducers
>>>>>>> set by
>>>>>>> planner will be equal to 1. The reason being [1]. So, it looks like
>>>>>>> the
>>>>>>> planner assumes there is only about 1Gb of data that's read from the
>>>>>>> source, and sets the number of reducers accordingly. However, let's
>>>>>>> say my
>>>>>>> HBase scan is returning very less data or huge amounts of data. The
>>>>>>> planner
>>>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>>>>> interesting is, if there are dependent jobs, the planner will set the
>>>>>>> number of reducers based on the initially determined size from HBase
>>>>>>> source.
>>>>>>>
>>>>>>> As a fix for the above problem, I can set the number of reducers on
>>>>>>> the
>>>>>>> groupByKey(), but that does not offer much flexibility when dealing
>>>>>>> with
>>>>>>> data that is of varying sizes. The other option, is to have a map
>>>>>>> only job
>>>>>>> that reads from HBase and writes to HDFS and have a run(). The next
>>>>>>> job
>>>>>>> will determine the size right, since FileSourceImpl calculates the
>>>>>>> size on
>>>>>>> disk.
>>>>>>>
>>>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if
>>>>>>> there was
>>>>>>> anything planned to have it implemented.
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>>>
>>>>>>> Thanks
>>>>>>> Nithin
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>

Re: Question about HBaseSourceTarget

Posted by Gabriel Reid <ga...@gmail.com>.

Yep, that looks like it could be pretty handy -- according to that ticket
it's in 0.98.1 as well.

On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <jw...@cloudera.com> wrote:

> Would this help for 0.99+?
>
> https://issues.apache.org/jira/browse/HBASE-10413
>
> On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <ga...@gmail.com>
> wrote:
>
>> That sounds like it would work pretty well, although the situation where
>> a custom Scan is used is still problematic.
>>
>> I think Hannibal [1] does some clever stuff as far as figuring out data
>> size as well (I think just via HBase RPC and not by looking at HDFS), there
>> could be some useful ideas in there.
>>
>> - Gabriel
>>
>> 1. https://github.com/sentric/hannibal
>>
>>
>> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mk...@gmail.com>
>> wrote:
>>
>>> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>>>  The case where this would overestimate would be if someone pre-split a
>>> table upon creation.   Otherwise as the table fills up over time in theory
>>> each region would grow and split evenly (and possibly hit max size and
>>> therefore split again).
>>>
>>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jw...@cloudera.com>
>>> wrote:
>>>
>>>> Also open to suggestion here-- this has annoyed me for some time (as
>>>> Gabriel pointed out), but I don't have a good fix for it.
>>>>
>>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <ga...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Nithin,
>>>>>
>>>>> This is a long-standing issue in Crunch (I think it's been present
>>>>> since Crunch was originally open-sourced). I'd love to get this fixed
>>>>> somehow, although it seems to not be trivial to do -- it can be difficult
>>>>> to accurately estimate the size of data that will come from an HBase table,
>>>>> especially considering that filters and selections of a subset of columns
>>>>> can be done on an HBase table.
>>>>>
>>>>> One short-term way of working around this is to add a simple identity
>>>>> function directly after the HBaseSourceTarget that implements the
>>>>> scaleFactor method to manipulate the calculated size of the HBase data, but
>>>>> this is just another hack.
>>>>>
>>>>> Maybe the better solution would be to estimate the size of the HBase
>>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>>>> also take a long value which is the estimated byte size (or perhaps scale
>>>>> factor) of the table content that is expected to be returned from the given
>>>>> Scan.
>>>>>
>>>>> Any thoughts on either of these?
>>>>>
>>>>> - Gabriel
>>>>>
>>>>>
>>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> I came across a unique behavior while using HBaseSourceTarget.
>>>>>> Suppose I
>>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>>> HBaseSourceTarget
>>>>>> and passes all the data to a reduce phase, the number of reducers set
>>>>>> by
>>>>>> planner will be equal to 1. The reason being [1]. So, it looks like
>>>>>> the
>>>>>> planner assumes there is only about 1Gb of data that's read from the
>>>>>> source, and sets the number of reducers accordingly. However, let's
>>>>>> say my
>>>>>> HBase scan is returning very less data or huge amounts of data. The
>>>>>> planner
>>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>>>> interesting is, if there are dependent jobs, the planner will set the
>>>>>> number of reducers based on the initially determined size from HBase
>>>>>> source.
>>>>>>
>>>>>> As a fix for the above problem, I can set the number of reducers on
>>>>>> the
>>>>>> groupByKey(), but that does not offer much flexibility when dealing
>>>>>> with
>>>>>> data that is of varying sizes. The other option, is to have a map
>>>>>> only job
>>>>>> that reads from HBase and writes to HDFS and have a run(). The next
>>>>>> job
>>>>>> will determine the size right, since FileSourceImpl calculates the
>>>>>> size on
>>>>>> disk.
>>>>>>
>>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if
>>>>>> there was
>>>>>> anything planned to have it implemented.
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>>
>>>>>> Thanks
>>>>>> Nithin
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Question about HBaseSourceTarget

Posted by Josh Wills <jw...@cloudera.com>.

Would this help for 0.99+?

https://issues.apache.org/jira/browse/HBASE-10413

On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <ga...@gmail.com>
wrote:

> That sounds like it would work pretty well, although the situation where a
> custom Scan is used is still problematic.
>
> I think Hannibal [1] does some clever stuff as far as figuring out data
> size as well (I think just via HBase RPC and not by looking at HDFS), there
> could be some useful ideas in there.
>
> - Gabriel
>
> 1. https://github.com/sentric/hannibal
>
>
> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mk...@gmail.com>
> wrote:
>
>> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>>  The case where this would overestimate would be if someone pre-split a
>> table upon creation.   Otherwise as the table fills up over time in theory
>> each region would grow and split evenly (and possibly hit max size and
>> therefore split again).
>>
>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jw...@cloudera.com> wrote:
>>
>>> Also open to suggestion here-- this has annoyed me for some time (as
>>> Gabriel pointed out), but I don't have a good fix for it.
>>>
>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <ga...@gmail.com>
>>> wrote:
>>>
>>>> Hi Nithin,
>>>>
>>>> This is a long-standing issue in Crunch (I think it's been present
>>>> since Crunch was originally open-sourced). I'd love to get this fixed
>>>> somehow, although it seems to not be trivial to do -- it can be difficult
>>>> to accurately estimate the size of data that will come from an HBase table,
>>>> especially considering that filters and selections of a subset of columns
>>>> can be done on an HBase table.
>>>>
>>>> One short-term way of working around this is to add a simple identity
>>>> function directly after the HBaseSourceTarget that implements the
>>>> scaleFactor method to manipulate the calculated size of the HBase data, but
>>>> this is just another hack.
>>>>
>>>> Maybe the better solution would be to estimate the size of the HBase
>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>>> also take a long value which is the estimated byte size (or perhaps scale
>>>> factor) of the table content that is expected to be returned from the given
>>>> Scan.
>>>>
>>>> Any thoughts on either of these?
>>>>
>>>> - Gabriel
>>>>
>>>>
>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I came across a unique behavior while using HBaseSourceTarget. Suppose
>>>>> I
>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>> HBaseSourceTarget
>>>>> and passes all the data to a reduce phase, the number of reducers set
>>>>> by
>>>>> planner will be equal to 1. The reason being [1]. So, it looks like the
>>>>> planner assumes there is only about 1Gb of data that's read from the
>>>>> source, and sets the number of reducers accordingly. However, let's
>>>>> say my
>>>>> HBase scan is returning very less data or huge amounts of data. The
>>>>> planner
>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>>> interesting is, if there are dependent jobs, the planner will set the
>>>>> number of reducers based on the initially determined size from HBase
>>>>> source.
>>>>>
>>>>> As a fix for the above problem, I can set the number of reducers on the
>>>>> groupByKey(), but that does not offer much flexibility when dealing
>>>>> with
>>>>> data that is of varying sizes. The other option, is to have a map only
>>>>> job
>>>>> that reads from HBase and writes to HDFS and have a run(). The next job
>>>>> will determine the size right, since FileSourceImpl calculates the
>>>>> size on
>>>>> disk.
>>>>>
>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if there
>>>>> was
>>>>> anything planned to have it implemented.
>>>>>
>>>>> [1]
>>>>>
>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>
>>>>> Thanks
>>>>> Nithin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Question about HBaseSourceTarget

Posted by Gabriel Reid <ga...@gmail.com>.

That sounds like it would work pretty well, although the situation where a
custom Scan is used is still problematic.

I think Hannibal [1] does some clever stuff as far as figuring out data
size as well (I think just via HBase RPC and not by looking at HDFS), there
could be some useful ideas in there.

- Gabriel

1. https://github.com/sentric/hannibal


On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mk...@gmail.com> wrote:

> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>  The case where this would overestimate would be if someone pre-split a
> table upon creation.   Otherwise as the table fills up over time in theory
> each region would grow and split evenly (and possibly hit max size and
> therefore split again).
>
> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jw...@cloudera.com> wrote:
>
>> Also open to suggestion here-- this has annoyed me for some time (as
>> Gabriel pointed out), but I don't have a good fix for it.
>>
>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <ga...@gmail.com>
>> wrote:
>>
>>> Hi Nithin,
>>>
>>> This is a long-standing issue in Crunch (I think it's been present since
>>> Crunch was originally open-sourced). I'd love to get this fixed somehow,
>>> although it seems to not be trivial to do -- it can be difficult to
>>> accurately estimate the size of data that will come from an HBase table,
>>> especially considering that filters and selections of a subset of columns
>>> can be done on an HBase table.
>>>
>>> One short-term way of working around this is to add a simple identity
>>> function directly after the HBaseSourceTarget that implements the
>>> scaleFactor method to manipulate the calculated size of the HBase data, but
>>> this is just another hack.
>>>
>>> Maybe the better solution would be to estimate the size of the HBase
>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>> also take a long value which is the estimated byte size (or perhaps scale
>>> factor) of the table content that is expected to be returned from the given
>>> Scan.
>>>
>>> Any thoughts on either of these?
>>>
>>> - Gabriel
>>>
>>>
>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>> I came across a unique behavior while using HBaseSourceTarget. Suppose I
>>>> have a job(from MRPipeline) that reads from HBase using
>>>> HBaseSourceTarget
>>>> and passes all the data to a reduce phase, the number of reducers set by
>>>> planner will be equal to 1. The reason being [1]. So, it looks like the
>>>> planner assumes there is only about 1Gb of data that's read from the
>>>> source, and sets the number of reducers accordingly. However, let's say
>>>> my
>>>> HBase scan is returning very less data or huge amounts of data. The
>>>> planner
>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>> interesting is, if there are dependent jobs, the planner will set the
>>>> number of reducers based on the initially determined size from HBase
>>>> source.
>>>>
>>>> As a fix for the above problem, I can set the number of reducers on the
>>>> groupByKey(), but that does not offer much flexibility when dealing with
>>>> data that is of varying sizes. The other option, is to have a map only
>>>> job
>>>> that reads from HBase and writes to HDFS and have a run(). The next job
>>>> will determine the size right, since FileSourceImpl calculates the size
>>>> on
>>>> disk.
>>>>
>>>> I noticed the comment on HBaseSourceTarget, and was wondering if there
>>>> was
>>>> anything planned to have it implemented.
>>>>
>>>> [1]
>>>>
>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>
>>>> Thanks
>>>> Nithin
>>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Re: Question about HBaseSourceTarget

Posted by Micah Whitacre <mk...@gmail.com>.

Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
 The case where this would overestimate would be if someone pre-split a
table upon creation.   Otherwise as the table fills up over time in theory
each region would grow and split evenly (and possibly hit max size and
therefore split again).

On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jw...@cloudera.com> wrote:

> Also open to suggestion here-- this has annoyed me for some time (as
> Gabriel pointed out), but I don't have a good fix for it.
>
> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <ga...@gmail.com>
> wrote:
>
>> Hi Nithin,
>>
>> This is a long-standing issue in Crunch (I think it's been present since
>> Crunch was originally open-sourced). I'd love to get this fixed somehow,
>> although it seems to not be trivial to do -- it can be difficult to
>> accurately estimate the size of data that will come from an HBase table,
>> especially considering that filters and selections of a subset of columns
>> can be done on an HBase table.
>>
>> One short-term way of working around this is to add a simple identity
>> function directly after the HBaseSourceTarget that implements the
>> scaleFactor method to manipulate the calculated size of the HBase data, but
>> this is just another hack.
>>
>> Maybe the better solution would be to estimate the size of the HBase
>> table based on its size on HDFS when using the HBaseFrom.table(String)
>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>> also take a long value which is the estimated byte size (or perhaps scale
>> factor) of the table content that is expected to be returned from the given
>> Scan.
>>
>> Any thoughts on either of these?
>>
>> - Gabriel
>>
>>
>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com>
>> wrote:
>>
>>> Hello,
>>> I came across a unique behavior while using HBaseSourceTarget. Suppose I
>>> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
>>> and passes all the data to a reduce phase, the number of reducers set by
>>> planner will be equal to 1. The reason being [1]. So, it looks like the
>>> planner assumes there is only about 1Gb of data that's read from the
>>> source, and sets the number of reducers accordingly. However, let's say
>>> my
>>> HBase scan is returning very less data or huge amounts of data. The
>>> planner
>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>> interesting is, if there are dependent jobs, the planner will set the
>>> number of reducers based on the initially determined size from HBase
>>> source.
>>>
>>> As a fix for the above problem, I can set the number of reducers on the
>>> groupByKey(), but that does not offer much flexibility when dealing with
>>> data that is of varying sizes. The other option, is to have a map only
>>> job
>>> that reads from HBase and writes to HDFS and have a run(). The next job
>>> will determine the size right, since FileSourceImpl calculates the size
>>> on
>>> disk.
>>>
>>> I noticed the comment on HBaseSourceTarget, and was wondering if there
>>> was
>>> anything planned to have it implemented.
>>>
>>> [1]
>>>
>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>
>>> Thanks
>>> Nithin
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Question about HBaseSourceTarget

Posted by Josh Wills <jw...@cloudera.com>.

Also open to suggestion here-- this has annoyed me for some time (as
Gabriel pointed out), but I don't have a good fix for it.

On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <ga...@gmail.com>
wrote:

> Hi Nithin,
>
> This is a long-standing issue in Crunch (I think it's been present since
> Crunch was originally open-sourced). I'd love to get this fixed somehow,
> although it seems to not be trivial to do -- it can be difficult to
> accurately estimate the size of data that will come from an HBase table,
> especially considering that filters and selections of a subset of columns
> can be done on an HBase table.
>
> One short-term way of working around this is to add a simple identity
> function directly after the HBaseSourceTarget that implements the
> scaleFactor method to manipulate the calculated size of the HBase data, but
> this is just another hack.
>
> Maybe the better solution would be to estimate the size of the HBase table
> based on its size on HDFS when using the HBaseFrom.table(String) method,
> and then also overload the HBaseFrom.table(String, Scan) method to also
> take a long value which is the estimated byte size (or perhaps scale
> factor) of the table content that is expected to be returned from the given
> Scan.
>
> Any thoughts on either of these?
>
> - Gabriel
>
>
> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com> wrote:
>
>> Hello,
>> I came across a unique behavior while using HBaseSourceTarget. Suppose I
>> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
>> and passes all the data to a reduce phase, the number of reducers set by
>> planner will be equal to 1. The reason being [1]. So, it looks like the
>> planner assumes there is only about 1Gb of data that's read from the
>> source, and sets the number of reducers accordingly. However, let's say my
>> HBase scan is returning very less data or huge amounts of data. The
>> planner
>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>> interesting is, if there are dependent jobs, the planner will set the
>> number of reducers based on the initially determined size from HBase
>> source.
>>
>> As a fix for the above problem, I can set the number of reducers on the
>> groupByKey(), but that does not offer much flexibility when dealing with
>> data that is of varying sizes. The other option, is to have a map only job
>> that reads from HBase and writes to HDFS and have a run(). The next job
>> will determine the size right, since FileSourceImpl calculates the size on
>> disk.
>>
>> I noticed the comment on HBaseSourceTarget, and was wondering if there was
>> anything planned to have it implemented.
>>
>> [1]
>>
>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>
>> Thanks
>> Nithin
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Question about HBaseSourceTarget

Posted by Gabriel Reid <ga...@gmail.com>.

Hi Nithin,

This is a long-standing issue in Crunch (I think it's been present since
Crunch was originally open-sourced). I'd love to get this fixed somehow,
although it seems to not be trivial to do -- it can be difficult to
accurately estimate the size of data that will come from an HBase table,
especially considering that filters and selections of a subset of columns
can be done on an HBase table.

One short-term way of working around this is to add a simple identity
function directly after the HBaseSourceTarget that implements the
scaleFactor method to manipulate the calculated size of the HBase data, but
this is just another hack.

Maybe the better solution would be to estimate the size of the HBase table
based on its size on HDFS when using the HBaseFrom.table(String) method,
and then also overload the HBaseFrom.table(String, Scan) method to also
take a long value which is the estimated byte size (or perhaps scale
factor) of the table content that is expected to be returned from the given
Scan.

Any thoughts on either of these?

- Gabriel

On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <an...@gmail.com> wrote:

> Hello,
> I came across a unique behavior while using HBaseSourceTarget. Suppose I
> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
> and passes all the data to a reduce phase, the number of reducers set by
> planner will be equal to 1. The reason being [1]. So, it looks like the
> planner assumes there is only about 1Gb of data that's read from the
> source, and sets the number of reducers accordingly. However, let's say my
> HBase scan is returning very less data or huge amounts of data. The planner
> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
> interesting is, if there are dependent jobs, the planner will set the
> number of reducers based on the initially determined size from HBase
> source.
>
> As a fix for the above problem, I can set the number of reducers on the
> groupByKey(), but that does not offer much flexibility when dealing with
> data that is of varying sizes. The other option, is to have a map only job
> that reads from HBase and writes to HDFS and have a run(). The next job
> will determine the size right, since FileSourceImpl calculates the size on
> disk.
>
> I noticed the comment on HBaseSourceTarget, and was wondering if there was
> anything planned to have it implemented.
>
> [1]
>
> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>
> Thanks
> Nithin
>