You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Nithin Asokan <an...@gmail.com> on 2015/03/16 20:32:43 UTC

Question about HBaseSourceTarget#getSize()

Hello,
I came across a unique behavior while using HBaseSourceTarget. Suppose I
have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
and passes all the data to a reduce phase, the number of reducers set by
planner will be equal to 1. The reason being [1]. So, it looks like the
planner assumes there is only about 1Gb of data that's read from the
source, and sets the number of reducers accordingly. However, let's say my
HBase scan is returning very less data or huge amounts of data. The planner
still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
interesting is, if there are dependent jobs, the planner will set the
number of reducers based on the initially determined size from HBase source.

As a fix for the above problem, I can set the number of reducers on the
groupByKey(), but that does not offer much flexibility when dealing with
data that is of varying sizes. The other option, is to have a map only job
that reads from HBase and writes to HDFS and have a run(). The next job
will determine the size right, since FileSourceImpl calculates the size on
disk.

I noticed the comment on HBaseSourceTarget, and was wondering if there was
anything planned to have it implemented.

[1]
https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173

Thanks
Nithin

Question about HBaseSourceTarget#getSize()

Posted by Nithin Asokan <an...@gmail.com>.
Hello,
I came across a unique behavior while using HBaseSourceTarget. Suppose I
have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
and passes all the data to a reduce phase, the number of reducers set by
planner will be equal to 1. The reason being [1]. So, it looks like the
planner assumes there is only about 1Gb of data that's read from the
source, and sets the number of reducers accordingly. However, let's say my
HBase scan is returning very less data or huge amounts of data. The planner
still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
interesting is, if there are dependent jobs, the planner will set the
number of reducers based on the initially determined size from HBase source.

As a fix for the above problem, I can set the number of reducers on the
groupByKey(), but that does not offer much flexibility when dealing with
data that is of varying sizes. The other option, is to have a map only job
that reads from HBase and writes to HDFS and have a run(). The next job
will determine the size right, since FileSourceImpl calculates the size on
disk.

I noticed the comment on HBaseSourceTarget, and was wondering if there was
anything planned to have it implemented.

[1]
https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173

Thanks
Nithin

Re: Question about HBaseSourceTarget#getSize()

Posted by Chao Shi <st...@live.com>.
Hi Nithin,

Because HBaseSourceTarget supports custom Scan criteria (i.e. you can apply
filters), I think it can hardly make a guess on the resulting data size.
Even HBase itself, because of the nature of LSM storage it uses, it cannot
estimate the resulting number of rows or the data size before the query
actually executed.

Does anyone else have better idea on this?

2015-03-17 3:32 GMT+08:00 Nithin Asokan <an...@gmail.com>:

> Hello,
> I came across a unique behavior while using HBaseSourceTarget. Suppose I
> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
> and passes all the data to a reduce phase, the number of reducers set by
> planner will be equal to 1. The reason being [1]. So, it looks like the
> planner assumes there is only about 1Gb of data that's read from the
> source, and sets the number of reducers accordingly. However, let's say my
> HBase scan is returning very less data or huge amounts of data. The planner
> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
> interesting is, if there are dependent jobs, the planner will set the
> number of reducers based on the initially determined size from HBase source.
>
> As a fix for the above problem, I can set the number of reducers on the
> groupByKey(), but that does not offer much flexibility when dealing with
> data that is of varying sizes. The other option, is to have a map only job
> that reads from HBase and writes to HDFS and have a run(). The next job
> will determine the size right, since FileSourceImpl calculates the size on
> disk.
>
> I noticed the comment on HBaseSourceTarget, and was wondering if there was
> anything planned to have it implemented.
>
> [1]
> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>
> Thanks
> Nithin
>