You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Praveen Bysani <pr...@gmail.com> on 2013/05/13 08:40:21 UTC

Block size of HBase files

Hi,

I have the dfs.block.size value set to 1 GB in my cluster configuration. I
have around 250 GB of data stored in hbase over this cluster. But when i
check the number of blocks, it doesn't correspond to the block size value i
set. From what i understand i should only have ~250 blocks. But instead
when i did a fsck on the /hbase/<table-name>, i got the following

Status: HEALTHY
 Total size:    265727504820 B
 Total dirs:    1682
 Total files:   1459
 Total blocks (validated):      1459 (avg. block size 182129886 B)
 Minimally replicated blocks:   1459 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          5
 Number of racks:               1

Are there any other configuration parameters that need to be set ?

-- 
Regards,
Praveen Bysani
http://www.praveenbysani.com

Re: Block size of HBase files

Posted by Praveen Bysani <pr...@gmail.com>.
Hi Anoop,

No we didn't specify any such while creating and writing into the table.

On 13 May 2013 20:22, Anoop John <an...@gmail.com> wrote:

> I mean when u created the table (Using client I guess)  have u specified
> any thuing like splitKeys or [start,end, no#regions]?
>
> -Anoop-
>
> On Mon, May 13, 2013 at 5:49 PM, Praveen Bysani <praveen.iiith@gmail.com
> >wrote:
>
> > We insert data using java hbase client
> (org.apache.hadoop.hbase.client.*) .
> > However we are not providing any details in the configuration object ,
> > except for the zookeeper quorum, port number. Should we specify
> explicitly
> > at this stage ?
> >
> > On 13 May 2013 19:54, Anoop John <an...@gmail.com> wrote:
> >
> > > >now have 731 regions (each about ~350 mb !!). I checked the
> > > configuration in CM, and the value for hbase.hregion.max.filesize  is 1
> > GB
> > > too !!!
> > >
> > > You mentioned the splits at the time of table creation?  How u created
> > the
> > > table?
> > >
> > > -Anoop-
> > >
> > > On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani <
> praveen.iiith@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the details. No i haven't run any compaction or i have no
> > idea
> > > > if there is one going on in background. I executed a major_compact on
> > > that
> > > > table  and i now have 731 regions (each about ~350 mb !!). I checked
> > the
> > > > configuration in CM, and the value for hbase.hregion.max.filesize
>  is 1
> > > GB
> > > > too !!!
> > > >
> > > > I am not trying to access HFiles in my MR job, infact i am just
> using a
> > > PIG
> > > > script which handles this. This number (731) is close to my number of
> > map
> > > > tasks, which makes sense. But how can i decrease this, shouldn't the
> > size
> > > > of each region be 1 GB with that configuration value ?
> > > >
> > > >
> > > > On 13 May 2013 18:36, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > You can change HFile size through hbase.hregion.max.filesize
> > parameter.
> > > > >
> > > > > On May 13, 2013, at 2:45 AM, Praveen Bysani <
> praveen.iiith@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I wanted to minimize on the number of map reduce tasks generated
> > > while
> > > > > > processing a job, hence configured it to a larger value.
> > > > > >
> > > > > > I don't think i have configured HFile size in the cluster. I use
> > > > Cloudera
> > > > > > Manager to mange my cluster, and the only configuration i can
> > relate
> > > > > > to is hfile.block.cache.size
> > > > > > which is set to 0.25. How do i change the HFile size ?
> > > > > >
> > > > > > On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
> > > > > >
> > > > > >> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> > > > > praveen.iiith@gmail.com
> > > > > >>> wrote:
> > > > > >>
> > > > > >>> Hi,
> > > > > >>>
> > > > > >>> I have the dfs.block.size value set to 1 GB in my cluster
> > > > > configuration.
> > > > > >>
> > > > > >>
> > > > > >> Just out of curiosity - why do you have it set at 1GB?
> > > > > >>
> > > > > >>
> > > > > >>> I
> > > > > >>> have around 250 GB of data stored in hbase over this cluster.
> But
> > > > when
> > > > > i
> > > > > >>> check the number of blocks, it doesn't correspond to the block
> > size
> > > > > >> value i
> > > > > >>> set. From what i understand i should only have ~250 blocks. But
> > > > instead
> > > > > >>> when i did a fsck on the /hbase/<table-name>, i got the
> following
> > > > > >>>
> > > > > >>> Status: HEALTHY
> > > > > >>> Total size:    265727504820 B
> > > > > >>> Total dirs:    1682
> > > > > >>> Total files:   1459
> > > > > >>> Total blocks (validated):      1459 (avg. block size 182129886
> B)
> > > > > >>> Minimally replicated blocks:   1459 (100.0 %)
> > > > > >>> Over-replicated blocks:        0 (0.0 %)
> > > > > >>> Under-replicated blocks:       0 (0.0 %)
> > > > > >>> Mis-replicated blocks:         0 (0.0 %)
> > > > > >>> Default replication factor:    3
> > > > > >>> Average block replication:     3.0
> > > > > >>> Corrupt blocks:                0
> > > > > >>> Missing replicas:              0 (0.0 %)
> > > > > >>> Number of data-nodes:          5
> > > > > >>> Number of racks:               1
> > > > > >>>
> > > > > >>> Are there any other configuration parameters that need to be
> set
> > ?
> > > > > >>
> > > > > >>
> > > > > >> What is your HFile size set to? The HFiles that get persisted
> > would
> > > be
> > > > > >> bound by that number. Thereafter each HFile would be split into
> > > > blocks,
> > > > > the
> > > > > >> size of which you configure using the dfs.block.size
> configuration
> > > > > >> parameter.
> > > > > >>
> > > > > >>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Regards,
> > > > > >>> Praveen Bysani
> > > > > >>> http://www.praveenbysani.com
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Praveen Bysani
> > > > > > http://www.praveenbysani.com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Praveen Bysani
> > > > http://www.praveenbysani.com
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Praveen Bysani
> > http://www.praveenbysani.com
> >
>



-- 
Regards,
Praveen Bysani
http://www.praveenbysani.com

Re: Block size of HBase files

Posted by Anoop John <an...@gmail.com>.
I mean when u created the table (Using client I guess)  have u specified
any thuing like splitKeys or [start,end, no#regions]?

-Anoop-

On Mon, May 13, 2013 at 5:49 PM, Praveen Bysani <pr...@gmail.com>wrote:

> We insert data using java hbase client (org.apache.hadoop.hbase.client.*) .
> However we are not providing any details in the configuration object ,
> except for the zookeeper quorum, port number. Should we specify explicitly
> at this stage ?
>
> On 13 May 2013 19:54, Anoop John <an...@gmail.com> wrote:
>
> > >now have 731 regions (each about ~350 mb !!). I checked the
> > configuration in CM, and the value for hbase.hregion.max.filesize  is 1
> GB
> > too !!!
> >
> > You mentioned the splits at the time of table creation?  How u created
> the
> > table?
> >
> > -Anoop-
> >
> > On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani <praveen.iiith@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Thanks for the details. No i haven't run any compaction or i have no
> idea
> > > if there is one going on in background. I executed a major_compact on
> > that
> > > table  and i now have 731 regions (each about ~350 mb !!). I checked
> the
> > > configuration in CM, and the value for hbase.hregion.max.filesize  is 1
> > GB
> > > too !!!
> > >
> > > I am not trying to access HFiles in my MR job, infact i am just using a
> > PIG
> > > script which handles this. This number (731) is close to my number of
> map
> > > tasks, which makes sense. But how can i decrease this, shouldn't the
> size
> > > of each region be 1 GB with that configuration value ?
> > >
> > >
> > > On 13 May 2013 18:36, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > You can change HFile size through hbase.hregion.max.filesize
> parameter.
> > > >
> > > > On May 13, 2013, at 2:45 AM, Praveen Bysani <praveen.iiith@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I wanted to minimize on the number of map reduce tasks generated
> > while
> > > > > processing a job, hence configured it to a larger value.
> > > > >
> > > > > I don't think i have configured HFile size in the cluster. I use
> > > Cloudera
> > > > > Manager to mange my cluster, and the only configuration i can
> relate
> > > > > to is hfile.block.cache.size
> > > > > which is set to 0.25. How do i change the HFile size ?
> > > > >
> > > > > On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
> > > > >
> > > > >> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> > > > praveen.iiith@gmail.com
> > > > >>> wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> I have the dfs.block.size value set to 1 GB in my cluster
> > > > configuration.
> > > > >>
> > > > >>
> > > > >> Just out of curiosity - why do you have it set at 1GB?
> > > > >>
> > > > >>
> > > > >>> I
> > > > >>> have around 250 GB of data stored in hbase over this cluster. But
> > > when
> > > > i
> > > > >>> check the number of blocks, it doesn't correspond to the block
> size
> > > > >> value i
> > > > >>> set. From what i understand i should only have ~250 blocks. But
> > > instead
> > > > >>> when i did a fsck on the /hbase/<table-name>, i got the following
> > > > >>>
> > > > >>> Status: HEALTHY
> > > > >>> Total size:    265727504820 B
> > > > >>> Total dirs:    1682
> > > > >>> Total files:   1459
> > > > >>> Total blocks (validated):      1459 (avg. block size 182129886 B)
> > > > >>> Minimally replicated blocks:   1459 (100.0 %)
> > > > >>> Over-replicated blocks:        0 (0.0 %)
> > > > >>> Under-replicated blocks:       0 (0.0 %)
> > > > >>> Mis-replicated blocks:         0 (0.0 %)
> > > > >>> Default replication factor:    3
> > > > >>> Average block replication:     3.0
> > > > >>> Corrupt blocks:                0
> > > > >>> Missing replicas:              0 (0.0 %)
> > > > >>> Number of data-nodes:          5
> > > > >>> Number of racks:               1
> > > > >>>
> > > > >>> Are there any other configuration parameters that need to be set
> ?
> > > > >>
> > > > >>
> > > > >> What is your HFile size set to? The HFiles that get persisted
> would
> > be
> > > > >> bound by that number. Thereafter each HFile would be split into
> > > blocks,
> > > > the
> > > > >> size of which you configure using the dfs.block.size configuration
> > > > >> parameter.
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> --
> > > > >>> Regards,
> > > > >>> Praveen Bysani
> > > > >>> http://www.praveenbysani.com
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Praveen Bysani
> > > > > http://www.praveenbysani.com
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Praveen Bysani
> > > http://www.praveenbysani.com
> > >
> >
>
>
>
> --
> Regards,
> Praveen Bysani
> http://www.praveenbysani.com
>

Re: Block size of HBase files

Posted by Praveen Bysani <pr...@gmail.com>.
We insert data using java hbase client (org.apache.hadoop.hbase.client.*) .
However we are not providing any details in the configuration object ,
except for the zookeeper quorum, port number. Should we specify explicitly
at this stage ?

On 13 May 2013 19:54, Anoop John <an...@gmail.com> wrote:

> >now have 731 regions (each about ~350 mb !!). I checked the
> configuration in CM, and the value for hbase.hregion.max.filesize  is 1 GB
> too !!!
>
> You mentioned the splits at the time of table creation?  How u created the
> table?
>
> -Anoop-
>
> On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani <praveen.iiith@gmail.com
> >wrote:
>
> > Hi,
> >
> > Thanks for the details. No i haven't run any compaction or i have no idea
> > if there is one going on in background. I executed a major_compact on
> that
> > table  and i now have 731 regions (each about ~350 mb !!). I checked the
> > configuration in CM, and the value for hbase.hregion.max.filesize  is 1
> GB
> > too !!!
> >
> > I am not trying to access HFiles in my MR job, infact i am just using a
> PIG
> > script which handles this. This number (731) is close to my number of map
> > tasks, which makes sense. But how can i decrease this, shouldn't the size
> > of each region be 1 GB with that configuration value ?
> >
> >
> > On 13 May 2013 18:36, Ted Yu <yu...@gmail.com> wrote:
> >
> > > You can change HFile size through hbase.hregion.max.filesize parameter.
> > >
> > > On May 13, 2013, at 2:45 AM, Praveen Bysani <pr...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I wanted to minimize on the number of map reduce tasks generated
> while
> > > > processing a job, hence configured it to a larger value.
> > > >
> > > > I don't think i have configured HFile size in the cluster. I use
> > Cloudera
> > > > Manager to mange my cluster, and the only configuration i can relate
> > > > to is hfile.block.cache.size
> > > > which is set to 0.25. How do i change the HFile size ?
> > > >
> > > > On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
> > > >
> > > >> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> > > praveen.iiith@gmail.com
> > > >>> wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I have the dfs.block.size value set to 1 GB in my cluster
> > > configuration.
> > > >>
> > > >>
> > > >> Just out of curiosity - why do you have it set at 1GB?
> > > >>
> > > >>
> > > >>> I
> > > >>> have around 250 GB of data stored in hbase over this cluster. But
> > when
> > > i
> > > >>> check the number of blocks, it doesn't correspond to the block size
> > > >> value i
> > > >>> set. From what i understand i should only have ~250 blocks. But
> > instead
> > > >>> when i did a fsck on the /hbase/<table-name>, i got the following
> > > >>>
> > > >>> Status: HEALTHY
> > > >>> Total size:    265727504820 B
> > > >>> Total dirs:    1682
> > > >>> Total files:   1459
> > > >>> Total blocks (validated):      1459 (avg. block size 182129886 B)
> > > >>> Minimally replicated blocks:   1459 (100.0 %)
> > > >>> Over-replicated blocks:        0 (0.0 %)
> > > >>> Under-replicated blocks:       0 (0.0 %)
> > > >>> Mis-replicated blocks:         0 (0.0 %)
> > > >>> Default replication factor:    3
> > > >>> Average block replication:     3.0
> > > >>> Corrupt blocks:                0
> > > >>> Missing replicas:              0 (0.0 %)
> > > >>> Number of data-nodes:          5
> > > >>> Number of racks:               1
> > > >>>
> > > >>> Are there any other configuration parameters that need to be set ?
> > > >>
> > > >>
> > > >> What is your HFile size set to? The HFiles that get persisted would
> be
> > > >> bound by that number. Thereafter each HFile would be split into
> > blocks,
> > > the
> > > >> size of which you configure using the dfs.block.size configuration
> > > >> parameter.
> > > >>
> > > >>
> > > >>>
> > > >>> --
> > > >>> Regards,
> > > >>> Praveen Bysani
> > > >>> http://www.praveenbysani.com
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Praveen Bysani
> > > > http://www.praveenbysani.com
> > >
> >
> >
> >
> > --
> > Regards,
> > Praveen Bysani
> > http://www.praveenbysani.com
> >
>



-- 
Regards,
Praveen Bysani
http://www.praveenbysani.com

Re: Block size of HBase files

Posted by Anoop John <an...@gmail.com>.
>now have 731 regions (each about ~350 mb !!). I checked the
configuration in CM, and the value for hbase.hregion.max.filesize  is 1 GB
too !!!

You mentioned the splits at the time of table creation?  How u created the
table?

-Anoop-

On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani <pr...@gmail.com>wrote:

> Hi,
>
> Thanks for the details. No i haven't run any compaction or i have no idea
> if there is one going on in background. I executed a major_compact on that
> table  and i now have 731 regions (each about ~350 mb !!). I checked the
> configuration in CM, and the value for hbase.hregion.max.filesize  is 1 GB
> too !!!
>
> I am not trying to access HFiles in my MR job, infact i am just using a PIG
> script which handles this. This number (731) is close to my number of map
> tasks, which makes sense. But how can i decrease this, shouldn't the size
> of each region be 1 GB with that configuration value ?
>
>
> On 13 May 2013 18:36, Ted Yu <yu...@gmail.com> wrote:
>
> > You can change HFile size through hbase.hregion.max.filesize parameter.
> >
> > On May 13, 2013, at 2:45 AM, Praveen Bysani <pr...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I wanted to minimize on the number of map reduce tasks generated while
> > > processing a job, hence configured it to a larger value.
> > >
> > > I don't think i have configured HFile size in the cluster. I use
> Cloudera
> > > Manager to mange my cluster, and the only configuration i can relate
> > > to is hfile.block.cache.size
> > > which is set to 0.25. How do i change the HFile size ?
> > >
> > > On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
> > >
> > >> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> > praveen.iiith@gmail.com
> > >>> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I have the dfs.block.size value set to 1 GB in my cluster
> > configuration.
> > >>
> > >>
> > >> Just out of curiosity - why do you have it set at 1GB?
> > >>
> > >>
> > >>> I
> > >>> have around 250 GB of data stored in hbase over this cluster. But
> when
> > i
> > >>> check the number of blocks, it doesn't correspond to the block size
> > >> value i
> > >>> set. From what i understand i should only have ~250 blocks. But
> instead
> > >>> when i did a fsck on the /hbase/<table-name>, i got the following
> > >>>
> > >>> Status: HEALTHY
> > >>> Total size:    265727504820 B
> > >>> Total dirs:    1682
> > >>> Total files:   1459
> > >>> Total blocks (validated):      1459 (avg. block size 182129886 B)
> > >>> Minimally replicated blocks:   1459 (100.0 %)
> > >>> Over-replicated blocks:        0 (0.0 %)
> > >>> Under-replicated blocks:       0 (0.0 %)
> > >>> Mis-replicated blocks:         0 (0.0 %)
> > >>> Default replication factor:    3
> > >>> Average block replication:     3.0
> > >>> Corrupt blocks:                0
> > >>> Missing replicas:              0 (0.0 %)
> > >>> Number of data-nodes:          5
> > >>> Number of racks:               1
> > >>>
> > >>> Are there any other configuration parameters that need to be set ?
> > >>
> > >>
> > >> What is your HFile size set to? The HFiles that get persisted would be
> > >> bound by that number. Thereafter each HFile would be split into
> blocks,
> > the
> > >> size of which you configure using the dfs.block.size configuration
> > >> parameter.
> > >>
> > >>
> > >>>
> > >>> --
> > >>> Regards,
> > >>> Praveen Bysani
> > >>> http://www.praveenbysani.com
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Praveen Bysani
> > > http://www.praveenbysani.com
> >
>
>
>
> --
> Regards,
> Praveen Bysani
> http://www.praveenbysani.com
>

Re: Block size of HBase files

Posted by Praveen Bysani <pr...@gmail.com>.
Hi,

Thanks for the details. No i haven't run any compaction or i have no idea
if there is one going on in background. I executed a major_compact on that
table  and i now have 731 regions (each about ~350 mb !!). I checked the
configuration in CM, and the value for hbase.hregion.max.filesize  is 1 GB
too !!!

I am not trying to access HFiles in my MR job, infact i am just using a PIG
script which handles this. This number (731) is close to my number of map
tasks, which makes sense. But how can i decrease this, shouldn't the size
of each region be 1 GB with that configuration value ?


On 13 May 2013 18:36, Ted Yu <yu...@gmail.com> wrote:

> You can change HFile size through hbase.hregion.max.filesize parameter.
>
> On May 13, 2013, at 2:45 AM, Praveen Bysani <pr...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I wanted to minimize on the number of map reduce tasks generated while
> > processing a job, hence configured it to a larger value.
> >
> > I don't think i have configured HFile size in the cluster. I use Cloudera
> > Manager to mange my cluster, and the only configuration i can relate
> > to is hfile.block.cache.size
> > which is set to 0.25. How do i change the HFile size ?
> >
> > On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
> >
> >> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> praveen.iiith@gmail.com
> >>> wrote:
> >>
> >>> Hi,
> >>>
> >>> I have the dfs.block.size value set to 1 GB in my cluster
> configuration.
> >>
> >>
> >> Just out of curiosity - why do you have it set at 1GB?
> >>
> >>
> >>> I
> >>> have around 250 GB of data stored in hbase over this cluster. But when
> i
> >>> check the number of blocks, it doesn't correspond to the block size
> >> value i
> >>> set. From what i understand i should only have ~250 blocks. But instead
> >>> when i did a fsck on the /hbase/<table-name>, i got the following
> >>>
> >>> Status: HEALTHY
> >>> Total size:    265727504820 B
> >>> Total dirs:    1682
> >>> Total files:   1459
> >>> Total blocks (validated):      1459 (avg. block size 182129886 B)
> >>> Minimally replicated blocks:   1459 (100.0 %)
> >>> Over-replicated blocks:        0 (0.0 %)
> >>> Under-replicated blocks:       0 (0.0 %)
> >>> Mis-replicated blocks:         0 (0.0 %)
> >>> Default replication factor:    3
> >>> Average block replication:     3.0
> >>> Corrupt blocks:                0
> >>> Missing replicas:              0 (0.0 %)
> >>> Number of data-nodes:          5
> >>> Number of racks:               1
> >>>
> >>> Are there any other configuration parameters that need to be set ?
> >>
> >>
> >> What is your HFile size set to? The HFiles that get persisted would be
> >> bound by that number. Thereafter each HFile would be split into blocks,
> the
> >> size of which you configure using the dfs.block.size configuration
> >> parameter.
> >>
> >>
> >>>
> >>> --
> >>> Regards,
> >>> Praveen Bysani
> >>> http://www.praveenbysani.com
> >
> >
> >
> > --
> > Regards,
> > Praveen Bysani
> > http://www.praveenbysani.com
>



-- 
Regards,
Praveen Bysani
http://www.praveenbysani.com

Re: Block size of HBase files

Posted by Ted Yu <yu...@gmail.com>.
You can change HFile size through hbase.hregion.max.filesize parameter. 

On May 13, 2013, at 2:45 AM, Praveen Bysani <pr...@gmail.com> wrote:

> Hi,
> 
> I wanted to minimize on the number of map reduce tasks generated while
> processing a job, hence configured it to a larger value.
> 
> I don't think i have configured HFile size in the cluster. I use Cloudera
> Manager to mange my cluster, and the only configuration i can relate
> to is hfile.block.cache.size
> which is set to 0.25. How do i change the HFile size ?
> 
> On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
> 
>> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <praveen.iiith@gmail.com
>>> wrote:
>> 
>>> Hi,
>>> 
>>> I have the dfs.block.size value set to 1 GB in my cluster configuration.
>> 
>> 
>> Just out of curiosity - why do you have it set at 1GB?
>> 
>> 
>>> I
>>> have around 250 GB of data stored in hbase over this cluster. But when i
>>> check the number of blocks, it doesn't correspond to the block size
>> value i
>>> set. From what i understand i should only have ~250 blocks. But instead
>>> when i did a fsck on the /hbase/<table-name>, i got the following
>>> 
>>> Status: HEALTHY
>>> Total size:    265727504820 B
>>> Total dirs:    1682
>>> Total files:   1459
>>> Total blocks (validated):      1459 (avg. block size 182129886 B)
>>> Minimally replicated blocks:   1459 (100.0 %)
>>> Over-replicated blocks:        0 (0.0 %)
>>> Under-replicated blocks:       0 (0.0 %)
>>> Mis-replicated blocks:         0 (0.0 %)
>>> Default replication factor:    3
>>> Average block replication:     3.0
>>> Corrupt blocks:                0
>>> Missing replicas:              0 (0.0 %)
>>> Number of data-nodes:          5
>>> Number of racks:               1
>>> 
>>> Are there any other configuration parameters that need to be set ?
>> 
>> 
>> What is your HFile size set to? The HFiles that get persisted would be
>> bound by that number. Thereafter each HFile would be split into blocks, the
>> size of which you configure using the dfs.block.size configuration
>> parameter.
>> 
>> 
>>> 
>>> --
>>> Regards,
>>> Praveen Bysani
>>> http://www.praveenbysani.com
> 
> 
> 
> -- 
> Regards,
> Praveen Bysani
> http://www.praveenbysani.com

Re: Block size of HBase files

Posted by Anoop John <an...@gmail.com>.
Praveen,

How many regions there in ur table and how and CFs?
Under /hbase/<table-name> there will be many files and dir u will be able
to see. There will be .tableinfo file and every region will have
.regionInfo file and then under cf the data file (HFiles) .  Your total
data is 250GB. When your block size is 1GB and u have only one file of
250GB, then what you are looking for makes sense. But it is not the case
with HBase data storage.

HFiles are created per CF per region.  Also as data comes in (writes), by
default after 128mb HBase will flush it as a file into HDFS. So making a
file in HDFS with 1 block.(In ur case)  Later these smaller files will get
merged into bigger one .(Compaction)  At the time when u checked, some
major compactions were run? Major compaction will merge all files under a
CF within a region to one HFile .  So if u have 100 regions and 2 CFs for
table,after major compaction you will be having 200 HFiles. (Remember under
/hbase/<table-name> some other files also you will be able to see other
than the HFiles.)

The #files and avg block size displayed below speaks it.(Why u have those
many blocks)

The HFile size Amandeep was refering is the max size for an HFile (And thus
for a region).  If you keep on writing data to a region and when the data
size crosses this max size, HBase will split that region into 2.

Can you try checking the files count and blocks count after running a major
compaction?

What MR job u r trying to run with HBase? Also why you run MR directly on
the HFiles?  When you run the MR job over HBase (Like Scan using MR) it is
not the #files or blocks which decides the #mappers.  It will be based on
the #regions in the table..

-Anoop-

On Mon, May 13, 2013 at 3:15 PM, Praveen Bysani <pr...@gmail.com>wrote:

> Hi,
>
> I wanted to minimize on the number of map reduce tasks generated while
> processing a job, hence configured it to a larger value.
>
> I don't think i have configured HFile size in the cluster. I use Cloudera
> Manager to mange my cluster, and the only configuration i can relate
> to is hfile.block.cache.size
> which is set to 0.25. How do i change the HFile size ?
>
> On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:
>
> > On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> praveen.iiith@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I have the dfs.block.size value set to 1 GB in my cluster
> configuration.
> >
> >
> > Just out of curiosity - why do you have it set at 1GB?
> >
> >
> > > I
> > > have around 250 GB of data stored in hbase over this cluster. But when
> i
> > > check the number of blocks, it doesn't correspond to the block size
> > value i
> > > set. From what i understand i should only have ~250 blocks. But instead
> > > when i did a fsck on the /hbase/<table-name>, i got the following
> > >
> > > Status: HEALTHY
> > >  Total size:    265727504820 B
> > >  Total dirs:    1682
> > >  Total files:   1459
> > >  Total blocks (validated):      1459 (avg. block size 182129886 B)
> > >  Minimally replicated blocks:   1459 (100.0 %)
> > >  Over-replicated blocks:        0 (0.0 %)
> > >  Under-replicated blocks:       0 (0.0 %)
> > >  Mis-replicated blocks:         0 (0.0 %)
> > >  Default replication factor:    3
> > >  Average block replication:     3.0
> > >  Corrupt blocks:                0
> > >  Missing replicas:              0 (0.0 %)
> > >  Number of data-nodes:          5
> > >  Number of racks:               1
> > >
> > > Are there any other configuration parameters that need to be set ?
> >
> >
> > What is your HFile size set to? The HFiles that get persisted would be
> > bound by that number. Thereafter each HFile would be split into blocks,
> the
> > size of which you configure using the dfs.block.size configuration
> > parameter.
> >
> >
> > >
> > > --
> > > Regards,
> > > Praveen Bysani
> > > http://www.praveenbysani.com
> > >
> >
>
>
>
> --
> Regards,
> Praveen Bysani
> http://www.praveenbysani.com
>

Re: Block size of HBase files

Posted by Praveen Bysani <pr...@gmail.com>.
Hi,

I wanted to minimize on the number of map reduce tasks generated while
processing a job, hence configured it to a larger value.

I don't think i have configured HFile size in the cluster. I use Cloudera
Manager to mange my cluster, and the only configuration i can relate
to is hfile.block.cache.size
which is set to 0.25. How do i change the HFile size ?

On 13 May 2013 15:03, Amandeep Khurana <am...@gmail.com> wrote:

> On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <praveen.iiith@gmail.com
> >wrote:
>
> > Hi,
> >
> > I have the dfs.block.size value set to 1 GB in my cluster configuration.
>
>
> Just out of curiosity - why do you have it set at 1GB?
>
>
> > I
> > have around 250 GB of data stored in hbase over this cluster. But when i
> > check the number of blocks, it doesn't correspond to the block size
> value i
> > set. From what i understand i should only have ~250 blocks. But instead
> > when i did a fsck on the /hbase/<table-name>, i got the following
> >
> > Status: HEALTHY
> >  Total size:    265727504820 B
> >  Total dirs:    1682
> >  Total files:   1459
> >  Total blocks (validated):      1459 (avg. block size 182129886 B)
> >  Minimally replicated blocks:   1459 (100.0 %)
> >  Over-replicated blocks:        0 (0.0 %)
> >  Under-replicated blocks:       0 (0.0 %)
> >  Mis-replicated blocks:         0 (0.0 %)
> >  Default replication factor:    3
> >  Average block replication:     3.0
> >  Corrupt blocks:                0
> >  Missing replicas:              0 (0.0 %)
> >  Number of data-nodes:          5
> >  Number of racks:               1
> >
> > Are there any other configuration parameters that need to be set ?
>
>
> What is your HFile size set to? The HFiles that get persisted would be
> bound by that number. Thereafter each HFile would be split into blocks, the
> size of which you configure using the dfs.block.size configuration
> parameter.
>
>
> >
> > --
> > Regards,
> > Praveen Bysani
> > http://www.praveenbysani.com
> >
>



-- 
Regards,
Praveen Bysani
http://www.praveenbysani.com

Re: Block size of HBase files

Posted by Amandeep Khurana <am...@gmail.com>.
On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <pr...@gmail.com>wrote:

> Hi,
>
> I have the dfs.block.size value set to 1 GB in my cluster configuration.


Just out of curiosity - why do you have it set at 1GB?


> I
> have around 250 GB of data stored in hbase over this cluster. But when i
> check the number of blocks, it doesn't correspond to the block size value i
> set. From what i understand i should only have ~250 blocks. But instead
> when i did a fsck on the /hbase/<table-name>, i got the following
>
> Status: HEALTHY
>  Total size:    265727504820 B
>  Total dirs:    1682
>  Total files:   1459
>  Total blocks (validated):      1459 (avg. block size 182129886 B)
>  Minimally replicated blocks:   1459 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          5
>  Number of racks:               1
>
> Are there any other configuration parameters that need to be set ?


What is your HFile size set to? The HFiles that get persisted would be
bound by that number. Thereafter each HFile would be split into blocks, the
size of which you configure using the dfs.block.size configuration
parameter.


>
> --
> Regards,
> Praveen Bysani
> http://www.praveenbysani.com
>