You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by kranthi reddy <kr...@gmail.com> on 2011/12/04 15:19:33 UTC

Unexpected Data insertion time and Data size explosion

Hi all,

    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
and am trying to insert data. 3 of the machines are tasktrackers, with 4
map tasks each.

    My data consists of about 1.3 billion rows with 4 columns each (100GB
txt file). The column structure is "rowID, word1, word2, word3".  My DFS
replication in hadoop and hbase is set to 3 each. I have put only one
column family and 3 qualifiers for each field (word*).

    I am using the SampleUploader present in the HBase distribution. To
complete 40% of the insertion, it has taken around 21 hrs and it's still
running. I have 12 map tasks running.* I would like to know is the
insertion time taken here on expected lines ??? Because when I used lucene,
I was able to insert the entire data in about 8 hours.*

    Also, there seems to be huge explosion of data size here. With a
replication factor of 3 for HBase, I was expecting the table size inserted
to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
replicating the data 3 times and 50+ GB for additional storage
information). But even for 40% completion of data insertion, the space
occupied is around 550GB (Looks like it might take around 1.2TB for an
100GB file).* I have used the rowID to be a String, instead of Long. Will
that account for such rapid increase in data storage???
*

Regards,
Kranthi

Re: Unexpected Data insertion time and Data size explosion

Posted by kranthi reddy <kr...@gmail.com>.

Hi all,

     I have been able to understand clearly as to why my Storage is
occupying such huge space.

     I have an issue with the insertion time. I have currently .1 billion
records (In hbase format, in future it would run into few billions) and am
inserting them using 12 map tasks running on 4 machine hadoop cluster.

     The time taken is approximately 3 hours. Which on calculation leads to
around 750 rows insertion per map task per second. IS THIS GOOD OR CAN IT
BE IMPROVED???

      .1 billion -> 100000000/( 180 min * 60 sec * 12 map task) = 750
(approx).

 I have tried using batch() function, but there is no improvement in the
insertion time.

* I have attached the codes that I am using to insert. Can some1 please
check If what I am trying to do is the best way to insert data is the
fastest and best way.
*
Regards,
Kranthi



On Mon, Dec 5, 2011 at 11:12 PM, Doug Meil <do...@explorysmedical.com>wrote:

>
> Hi there-
>
> Have you looked at this?
>
> http://hbase.apache.org/book.html#keyvalue
>
>
>
>
>
> On 12/5/11 11:33 AM, "kranthi reddy" <kr...@gmail.com> wrote:
>
> >Ok. But can some1 explain why the data size is exploding the way I have
> >mentioned earlier.
> >
> >I have tried to insert sample data of arnd 12GB. The data occupied by
> >Hbase
> >table is arnd 130GB. All my columns i.e. including the ROWID are strings.
> >I
> >have even tried converting by ROWID to long, but that seems to occupy more
> >space i.e. arnd 150GB.
> >
> >Sample rows
> >
> >0-<>-f-<>-c-<>-Anarchism
> >0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
> >0-<>-f-<>-e2-<>-anarchy
> >1-<>-f-<>-c-<>-Anarchism
> >1-<>-f-<>-e1-<>-anarchy
> >1-<>-f-<>-e2-<>-state (polity)
> >2-<>-f-<>-c-<>-Anarchism
> >2-<>-f-<>-e1-<>-anarchy
> >2-<>-f-<>-e2-<>-political philosophy
> >3-<>-f-<>-c-<>-Anarchism
> >3-<>-f-<>-e1-<>-The Globe and Mail
> >3-<>-f-<>-e2-<>-anarchy
> >4-<>-f-<>-c-<>-Anarchism
> >4-<>-f-<>-e1-<>-anarchy
> >4-<>-f-<>-e2-<>-stateless society
> >
> >Is there a way I can know the number of bytes occupied by each key:value
> >for each cell ???
> >
> >On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
> >ustaudinger@activequant.org> wrote:
> >
> >> the point, I refer to is not so much about when hbase's server side
> >> flushes, but when the client side flushes.
> >> If you put every value immediately, it will result every time in an RPC
> >> call. If you collect the data on the client side and flush (on the
> >>client
> >> side) manually, it will result in one RPC call with hundred or thousand
> >> small puts inside, instead of hundred or thousands individual put RPC
> >> calls.
> >>
> >> Another issue is, I am not so sure what happens if you collect hundreds
> >>of
> >> thousands of small puts, which might possibly be bigger than the
> >>memstore,
> >> and flush then. I guess the hbase client will hang.
> >>
> >>
> >>
> >>
> >> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
> >> >wrote:
> >>
> >> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size"
> >>do
> >> > the bulk insert ??? I was of the opinion that Hbase would flush all
> >>the
> >> > puts to the disk when it's memstore is filled, whose property is
> >>defined
> >> in
> >> > hbase-default.xml. Is my understanding wrong here ???
> >> >
> >> >
> >> >
> >> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
> >> > ustaudinger@activequant.org> wrote:
> >> >
> >> > > Hi there,
> >> > >
> >> > > while I cannot give you any concrete advice on your particular
> >>storage
> >> > > problem, I can share some experiences with you regarding
> >>performance.
> >> > >
> >> > > I also bulk import data regularly, which is around 4GB every day in
> >> about
> >> > > 150 files with something between 10'000 to 30'000 lines in it.
> >> > >
> >> > > My first approach was to read every line and put it separately.
> >>Which
> >> > > resulted in a load time of about an hour. My next approach was to
> >>read
> >> an
> >> > > entire file, put each individual put into a list and then store the
> >> > entire
> >> > > list at once. This works fast in the beginning, but after about 20
> >> files,
> >> > > the server ran into compactions and couldn't cope with the load and
> >> > > finally, the master crashed, leaving regionserver and zookeeper
> >> running.
> >> > To
> >> > > HBase's defense, I have to say that I did this on a standalone
> >> > installation
> >> > > without Hadoop underneath, so the test may not be entirely fair.
> >> > > Next, I switched to a proper Hadoop layer with HBase on top. I now
> >>also
> >> > put
> >> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and
> >>have
> >> > > insert times of around 0.5ms per row - which is very decent. My
> >>entire
> >> > > import now takes only 7 minutes.
> >> > >
> >> > > I think you must find a balance regarding the performance of your
> >> servers
> >> > > and how quick they are with compactions and the amount of data you
> >>put
> >> at
> >> > > once. I have definitely found single puts to result in low
> >>performance.
> >> > >
> >> > > Best regards,
> >> > > Ulrich
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy
> >><kranthili2020@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > No, I split the table on the fly. This I have done because
> >>converting
> >> > my
> >> > > > table into Hbase format (rowID, family, qualifier, value) would
> >> result
> >> > in
> >> > > > the input file being arnd 300GB. Hence, I had decided to do the
> >> > splitting
> >> > > > and generating this format on the fly.
> >> > > >
> >> > > > Will this effect the performance so heavily ???
> >> > > >
> >> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
> >> > > >
> >> > > > > May I ask whether you pre-split your table before loading ?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy
> >><kranthili2020@gmail.com
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > Hi all,
> >> > > > > >
> >> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster
> >>of 4
> >> > > > > machines
> >> > > > > > and am trying to insert data. 3 of the machines are
> >>tasktrackers,
> >> > > with
> >> > > > 4
> >> > > > > > map tasks each.
> >> > > > > >
> >> > > > > >    My data consists of about 1.3 billion rows with 4 columns
> >>each
> >> > > > (100GB
> >> > > > > > txt file). The column structure is "rowID, word1, word2,
> >>word3".
> >> >  My
> >> > > > DFS
> >> > > > > > replication in hadoop and hbase is set to 3 each. I have put
> >>only
> >> > one
> >> > > > > > column family and 3 qualifiers for each field (word*).
> >> > > > > >
> >> > > > > >    I am using the SampleUploader present in the HBase
> >> distribution.
> >> > > To
> >> > > > > > complete 40% of the insertion, it has taken around 21 hrs and
> >> it's
> >> > > > still
> >> > > > > > running. I have 12 map tasks running.* I would like to know is
> >> the
> >> > > > > > insertion time taken here on expected lines ??? Because when I
> >> used
> >> > > > > lucene,
> >> > > > > > I was able to insert the entire data in about 8 hours.*
> >> > > > > >
> >> > > > > >    Also, there seems to be huge explosion of data size here.
> >> With a
> >> > > > > > replication factor of 3 for HBase, I was expecting the table
> >>size
> >> > > > > inserted
> >> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I
> >>have,
> >> > > 300GB
> >> > > > > for
> >> > > > > > replicating the data 3 times and 50+ GB for additional storage
> >> > > > > > information). But even for 40% completion of data insertion,
> >>the
> >> > > space
> >> > > > > > occupied is around 550GB (Looks like it might take around
> >>1.2TB
> >> for
> >> > > an
> >> > > > > > 100GB file).* I have used the rowID to be a String, instead of
> >> > Long.
> >> > > > Will
> >> > > > > > that account for such rapid increase in data storage???
> >> > > > > > *
> >> > > > > >
> >> > > > > > Regards,
> >> > > > > > Kranthi
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Kranthi Reddy. B
> >> > > >
> >> > > > http://www.setusoftware.com/setu/index.htm
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Kranthi Reddy. B
> >> >
> >> > http://www.setusoftware.com/setu/index.htm
> >> >
> >>
> >
> >
> >
> >--
> >Kranthi Reddy. B
> >
> >http://www.setusoftware.com/setu/index.htm
>
>
>


-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Posted by kranthi reddy <kr...@gmail.com>.

1) Does having dfs.replication factor "3" in general result in table data
size of 3x + y (where x is the size of the file in local file system and y
is some additional space for meta information storage) ???

2) Does Hbase, pre allocate space for all the cell versions when the cell
is created for the first time?

Unfortunately, I am just unable to wrap my head around the problem of such
exponential increase of data size. Except for this case happening (which I
doubt), I just don't get it how such exponential growth of table data is
possible.

3) Or is it case where my KEY is being larger than VALUE and hence
resulting in such large size increase ???

*Similar to the the sample rows below, I have around 300 million entries
and the ROWID increases linearly*.

On Mon, Dec 5, 2011 at 10:03 PM, kranthi reddy <kr...@gmail.com>wrote:

> Ok. But can some1 explain why the data size is exploding the way I have
> mentioned earlier.
>
> I have tried to insert sample data of arnd 12GB. The data occupied by
> Hbase table is arnd 130GB. All my columns i.e. including the ROWID are
> strings. I have even tried converting by ROWID to long, but that seems to
> occupy more space i.e. arnd 150GB.
>
> Sample rows
>
> 0-<>-f-<>-c-<>-Anarchism
> 0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
> 0-<>-f-<>-e2-<>-anarchy
> 1-<>-f-<>-c-<>-Anarchism
> 1-<>-f-<>-e1-<>-anarchy
> 1-<>-f-<>-e2-<>-state (polity)
> 2-<>-f-<>-c-<>-Anarchism
> 2-<>-f-<>-e1-<>-anarchy
> 2-<>-f-<>-e2-<>-political philosophy
> 3-<>-f-<>-c-<>-Anarchism
> 3-<>-f-<>-e1-<>-The Globe and Mail
> 3-<>-f-<>-e2-<>-anarchy
> 4-<>-f-<>-c-<>-Anarchism
> 4-<>-f-<>-e1-<>-anarchy
> 4-<>-f-<>-e2-<>-stateless society
>
> Is there a way I can know the number of bytes occupied by each key:value
> for each cell ???
>
>
> On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
> ustaudinger@activequant.org> wrote:
>
>> the point, I refer to is not so much about when hbase's server side
>> flushes, but when the client side flushes.
>> If you put every value immediately, it will result every time in an RPC
>> call. If you collect the data on the client side and flush (on the client
>> side) manually, it will result in one RPC call with hundred or thousand
>> small puts inside, instead of hundred or thousands individual put RPC
>> calls.
>>
>> Another issue is, I am not so sure what happens if you collect hundreds of
>> thousands of small puts, which might possibly be bigger than the memstore,
>> and flush then. I guess the hbase client will hang.
>>
>>
>>
>>
>> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
>> >wrote:
>>
>> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
>> > the bulk insert ??? I was of the opinion that Hbase would flush all the
>> > puts to the disk when it's memstore is filled, whose property is
>> defined in
>> > hbase-default.xml. Is my understanding wrong here ???
>> >
>> >
>> >
>> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
>> > ustaudinger@activequant.org> wrote:
>> >
>> > > Hi there,
>> > >
>> > > while I cannot give you any concrete advice on your particular storage
>> > > problem, I can share some experiences with you regarding performance.
>> > >
>> > > I also bulk import data regularly, which is around 4GB every day in
>> about
>> > > 150 files with something between 10'000 to 30'000 lines in it.
>> > >
>> > > My first approach was to read every line and put it separately. Which
>> > > resulted in a load time of about an hour. My next approach was to
>> read an
>> > > entire file, put each individual put into a list and then store the
>> > entire
>> > > list at once. This works fast in the beginning, but after about 20
>> files,
>> > > the server ran into compactions and couldn't cope with the load and
>> > > finally, the master crashed, leaving regionserver and zookeeper
>> running.
>> > To
>> > > HBase's defense, I have to say that I did this on a standalone
>> > installation
>> > > without Hadoop underneath, so the test may not be entirely fair.
>> > > Next, I switched to a proper Hadoop layer with HBase on top. I now
>> also
>> > put
>> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
>> > > insert times of around 0.5ms per row - which is very decent. My entire
>> > > import now takes only 7 minutes.
>> > >
>> > > I think you must find a balance regarding the performance of your
>> servers
>> > > and how quick they are with compactions and the amount of data you
>> put at
>> > > once. I have definitely found single puts to result in low
>> performance.
>> > >
>> > > Best regards,
>> > > Ulrich
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <
>> kranthili2020@gmail.com
>> > > >wrote:
>> > >
>> > > > No, I split the table on the fly. This I have done because
>> converting
>> > my
>> > > > table into Hbase format (rowID, family, qualifier, value) would
>> result
>> > in
>> > > > the input file being arnd 300GB. Hence, I had decided to do the
>> > splitting
>> > > > and generating this format on the fly.
>> > > >
>> > > > Will this effect the performance so heavily ???
>> > > >
>> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
>> > > >
>> > > > > May I ask whether you pre-split your table before loading ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <
>> kranthili2020@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of
>> 4
>> > > > > machines
>> > > > > > and am trying to insert data. 3 of the machines are
>> tasktrackers,
>> > > with
>> > > > 4
>> > > > > > map tasks each.
>> > > > > >
>> > > > > >    My data consists of about 1.3 billion rows with 4 columns
>> each
>> > > > (100GB
>> > > > > > txt file). The column structure is "rowID, word1, word2, word3".
>> >  My
>> > > > DFS
>> > > > > > replication in hadoop and hbase is set to 3 each. I have put
>> only
>> > one
>> > > > > > column family and 3 qualifiers for each field (word*).
>> > > > > >
>> > > > > >    I am using the SampleUploader present in the HBase
>> distribution.
>> > > To
>> > > > > > complete 40% of the insertion, it has taken around 21 hrs and
>> it's
>> > > > still
>> > > > > > running. I have 12 map tasks running.* I would like to know is
>> the
>> > > > > > insertion time taken here on expected lines ??? Because when I
>> used
>> > > > > lucene,
>> > > > > > I was able to insert the entire data in about 8 hours.*
>> > > > > >
>> > > > > >    Also, there seems to be huge explosion of data size here.
>> With a
>> > > > > > replication factor of 3 for HBase, I was expecting the table
>> size
>> > > > > inserted
>> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have,
>> > > 300GB
>> > > > > for
>> > > > > > replicating the data 3 times and 50+ GB for additional storage
>> > > > > > information). But even for 40% completion of data insertion, the
>> > > space
>> > > > > > occupied is around 550GB (Looks like it might take around 1.2TB
>> for
>> > > an
>> > > > > > 100GB file).* I have used the rowID to be a String, instead of
>> > Long.
>> > > > Will
>> > > > > > that account for such rapid increase in data storage???
>> > > > > > *
>> > > > > >
>> > > > > > Regards,
>> > > > > > Kranthi
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Kranthi Reddy. B
>> > > >
>> > > > http://www.setusoftware.com/setu/index.htm
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Kranthi Reddy. B
>> >
>> > http://www.setusoftware.com/setu/index.htm
>> >
>>
>
>
>
> --
> Kranthi Reddy. B
>
> http://www.setusoftware.com/setu/index.htm
>



-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Posted by Doug Meil <do...@explorysmedical.com>.

Hi there-

Have you looked at this?

http://hbase.apache.org/book.html#keyvalue





On 12/5/11 11:33 AM, "kranthi reddy" <kr...@gmail.com> wrote:

>Ok. But can some1 explain why the data size is exploding the way I have
>mentioned earlier.
>
>I have tried to insert sample data of arnd 12GB. The data occupied by
>Hbase
>table is arnd 130GB. All my columns i.e. including the ROWID are strings.
>I
>have even tried converting by ROWID to long, but that seems to occupy more
>space i.e. arnd 150GB.
>
>Sample rows
>
>0-<>-f-<>-c-<>-Anarchism
>0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
>0-<>-f-<>-e2-<>-anarchy
>1-<>-f-<>-c-<>-Anarchism
>1-<>-f-<>-e1-<>-anarchy
>1-<>-f-<>-e2-<>-state (polity)
>2-<>-f-<>-c-<>-Anarchism
>2-<>-f-<>-e1-<>-anarchy
>2-<>-f-<>-e2-<>-political philosophy
>3-<>-f-<>-c-<>-Anarchism
>3-<>-f-<>-e1-<>-The Globe and Mail
>3-<>-f-<>-e2-<>-anarchy
>4-<>-f-<>-c-<>-Anarchism
>4-<>-f-<>-e1-<>-anarchy
>4-<>-f-<>-e2-<>-stateless society
>
>Is there a way I can know the number of bytes occupied by each key:value
>for each cell ???
>
>On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
>ustaudinger@activequant.org> wrote:
>
>> the point, I refer to is not so much about when hbase's server side
>> flushes, but when the client side flushes.
>> If you put every value immediately, it will result every time in an RPC
>> call. If you collect the data on the client side and flush (on the
>>client
>> side) manually, it will result in one RPC call with hundred or thousand
>> small puts inside, instead of hundred or thousands individual put RPC
>> calls.
>>
>> Another issue is, I am not so sure what happens if you collect hundreds
>>of
>> thousands of small puts, which might possibly be bigger than the
>>memstore,
>> and flush then. I guess the hbase client will hang.
>>
>>
>>
>>
>> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
>> >wrote:
>>
>> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size"
>>do
>> > the bulk insert ??? I was of the opinion that Hbase would flush all
>>the
>> > puts to the disk when it's memstore is filled, whose property is
>>defined
>> in
>> > hbase-default.xml. Is my understanding wrong here ???
>> >
>> >
>> >
>> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
>> > ustaudinger@activequant.org> wrote:
>> >
>> > > Hi there,
>> > >
>> > > while I cannot give you any concrete advice on your particular
>>storage
>> > > problem, I can share some experiences with you regarding
>>performance.
>> > >
>> > > I also bulk import data regularly, which is around 4GB every day in
>> about
>> > > 150 files with something between 10'000 to 30'000 lines in it.
>> > >
>> > > My first approach was to read every line and put it separately.
>>Which
>> > > resulted in a load time of about an hour. My next approach was to
>>read
>> an
>> > > entire file, put each individual put into a list and then store the
>> > entire
>> > > list at once. This works fast in the beginning, but after about 20
>> files,
>> > > the server ran into compactions and couldn't cope with the load and
>> > > finally, the master crashed, leaving regionserver and zookeeper
>> running.
>> > To
>> > > HBase's defense, I have to say that I did this on a standalone
>> > installation
>> > > without Hadoop underneath, so the test may not be entirely fair.
>> > > Next, I switched to a proper Hadoop layer with HBase on top. I now
>>also
>> > put
>> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and
>>have
>> > > insert times of around 0.5ms per row - which is very decent. My
>>entire
>> > > import now takes only 7 minutes.
>> > >
>> > > I think you must find a balance regarding the performance of your
>> servers
>> > > and how quick they are with compactions and the amount of data you
>>put
>> at
>> > > once. I have definitely found single puts to result in low
>>performance.
>> > >
>> > > Best regards,
>> > > Ulrich
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy
>><kranthili2020@gmail.com
>> > > >wrote:
>> > >
>> > > > No, I split the table on the fly. This I have done because
>>converting
>> > my
>> > > > table into Hbase format (rowID, family, qualifier, value) would
>> result
>> > in
>> > > > the input file being arnd 300GB. Hence, I had decided to do the
>> > splitting
>> > > > and generating this format on the fly.
>> > > >
>> > > > Will this effect the performance so heavily ???
>> > > >
>> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
>> > > >
>> > > > > May I ask whether you pre-split your table before loading ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy
>><kranthili2020@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster
>>of 4
>> > > > > machines
>> > > > > > and am trying to insert data. 3 of the machines are
>>tasktrackers,
>> > > with
>> > > > 4
>> > > > > > map tasks each.
>> > > > > >
>> > > > > >    My data consists of about 1.3 billion rows with 4 columns
>>each
>> > > > (100GB
>> > > > > > txt file). The column structure is "rowID, word1, word2,
>>word3".
>> >  My
>> > > > DFS
>> > > > > > replication in hadoop and hbase is set to 3 each. I have put
>>only
>> > one
>> > > > > > column family and 3 qualifiers for each field (word*).
>> > > > > >
>> > > > > >    I am using the SampleUploader present in the HBase
>> distribution.
>> > > To
>> > > > > > complete 40% of the insertion, it has taken around 21 hrs and
>> it's
>> > > > still
>> > > > > > running. I have 12 map tasks running.* I would like to know is
>> the
>> > > > > > insertion time taken here on expected lines ??? Because when I
>> used
>> > > > > lucene,
>> > > > > > I was able to insert the entire data in about 8 hours.*
>> > > > > >
>> > > > > >    Also, there seems to be huge explosion of data size here.
>> With a
>> > > > > > replication factor of 3 for HBase, I was expecting the table
>>size
>> > > > > inserted
>> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I
>>have,
>> > > 300GB
>> > > > > for
>> > > > > > replicating the data 3 times and 50+ GB for additional storage
>> > > > > > information). But even for 40% completion of data insertion,
>>the
>> > > space
>> > > > > > occupied is around 550GB (Looks like it might take around
>>1.2TB
>> for
>> > > an
>> > > > > > 100GB file).* I have used the rowID to be a String, instead of
>> > Long.
>> > > > Will
>> > > > > > that account for such rapid increase in data storage???
>> > > > > > *
>> > > > > >
>> > > > > > Regards,
>> > > > > > Kranthi
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Kranthi Reddy. B
>> > > >
>> > > > http://www.setusoftware.com/setu/index.htm
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Kranthi Reddy. B
>> >
>> > http://www.setusoftware.com/setu/index.htm
>> >
>>
>
>
>
>-- 
>Kranthi Reddy. B
>
>http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Posted by kranthi reddy <kr...@gmail.com>.

Ok. But can some1 explain why the data size is exploding the way I have
mentioned earlier.

I have tried to insert sample data of arnd 12GB. The data occupied by Hbase
table is arnd 130GB. All my columns i.e. including the ROWID are strings. I
have even tried converting by ROWID to long, but that seems to occupy more
space i.e. arnd 150GB.

Sample rows

0-<>-f-<>-c-<>-Anarchism
0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
0-<>-f-<>-e2-<>-anarchy
1-<>-f-<>-c-<>-Anarchism
1-<>-f-<>-e1-<>-anarchy
1-<>-f-<>-e2-<>-state (polity)
2-<>-f-<>-c-<>-Anarchism
2-<>-f-<>-e1-<>-anarchy
2-<>-f-<>-e2-<>-political philosophy
3-<>-f-<>-c-<>-Anarchism
3-<>-f-<>-e1-<>-The Globe and Mail
3-<>-f-<>-e2-<>-anarchy
4-<>-f-<>-c-<>-Anarchism
4-<>-f-<>-e1-<>-anarchy
4-<>-f-<>-e2-<>-stateless society

Is there a way I can know the number of bytes occupied by each key:value
for each cell ???

On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
ustaudinger@activequant.org> wrote:

> the point, I refer to is not so much about when hbase's server side
> flushes, but when the client side flushes.
> If you put every value immediately, it will result every time in an RPC
> call. If you collect the data on the client side and flush (on the client
> side) manually, it will result in one RPC call with hundred or thousand
> small puts inside, instead of hundred or thousands individual put RPC
> calls.
>
> Another issue is, I am not so sure what happens if you collect hundreds of
> thousands of small puts, which might possibly be bigger than the memstore,
> and flush then. I guess the hbase client will hang.
>
>
>
>
> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2020@gmail.com
> >wrote:
>
> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
> > the bulk insert ??? I was of the opinion that Hbase would flush all the
> > puts to the disk when it's memstore is filled, whose property is defined
> in
> > hbase-default.xml. Is my understanding wrong here ???
> >
> >
> >
> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
> > ustaudinger@activequant.org> wrote:
> >
> > > Hi there,
> > >
> > > while I cannot give you any concrete advice on your particular storage
> > > problem, I can share some experiences with you regarding performance.
> > >
> > > I also bulk import data regularly, which is around 4GB every day in
> about
> > > 150 files with something between 10'000 to 30'000 lines in it.
> > >
> > > My first approach was to read every line and put it separately. Which
> > > resulted in a load time of about an hour. My next approach was to read
> an
> > > entire file, put each individual put into a list and then store the
> > entire
> > > list at once. This works fast in the beginning, but after about 20
> files,
> > > the server ran into compactions and couldn't cope with the load and
> > > finally, the master crashed, leaving regionserver and zookeeper
> running.
> > To
> > > HBase's defense, I have to say that I did this on a standalone
> > installation
> > > without Hadoop underneath, so the test may not be entirely fair.
> > > Next, I switched to a proper Hadoop layer with HBase on top. I now also
> > put
> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
> > > insert times of around 0.5ms per row - which is very decent. My entire
> > > import now takes only 7 minutes.
> > >
> > > I think you must find a balance regarding the performance of your
> servers
> > > and how quick they are with compactions and the amount of data you put
> at
> > > once. I have definitely found single puts to result in low performance.
> > >
> > > Best regards,
> > > Ulrich
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
> > > >wrote:
> > >
> > > > No, I split the table on the fly. This I have done because converting
> > my
> > > > table into Hbase format (rowID, family, qualifier, value) would
> result
> > in
> > > > the input file being arnd 300GB. Hence, I had decided to do the
> > splitting
> > > > and generating this format on the fly.
> > > >
> > > > Will this effect the performance so heavily ???
> > > >
> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
> > > >
> > > > > May I ask whether you pre-split your table before loading ?
> > > > >
> > > > >
> > > > >
> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2020@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> > > > > machines
> > > > > > and am trying to insert data. 3 of the machines are tasktrackers,
> > > with
> > > > 4
> > > > > > map tasks each.
> > > > > >
> > > > > >    My data consists of about 1.3 billion rows with 4 columns each
> > > > (100GB
> > > > > > txt file). The column structure is "rowID, word1, word2, word3".
> >  My
> > > > DFS
> > > > > > replication in hadoop and hbase is set to 3 each. I have put only
> > one
> > > > > > column family and 3 qualifiers for each field (word*).
> > > > > >
> > > > > >    I am using the SampleUploader present in the HBase
> distribution.
> > > To
> > > > > > complete 40% of the insertion, it has taken around 21 hrs and
> it's
> > > > still
> > > > > > running. I have 12 map tasks running.* I would like to know is
> the
> > > > > > insertion time taken here on expected lines ??? Because when I
> used
> > > > > lucene,
> > > > > > I was able to insert the entire data in about 8 hours.*
> > > > > >
> > > > > >    Also, there seems to be huge explosion of data size here.
> With a
> > > > > > replication factor of 3 for HBase, I was expecting the table size
> > > > > inserted
> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have,
> > > 300GB
> > > > > for
> > > > > > replicating the data 3 times and 50+ GB for additional storage
> > > > > > information). But even for 40% completion of data insertion, the
> > > space
> > > > > > occupied is around 550GB (Looks like it might take around 1.2TB
> for
> > > an
> > > > > > 100GB file).* I have used the rowID to be a String, instead of
> > Long.
> > > > Will
> > > > > > that account for such rapid increase in data storage???
> > > > > > *
> > > > > >
> > > > > > Regards,
> > > > > > Kranthi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kranthi Reddy. B
> > > >
> > > > http://www.setusoftware.com/setu/index.htm
> > > >
> > >
> >
> >
> >
> > --
> > Kranthi Reddy. B
> >
> > http://www.setusoftware.com/setu/index.htm
> >
>



-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Posted by Ulrich Staudinger <us...@activequant.org>.

the point, I refer to is not so much about when hbase's server side
flushes, but when the client side flushes.
If you put every value immediately, it will result every time in an RPC
call. If you collect the data on the client side and flush (on the client
side) manually, it will result in one RPC call with hundred or thousand
small puts inside, instead of hundred or thousands individual put RPC
calls.

Another issue is, I am not so sure what happens if you collect hundreds of
thousands of small puts, which might possibly be bigger than the memstore,
and flush then. I guess the hbase client will hang.




On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kr...@gmail.com>wrote:

> Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
> the bulk insert ??? I was of the opinion that Hbase would flush all the
> puts to the disk when it's memstore is filled, whose property is defined in
> hbase-default.xml. Is my understanding wrong here ???
>
>
>
> On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
> ustaudinger@activequant.org> wrote:
>
> > Hi there,
> >
> > while I cannot give you any concrete advice on your particular storage
> > problem, I can share some experiences with you regarding performance.
> >
> > I also bulk import data regularly, which is around 4GB every day in about
> > 150 files with something between 10'000 to 30'000 lines in it.
> >
> > My first approach was to read every line and put it separately. Which
> > resulted in a load time of about an hour. My next approach was to read an
> > entire file, put each individual put into a list and then store the
> entire
> > list at once. This works fast in the beginning, but after about 20 files,
> > the server ran into compactions and couldn't cope with the load and
> > finally, the master crashed, leaving regionserver and zookeeper running.
> To
> > HBase's defense, I have to say that I did this on a standalone
> installation
> > without Hadoop underneath, so the test may not be entirely fair.
> > Next, I switched to a proper Hadoop layer with HBase on top. I now also
> put
> > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
> > insert times of around 0.5ms per row - which is very decent. My entire
> > import now takes only 7 minutes.
> >
> > I think you must find a balance regarding the performance of your servers
> > and how quick they are with compactions and the amount of data you put at
> > once. I have definitely found single puts to result in low performance.
> >
> > Best regards,
> > Ulrich
> >
> >
> >
> >
> >
> > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
> > >wrote:
> >
> > > No, I split the table on the fly. This I have done because converting
> my
> > > table into Hbase format (rowID, family, qualifier, value) would result
> in
> > > the input file being arnd 300GB. Hence, I had decided to do the
> splitting
> > > and generating this format on the fly.
> > >
> > > Will this effect the performance so heavily ???
> > >
> > > On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
> > >
> > > > May I ask whether you pre-split your table before loading ?
> > > >
> > > >
> > > >
> > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kr...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> > > > machines
> > > > > and am trying to insert data. 3 of the machines are tasktrackers,
> > with
> > > 4
> > > > > map tasks each.
> > > > >
> > > > >    My data consists of about 1.3 billion rows with 4 columns each
> > > (100GB
> > > > > txt file). The column structure is "rowID, word1, word2, word3".
>  My
> > > DFS
> > > > > replication in hadoop and hbase is set to 3 each. I have put only
> one
> > > > > column family and 3 qualifiers for each field (word*).
> > > > >
> > > > >    I am using the SampleUploader present in the HBase distribution.
> > To
> > > > > complete 40% of the insertion, it has taken around 21 hrs and it's
> > > still
> > > > > running. I have 12 map tasks running.* I would like to know is the
> > > > > insertion time taken here on expected lines ??? Because when I used
> > > > lucene,
> > > > > I was able to insert the entire data in about 8 hours.*
> > > > >
> > > > >    Also, there seems to be huge explosion of data size here. With a
> > > > > replication factor of 3 for HBase, I was expecting the table size
> > > > inserted
> > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have,
> > 300GB
> > > > for
> > > > > replicating the data 3 times and 50+ GB for additional storage
> > > > > information). But even for 40% completion of data insertion, the
> > space
> > > > > occupied is around 550GB (Looks like it might take around 1.2TB for
> > an
> > > > > 100GB file).* I have used the rowID to be a String, instead of
> Long.
> > > Will
> > > > > that account for such rapid increase in data storage???
> > > > > *
> > > > >
> > > > > Regards,
> > > > > Kranthi
> > > >
> > >
> > >
> > >
> > > --
> > > Kranthi Reddy. B
> > >
> > > http://www.setusoftware.com/setu/index.htm
> > >
> >
>
>
>
> --
> Kranthi Reddy. B
>
> http://www.setusoftware.com/setu/index.htm
>

Re: Unexpected Data insertion time and Data size explosion

Posted by kranthi reddy <kr...@gmail.com>.

Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
the bulk insert ??? I was of the opinion that Hbase would flush all the
puts to the disk when it's memstore is filled, whose property is defined in
hbase-default.xml. Is my understanding wrong here ???



On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
ustaudinger@activequant.org> wrote:

> Hi there,
>
> while I cannot give you any concrete advice on your particular storage
> problem, I can share some experiences with you regarding performance.
>
> I also bulk import data regularly, which is around 4GB every day in about
> 150 files with something between 10'000 to 30'000 lines in it.
>
> My first approach was to read every line and put it separately. Which
> resulted in a load time of about an hour. My next approach was to read an
> entire file, put each individual put into a list and then store the entire
> list at once. This works fast in the beginning, but after about 20 files,
> the server ran into compactions and couldn't cope with the load and
> finally, the master crashed, leaving regionserver and zookeeper running. To
> HBase's defense, I have to say that I did this on a standalone installation
> without Hadoop underneath, so the test may not be entirely fair.
> Next, I switched to a proper Hadoop layer with HBase on top. I now also put
> around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
> insert times of around 0.5ms per row - which is very decent. My entire
> import now takes only 7 minutes.
>
> I think you must find a balance regarding the performance of your servers
> and how quick they are with compactions and the amount of data you put at
> once. I have definitely found single puts to result in low performance.
>
> Best regards,
> Ulrich
>
>
>
>
>
> On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2020@gmail.com
> >wrote:
>
> > No, I split the table on the fly. This I have done because converting my
> > table into Hbase format (rowID, family, qualifier, value) would result in
> > the input file being arnd 300GB. Hence, I had decided to do the splitting
> > and generating this format on the fly.
> >
> > Will this effect the performance so heavily ???
> >
> > On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
> >
> > > May I ask whether you pre-split your table before loading ?
> > >
> > >
> > >
> > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kr...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> > > machines
> > > > and am trying to insert data. 3 of the machines are tasktrackers,
> with
> > 4
> > > > map tasks each.
> > > >
> > > >    My data consists of about 1.3 billion rows with 4 columns each
> > (100GB
> > > > txt file). The column structure is "rowID, word1, word2, word3".  My
> > DFS
> > > > replication in hadoop and hbase is set to 3 each. I have put only one
> > > > column family and 3 qualifiers for each field (word*).
> > > >
> > > >    I am using the SampleUploader present in the HBase distribution.
> To
> > > > complete 40% of the insertion, it has taken around 21 hrs and it's
> > still
> > > > running. I have 12 map tasks running.* I would like to know is the
> > > > insertion time taken here on expected lines ??? Because when I used
> > > lucene,
> > > > I was able to insert the entire data in about 8 hours.*
> > > >
> > > >    Also, there seems to be huge explosion of data size here. With a
> > > > replication factor of 3 for HBase, I was expecting the table size
> > > inserted
> > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have,
> 300GB
> > > for
> > > > replicating the data 3 times and 50+ GB for additional storage
> > > > information). But even for 40% completion of data insertion, the
> space
> > > > occupied is around 550GB (Looks like it might take around 1.2TB for
> an
> > > > 100GB file).* I have used the rowID to be a String, instead of Long.
> > Will
> > > > that account for such rapid increase in data storage???
> > > > *
> > > >
> > > > Regards,
> > > > Kranthi
> > >
> >
> >
> >
> > --
> > Kranthi Reddy. B
> >
> > http://www.setusoftware.com/setu/index.htm
> >
>



-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Posted by Ulrich Staudinger <us...@activequant.org>.

Hi there,

while I cannot give you any concrete advice on your particular storage
problem, I can share some experiences with you regarding performance.

I also bulk import data regularly, which is around 4GB every day in about
150 files with something between 10'000 to 30'000 lines in it.

My first approach was to read every line and put it separately. Which
resulted in a load time of about an hour. My next approach was to read an
entire file, put each individual put into a list and then store the entire
list at once. This works fast in the beginning, but after about 20 files,
the server ran into compactions and couldn't cope with the load and
finally, the master crashed, leaving regionserver and zookeeper running. To
HBase's defense, I have to say that I did this on a standalone installation
without Hadoop underneath, so the test may not be entirely fair.
Next, I switched to a proper Hadoop layer with HBase on top. I now also put
around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
insert times of around 0.5ms per row - which is very decent. My entire
import now takes only 7 minutes.

I think you must find a balance regarding the performance of your servers
and how quick they are with compactions and the amount of data you put at
once. I have definitely found single puts to result in low performance.

Best regards,
Ulrich

On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kr...@gmail.com>wrote:

> No, I split the table on the fly. This I have done because converting my
> table into Hbase format (rowID, family, qualifier, value) would result in
> the input file being arnd 300GB. Hence, I had decided to do the splitting
> and generating this format on the fly.
>
> Will this effect the performance so heavily ???
>
> On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:
>
> > May I ask whether you pre-split your table before loading ?
> >
> >
> >
> > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kr...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> > machines
> > > and am trying to insert data. 3 of the machines are tasktrackers, with
> 4
> > > map tasks each.
> > >
> > >    My data consists of about 1.3 billion rows with 4 columns each
> (100GB
> > > txt file). The column structure is "rowID, word1, word2, word3".  My
> DFS
> > > replication in hadoop and hbase is set to 3 each. I have put only one
> > > column family and 3 qualifiers for each field (word*).
> > >
> > >    I am using the SampleUploader present in the HBase distribution. To
> > > complete 40% of the insertion, it has taken around 21 hrs and it's
> still
> > > running. I have 12 map tasks running.* I would like to know is the
> > > insertion time taken here on expected lines ??? Because when I used
> > lucene,
> > > I was able to insert the entire data in about 8 hours.*
> > >
> > >    Also, there seems to be huge explosion of data size here. With a
> > > replication factor of 3 for HBase, I was expecting the table size
> > inserted
> > > to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB
> > for
> > > replicating the data 3 times and 50+ GB for additional storage
> > > information). But even for 40% completion of data insertion, the space
> > > occupied is around 550GB (Looks like it might take around 1.2TB for an
> > > 100GB file).* I have used the rowID to be a String, instead of Long.
> Will
> > > that account for such rapid increase in data storage???
> > > *
> > >
> > > Regards,
> > > Kranthi
> >
>
>
>
> --
> Kranthi Reddy. B
>
> http://www.setusoftware.com/setu/index.htm
>

Re: Unexpected Data insertion time and Data size explosion

Posted by kranthi reddy <kr...@gmail.com>.

No, I split the table on the fly. This I have done because converting my
table into Hbase format (rowID, family, qualifier, value) would result in
the input file being arnd 300GB. Hence, I had decided to do the splitting
and generating this format on the fly.

Will this effect the performance so heavily ???

On Mon, Dec 5, 2011 at 1:21 AM, <yu...@gmail.com> wrote:

> May I ask whether you pre-split your table before loading ?
>
>
>
> On Dec 4, 2011, at 6:19 AM, kranthi reddy <kr...@gmail.com> wrote:
>
> > Hi all,
> >
> >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> machines
> > and am trying to insert data. 3 of the machines are tasktrackers, with 4
> > map tasks each.
> >
> >    My data consists of about 1.3 billion rows with 4 columns each (100GB
> > txt file). The column structure is "rowID, word1, word2, word3".  My DFS
> > replication in hadoop and hbase is set to 3 each. I have put only one
> > column family and 3 qualifiers for each field (word*).
> >
> >    I am using the SampleUploader present in the HBase distribution. To
> > complete 40% of the insertion, it has taken around 21 hrs and it's still
> > running. I have 12 map tasks running.* I would like to know is the
> > insertion time taken here on expected lines ??? Because when I used
> lucene,
> > I was able to insert the entire data in about 8 hours.*
> >
> >    Also, there seems to be huge explosion of data size here. With a
> > replication factor of 3 for HBase, I was expecting the table size
> inserted
> > to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB
> for
> > replicating the data 3 times and 50+ GB for additional storage
> > information). But even for 40% completion of data insertion, the space
> > occupied is around 550GB (Looks like it might take around 1.2TB for an
> > 100GB file).* I have used the rowID to be a String, instead of Long. Will
> > that account for such rapid increase in data storage???
> > *
> >
> > Regards,
> > Kranthi
>



-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Posted by yu...@gmail.com.

May I ask whether you pre-split your table before loading ?



On Dec 4, 2011, at 6:19 AM, kranthi reddy <kr...@gmail.com> wrote:

> Hi all,
> 
>    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
> and am trying to insert data. 3 of the machines are tasktrackers, with 4
> map tasks each.
> 
>    My data consists of about 1.3 billion rows with 4 columns each (100GB
> txt file). The column structure is "rowID, word1, word2, word3".  My DFS
> replication in hadoop and hbase is set to 3 each. I have put only one
> column family and 3 qualifiers for each field (word*).
> 
>    I am using the SampleUploader present in the HBase distribution. To
> complete 40% of the insertion, it has taken around 21 hrs and it's still
> running. I have 12 map tasks running.* I would like to know is the
> insertion time taken here on expected lines ??? Because when I used lucene,
> I was able to insert the entire data in about 8 hours.*
> 
>    Also, there seems to be huge explosion of data size here. With a
> replication factor of 3 for HBase, I was expecting the table size inserted
> to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
> replicating the data 3 times and 50+ GB for additional storage
> information). But even for 40% completion of data insertion, the space
> occupied is around 550GB (Looks like it might take around 1.2TB for an
> 100GB file).* I have used the rowID to be a String, instead of Long. Will
> that account for such rapid increase in data storage???
> *
> 
> Regards,
> Kranthi