You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Adal Chiriliuc <ad...@adobe.com> on 2008/09/15 18:03:02 UTC

Data serialization doesn't seem to respect MAX_VERSIONS

Hello,

We've been inserting data into Hbase and we found out that the size of the files on local disk/HDFS is much larger than expected.

So I made a small script which updates over Thrift the same row many times. The table was created with MAX_VERSIONS = 1.

This is what I found:

If I modify the same cell 100.000 times, the final region "data" file on disk contains around 50.000 of those modifications after I shutdown Hbase.

If I modify the same cell 200.000 times, the final region "data" file on disk contains around 100.000 of those modifications after I shutdown Hbase.

client = thrift_util.create_client(Hbase.Client, "localhost", 9090, 30.0)
cd = ColumnDescriptor()
cd.name = "test:"
cd.maxVersions = 1
client.createTable("bug_test", [cd])

for i in range(100000):
                mutation = Mutation()
                mutation.column = "test:column"
                mutation.value = "version_%d" % i
                client.mutateRow("bug_test", "single_row", [mutation])
                if i % 1000 == 0:
                                print i

Is this expected behavior? Our use case involves multiple updates of the same cell using big blobs of data (25 KB).

Note: when getting a cell/scanning the table, everything is ok, only the last inserted version of the cell is returned. The older values of the cell are only present in the storage files.

Best regards,
Adal

Re: Data serialization doesn't seem to respect MAX_VERSIONS

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Adal,

For small tables being used with a lot of updates,
HBASE-871<https://issues.apache.org/jira/browse/HBASE-871>was created
(but not really documented outside of the code). I think I will
blog on this.

Thx for reporting this issue.

J-D

On Mon, Sep 15, 2008 at 12:03 PM, Adal Chiriliuc <ad...@adobe.com> wrote:

> Hello,
>
> We've been inserting data into Hbase and we found out that the size of the
> files on local disk/HDFS is much larger than expected.
>
> So I made a small script which updates over Thrift the same row many times.
> The table was created with MAX_VERSIONS = 1.
>
> This is what I found:
>
> If I modify the same cell 100.000 times, the final region "data" file on
> disk contains around 50.000 of those modifications after I shutdown Hbase.
>
> If I modify the same cell 200.000 times, the final region "data" file on
> disk contains around 100.000 of those modifications after I shutdown Hbase.
>
> client = thrift_util.create_client(Hbase.Client, "localhost", 9090, 30.0)
> cd = ColumnDescriptor()
> cd.name = "test:"
> cd.maxVersions = 1
> client.createTable("bug_test", [cd])
>
> for i in range(100000):
>                mutation = Mutation()
>                mutation.column = "test:column"
>                mutation.value = "version_%d" % i
>                client.mutateRow("bug_test", "single_row", [mutation])
>                if i % 1000 == 0:
>                                print i
>
> Is this expected behavior? Our use case involves multiple updates of the
> same cell using big blobs of data (25 KB).
>
> Note: when getting a cell/scanning the table, everything is ok, only the
> last inserted version of the cell is returned. The older values of the cell
> are only present in the storage files.
>
> Best regards,
> Adal
>

RE: Data serialization doesn't seem to respect MAX_VERSIONS

Posted by Adal Chiriliuc <ad...@adobe.com>.
Forgot to specify: this happens using the latest trunk version.

-----Original Message-----
From: Adal Chiriliuc [mailto:adalc@adobe.com]
Sent: 15 septembrie 2008 19:03
To: hbase-dev@hadoop.apache.org
Subject: Data serialization doesn't seem to respect MAX_VERSIONS

Hello,

We've been inserting data into Hbase and we found out that the size of the files on local disk/HDFS is much larger than expected.

So I made a small script which updates over Thrift the same row many times. The table was created with MAX_VERSIONS = 1.

This is what I found:

If I modify the same cell 100.000 times, the final region "data" file on disk contains around 50.000 of those modifications after I shutdown Hbase.

If I modify the same cell 200.000 times, the final region "data" file on disk contains around 100.000 of those modifications after I shutdown Hbase.

client = thrift_util.create_client(Hbase.Client, "localhost", 9090, 30.0)
cd = ColumnDescriptor()
cd.name = "test:"
cd.maxVersions = 1
client.createTable("bug_test", [cd])

for i in range(100000):
                mutation = Mutation()
                mutation.column = "test:column"
                mutation.value = "version_%d" % i
                client.mutateRow("bug_test", "single_row", [mutation])
                if i % 1000 == 0:
                                print i

Is this expected behavior? Our use case involves multiple updates of the same cell using big blobs of data (25 KB).

Note: when getting a cell/scanning the table, everything is ok, only the last inserted version of the cell is returned. The older values of the cell are only present in the storage files.

Best regards,
Adal

Re: Data serialization doesn't seem to respect MAX_VERSIONS

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
The reason the values are stored and the files are large is the only time we 
remove > max_versions is in a major compaction
this is defaulted to once a day in the hbase-default.xml. we do minor 
compactions of small groups of the map files more often but can not
enforce the max_versions until the major compaction when we combine all all 
the map files and can see all the versions and keep only the latest X 
versions.

HBASE-871 make it where we can set each column level to run a major 
compaction at different times to help with table/column family that receive 
lots of updates to the same rows over a days time.

Billy



"Adal Chiriliuc" <ad...@adobe.com> wrote in 
message 
news:49CA7AD832CA884E870CC10F59FF655754351087@eurmbx01.eur.adobe.com...
Hello,

We've been inserting data into Hbase and we found out that the size of the 
files on local disk/HDFS is much larger than expected.

So I made a small script which updates over Thrift the same row many times. 
The table was created with MAX_VERSIONS = 1.

This is what I found:

If I modify the same cell 100.000 times, the final region "data" file on 
disk contains around 50.000 of those modifications after I shutdown Hbase.

If I modify the same cell 200.000 times, the final region "data" file on 
disk contains around 100.000 of those modifications after I shutdown Hbase.

client = thrift_util.create_client(Hbase.Client, "localhost", 9090, 30.0)
cd = ColumnDescriptor()
cd.name = "test:"
cd.maxVersions = 1
client.createTable("bug_test", [cd])

for i in range(100000):
                mutation = Mutation()
                mutation.column = "test:column"
                mutation.value = "version_%d" % i
                client.mutateRow("bug_test", "single_row", [mutation])
                if i % 1000 == 0:
                                print i

Is this expected behavior? Our use case involves multiple updates of the 
same cell using big blobs of data (25 KB).

Note: when getting a cell/scanning the table, everything is ok, only the 
last inserted version of the cell is returned. The older values of the cell 
are only present in the storage files.

Best regards,
Adal