You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Slater, David M." <Da...@jhuapl.edu> on 2013/10/29 22:50:08 UTC
sum of mutation.numBytes() significantly different from rfile size
Hello,
I'm seeing about an order of magnitude difference between the number of bytes returned by mutation.numBytes() and the size of the rfiles on disk (Accumulo 1.4.2). Note that all of my mutations are new entries, and there are no combiners running.
While I understand that there is some compression on the rfile, I would be really surprised if it was 10:1.
My entries are composed of a row ID (most of which is equivalent to the previous row ID), an empty column family, a nonempty column qualifier (which likely shares a lot with the previous qualifier), and an empty value. An example of the rowID and column qualifier might be:
(forward table)
0000000000000|9|fa19 IP|127.000.000.001
0000000000000|9|fa19 PORT|00080
...
0000000000000|9|fa22 IP|128.032.144.139
...
<timeblock>|<hash>|<uid> <index>|<textual value>
OR
(reverse table)
0000000000000|IP|127.000.000.001 fa19
0000000000000|IP|127.000.000.001 fd02
0000000000000|IP|127.000.000.002 123
...
0000000000000|PORT|00080 fa19
The numBytes() method appears to return a number of bytes equal to the string length of the row ID and column qualifiers, plus 26 * # of column qualifiers.
Is there something else that I'm missing, or would this possibly compress by that much?
Thanks,
David
Re: sum of mutation.numBytes() significantly different from rfile
size
Posted by Josh Elser <jo...@gmail.com>.
GZ typically compresses text fairly well (assuming that's the
compression codec that you're using).
I don't believe 1.4 has anything extra at the RFile level for size
savings; however, I think that 1.5+ has some additional encoding to
reduce the size on disk.
On 10/29/13, 5:50 PM, Slater, David M. wrote:
> Hello,
>
> I’m seeing about an order of magnitude difference between the number of
> bytes returned by mutation.numBytes() and the size of the rfiles on disk
> (Accumulo 1.4.2). Note that all of my mutations are new entries, and
> there are no combiners running.
>
> While I understand that there is some compression on the rfile, I would
> be really surprised if it was 10:1.
>
> My entries are composed of a row ID (most of which is equivalent to the
> previous row ID), an empty column family, a nonempty column qualifier
> (which likely shares a lot with the previous qualifier), and an empty
> value. An example of the rowID and column qualifier might be:
>
> (forward table)
>
> 0000000000000|9|fa19 IP|127.000.000.001
>
> 0000000000000|9|fa19 PORT|00080
>
> …
>
> 0000000000000|9|fa22 IP|128.032.144.139
>
> …
>
> <timeblock>|<hash>|<uid> <index>|<textual value>
>
> OR
>
> (reverse table)
>
> 0000000000000|IP|127.000.000.001 fa19
>
> 0000000000000|IP|127.000.000.001 fd02
>
> 0000000000000|IP|127.000.000.002 123
>
> …
>
> 0000000000000|PORT|00080 fa19
>
> The numBytes() method appears to return a number of bytes equal to the
> string length of the row ID and column qualifiers, plus 26 * # of column
> qualifiers.
>
> Is there something else that I’m missing, or would this possibly
> compress by that much?
>
> Thanks,
>
> David
>
RE: sum of mutation.numBytes() significantly different from rfile
size
Posted by "Slater, David M." <Da...@jhuapl.edu>.
Comparing the rfiles with compressed CSV files, the results do make sense now.
Thanks,
David
-----Original Message-----
From: Eric Newton [mailto:eric.newton@gmail.com]
Sent: Tuesday, October 29, 2013 11:05 PM
To: user@accumulo.apache.org
Subject: Re: sum of mutation.numBytes() significantly different from rfile size
For comparison, I posted this some time ago:
http://tinyurl.com/k28bkbg
I was surprised that RFile was smaller than a gzip'd CSV file, too.
On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M.
> <Da...@jhuapl.edu>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I'm seeing about an order of magnitude difference between the number
>> of bytes returned by mutation.numBytes() and the size of the rfiles
>> on disk (Accumulo 1.4.2). Note that all of my mutations are new
>> entries, and there are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I
>> would be really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to
>> the previous row ID), an empty column family, a nonempty column
>> qualifier (which likely shares a lot with the previous qualifier),
>> and an empty value. An example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the
> previous, then its not written again. So if the row is the same in 10 consecutive
> keys, its only written once. Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19 IP|127.000.000.001
>>
>> 0000000000000|9|fa19 PORT|00080
>>
>> ...
>>
>> 0000000000000|9|fa22 IP|128.032.144.139
>>
>> ...
>>
>> <timeblock>|<hash>|<uid> <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001 fa19
>>
>> 0000000000000|IP|127.000.000.001 fd02
>>
>> 0000000000000|IP|127.000.000.002 123
>>
>> ...
>>
>> 0000000000000|PORT|00080 fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to
>> the string length of the row ID and column qualifiers, plus 26 * # of
>> column qualifiers.
>>
>>
>>
>> Is there something else that I'm missing, or would this possibly
>> compress by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>
Re: sum of mutation.numBytes() significantly different from rfile size
Posted by Eric Newton <er...@gmail.com>.
For comparison, I posted this some time ago:
http://tinyurl.com/k28bkbg
I was surprised that RFile was smaller than a gzip'd CSV file, too.
On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. <Da...@jhuapl.edu>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I’m seeing about an order of magnitude difference between the number of
>> bytes returned by mutation.numBytes() and the size of the rfiles on disk
>> (Accumulo 1.4.2). Note that all of my mutations are new entries, and there
>> are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I would be
>> really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to the
>> previous row ID), an empty column family, a nonempty column qualifier (which
>> likely shares a lot with the previous qualifier), and an empty value. An
>> example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the previous,
> then its not written again. So if the row is the same in 10 consecutive
> keys, its only written once. Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19 IP|127.000.000.001
>>
>> 0000000000000|9|fa19 PORT|00080
>>
>> …
>>
>> 0000000000000|9|fa22 IP|128.032.144.139
>>
>> …
>>
>> <timeblock>|<hash>|<uid> <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001 fa19
>>
>> 0000000000000|IP|127.000.000.001 fd02
>>
>> 0000000000000|IP|127.000.000.002 123
>>
>> …
>>
>> 0000000000000|PORT|00080 fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to the
>> string length of the row ID and column qualifiers, plus 26 * # of column
>> qualifiers.
>>
>>
>>
>> Is there something else that I’m missing, or would this possibly compress
>> by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>
Re: sum of mutation.numBytes() significantly different from rfile size
Posted by Keith Turner <ke...@deenlo.com>.
On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M.
<Da...@jhuapl.edu>wrote:
> Hello,****
>
> ** **
>
> I’m seeing about an order of magnitude difference between the number of
> bytes returned by mutation.numBytes() and the size of the rfiles on disk
> (Accumulo 1.4.2). Note that all of my mutations are new entries, and there
> are no combiners running. ****
>
> ** **
>
> While I understand that there is some compression on the rfile, I would be
> really surprised if it was 10:1. ****
>
> ** **
>
> My entries are composed of a row ID (most of which is equivalent to the
> previous row ID), an empty column family, a nonempty column qualifier
> (which likely shares a lot with the previous qualifier), and an empty
> value. An example of the rowID and column qualifier might be:
>
In 1.4 if a field (row, col fam, etc) in key is the same as the previous,
then its not written again. So if the row is the same in 10 consecutive
keys, its only written once. Maybe this explains the difference. Scan the
table to make sure all of the data you expect to be there is there.
> ****
>
> ** **
>
> (forward table)****
>
> 0000000000000|9|fa19 IP|127.000.000.001****
>
> 0000000000000|9|fa19 PORT|00080****
>
> …****
>
> 0000000000000|9|fa22 IP|128.032.144.139****
>
> …****
>
> <timeblock>|<hash>|<uid> <index>|<textual value>****
>
> ** **
>
> OR****
>
> (reverse table)****
>
> 0000000000000|IP|127.000.000.001 fa19****
>
> 0000000000000|IP|127.000.000.001 fd02****
>
> 0000000000000|IP|127.000.000.002 123****
>
> …****
>
> 0000000000000|PORT|00080 fa19****
>
> ** **
>
> The numBytes() method appears to return a number of bytes equal to the
> string length of the row ID and column qualifiers, plus 26 * # of column
> qualifiers. ****
>
> ** **
>
> Is there something else that I’m missing, or would this possibly compress
> by that much?****
>
> ** **
>
> Thanks,****
>
> David****
>