You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Mastergeek <ma...@gmail.com> on 2013/10/02 22:15:50 UTC

Table entry count confusion

I have an interesting dilemma wherein my Accumulo cluster overview says that
I have over 1.4 billion entries within the table and yet when I run scan
where I keep track of unique row ids, I get back a number that is
drastically less than (a little over 30 million) what the table claims to
have. I read the legend and it says, "Entries: Key/value pairs over each
instance, table or tablet." I was under the impression that Accumulo tables
did away with duplicate rows and hence my curiosity as to why there is
apparently 45 times more entries then there should be. Do I need to perform
a compaction or some other action to rid my cluster of what I believe to be
duplicate entries?

Thanks,
Jeff



-----



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
Sent from the Developers mailing list archive at Nabble.com.

Re: Table entry count confusion

Posted by Adam Fuchs <af...@apache.org>.

The count that displays in the monitor is the sum of all the key/value
pairs that are in the files that back Accumulo. You can also get this count
by doing a scan of the !METADATA table and looking at the values associated
with keys in the "file" column family. Inserting the same key twice could
result in one key in one file or two keys in two files. At query time,
those keys will get deduplicated by the VersioningIterator, providing a
view that only has one key.

45x seems really high, since a tablet tends to have an average of maybe 4-8
files associated with it at the billion entry scale (rough estimate). There
could be other considerations, like cell-level security eliminating entries
from the view that the scanner gives you, or maybe major compactions are
not running properly for you? Your backing data could also include a large
number of deletes, which could throw off the stats. Deletes are implemented
as a tombstone marker, and are only eliminated when a full major compaction
happens. Forcing a major compaction by running the compact command in the
shell should give you some better evidence to diagnose the confusion.

Cheers,
Adam

On Wed, Oct 2, 2013 at 4:15 PM, Mastergeek <ma...@gmail.com> wrote:

> I have an interesting dilemma wherein my Accumulo cluster overview says
> that
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
>
> Thanks,
> Jeff
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: Table entry count confusion

Posted by Josh Elser <jo...@gmail.com>.

You got it!

On 10/07/2013 05:39 PM, Mastergeek wrote:
> I know it has been a while, but the command I should run to perform a
> compaction on a given table would just be the following if I wanted the
> whole table to compact?
>
> compact -t <tablename>
>
>
>
> -----
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629p5661.html
> Sent from the Developers mailing list archive at Nabble.com.

Re: Table entry count confusion

Posted by Mastergeek <ma...@gmail.com>.

I know it has been a while, but the command I should run to perform a
compaction on a given table would just be the following if I wanted the
whole table to compact?

compact -t <tablename>



-----



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629p5661.html
Sent from the Developers mailing list archive at Nabble.com.

Re: Table entry count confusion

Posted by Josh Elser <jo...@gmail.com>.

Yup, compaction will flush out those deletes/duplicate keys that you
may have lingering in that table and should give you an accurate
entries count on the monitor.

On Wed, Oct 2, 2013 at 4:15 PM, Mastergeek <ma...@gmail.com> wrote:
> I have an interesting dilemma wherein my Accumulo cluster overview says that
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
>
> Thanks,
> Jeff
>
>
>
> -----
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
> Sent from the Developers mailing list archive at Nabble.com.

Re: Table entry count confusion

Posted by Billie Rinaldi <bi...@gmail.com>.

In your original email, you appeared to be using the concept of rows / row
ids and the concept of entries / key-value pairs interchangeably.  A row is
a set of key-value pairs (aka entries) with the same row id.  You said you
counted the unique row ids, and that the number of entries reported by the
monitor was about 45 times the number of row ids.  This would be expected
if you have an average of 45 key-value pairs per row.

Billie

On Mon, Oct 7, 2013 at 5:42 PM, Mastergeek <ma...@gmail.com> wrote:

> Yes, each rowid has numerous column qualifiers per column family, but I
> assumed that the all of that was still wrapped in a single row.
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629p5662.html
> Sent from the Developers mailing list archive at Nabble.com.
>

Re: Table entry count confusion

Posted by Mastergeek <ma...@gmail.com>.

Yes, each rowid has numerous column qualifiers per column family, but I
assumed that the all of that was still wrapped in a single row.



-----



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629p5662.html
Sent from the Developers mailing list archive at Nabble.com.

Re: Table entry count confusion

Posted by Billie Rinaldi <bi...@gmail.com>.

Does your table have more than one key/value per row id?  The monitor
counts key/value pairs, not rows.

Billie


On Wed, Oct 2, 2013 at 1:15 PM, Mastergeek <ma...@gmail.com> wrote:

> I have an interesting dilemma wherein my Accumulo cluster overview says
> that
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
>
> Thanks,
> Jeff
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
> Sent from the Developers mailing list archive at Nabble.com.
>