You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by David Medinets <da...@gmail.com> on 2012/07/12 12:42:06 UTC

Accumulo Data Storage Efficiency

Are there any published numbers for the amount of disk space used by
Accumulo versus other products? I'm thinking some dataset like dbpedia
or something from http://books.google.com/ngrams/datasets. If there is
not such a comparison, what comparisons would you like to see? What
about WordNet stored in CSV, MySQL, Cassandra, HBase, and Accumulo?
WordNet is just a large set of CSV files so it would be a good
candidate for this concept, I think.

Re: Accumulo Data Storage Efficiency

Posted by Clint Green <cl...@gmail.com>.
Could you use Culvert to control the indexing across platforms?

On Thu, Jul 12, 2012 at 8:57 AM, William Slacum <ws...@gmail.com> wrote:

> It'd be nice to see some numbers, but I also think it's important to
> account for use cases. Doing secondary indexing on records/files,
> metadata extraction and document storage will increase the raw storage
> required by some factor. Then, it's all compressed in various ways
> (ie, at the RFile level, at the HDFS block level)!
>
> Could we try to define some rudimentary structure that we'd put the
> data in? Like just create a term index on it, since I know HBase and
> Cassandra should be able to handle that.
>
> On Thu, Jul 12, 2012 at 6:42 AM, David Medinets
> <da...@gmail.com> wrote:
> > Are there any published numbers for the amount of disk space used by
> > Accumulo versus other products? I'm thinking some dataset like dbpedia
> > or something from http://books.google.com/ngrams/datasets. If there is
> > not such a comparison, what comparisons would you like to see? What
> > about WordNet stored in CSV, MySQL, Cassandra, HBase, and Accumulo?
> > WordNet is just a large set of CSV files so it would be a good
> > candidate for this concept, I think.
>

Re: Accumulo Data Storage Efficiency

Posted by William Slacum <ws...@gmail.com>.
It'd be nice to see some numbers, but I also think it's important to
account for use cases. Doing secondary indexing on records/files,
metadata extraction and document storage will increase the raw storage
required by some factor. Then, it's all compressed in various ways
(ie, at the RFile level, at the HDFS block level)!

Could we try to define some rudimentary structure that we'd put the
data in? Like just create a term index on it, since I know HBase and
Cassandra should be able to handle that.

On Thu, Jul 12, 2012 at 6:42 AM, David Medinets
<da...@gmail.com> wrote:
> Are there any published numbers for the amount of disk space used by
> Accumulo versus other products? I'm thinking some dataset like dbpedia
> or something from http://books.google.com/ngrams/datasets. If there is
> not such a comparison, what comparisons would you like to see? What
> about WordNet stored in CSV, MySQL, Cassandra, HBase, and Accumulo?
> WordNet is just a large set of CSV files so it would be a good
> candidate for this concept, I think.