You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Jamie Johnson <je...@gmail.com> on 2016/02/06 14:52:21 UTC

How to choose BinId for Document partitioned index

Reading the examples for table design I've come across a question
associated with the document partitioned index, specifically what is
typically chosen as the BinId or maybe more appropriately what factors
should influence what is chosen as the BinId and what impact do they have?

Re: How to choose BinId for Document partitioned index

Posted by Josh Elser <jo...@gmail.com>.

Yes! Very astute, Jamie :)

For the wikisearch schemas, the general idea is that the inverted index 
tables can prune your row space for some terms. This way, you can know 
the exact rows you have to search in the sharded table to get good 
parallelism without a full-table scan.

Jamie Johnson wrote:
> Thanks guys.  I was also looking at some of the examples and saw the
> event store, I like the idea of including time as a prefix to the
> binning to limit the number of servers that need to be hit for time
> bound queries.  Without something like this queries end up having to hit
> all tablets right?  It's not always a full table scan since the
> iterators can bail on a row part way through but still needs to hit
> every row to some extent right?
>
> I also was looking at the wiki example but wasn't able to find a good
> description of how all the tables are used, does anything more exist?
>
> On Feb 6, 2016 2:20 PM, "Josh Elser" <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     You can get *really* fancy if you have lots of ingesters and lots of
>     servers, include some attribute in the data you're hashing to
>     control how many servers a given client will need to write to for
>     some batch of documents. This is probably overkill for most setups
>     though.
>
>     Guava provides a decent murmur3 implementation which will be much
>     faster than your run-of-the-mill MD5 for generating the hash (which
>     you'll mod by the max number of bins).
>
>     William Slacum wrote:
>
>         Often it'll be a hash of the document mod the number of bins you're
>         using. The hash should be "good" in the sense that it uniquely
>         identifies the document. It can be as simple as some unique
>         field in the
>         document or just a hash (like murmur) of the whole document.
>
>         On Saturday, February 6, 2016, Jamie Johnson <jej2003@gmail.com
>         <ma...@gmail.com>
>         <mailto:jej2003@gmail.com <ma...@gmail.com>>> wrote:
>
>              Just found this excellent write up that explains a bit.
>
>         https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo
>
>              On Feb 6, 2016 8:52 AM, "Jamie Johnson" <jej2003@gmail.com
>         <ma...@gmail.com>
>         <javascript:_e(%7B%7D,'cvml','jej2003@gmail.com
>         <ma...@gmail.com>');>> wrote:
>
>                  Reading the examples for table design I've come across a
>                  question associated with the document partitioned index,
>                  specifically what is typically chosen as the BinId or
>         maybe more
>                  appropriately what factors should influence what is
>         chosen as
>                  the BinId and what impact do they have?
>

Re: How to choose BinId for Document partitioned index

Posted by Jamie Johnson <je...@gmail.com>.

Thanks guys.  I was also looking at some of the examples and saw the event
store, I like the idea of including time as a prefix to the binning to
limit the number of servers that need to be hit for time bound queries.
Without something like this queries end up having to hit all tablets
right?  It's not always a full table scan since the iterators can bail on a
row part way through but still needs to hit every row to some extent right?

I also was looking at the wiki example but wasn't able to find a good
description of how all the tables are used, does anything more exist?
On Feb 6, 2016 2:20 PM, "Josh Elser" <jo...@gmail.com> wrote:

> You can get *really* fancy if you have lots of ingesters and lots of
> servers, include some attribute in the data you're hashing to control how
> many servers a given client will need to write to for some batch of
> documents. This is probably overkill for most setups though.
>
> Guava provides a decent murmur3 implementation which will be much faster
> than your run-of-the-mill MD5 for generating the hash (which you'll mod by
> the max number of bins).
>
> William Slacum wrote:
>
>> Often it'll be a hash of the document mod the number of bins you're
>> using. The hash should be "good" in the sense that it uniquely
>> identifies the document. It can be as simple as some unique field in the
>> document or just a hash (like murmur) of the whole document.
>>
>> On Saturday, February 6, 2016, Jamie Johnson <jej2003@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Just found this excellent write up that explains a bit.
>>
>>
>> https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo
>>
>>     On Feb 6, 2016 8:52 AM, "Jamie Johnson" <jej2003@gmail.com
>>     <javascript:_e(%7B%7D,'cvml','jej2003@gmail.com');>> wrote:
>>
>>         Reading the examples for table design I've come across a
>>         question associated with the document partitioned index,
>>         specifically what is typically chosen as the BinId or maybe more
>>         appropriately what factors should influence what is chosen as
>>         the BinId and what impact do they have?
>>
>>

Re: How to choose BinId for Document partitioned index

Posted by Josh Elser <jo...@gmail.com>.

You can get *really* fancy if you have lots of ingesters and lots of 
servers, include some attribute in the data you're hashing to control 
how many servers a given client will need to write to for some batch of 
documents. This is probably overkill for most setups though.

Guava provides a decent murmur3 implementation which will be much faster 
than your run-of-the-mill MD5 for generating the hash (which you'll mod 
by the max number of bins).

William Slacum wrote:
> Often it'll be a hash of the document mod the number of bins you're
> using. The hash should be "good" in the sense that it uniquely
> identifies the document. It can be as simple as some unique field in the
> document or just a hash (like murmur) of the whole document.
>
> On Saturday, February 6, 2016, Jamie Johnson <jej2003@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Just found this excellent write up that explains a bit.
>
>     https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo
>
>     On Feb 6, 2016 8:52 AM, "Jamie Johnson" <jej2003@gmail.com
>     <javascript:_e(%7B%7D,'cvml','jej2003@gmail.com');>> wrote:
>
>         Reading the examples for table design I've come across a
>         question associated with the document partitioned index,
>         specifically what is typically chosen as the BinId or maybe more
>         appropriately what factors should influence what is chosen as
>         the BinId and what impact do they have?
>

Re: How to choose BinId for Document partitioned index

Posted by William Slacum <ws...@gmail.com>.

Often it'll be a hash of the document mod the number of bins you're using.
The hash should be "good" in the sense that it uniquely identifies the
document. It can be as simple as some unique field in the document or just
a hash (like murmur) of the whole document.

On Saturday, February 6, 2016, Jamie Johnson <je...@gmail.com> wrote:

> Just found this excellent write up that explains a bit.
>
> https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo
> On Feb 6, 2016 8:52 AM, "Jamie Johnson" <jej2003@gmail.com
> <javascript:_e(%7B%7D,'cvml','jej2003@gmail.com');>> wrote:
>
>> Reading the examples for table design I've come across a question
>> associated with the document partitioned index, specifically what is
>> typically chosen as the BinId or maybe more appropriately what factors
>> should influence what is chosen as the BinId and what impact do they have?
>>
>

Re: How to choose BinId for Document partitioned index

Posted by Jamie Johnson <je...@gmail.com>.

Just found this excellent write up that explains a bit.

https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo
On Feb 6, 2016 8:52 AM, "Jamie Johnson" <je...@gmail.com> wrote:

> Reading the examples for table design I've come across a question
> associated with the document partitioned index, specifically what is
> typically chosen as the BinId or maybe more appropriately what factors
> should influence what is chosen as the BinId and what impact do they have?
>