You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by vaibhav thapliyal <va...@gmail.com> on 2015/07/13 17:19:56 UTC

Questions on intersecting iterator and partition ids

Dear all,

I have the following questions on intersecting iterator and partition ids
used in document sharded indexing:

1. Can we run a boolean and query using the current intersecting iterator
on a given range of ids. These ids are a subset of the total ids stored in
the column qualifier field as per the document sharded indexing format.

If it's not possible with current iterator can I tweak the existing one?

2. Is the partitioning suggested in document sharded indexing logical or
physical. For eg if I have 30 partition ids do I have to physically
presplit the table based on the partition ids for the and query to run in
the most efficient way so that I have 30 tablets in table?

3.  Lastly,  Can anybody suggest me the number of partitions for document
sharded indexing. What should I look for when deciding it?

Thanks
Vaibhav

Re: Questions on intersecting iterator and partition ids

Posted by Josh Elser <jo...@gmail.com>.
Inlined.

vaibhav thapliyal wrote:
> Dear all,
>
> I have the following questions on intersecting iterator and partition
> ids used in document sharded indexing:
>
> 1. Can we run a boolean and query using the current intersecting
> iterator on a given range of ids. These ids are a subset of the total
> ids stored in the column qualifier field as per the document sharded
> indexing format.

The IntersectingIterator is meant to find documents which contain a list 
of terms. If you have a set of candidate documents which means that 
you've already done the work that the IntersectingIterator would.

> If it's not possible with current iterator can I tweak the existing one?

No, I don't think so. The schema that the IntersectingIterator expects 
is "row: shardID, colfam: term, colqual: docID". If you have a document 
which you _might_ match your terms, you can just fetch each key-value 
pair for the document and see if it matches.

Ideally, if you had another index structure which reversed the column 
family and qualifier, you could easily verify whether a document 
contains all of the terms you're looking for via a column qualifier filter.

Remember, space is cheap.

> 2. Is the partitioning suggested in document sharded indexing logical or
> physical. For eg if I have 30 partition ids do I have to physically
> presplit the table based on the partition ids for the and query to run
> in the most efficient way so that I have 30 tablets in table?

This is likely a good starting place, but read the below comment.

> 3.  Lastly,  Can anybody suggest me the number of partitions for
> document sharded indexing. What should I look for when deciding it?

See 
http://mail-archives.apache.org/mod_mbox/accumulo-user/201507.mbox/%3C559994BB.3070607%40gmail.com%3E

> Thanks
> Vaibhav
>

Re: Questions on intersecting iterator and partition ids

Posted by Adam Fuchs <af...@apache.org>.
Vaibhav,

I have included some answers below.

Cheers,
Adam

On Mon, Jul 13, 2015 at 11:19 AM, vaibhav thapliyal <
vaibhav.thapliyal.91@gmail.com> wrote:

> Dear all,
>
> I have the following questions on intersecting iterator and partition ids
> used in document sharded indexing:
>
> 1. Can we run a boolean and query using the current intersecting iterator
> on a given range of ids. These ids are a subset of the total ids stored in
> the column qualifier field as per the document sharded indexing format.
>
The IntersectingIterator is designed to do index intersections, which are
very similar to boolean AND queries. It does require indexes to be built in
a particular fashion. You should play around with the WikiSearch example (
https://accumulo.apache.org/example/wikisearch.html) to get familiar with
its use.

> If it's not possible with current iterator can I tweak the existing one?
>
If you are indexing documents similar to what the IntersectingIterator
expects then you should be able to get it to work for you. More generally,
any row-local logic can be implemented in an iterator. If you're not
building indexes then you might want to look at the RowFilter as a starting
point.

>  2. Is the partitioning suggested in document sharded indexing logical or
> physical. For eg if I have 30 partition ids do I have to physically
> presplit the table based on the partition ids for the and query to run in
> the most efficient way so that I have 30 tablets in table?
>
You don't have to pre-split -- Accumulo will automatically split big rows
into their own tablets. However, there are some performance advantages to
pre-splitting before your tablet gets big enough to split on its own.

>  3.  Lastly,  Can anybody suggest me the number of partitions for
> document sharded indexing. What should I look for when deciding it?
>
You have to consider a few factors for this: (a) ingest parallelization,
for which you want approximately as many partitions as you have cores in
your cluster, (b) size of a partition when full, which you want to be under
about 20GB for compaction performance reasons, and (c) query parallelism,
for which you want no more than a small factor of the number of cores in
your cluster to reduce query latency. If you can't find a solution that
works for all of these factors then you will be forced to make trade-offs
(or do something complicated like time-based partitioning).

>  Thanks
> Vaibhav
>