You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by vaibhav thapliyal <va...@gmail.com> on 2015/07/13 17:19:56 UTC
Questions on intersecting iterator and partition ids
Dear all,
I have the following questions on intersecting iterator and partition ids
used in document sharded indexing:
1. Can we run a boolean and query using the current intersecting iterator
on a given range of ids. These ids are a subset of the total ids stored in
the column qualifier field as per the document sharded indexing format.
If it's not possible with current iterator can I tweak the existing one?
2. Is the partitioning suggested in document sharded indexing logical or
physical. For eg if I have 30 partition ids do I have to physically
presplit the table based on the partition ids for the and query to run in
the most efficient way so that I have 30 tablets in table?
3. Lastly, Can anybody suggest me the number of partitions for document
sharded indexing. What should I look for when deciding it?
Thanks
Vaibhav
Re: Questions on intersecting iterator and partition ids
Posted by Josh Elser <jo...@gmail.com>.
Inlined.
vaibhav thapliyal wrote:
> Dear all,
>
> I have the following questions on intersecting iterator and partition
> ids used in document sharded indexing:
>
> 1. Can we run a boolean and query using the current intersecting
> iterator on a given range of ids. These ids are a subset of the total
> ids stored in the column qualifier field as per the document sharded
> indexing format.
The IntersectingIterator is meant to find documents which contain a list
of terms. If you have a set of candidate documents which means that
you've already done the work that the IntersectingIterator would.
> If it's not possible with current iterator can I tweak the existing one?
No, I don't think so. The schema that the IntersectingIterator expects
is "row: shardID, colfam: term, colqual: docID". If you have a document
which you _might_ match your terms, you can just fetch each key-value
pair for the document and see if it matches.
Ideally, if you had another index structure which reversed the column
family and qualifier, you could easily verify whether a document
contains all of the terms you're looking for via a column qualifier filter.
Remember, space is cheap.
> 2. Is the partitioning suggested in document sharded indexing logical or
> physical. For eg if I have 30 partition ids do I have to physically
> presplit the table based on the partition ids for the and query to run
> in the most efficient way so that I have 30 tablets in table?
This is likely a good starting place, but read the below comment.
> 3. Lastly, Can anybody suggest me the number of partitions for
> document sharded indexing. What should I look for when deciding it?
See
http://mail-archives.apache.org/mod_mbox/accumulo-user/201507.mbox/%3C559994BB.3070607%40gmail.com%3E
> Thanks
> Vaibhav
>
Re: Questions on intersecting iterator and partition ids
Posted by Adam Fuchs <af...@apache.org>.
Vaibhav,
I have included some answers below.
Cheers,
Adam
On Mon, Jul 13, 2015 at 11:19 AM, vaibhav thapliyal <
vaibhav.thapliyal.91@gmail.com> wrote:
> Dear all,
>
> I have the following questions on intersecting iterator and partition ids
> used in document sharded indexing:
>
> 1. Can we run a boolean and query using the current intersecting iterator
> on a given range of ids. These ids are a subset of the total ids stored in
> the column qualifier field as per the document sharded indexing format.
>
The IntersectingIterator is designed to do index intersections, which are
very similar to boolean AND queries. It does require indexes to be built in
a particular fashion. You should play around with the WikiSearch example (
https://accumulo.apache.org/example/wikisearch.html) to get familiar with
its use.
> If it's not possible with current iterator can I tweak the existing one?
>
If you are indexing documents similar to what the IntersectingIterator
expects then you should be able to get it to work for you. More generally,
any row-local logic can be implemented in an iterator. If you're not
building indexes then you might want to look at the RowFilter as a starting
point.
> 2. Is the partitioning suggested in document sharded indexing logical or
> physical. For eg if I have 30 partition ids do I have to physically
> presplit the table based on the partition ids for the and query to run in
> the most efficient way so that I have 30 tablets in table?
>
You don't have to pre-split -- Accumulo will automatically split big rows
into their own tablets. However, there are some performance advantages to
pre-splitting before your tablet gets big enough to split on its own.
> 3. Lastly, Can anybody suggest me the number of partitions for
> document sharded indexing. What should I look for when deciding it?
>
You have to consider a few factors for this: (a) ingest parallelization,
for which you want approximately as many partitions as you have cores in
your cluster, (b) size of a partition when full, which you want to be under
about 20GB for compaction performance reasons, and (c) query parallelism,
for which you want no more than a small factor of the number of cores in
your cluster to reduce query latency. If you can't find a solution that
works for all of these factors then you will be forced to make trade-offs
(or do something complicated like time-based partitioning).
> Thanks
> Vaibhav
>