You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Rob Tallis <ro...@gmail.com> on 2013/09/15 17:05:45 UTC

IndexedDocIterator, indexing approaches

Hello

The documentation has a couple of sections for indexing - 7.3 talks about
pulling back rowids to the client, doing your logic, then using
BatchScanners to submit a second query. 7.5 talks about Intersecting
Iterators and IndexedDocIterators which do all the work server/cluster side.

Getting the cluster to to do all the work seems like a better idea,
particularly on massive data sets, since you might hit limits on the client
- sounds reasonable, right?

So, I've taken a look at IndexedDocIterator and IntersectingIterator and
can both get them going with a few noddy examples of AND and NOT querying
being done server-side - so far so good, but what about other query
operations?
The wiki example uses IndexedDocIterator and talks about doing OR queries,
regex, and a "much more expressive query language" but I'm not sure how you
do this. (I can't find the source it refers to - where do I find it?)

Specifically, how would I do AND, OR and NOT queries (or union, intersect,
except) in the same query using IndexedDocIterators or intersecting
iterators. What about other queries like greater_than, less_than, IN,
etc... are these possible?

As an aside, I guess using IndexedDocIterators restricts me to having my
document in a single row/value (perhaps encoded in JSON or something - is
there a recommended method?). IntersectingIterator would return rowIDs
which could refer to documents split out by ColF ColQ in the usual way -
this would still be a secondary lookup from the client but at least the
server has done all the hard work figuring out the rowIDs. Is this a fair
assumption?

Generally, I'm not "getting" schemas/indexing/querying in Accumulo. Is
there a good tutorial on any of this, that perhaps shows some typical
SQL-like things I might want to do and what is/isn't possible in Accumulo
and how I do it?

Cheers,
Rob Tallis

Re: IndexedDocIterator, indexing approaches

Posted by Josh Elser <jo...@gmail.com>.

Rob,

I can try to provide a little more insight here.

If you think about the intersecting iterator(s) as (set) intersecting two
sorted streams of unique IDs, you can easily work in negations and
disjunctions as well. An union iterator is easy to make as you just merge
the two sorted streams of unique IDs. A negation is just an existence check
in a sorted stream. Being able to represent each of these operands in terms
of sorted streams of unique IDs, you can create arbitrary trees of them,
e.g. (A and (B or C)).

As far greater than, less than, and regular expressions, the easiest
approach is to use an inverted index to expand these operators into
discrete terms based on what actually occurs in the data. However, this is
not without it's own pitfalls as well :)

The point about "more expressive query language" is typically supported
through post-filtering, e.g. (A and B and
my-really-complicated-function()). In this case, you can primarily run your
query over the intersection of A and B, and then post-filter out records
which don't satisfy the query. However, this is only really ideal when your
primary search terms are identifying a "reasonable" subset of your data.

As Eric points out, implementing a "full" SQL semantics, in addition to
generalized secondary indexing, is a rather difficult problem; however,
Accumulo does make some things very easy to work with.

On Mon, Sep 16, 2013 at 11:21 AM, Eric Newton <er...@gmail.com> wrote:

> Hi Rob,
>
> You're going to have to dig into the source code of the wiki example to
> find out more.  It would be nice if we could update that example and
> provide better documentation, but it is not maintained in its current form.
>
> The wiki example uses jexl to provide a base query language.  AND term
> searches use intersecting iterators, other expressions are handled as basic
> filters.
>
> I think you are getting accumulo schemas, indexing and querying. It's just
> harder than you expected.
>
> There are many teams who are working on better frameworks for query in
> Accumulo; I will let them speak for themselves.
>
> -Eric
>
>
> On Sun, Sep 15, 2013 at 11:05 AM, Rob Tallis <ro...@gmail.com> wrote:
>
>> Hello
>>
>> The documentation has a couple of sections for indexing - 7.3 talks about
>> pulling back rowids to the client, doing your logic, then using
>> BatchScanners to submit a second query. 7.5 talks about Intersecting
>> Iterators and IndexedDocIterators which do all the work server/cluster side.
>>
>> Getting the cluster to to do all the work seems like a better idea,
>> particularly on massive data sets, since you might hit limits on the client
>> - sounds reasonable, right?
>>
>> So, I've taken a look at IndexedDocIterator and IntersectingIterator and
>> can both get them going with a few noddy examples of AND and NOT querying
>> being done server-side - so far so good, but what about other query
>> operations?
>> The wiki example uses IndexedDocIterator and talks about doing OR
>> queries, regex, and a "much more expressive query language" but I'm not
>> sure how you do this. (I can't find the source it refers to - where do I
>> find it?)
>>
>> Specifically, how would I do AND, OR and NOT queries (or union,
>> intersect, except) in the same query using IndexedDocIterators or
>> intersecting iterators. What about other queries like greater_than,
>> less_than, IN, etc... are these possible?
>>
>> As an aside, I guess using IndexedDocIterators restricts me to having my
>> document in a single row/value (perhaps encoded in JSON or something - is
>> there a recommended method?). IntersectingIterator would return rowIDs
>> which could refer to documents split out by ColF ColQ in the usual way -
>> this would still be a secondary lookup from the client but at least the
>> server has done all the hard work figuring out the rowIDs. Is this a fair
>> assumption?
>>
>> Generally, I'm not "getting" schemas/indexing/querying in Accumulo. Is
>> there a good tutorial on any of this, that perhaps shows some typical
>> SQL-like things I might want to do and what is/isn't possible in Accumulo
>> and how I do it?
>>
>> Cheers,
>> Rob Tallis
>>
>>
>>
>

Re: IndexedDocIterator, indexing approaches

Posted by Eric Newton <er...@gmail.com>.

Hi Rob,

You're going to have to dig into the source code of the wiki example to
find out more.  It would be nice if we could update that example and
provide better documentation, but it is not maintained in its current form.

The wiki example uses jexl to provide a base query language.  AND term
searches use intersecting iterators, other expressions are handled as basic
filters.

I think you are getting accumulo schemas, indexing and querying. It's just
harder than you expected.

There are many teams who are working on better frameworks for query in
Accumulo; I will let them speak for themselves.

-Eric


On Sun, Sep 15, 2013 at 11:05 AM, Rob Tallis <ro...@gmail.com> wrote:

> Hello
>
> The documentation has a couple of sections for indexing - 7.3 talks about
> pulling back rowids to the client, doing your logic, then using
> BatchScanners to submit a second query. 7.5 talks about Intersecting
> Iterators and IndexedDocIterators which do all the work server/cluster side.
>
> Getting the cluster to to do all the work seems like a better idea,
> particularly on massive data sets, since you might hit limits on the client
> - sounds reasonable, right?
>
> So, I've taken a look at IndexedDocIterator and IntersectingIterator and
> can both get them going with a few noddy examples of AND and NOT querying
> being done server-side - so far so good, but what about other query
> operations?
> The wiki example uses IndexedDocIterator and talks about doing OR queries,
> regex, and a "much more expressive query language" but I'm not sure how you
> do this. (I can't find the source it refers to - where do I find it?)
>
> Specifically, how would I do AND, OR and NOT queries (or union, intersect,
> except) in the same query using IndexedDocIterators or intersecting
> iterators. What about other queries like greater_than, less_than, IN,
> etc... are these possible?
>
> As an aside, I guess using IndexedDocIterators restricts me to having my
> document in a single row/value (perhaps encoded in JSON or something - is
> there a recommended method?). IntersectingIterator would return rowIDs
> which could refer to documents split out by ColF ColQ in the usual way -
> this would still be a secondary lookup from the client but at least the
> server has done all the hard work figuring out the rowIDs. Is this a fair
> assumption?
>
> Generally, I'm not "getting" schemas/indexing/querying in Accumulo. Is
> there a good tutorial on any of this, that perhaps shows some typical
> SQL-like things I might want to do and what is/isn't possible in Accumulo
> and how I do it?
>
> Cheers,
> Rob Tallis
>
>
>