You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Sukant Hajra <qn...@snkmail.com> on 2012/07/16 01:05:26 UTC

more questions about IndexedDocIterators

Hi all,

I have a mixed bag of questions to follow up on an earlier post inquiring about
intersecting iterators now that I've done some prototyping:


1. Do FamilyIntersectingIterators work in 1.3.4?
------------------------------------------------

Does anyone know if FamilyIntersectingIterators were useable as far back as
1.3.4?  Or am I wasting my time on them at this old version (and need to
upgrade)?

I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
currently have a hung thread in my attempt to use a FamilyIntersectingIterator
with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove some
oddly designed static configuration.

If FamilyIntersectingIterators were buggy, were there sufficient work-arounds
to get some use out of them in 1.3.4?

Unfortunately, I need to jump through some political/social hoops to upgrade,
but if it's got to be done, then I'll do what I have to.


2. Is this approach reasonable?
-------------------------------

We're trying to be clever with our use of indexed docs.  We're less interested
in searching over a large corpus of data in parallel, and more interested in
doing some server-side joins in a data-local way (to reduce client burden and
network traffic).  So we're heavily "sharding" our documents (billions of
shards) and using range constraints on the iterator to hone in on exactly one
shard (new Range(shardId, shardId)).

Let me give you a sense for what we're doing.  In one use case, we're using
document-indexed iterators to accomodate both per-author and by-time accesses
of a per-document commit log.  So we're sharding by document ID (and we have
billions of documents).  Then we use the author ID as terms for each commit
(one term per commit entry).  We use a reverse timestamp for the doc type, so
we get back these entries in reverse time order.  In this way, we can scan the
log for the entire document by time with plan iterators, and for a specific
author with a document-indexed iterator (with a server-side join to the commit
log entry).  Later on, we may index the log by other features with this
approach.

Is this strategy sane?  Is there precedent for doing it?  Is there a better
alternative?


3. Compressed reverse-timestamp using Unicode tricks?
------------------------------------------------------

I see code in Accumulo like

    // We're past the index column family, so return a term that will sort
    // lexicographically last.  The last unicode character should suffice
    return new Text("\uFFFD");

which gets me thinking that i can probably pull off a impressively compressed,
but still lexically orderd, reverse timestamp using Unicode trickery to get a
gigantic radix.  Is there any precedence for this?  I'm a little worried about
running into corner cases with Unicode encoding.  Otherwise, I think it feels
like a simple algorithm that may not eat up much CPU in translation and might
save disk space at scale.

Or is this optimizing into the noise given compression Accumulo already does
under the covers?


4. Response from IndexedDocIterator not reflecting documentation
----------------------------------------------------------------

I got back results in my prototype that don't line up with the documentation
for a IndexedDocIterator.  For example, here's some data I put into a test
table:

    r:"shardId", cf:"e\0docType", cq:"docId", value:"content"
    r:"shardId", cf:"i", cq:"term\0docType\0docId\0docInfo", value:[]

This is as per the documentation of IndexedDocIterator.java.  What I believe I
should have gotten back from an intersecting iteration was:

    r:"shardId", cf:"i", cq:"docType\0docId\0docInfo", value:"content"

but instead, the column qualifier I actually got was formatted differently:

    r:"shardId", cf:"i", cq:"\0docType\0docId\0", value:"content"

The document info wasn't returned at all, and the column qualifier was
suspiciously prefixed with a null character.

This isn't so horrible, because I didn't have plans to use the document info
anyway.  Actually, I was curious what people were using it for anyway.

Based upon my read of the source code for IndexedDocIterator#parseDocID, I'm
not sure how the document info could possibly be parsed.  I feel the info part
of the index is truly discarded in code.

I can provide sample code if people doubt the integrity of my protoype.  It's
just not compact in it's current form.

Mostly, I want to confirm that this behavior is not due to a user error on my
part.


5. Why not do intersecting iteration of a single term?
------------------------------------------------------

The API throws an exception if you search for only a single term.  Especially
given our strategy our strategy of using doc-indexing for server-side joining
(question 2. above), it seems like supporting a single term lookup makes sense.
Also, with the dynamism of user interaction, you don't always know up-front how
many terms a user is interested in any way.

As a work around, I'm putting in a dummy term with a not-flag.  But this seems
silly to me.  Am I missing the larger picture or abusing the API?


Thanks for the help,
Sukant

Re: more questions about IndexedDocIterators

Posted by Adam Fuchs <af...@apache.org>.

Another point RE #1: You always have the option of adding iterators to an
already-installed instance. If you want to use the Accumulo version of the
iterators, you can backport those relatively easily and then stick them in
a jar in the lib/ext directory. The only trick is that you need to avoid
classname collisions or the built-in iterators will get loaded instead of
the ones in lib/ext. Just change the package names if that is a problem.

I'm also curious as to how what you described in #2 works. It seems like
what you're doing could work, but the trouble with having billions of
"shards" is that you might have to search through a large number of them
linearly if you can't narrow down the set of candidate shards enough from
the start. It also suggests that each of your billions of shards is
probably small enough that you don't need to worry about keeping a complex
index, and you could just evaluate the entire shard in-memory. However, I
could be totally wrong about the expected distribution. Maybe you can fill
in some more details?

Cheers,
Adam


On Mon, Jul 16, 2012 at 9:34 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> 1) The class hierarchy is a little convoluted, but there doesn't seem to
> be anything necessarily broken about the
> FamilyIntersectingIterator/IndexedDocIterator that would prevent it from
> being backported from trunk to a 1.3.x branch. AFAIK the
> SortedKeyValueIterator interface has remained unchanged between the initial
> 1.3 release up through our current trunk.
>
> 2) I'm a little confused as to what you mean by "sharding by document ID."
> Does this mean that for any given key, the row portion is a document ID? As
> far as reversing the timestamp, it seems reasonable if your queries are
> primarily of the form "give me documents within the past X time units."
>
> 3) What's your timestamp? If it's just a milliseconds-since-epoch
> timestamp, it's not unheard of to encode numeric values into an ordering
> that sorts lexicographically that isn't just padding with zeroes. The
> Wikipedia example has a NumberNormalizer that uses commons-lang to do this.
> As for hard numbers on performance with time and space, I don't have them.
> I would imagine you will see a difference in space and possibly time if the
> deserializing of the String is faster than what your'e using now.
>
> 4) I'd like to see your source. Have you looked at the
> IndexedDocIteratorTest to verify that it behaves properly? I'm surprised
> that it's returning you an index column family. Was your sample client
> running with the dummy negation you mentioned in #5?
>
> On Sun, Jul 15, 2012 at 7:05 PM, Sukant Hajra <qn...@snkmail.com>wrote:
>
>> Hi all,
>>
>> I have a mixed bag of questions to follow up on an earlier post inquiring
>> about
>> intersecting iterators now that I've done some prototyping:
>>
>>
>> 1. Do FamilyIntersectingIterators work in 1.3.4?
>> ------------------------------------------------
>>
>> Does anyone know if FamilyIntersectingIterators were useable as far back
>> as
>> 1.3.4?  Or am I wasting my time on them at this old version (and need to
>> upgrade)?
>>
>> I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
>> currently have a hung thread in my attempt to use a
>> FamilyIntersectingIterator
>> with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove
>> some
>> oddly designed static configuration.
>>
>> If FamilyIntersectingIterators were buggy, were there sufficient
>> work-arounds
>> to get some use out of them in 1.3.4?
>>
>> Unfortunately, I need to jump through some political/social hoops to
>> upgrade,
>> but if it's got to be done, then I'll do what I have to.
>>
>>
>> 2. Is this approach reasonable?
>> -------------------------------
>>
>> We're trying to be clever with our use of indexed docs.  We're less
>> interested
>> in searching over a large corpus of data in parallel, and more interested
>> in
>> doing some server-side joins in a data-local way (to reduce client burden
>> and
>> network traffic).  So we're heavily "sharding" our documents (billions of
>> shards) and using range constraints on the iterator to hone in on exactly
>> one
>> shard (new Range(shardId, shardId)).
>>
>> Let me give you a sense for what we're doing.  In one use case, we're
>> using
>> document-indexed iterators to accomodate both per-author and by-time
>> accesses
>> of a per-document commit log.  So we're sharding by document ID (and we
>> have
>> billions of documents).  Then we use the author ID as terms for each
>> commit
>> (one term per commit entry).  We use a reverse timestamp for the doc
>> type, so
>> we get back these entries in reverse time order.  In this way, we can
>> scan the
>> log for the entire document by time with plan iterators, and for a
>> specific
>> author with a document-indexed iterator (with a server-side join to the
>> commit
>> log entry).  Later on, we may index the log by other features with this
>> approach.
>>
>> Is this strategy sane?  Is there precedent for doing it?  Is there a
>> better
>> alternative?
>>
>>
>> 3. Compressed reverse-timestamp using Unicode tricks?
>> ------------------------------------------------------
>>
>> I see code in Accumulo like
>>
>>     // We're past the index column family, so return a term that will sort
>>     // lexicographically last.  The last unicode character should suffice
>>     return new Text("\uFFFD");
>>
>> which gets me thinking that i can probably pull off a impressively
>> compressed,
>> but still lexically orderd, reverse timestamp using Unicode trickery to
>> get a
>> gigantic radix.  Is there any precedence for this?  I'm a little worried
>> about
>> running into corner cases with Unicode encoding.  Otherwise, I think it
>> feels
>> like a simple algorithm that may not eat up much CPU in translation and
>> might
>> save disk space at scale.
>>
>> Or is this optimizing into the noise given compression Accumulo already
>> does
>> under the covers?
>>
>>
>> 4. Response from IndexedDocIterator not reflecting documentation
>> ----------------------------------------------------------------
>>
>> I got back results in my prototype that don't line up with the
>> documentation
>> for a IndexedDocIterator.  For example, here's some data I put into a test
>> table:
>>
>>     r:"shardId", cf:"e\0docType", cq:"docId", value:"content"
>>     r:"shardId", cf:"i", cq:"term\0docType\0docId\0docInfo", value:[]
>>
>> This is as per the documentation of IndexedDocIterator.java.  What I
>> believe I
>> should have gotten back from an intersecting iteration was:
>>
>>     r:"shardId", cf:"i", cq:"docType\0docId\0docInfo", value:"content"
>>
>> but instead, the column qualifier I actually got was formatted
>> differently:
>>
>>     r:"shardId", cf:"i", cq:"\0docType\0docId\0", value:"content"
>>
>> The document info wasn't returned at all, and the column qualifier was
>> suspiciously prefixed with a null character.
>>
>> This isn't so horrible, because I didn't have plans to use the document
>> info
>> anyway.  Actually, I was curious what people were using it for anyway.
>>
>> Based upon my read of the source code for IndexedDocIterator#parseDocID,
>> I'm
>> not sure how the document info could possibly be parsed.  I feel the info
>> part
>> of the index is truly discarded in code.
>>
>> I can provide sample code if people doubt the integrity of my protoype.
>>  It's
>> just not compact in it's current form.
>>
>> Mostly, I want to confirm that this behavior is not due to a user error
>> on my
>> part.
>>
>>
>> 5. Why not do intersecting iteration of a single term?
>> ------------------------------------------------------
>>
>> The API throws an exception if you search for only a single term.
>>  Especially
>> given our strategy our strategy of using doc-indexing for server-side
>> joining
>> (question 2. above), it seems like supporting a single term lookup makes
>> sense.
>> Also, with the dynamism of user interaction, you don't always know
>> up-front how
>> many terms a user is interested in any way.
>>
>> As a work around, I'm putting in a dummy term with a not-flag.  But this
>> seems
>> silly to me.  Am I missing the larger picture or abusing the API?
>>
>>
>> Thanks for the help,
>> Sukant
>>
>
>

Re: more questions about IndexedDocIterators

Posted by William Slacum <wi...@accumulo.net>.

1) The class hierarchy is a little convoluted, but there doesn't seem to be
anything necessarily broken about the
FamilyIntersectingIterator/IndexedDocIterator that would prevent it from
being backported from trunk to a 1.3.x branch. AFAIK the
SortedKeyValueIterator interface has remained unchanged between the initial
1.3 release up through our current trunk.

2) I'm a little confused as to what you mean by "sharding by document ID."
Does this mean that for any given key, the row portion is a document ID? As
far as reversing the timestamp, it seems reasonable if your queries are
primarily of the form "give me documents within the past X time units."

3) What's your timestamp? If it's just a milliseconds-since-epoch
timestamp, it's not unheard of to encode numeric values into an ordering
that sorts lexicographically that isn't just padding with zeroes. The
Wikipedia example has a NumberNormalizer that uses commons-lang to do this.
As for hard numbers on performance with time and space, I don't have them.
I would imagine you will see a difference in space and possibly time if the
deserializing of the String is faster than what your'e using now.

4) I'd like to see your source. Have you looked at the
IndexedDocIteratorTest to verify that it behaves properly? I'm surprised
that it's returning you an index column family. Was your sample client
running with the dummy negation you mentioned in #5?

On Sun, Jul 15, 2012 at 7:05 PM, Sukant Hajra <qn...@snkmail.com>wrote:

> Hi all,
>
> I have a mixed bag of questions to follow up on an earlier post inquiring
> about
> intersecting iterators now that I've done some prototyping:
>
>
> 1. Do FamilyIntersectingIterators work in 1.3.4?
> ------------------------------------------------
>
> Does anyone know if FamilyIntersectingIterators were useable as far back as
> 1.3.4?  Or am I wasting my time on them at this old version (and need to
> upgrade)?
>
> I got a prototype of IndexedDocIterators working with Accumulo 1.4.1, but
> currently have a hung thread in my attempt to use a
> FamilyIntersectingIterator
> with Cloudbase 1.3.4.  Also, I noticed the API changed somewhat to remove
> some
> oddly designed static configuration.
>
> If FamilyIntersectingIterators were buggy, were there sufficient
> work-arounds
> to get some use out of them in 1.3.4?
>
> Unfortunately, I need to jump through some political/social hoops to
> upgrade,
> but if it's got to be done, then I'll do what I have to.
>
>
> 2. Is this approach reasonable?
> -------------------------------
>
> We're trying to be clever with our use of indexed docs.  We're less
> interested
> in searching over a large corpus of data in parallel, and more interested
> in
> doing some server-side joins in a data-local way (to reduce client burden
> and
> network traffic).  So we're heavily "sharding" our documents (billions of
> shards) and using range constraints on the iterator to hone in on exactly
> one
> shard (new Range(shardId, shardId)).
>
> Let me give you a sense for what we're doing.  In one use case, we're using
> document-indexed iterators to accomodate both per-author and by-time
> accesses
> of a per-document commit log.  So we're sharding by document ID (and we
> have
> billions of documents).  Then we use the author ID as terms for each commit
> (one term per commit entry).  We use a reverse timestamp for the doc type,
> so
> we get back these entries in reverse time order.  In this way, we can scan
> the
> log for the entire document by time with plan iterators, and for a specific
> author with a document-indexed iterator (with a server-side join to the
> commit
> log entry).  Later on, we may index the log by other features with this
> approach.
>
> Is this strategy sane?  Is there precedent for doing it?  Is there a better
> alternative?
>
>
> 3. Compressed reverse-timestamp using Unicode tricks?
> ------------------------------------------------------
>
> I see code in Accumulo like
>
>     // We're past the index column family, so return a term that will sort
>     // lexicographically last.  The last unicode character should suffice
>     return new Text("\uFFFD");
>
> which gets me thinking that i can probably pull off a impressively
> compressed,
> but still lexically orderd, reverse timestamp using Unicode trickery to
> get a
> gigantic radix.  Is there any precedence for this?  I'm a little worried
> about
> running into corner cases with Unicode encoding.  Otherwise, I think it
> feels
> like a simple algorithm that may not eat up much CPU in translation and
> might
> save disk space at scale.
>
> Or is this optimizing into the noise given compression Accumulo already
> does
> under the covers?
>
>
> 4. Response from IndexedDocIterator not reflecting documentation
> ----------------------------------------------------------------
>
> I got back results in my prototype that don't line up with the
> documentation
> for a IndexedDocIterator.  For example, here's some data I put into a test
> table:
>
>     r:"shardId", cf:"e\0docType", cq:"docId", value:"content"
>     r:"shardId", cf:"i", cq:"term\0docType\0docId\0docInfo", value:[]
>
> This is as per the documentation of IndexedDocIterator.java.  What I
> believe I
> should have gotten back from an intersecting iteration was:
>
>     r:"shardId", cf:"i", cq:"docType\0docId\0docInfo", value:"content"
>
> but instead, the column qualifier I actually got was formatted differently:
>
>     r:"shardId", cf:"i", cq:"\0docType\0docId\0", value:"content"
>
> The document info wasn't returned at all, and the column qualifier was
> suspiciously prefixed with a null character.
>
> This isn't so horrible, because I didn't have plans to use the document
> info
> anyway.  Actually, I was curious what people were using it for anyway.
>
> Based upon my read of the source code for IndexedDocIterator#parseDocID,
> I'm
> not sure how the document info could possibly be parsed.  I feel the info
> part
> of the index is truly discarded in code.
>
> I can provide sample code if people doubt the integrity of my protoype.
>  It's
> just not compact in it's current form.
>
> Mostly, I want to confirm that this behavior is not due to a user error on
> my
> part.
>
>
> 5. Why not do intersecting iteration of a single term?
> ------------------------------------------------------
>
> The API throws an exception if you search for only a single term.
>  Especially
> given our strategy our strategy of using doc-indexing for server-side
> joining
> (question 2. above), it seems like supporting a single term lookup makes
> sense.
> Also, with the dynamism of user interaction, you don't always know
> up-front how
> many terms a user is interested in any way.
>
> As a work around, I'm putting in a dummy term with a not-flag.  But this
> seems
> silly to me.  Am I missing the larger picture or abusing the API?
>
>
> Thanks for the help,
> Sukant
>

Re: more questions about IndexedDocIterators

Posted by Adam Fuchs <af...@apache.org>.

I'm not sure there's a canonical source for this, but Jeffrey Dean gave a
keynote at WSDM in 2009 that includes a bunch of stuff on encoding
techniques:
research.google.com/people/jeff/WSDM09-keynote.pdf

Cheers,
Adam


On Mon, Jul 16, 2012 at 6:45 PM, David Medinets <da...@gmail.com>wrote:

> On Mon, Jul 16, 2012 at 3:27 PM, Adam Fuchs <af...@apache.org> wrote:
> > Slide 11 of my table design presentation
> > (http://people.apache.org/~afuchs/slides/accumulo_table_design.pdf) also
> > shows a few extra tricks that might help you out. A
>
> It's on my list to ask more details about that slide. Do you have any
> pointers to more information? I have looked around the web but all
> that I have found is a few white papers with a lot of math.
>

Re: more questions about IndexedDocIterators

Posted by David Medinets <da...@gmail.com>.

On Mon, Jul 16, 2012 at 3:27 PM, Adam Fuchs <af...@apache.org> wrote:
> Slide 11 of my table design presentation
> (http://people.apache.org/~afuchs/slides/accumulo_table_design.pdf) also
> shows a few extra tricks that might help you out. A

It's on my list to ask more details about that slide. Do you have any
pointers to more information? I have looked around the web but all
that I have found is a few white papers with a lot of math.

Re: more questions about IndexedDocIterators

Posted by Adam Fuchs <af...@apache.org>.

*SNIP

> > 3. Compressed reverse-timestamp using Unicode tricks?
> > ------------------------------------------------------
> >
> > I see code in Accumulo like
> >
> > // We're past the index column family, so return a term that will sort
> > // lexicographically last. The last unicode character should suffice
> > return new Text("\uFFFD");
> >
> > which gets me thinking that i can probably pull off a impressively
> > compressed,
> > but still lexically orderd, reverse timestamp using Unicode trickery
> > to get a
> > gigantic radix. Is there any precedence for this? I'm a little worried
> > about
> > running into corner cases with Unicode encoding. Otherwise, I think it
> > feels
> > like a simple algorithm that may not eat up much CPU in translation
> > and might
> > save disk space at scale.
> >
> > Or is this optimizing into the noise given compression Accumulo
> > already does
> > under the covers?
>
> I would think the compression would take care of this.  If you try it and
> get an improvement, we'd be interested in seeing the results.
>
>
I think it is generally a good idea to use encoding techniques whenever
they're quick, effective, and easy. If you know something about your data
then you can usually do better than a general-purpose compression
algorithm. Slide 11 of my table design presentation (
http://people.apache.org/~afuchs/slides/accumulo_table_design.pdf) also
shows a few extra tricks that might help you out. Another possibility is to
use a two's complement representation for a fixed precision number (e.g. a
long or an int), but flip the first bit.

Cheers,
Adam

Re: more questions about IndexedDocIterators

Posted by William Slacum <ws...@gmail.com>.

I'm on a phone, so excuse the lack of info/answers, but #5 is because the
IntersectingIterator is essentially a proof of concept piece of code.
There's no reason you shouldn't be able to do one term. The Wikipedia
example is able to handle single term queries. The code is a bit rough to
read, but should be a starting point.
On Jul 15, 2012 7:06 PM, "Sukant Hajra" <qn...@snkmail.com> wrote:

Re: more questions about IndexedDocIterators

Posted by Billie J Rinaldi <bi...@ugov.gov>.

On Sunday, July 15, 2012 7:05:26 PM, "Sukant Hajra" <qn...@snkmail.com> wrote:
> Hi all,
> 
> I have a mixed bag of questions to follow up on an earlier post
> inquiring about
> intersecting iterators now that I've done some prototyping:
> 
> 
> 1. Do FamilyIntersectingIterators work in 1.3.4?
> ------------------------------------------------
> 
> Does anyone know if FamilyIntersectingIterators were useable as far
> back as
> 1.3.4? Or am I wasting my time on them at this old version (and need
> to
> upgrade)?
> 
> I got a prototype of IndexedDocIterators working with Accumulo 1.4.1,
> but
> currently have a hung thread in my attempt to use a
> FamilyIntersectingIterator
> with Cloudbase 1.3.4. Also, I noticed the API changed somewhat to
> remove some
> oddly designed static configuration.
> 
> If FamilyIntersectingIterators were buggy, were there sufficient
> work-arounds
> to get some use out of them in 1.3.4?
> 
> Unfortunately, I need to jump through some political/social hoops to
> upgrade,
> but if it's got to be done, then I'll do what I have to.

I think there have been a couple of bug fixes to the FamilyIntersectingIterator since 1.3.4 (including ACCUMULO-178 and ACCUMULO-665).  Most of the fixes have been made in the 1.3 branch, with the exception of recent ones made for ACCUMULO-665.  I don't know if any of these would have been related to a hung thread.

> 
> 
> 2. Is this approach reasonable?
> -------------------------------
> 
> We're trying to be clever with our use of indexed docs. We're less
> interested
> in searching over a large corpus of data in parallel, and more
> interested in
> doing some server-side joins in a data-local way (to reduce client
> burden and
> network traffic). So we're heavily "sharding" our documents (billions
> of
> shards) and using range constraints on the iterator to hone in on
> exactly one
> shard (new Range(shardId, shardId)).
> 
> Let me give you a sense for what we're doing. In one use case, we're
> using
> document-indexed iterators to accomodate both per-author and by-time
> accesses
> of a per-document commit log. So we're sharding by document ID (and we
> have
> billions of documents). Then we use the author ID as terms for each
> commit
> (one term per commit entry). We use a reverse timestamp for the doc
> type, so
> we get back these entries in reverse time order. In this way, we can
> scan the
> log for the entire document by time with plan iterators, and for a
> specific
> author with a document-indexed iterator (with a server-side join to
> the commit
> log entry). Later on, we may index the log by other features with this
> approach.
> 
> Is this strategy sane? Is there precedent for doing it? Is there a
> better
> alternative?

Let me see if I understand.  You have a single row per document, with potentially a large number of commit logs for each, containing at least author, time, and document modification information.  You can recover the document from any given time by applying the commit logs in the original order, with a single scan over the 'e'-prefixed column family.  To recover the contributions of a single author, you use the 'i' column family and seek the column qualifier to the author to get back a list of commit IDs (or perhaps the timestamps are the IDs), then you join that with the 'e'-prefixed column family to get back the actual commits.

Are you always dealing with exactly one document at a time?  If it is ever the case that you want to find all the commits an author has made to any document, you're going to have to do a number of seeks that is on the order of the number of shards you have.  If instead you group documents into shards, you aren't losing anything as far as the behavior described above.  You could still recover the entire document as long as you knew the documentID, shardID, and docType (though you wouldn't be able to put timestamps in the docType anymore), and you could similarly pull back all of a single author's changes to a single document.  The improvement would be in the case where you are intersecting over all documents.  The main question would be whether you could still use the IndexedDocIterator out of the box, or whether you would need to modify it for your particular use case.  You might be able to use the plain one if you used a composite documentID that included the original documentID, reverse timestamp, and commitID.

> 
> 
> 3. Compressed reverse-timestamp using Unicode tricks?
> ------------------------------------------------------
> 
> I see code in Accumulo like
> 
> // We're past the index column family, so return a term that will sort
> // lexicographically last. The last unicode character should suffice
> return new Text("\uFFFD");
> 
> which gets me thinking that i can probably pull off a impressively
> compressed,
> but still lexically orderd, reverse timestamp using Unicode trickery
> to get a
> gigantic radix. Is there any precedence for this? I'm a little worried
> about
> running into corner cases with Unicode encoding. Otherwise, I think it
> feels
> like a simple algorithm that may not eat up much CPU in translation
> and might
> save disk space at scale.
> 
> Or is this optimizing into the noise given compression Accumulo
> already does
> under the covers?

I would think the compression would take care of this.  If you try it and get an improvement, we'd be interested in seeing the results.

> 
> 
> 4. Response from IndexedDocIterator not reflecting documentation
> ----------------------------------------------------------------
> 
> I got back results in my prototype that don't line up with the
> documentation
> for a IndexedDocIterator. For example, here's some data I put into a
> test
> table:
> 
> r:"shardId", cf:"e\0docType", cq:"docId", value:"content"
> r:"shardId", cf:"i", cq:"term\0docType\0docId\0docInfo", value:[]
> 
> This is as per the documentation of IndexedDocIterator.java. What I
> believe I
> should have gotten back from an intersecting iteration was:
> 
> r:"shardId", cf:"i", cq:"docType\0docId\0docInfo", value:"content"
> 
> but instead, the column qualifier I actually got was formatted
> differently:
> 
> r:"shardId", cf:"i", cq:"\0docType\0docId\0", value:"content"
> 
> The document info wasn't returned at all, and the column qualifier was
> suspiciously prefixed with a null character.
> 
> This isn't so horrible, because I didn't have plans to use the
> document info
> anyway. Actually, I was curious what people were using it for anyway.
> 
> Based upon my read of the source code for
> IndexedDocIterator#parseDocID, I'm
> not sure how the document info could possibly be parsed. I feel the
> info part
> of the index is truly discarded in code.
> 
> I can provide sample code if people doubt the integrity of my
> protoype. It's
> just not compact in it's current form.
> 
> Mostly, I want to confirm that this behavior is not due to a user
> error on my
> part.

Yes, this is a documentation error.  I believe it just takes a cq:"term\0docType\0docId\0docInfo" and removes the term and docInfo.  We envisioned the need to have some additional term-specific docInfo (for example, the offsets where a term can be found in a document) that could be used by a more specialized iterator to fine-tune results, but I don't know of examples where this has been used.  Because you're intersecting two terms, the term-specific information needs to be removed from the cq.  It might make more sense for it to just return the unmodified document entry instead of a modified index entry, e.g. r:"shardId", cf:"e\0docType", cq:"docId", value:"content".

> 
> 
> 5. Why not do intersecting iteration of a single term?
> ------------------------------------------------------
> 
> The API throws an exception if you search for only a single term.
> Especially
> given our strategy our strategy of using doc-indexing for server-side
> joining
> (question 2. above), it seems like supporting a single term lookup
> makes sense.
> Also, with the dynamism of user interaction, you don't always know
> up-front how
> many terms a user is interested in any way.
> 
> As a work around, I'm putting in a dummy term with a not-flag. But
> this seems
> silly to me. Am I missing the larger picture or abusing the API?

Yes, the IndexedDocIterator could be altered to return documents for a single term.  The current behavior is an artifact of its being a subclass of the IntersectingIterator, which only intersects terms and does not return document contents.

Feel free to open a ticket about improvements you'd like to see.

Billie

> 
> 
> Thanks for the help,
> Sukant