You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Thomas Jackson <ob...@gmail.com> on 2013/06/06 22:05:21 UTC

Wikisearch Iterators

Hey everyone,

I am taking the Wikisearch application for a test drive and ran into some
issues.  I have successfully ingested a number of wiki dumps for several
langues into Accumulo and have been able to search on terms that I know
exist in the corpus.  However, the issue I run into is that I get an out of
memory exception when the application performs a full table scan searching
for a term that does not exist in the index. Has anyone else encountered
this issue?

Also I was hoping to find out if anyone had any documentation or
information on how the iterators in the wikisearch application work.

Thanks
TJ

Re: Wikisearch Iterators

Posted by Thomas Jackson <ob...@gmail.com>.

Josh,

Appreciate the help.

I definitely have a use case that will involve terms not being found in the
index (like a user typo), and I need it to exit gracefully.

The OOME error definitely happens in the web server, and it is within the
method that creates documents.  This is puzzling, because I would expect no
documents to be created.  Can you help me understand why this is happening
and how to elegantly catch this circumstance?

Definitely agree on needing more fidelity on the iterators in this example,
I have written several simple iterators and used them well, but this is
clearly a more advanced implementation of an algorithm with them.
 Understand the concept of a document-partitioned index and an intersecting
iterator, but hard to get my brain around this whole thing.  I understand
the table structure and how they are scanned, and I understand how the
query is parsed and builds an iterator stack, but missing how they lash up.
 And of course, this case where no documents found is being turned into too
many documents returned.

Thanks,

TJ

On Thu, Jun 6, 2013 at 4:33 PM, Josh Elser <jo...@gmail.com> wrote:

> Hi Thomas,
>
> A couple of things you can glean from this.
>
> "full table scan" - Implies that, for some reason, the iterators or client
> code did not find one of the terms necessary to satisfy your query and
> attempted to find matching records using an exhaustive search. IMO, this
> shouldn't even exist as the Wikisearch indexes everything, and the
> 'feature' masks infinitely more problems than helping satisfies queries
> that the index can't satisfy (which are few).
>
> OOME - Was this the tabletserver or the webserver? If the webserver, it
> could be that your query returned too many results that fit into the
> configured Java heap space. You could try upping -Xmx and see if you can
> find the sweet spot.
>
> It should be said, also, that the iterators included in the Wikisearch
> application are *very* rough and are likely not great examples to use as a
> basis for good Accumulo SortedKeyValueIterator development. However, the
> basic algorithm which the iterators perform is sound, scalable, and can
> perform quite well, especially when coupled with certain optimizations.
>
> A would agree with you that a white-paper or similar on the table
> structure and algorithm is long overdue.
>
> If you have more specific problems, I'm sure the community at large (self,
> included) would be happy to help and go into more detail.
>
>
> On 06/06/2013 04:05 PM, Thomas Jackson wrote:
>
>> Hey everyone,
>>
>> I am taking the Wikisearch application for a test drive and ran into some
>> issues.  I have successfully ingested a number of wiki dumps for several
>> langues into Accumulo and have been able to search on terms that I know
>> exist in the corpus.  However, the issue I run into is that I get an out of
>> memory exception when the application performs a full table scan searching
>> for a term that does not exist in the index. Has anyone else encountered
>> this issue?
>>
>> Also I was hoping to find out if anyone had any documentation or
>> information on how the iterators in the wikisearch application work.
>>
>> Thanks
>> TJ
>>
>
>

Re: Wikisearch Iterators

Posted by Josh Elser <jo...@gmail.com>.

Hi Thomas,

A couple of things you can glean from this.

"full table scan" - Implies that, for some reason, the iterators or 
client code did not find one of the terms necessary to satisfy your 
query and attempted to find matching records using an exhaustive search. 
IMO, this shouldn't even exist as the Wikisearch indexes everything, and 
the 'feature' masks infinitely more problems than helping satisfies 
queries that the index can't satisfy (which are few).

OOME - Was this the tabletserver or the webserver? If the webserver, it 
could be that your query returned too many results that fit into the 
configured Java heap space. You could try upping -Xmx and see if you can 
find the sweet spot.

It should be said, also, that the iterators included in the Wikisearch 
application are *very* rough and are likely not great examples to use as 
a basis for good Accumulo SortedKeyValueIterator development. However, 
the basic algorithm which the iterators perform is sound, scalable, and 
can perform quite well, especially when coupled with certain optimizations.

A would agree with you that a white-paper or similar on the table 
structure and algorithm is long overdue.

If you have more specific problems, I'm sure the community at large 
(self, included) would be happy to help and go into more detail.

On 06/06/2013 04:05 PM, Thomas Jackson wrote:
> Hey everyone,
>
> I am taking the Wikisearch application for a test drive and ran into 
> some issues.  I have successfully ingested a number of wiki dumps for 
> several langues into Accumulo and have been able to search on terms 
> that I know exist in the corpus.  However, the issue I run into is 
> that I get an out of memory exception when the application performs a 
> full table scan searching for a term that does not exist in the index. 
> Has anyone else encountered this issue?
>
> Also I was hoping to find out if anyone had any documentation or 
> information on how the iterators in the wikisearch application work.
>
> Thanks
> TJ