You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Frank Smith <fr...@outlook.com> on 2013/06/10 05:07:32 UTC

Wikisearch

Appreciate everyone's help on the file storage question, but I was also looking at Josh's response to Thomas Jackson, and do I understand him correctly that the scan of the Index (and likely the ReverseIndex) table are really the key part of the search query, and the full table scan isn't really useful for much (because all of the tokens should go in the Index tables)?
So if I understand correctly, the partitioned main table is where documents and tokens get written, and then a combiner feeds the index tables, which are then scanned during a search?
What would I lose if I wanted to avoid Thomas's OOME and just skip the full table scan part of the search?  
Obviously, since I am not searching Wikipedia, I am going to be making some changes, just want to do it smartly.
Thanks,
Frank

Re: Wikisearch

Posted by Josh Elser <jo...@gmail.com>.

No, you're absolutely right, but if the webserver is OOME'ing, then it's 
obviously doing something :). You could try configuring it to write out 
a heapdump when it OOMEs and use jhat, jvisualvm or similar to analyze 
what was actually in the heap.

Let me expand a little more for you. The global index (forward and 
reverse) attempt to determine the search space for the query. For 
queries over very selective data, it will identify records in a row in 
the doc-partitioned table using the serialized protocol buffer in the 
Value. These records can be tested directly instead of having to also 
"open" the index inside of the doc-partitioned table. For very broad 
queries or intersections over very common terms, the global index will 
identifies the rows necessary to be searched in the doc-partitioned table.

The index in the doc-partitioned table is where the magic happens. A 
"tree" (using that term very loosely for the given implementation) is 
constructed for each field and term pair in each candidate row. At this 
point, merged, sorted reads over each field and term pair in that row 
are scanned trying to find docids which satisfy the "tree".

If you think of the docids as integers (they're not actually integers in 
practice, but that's irrelevant), each field and term pair creates a 
list of docids. For every AND in the query, you're intersecting the two 
lists of docids into a single sorted list, and for every OR you're 
merging those two lists into single sorted list.

This is trivial when you are simply intersecting two terms (e.g. "foo" 
AND "bar"), but applies generally for arbitrary subtrees, e.g. ("foo" 
AND ("bar" OR "bat" OR "baz")). Treating each subtree as a sorted list 
of docids is your recursive definition.

On 06/10/2013 09:46 PM, Frank Smith wrote:
> Ok, thanks for these insights, as I have mentioned, I am tweaking and 
> changing things for my own purpose, and I am trying to understand just 
> how much my tweaking might have unintended consequences.
>
> To extend upon your thoughts for why there is a problem, I need to 
> look in the web services to make sure it isn't creating objects from 
> the results of the search scan, because it should return no results. 
>  That is where I am still concerned, shouldn't the scan iterator not 
> pass anything through for something with no results?  Again, I need to 
> look harder myself, but I am more trying to understand how the 
> iterators notionally behave with the this table structure.
>
> ------------------------------------------------------------------------
> Date: Sun, 9 Jun 2013 23:18:43 -0400
> Subject: Re: Wikisearch
> From: josh.elser@gmail.com
> To: user@accumulo.apache.org
>
> The forward and reverse index are very important, yes, with the 
> in-partition "field index" being even more important.
>
> Yes to full table scans being undesirable and probably useless in the 
> scope of the wikisearch as it should index most everything and thus 
> there is nothing extra to be gleaned.
>
> I forget exactly how it was implemented, but tokens will appear in the 
> global indices and the doc partitioned table.
>
> The most likely reason for the oome is that the trivial web service 
> included attempts to suck all results into memory. There's nothing 
> inherently wrong with scanning all records in Accumulo, but the 
> webserver will easily fall over.
>
> On Jun 9, 2013 11:08 PM, "Frank Smith" <francis.h.smith@outlook.com 
> <ma...@outlook.com>> wrote:
>
>     Appreciate everyone's help on the file storage question, but I was
>     also looking at Josh's response to Thomas Jackson, and do I
>     understand him correctly that the scan of the Index (and likely
>     the ReverseIndex) table are really the key part of the search
>     query, and the full table scan isn't really useful for much
>     (because all of the tokens should go in the Index tables)?
>
>     So if I understand correctly, the partitioned main table is where
>     documents and tokens get written, and then a combiner feeds the
>     index tables, which are then scanned during a search?
>
>     What would I lose if I wanted to avoid Thomas's OOME and just skip
>     the full table scan part of the search?
>
>     Obviously, since I am not searching Wikipedia, I am going to be
>     making some changes, just want to do it smartly.
>
>     Thanks,
>
>     Frank
>

RE: Wikisearch

Posted by Frank Smith <fr...@outlook.com>.

Ok, thanks for these insights, as I have mentioned, I am tweaking and changing things for my own purpose, and I am trying to understand just how much my tweaking might have unintended consequences.  
To extend upon your thoughts for why there is a problem, I need to look in the web services to make sure it isn't creating objects from the results of the search scan, because it should return no results.  That is where I am still concerned, shouldn't the scan iterator not pass anything through for something with no results?  Again, I need to look harder myself, but I am more trying to understand how the iterators notionally behave with the this table structure. 

Date: Sun, 9 Jun 2013 23:18:43 -0400
Subject: Re: Wikisearch
From: josh.elser@gmail.com
To: user@accumulo.apache.org

The forward and reverse index are very important, yes, with the in-partition "field index" being even more important. 
Yes to full table scans being undesirable and probably useless in the scope of the wikisearch as it should index most everything and thus there is nothing extra to be gleaned. 
I forget exactly how it was implemented, but tokens will appear in the global indices and the doc partitioned table. 
The most likely reason for the oome is that the trivial web service included attempts to suck all results into memory. There's nothing inherently wrong with scanning all records in Accumulo, but the webserver will easily fall over. 

On Jun 9, 2013 11:08 PM, "Frank Smith" <fr...@outlook.com> wrote:

Appreciate everyone's help on the file storage question, but I was also looking at Josh's response to Thomas Jackson, and do I understand him correctly that the scan of the Index (and likely the ReverseIndex) table are really the key part of the search query, and the full table scan isn't really useful for much (because all of the tokens should go in the Index tables)?

So if I understand correctly, the partitioned main table is where documents and tokens get written, and then a combiner feeds the index tables, which are then scanned during a search?

What would I lose if I wanted to avoid Thomas's OOME and just skip the full table scan part of the search?  
Obviously, since I am not searching Wikipedia, I am going to be making some changes, just want to do it smartly.

Thanks,
Frank

Re: Wikisearch

Posted by Josh Elser <jo...@gmail.com>.

The forward and reverse index are very important, yes, with the
in-partition "field index" being even more important.

Yes to full table scans being undesirable and probably useless in the scope
of the wikisearch as it should index most everything and thus there is
nothing extra to be gleaned.

I forget exactly how it was implemented, but tokens will appear in the
global indices and the doc partitioned table.

The most likely reason for the oome is that the trivial web service
included attempts to suck all results into memory. There's nothing
inherently wrong with scanning all records in Accumulo, but the webserver
will easily fall over.
On Jun 9, 2013 11:08 PM, "Frank Smith" <fr...@outlook.com> wrote:

> Appreciate everyone's help on the file storage question, but I was also
> looking at Josh's response to Thomas Jackson, and do I understand him
> correctly that the scan of the Index (and likely the ReverseIndex) table
> are really the key part of the search query, and the full table scan isn't
> really useful for much (because all of the tokens should go in the Index
> tables)?
>
> So if I understand correctly, the partitioned main table is where
> documents and tokens get written, and then a combiner feeds the index
> tables, which are then scanned during a search?
>
> What would I lose if I wanted to avoid Thomas's OOME and just skip the
> full table scan part of the search?
>
> Obviously, since I am not searching Wikipedia, I am going to be making
> some changes, just want to do it smartly.
>
> Thanks,
>
> Frank
>