You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2015/03/09 22:13:38 UTC
[jira] [Updated] (LUCENE-4942) Indexed non-point shapes index excessive terms

     [ https://issues.apache.org/jira/browse/LUCENE-4942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley updated LUCENE-4942:
---------------------------------
    Attachment: LUCENE-4942_non-point_excessive_terms.patch

The attached patch _does not_ have the "+" / "*" (approximated leaf vs contained leaf) leaf type differentiation; that can wait.

Summary of patch changes:
* CellTokenStream: removed the dual/redundant indexing it was doing for leaf cells.  This simplified it, and I further simplified it to the point that CTS is now really a generic TokenStream for a BytesRefIterator you give it.  I have a nocommit to rename CellTokenStream to BytesRefIteratorTokenStream.
* Related to the CellTokenStream change, I refactored PrefixTreeStrategy a little to now have a protected createCellIteratorToIndex() and protected newCellToBytesRefIterator(), and added a CellToBytesRefIterator class. The particular arrangement paves the way for TokenStream re-use — LUCENE-5776 although leaves the actual re-use to occur later in a future patch on that issue.
* TermQueryPrefixTreeStrategy overrides newCellToBytesRefIterator to return a CTBRI subclass that does not have the leaf byte (since this strategy doesn’t query for them).
* Primary search-time changes were in AbstractVisitingPrefixTreeFilter (the base of Intersects, Within, heatmaps), WithinPrefixTreeFilter, and ContainsPrefixTreeFilter.
* ContainsPrefixTreeFilter now does more leap-frogging than it used to; it’s probably a bit faster as a result.
* Enhanced the toString()’s in the Filters to include the query shape.
* (Refactoring) Cell.isLeaf() should always return true if it’s level == maxLevels, and I clarified that when cell.isLeaf is false then this means this cell is a “prefix” (effectively the opposite of a leaf) which means there are cells at further resolutions (greater levels). For Quad & Geohash PrefixTree’s, it’s an implementation detail that it doesn’t append the ‘+’ because doing so is redundant/implied.
* (Refactoring) AbstractVisitingPrefixTreeFilter (the base of Intersects, Within, heatmaps) no longer has a hasIndexedLeaves boolean flag to supposedly make it faster for the all-points case.  The checks where it might be relevant are very cheap so I’d rather keep this class simpler.

Tests pass; I'll try precommit later.  I've yet to try lucene/benchmark and examine the index size change.

> Indexed non-point shapes index excessive terms
> ----------------------------------------------
>
>                 Key: LUCENE-4942
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4942
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: LUCENE-4942_non-point_excessive_terms.patch
>
>
> Indexed non-point shapes are comprised of a set of terms that represent grid cells.  Cells completely within the shape or cells on the intersecting edge that are at the maximum detail depth being indexed for the shape are denoted as "leaf" cells.  Such cells have a trailing '\+' at the end.  _Such tokens are actually indexed twice_, one with the leaf byte and one without.
> The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf' cells and so the tokens with '+' are completely redundant.
> The Recursive [algorithm] based PrefixTree Strategy better supports correct search of indexed non-point shapes than TermQuery does and the distinction is relevant.  However, the foundational search algorithms used by this strategy (Intersects & Contains; the other 2 are based on these) could each be upgraded to deal with this correctly.  Not trivial but very doable.
> In the end, spatial non-point indexes can probably be trimmed my ~40% by doing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org