You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Eks Dev (JIRA)" <ji...@apache.org> on 2010/05/27 22:45:38 UTC

[jira] Commented: (LUCENE-2482) Index sorter

    [ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872357#action_12872357 ] 

Eks Dev commented on LUCENE-2482:
---------------------------------

nice! 
There is also another interesting use case for sorting index, performance and index size!

We use a couple of fields with low cardinality (zip code, user group... and likes). Having index sorted on these makes rle compression of  postings really effective, making it possible to load all values into couple of M-bytes of ram.
At a moment we just sort collection before indexing.

Would  it be possible somehow to use a combination of stored fields and to specify comparator? Even comparing them as byte[] would do the trick for this business case as it is only important to keep the same values together, order is irrelevant. Of course, having decoder to decode byte[] before comparing would be useful (e.g. for composite fields) , but would work in many cases without it.   

This works fine even with moderate update rate, as you can re-sort periodically. It does not have to be totally sorted, everything works, just slightly more memory is needed for filters

With flex, having postings that use rle compression is quite possible ... this tool could become "optimizeHard()" tool for some indexes :)

> Index sorter
> ------------
>
>                 Key: LUCENE-2482
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2482
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
>
>         Attachments: indexSorter.patch
>
>
> A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of "early termination" of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results.
> (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org