You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Marvin Humphrey (JIRA)" <ji...@apache.org> on 2007/03/26 22:45:32 UTC

[jira] Created: (LUCENE-851) Pruning

Pruning
-------

                 Key: LUCENE-851
                 URL: https://issues.apache.org/jira/browse/LUCENE-851
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index, Search
            Reporter: Marvin Humphrey
            Priority: Minor


Greets,

A thread on java-dev a couple of months ago drew my attention to a technique used by Nutch for cutting down the number of hits that have to be processed:  if you have an algorithm for ordering documents by importance, and you sort them so that the lowest document numbers have the highest rank, then most of your high-scoring hits are going to occur early on in the hit-collection process.  Say you're looking for the top 100 matches -- the odds are pretty good that after you've found 1000 hits, you've gotten most of the good stuff.  It may not be necessary to score the other e.g. 5,000,000 hits.

To pull this off in Nutch, they run the index through a post process whereby documents are re-ordered by page score using the IndexSorter class.  Unfortunately, post-processing does not live happily with incremental indexing.  

However, if we ensure that document numbers are ordered according to our criteria within each segment, that's almost as good.

Say we're looking for 100 hits, as before; what we do is collect a maximum of 1000 hits per segment.  If we are dealing with an index made up of 25 segments, that's 25,000 hits max we'll have to process fully -- the rest we can skip over.  That's not as quick as only processing 1000 hits then stopping in a fully optimized index, but it's a lot better than churning through all 5,000,000 hits.

A lot of those hits from the smallest segments will be garbage; we'll get most of our good hits from a few large segments most of the time.  But that's fine -- the cost to process any one segment is small.

Writing a low-level scoring loop which implements pruning per segment is straightforward.  KinoSearch's version (in C) is below.

To control the amount of pruning, we need a high-level Searcher.setPruneFactor API, which sets a multiplier; the number of hits-per-segment which must be processed is determined by multiplying the number of hits you need by pruneFactor.  Here's code from KS for deriving hits-per-seg:

    # process prune_factor if supplied
    my $seg_starts;
    my $hits_per_seg = 2**31 - 1;
    if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
        my $prune_count = $self->{prune_factor} * $args{num_wanted};

        if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
            $hits_per_seg = $prune_count;
            $seg_starts   = $reader->get_seg_starts;
        }
    }

What I have not yet written is the index-time mechanism for sorting documents.  

In Nutch, they use the norms from a known indexed, non-tokenized field ("site").  However, in Lucene and KS, we can't count on any existing fields.  Document boost isn't stored directly, either.  The obvious answer is to start storing it, which would suffice for Nutch-like uses.  However, it may make sense to to avoid coupling document ordering to boost in order to influence pruning without affecting scores.

The sort ordering information needs a permanent home in the index, since it will be needed whenever segment merging occurs.  The fixed-width per-document storage in Lucene's .fdx file seems like a good place.  If we use one float per document, we can simply put it before or after the 64-bit file pointer and seek into the file after multiplying the doc num by 12 rather than 8.  

During indexing, we'd keep the ordering info in an array; after all documents for a segment have been added, we create an array of sorted document numbers.  When flushing the postings, their document numbers get remapped using the sorted array.  Then we rewrite the .fdx file (and also the .tvx file), moving the file pointers (and ordering info) to remapped locations.  The fact that the .fdt file is now "out of order" is isn't a problem -- optimizing sequential access to that file isn't important.

This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM to buffer added documents", and LUCENE-847, "Factor merge policy out of IndexWriter".  Michael McCandless, Steven Parks, Ning Li, anybody else... comments?  Suggestions?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

--------------------------------------------------------------------

void
Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
               u32_t hits_per_seg, VArray *seg_starts)
{
    u32_t seg_num          = 0;
    u32_t doc_num_thresh   = 0;
    u32_t hits_this_seg    = 0;
    u32_t hits_thresh      = hits_per_seg;

    /* get to first doc */
    if ( !Scorer_Skip_To(self, start) )
        return;

    /* execute scoring loop */
    do {
        u32_t doc_num = Scorer_Doc(self);
        Tally *tally;

        if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
            if (doc_num >= end) {
                /* bail out of loop if we've hit the user-spec'd end */
                return;
            }
            else if (seg_starts == NULL || seg_starts->size == 0) {
                /* must supply seg_starts to enable pruning */
                hits_thresh    = U32_MAX;
                doc_num_thresh = end;
            }
            else if (seg_num == seg_starts->size) {
                /* we've finished the last segment */
                return;
            }
            else {
                /* get start of upcoming segment */
                Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
                Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
                u32_t this_seg_start = this_start->value;
                seg_num++;

                /* skip docs as appropriate if we're pruning */
                if (doc_num < this_seg_start) {
                    if ( Scorer_Skip_To(self, this_seg_start) )
                        doc_num = Scorer_Doc(self);
                    else
                        return;
                }

                /* set the last doc_num we'll score this upcoming segment */
                doc_num_thresh = next_start == NULL
                    ? end  /* last segment */
                    : next_start->value;
            }

            /* start over counting hits for the new segment */
            hits_this_seg = 0;
        }

        /* this doc is in range, so collect it */
        tally = Scorer_Tally(self);
        hc->collect(hc, doc_num, tally->score);
        hits_this_seg++;
    } while (Scorer_Next(self));
}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Created: (LUCENE-851) Pruning

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mar 29, 2007, at 7:44 PM, Ning Li wrote:

> If a query requires top-K results, isn't it
> sufficient to find top-K results in each segment and merge them to
> return the overall top-K results?

They are merged by collecting them into a HitQueue.

> Early termination happens in
> finding top-K results in one segment. Assuming each document has a
> static score, document ids are assigned in the same order of their
> static scores within a segment. If a top-K query is scored by the same
> static score, query processing on a segment can stop as soon as the
> first K results are found.

Indeed, that's exactly how the loop in Scorer_collect() works.

> As to the indexing side, applications should be able to pick such a
> static score? If Lucene score function is used, norm is a good
> candidate? (One tricky thing with norm is that it is updatable.)

I would argue that only a single mechanism based on indexed, non- 
tokenized fields should be used to determine sort order.  Sort order  
based upon norms is easy for the user to fake using a dedicated field  
at a small cost, so library-level support is not needed.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Created: (LUCENE-851) Pruning

Posted by Ning Li <ni...@gmail.com>.

It will be great to support early termination for top-K queries within
the DAAT query processing model in Lucene. There is quite some work
published in related areas.
http://portal.acm.org/citation.cfm?id=956944 is one of them.

Am I getting it right? If a query requires top-K results, isn't it
sufficient to find top-K results in each segment and merge them to
return the overall top-K results? (This is the most straightforward
way. Many optimizations are possible). Early termination happens in
finding top-K results in one segment. Assuming each document has a
static score, document ids are assigned in the same order of their
static scores within a segment. If a top-K query is scored by the same
static score, query processing on a segment can stop as soon as the
first K results are found. If the score function is not just the
static score but still highly related to it (and is monotonic...),
query processing can stop as soon as the upper bound of the scores for
the unprocessed documents are lower than the lowest score in the
current top-K results. This also reveals the weakness of this scheme:
the static score should be significant enough in the score function
used by a query to enable significant early termination. This is true
in many specialized indexes.

As to the indexing side, applications should be able to pick such a
static score? If Lucene score function is used, norm is a good
candidate? (One tricky thing with norm is that it is updatable.)
Documents have to be sorted and id-ed when first level segments are
built. Merge process which builds higher level segments is still
simple enough since each segment is already sorted by static score. So
impact to factored merge policy is minimal.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-851) Pruning

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484258 ] 

Marvin Humphrey commented on LUCENE-851:
----------------------------------------

On Mar 26, 2007, at 3:45 PM, Karl Wettin (JIRA) wrote:

> I did not mean it should be used for primary sorting, but 
> rather as you describe it as priority thingy: in the n first 
> hits collected, the m (say n/10 as you guessed in the 
> description of the issue) most relevant (in this case relevant 
> would mean the most recent dates) would probably have been 
> collected and thus the collection process could be stopped. 

Right.  I was referring to the as-yet unresolved issue of how to implement index-time sorting.

What I'll probably do in KS is create a SortExternal object and feed it strings which have the (provisional) document number tacked on to the end of the value for the sort field. The sortex object's sorting routine will be set to ignore the document number part of the serialized string. Then, when finish() is called, the array of remapped document numbers can be allocated, and doc nums read into it as sorted elements are pulled from the sortex pool.

Using an external sorter is better than accumulating those strings in an array, then sorting in-memory.  The strings can be of arbitrary length, yet memory usage will be capped by the external sorter's memory threshold no matter how many documents get added to the segment during the indexing session.

> Pruning
> -------
>
>                 Key: LUCENE-851
>                 URL: https://issues.apache.org/jira/browse/LUCENE-851
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index, Search
>            Reporter: Marvin Humphrey
>            Priority: Minor
>
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique used by Nutch for cutting down the number of hits that have to be processed:  if you have an algorithm for ordering documents by importance, and you sort them so that the lowest document numbers have the highest rank, then most of your high-scoring hits are going to occur early on in the hit-collection process.  Say you're looking for the top 100 matches -- the odds are pretty good that after you've found 1000 hits, you've gotten most of the good stuff.  It may not be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby documents are re-ordered by page score using the IndexSorter class.  Unfortunately, post-processing does not live happily with incremental indexing.  
> However, if we ensure that document numbers are ordered according to our criteria within each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of 1000 hits per segment.  If we are dealing with an index made up of 25 segments, that's 25,000 hits max we'll have to process fully -- the rest we can skip over.  That's not as quick as only processing 1000 hits then stopping in a fully optimized index, but it's a lot better than churning through all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get most of our good hits from a few large segments most of the time.  But that's fine -- the cost to process any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is straightforward.  KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level Searcher.setPruneFactor API, which sets a multiplier; the number of hits-per-segment which must be processed is determined by multiplying the number of hits you need by pruneFactor.  Here's code from KS for deriving hits-per-seg:
>     # process prune_factor if supplied
>     my $seg_starts;
>     my $hits_per_seg = 2**31 - 1;
>     if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
>         my $prune_count = $self->{prune_factor} * $args{num_wanted};
>         if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
>             $hits_per_seg = $prune_count;
>             $seg_starts   = $reader->get_seg_starts;
>         }
>     }
> What I have not yet written is the index-time mechanism for sorting documents.  
> In Nutch, they use the norms from a known indexed, non-tokenized field ("site").  However, in Lucene and KS, we can't count on any existing fields.  Document boost isn't stored directly, either.  The obvious answer is to start storing it, which would suffice for Nutch-like uses.  However, it may make sense to to avoid coupling document ordering to boost in order to influence pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it will be needed whenever segment merging occurs.  The fixed-width per-document storage in Lucene's .fdx file seems like a good place.  If we use one float per document, we can simply put it before or after the 64-bit file pointer and seek into the file after multiplying the doc num by 12 rather than 8.  
> During indexing, we'd keep the ordering info in an array; after all documents for a segment have been added, we create an array of sorted document numbers.  When flushing the postings, their document numbers get remapped using the sorted array.  Then we rewrite the .fdx file (and also the .tvx file), moving the file pointers (and ordering info) to remapped locations.  The fact that the .fdt file is now "out of order" is isn't a problem -- optimizing sequential access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM to buffer added documents", and LUCENE-847, "Factor merge policy out of IndexWriter".  Michael McCandless, Steven Parks, Ning Li, anybody else... comments?  Suggestions?
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
>                u32_t hits_per_seg, VArray *seg_starts)
> {
>     u32_t seg_num          = 0;
>     u32_t doc_num_thresh   = 0;
>     u32_t hits_this_seg    = 0;
>     u32_t hits_thresh      = hits_per_seg;
>     /* get to first doc */
>     if ( !Scorer_Skip_To(self, start) )
>         return;
>     /* execute scoring loop */
>     do {
>         u32_t doc_num = Scorer_Doc(self);
>         Tally *tally;
>         if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
>             if (doc_num >= end) {
>                 /* bail out of loop if we've hit the user-spec'd end */
>                 return;
>             }
>             else if (seg_starts == NULL || seg_starts->size == 0) {
>                 /* must supply seg_starts to enable pruning */
>                 hits_thresh    = U32_MAX;
>                 doc_num_thresh = end;
>             }
>             else if (seg_num == seg_starts->size) {
>                 /* we've finished the last segment */
>                 return;
>             }
>             else {
>                 /* get start of upcoming segment */
>                 Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
>                 Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
>                 u32_t this_seg_start = this_start->value;
>                 seg_num++;
>                 /* skip docs as appropriate if we're pruning */
>                 if (doc_num < this_seg_start) {
>                     if ( Scorer_Skip_To(self, this_seg_start) )
>                         doc_num = Scorer_Doc(self);
>                     else
>                         return;
>                 }
>                 /* set the last doc_num we'll score this upcoming segment */
>                 doc_num_thresh = next_start == NULL
>                     ? end  /* last segment */
>                     : next_start->value;
>             }
>             /* start over counting hits for the new segment */
>             hits_this_seg = 0;
>         }
>         /* this doc is in range, so collect it */
>         tally = Scorer_Tally(self);
>         hc->collect(hc, doc_num, tally->score);
>         hits_this_seg++;
>     } while (Scorer_Next(self));
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-851) Pruning

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484234 ] 

Karl Wettin commented on LUCENE-851:
------------------------------------

If this is what I think, then it is really cool and can be very interesting for indices with a (one) static natural order, for instance publishing date of news articles. I know people have implemented this by reverting norms to a true float (32 bits rather than 8 bits) and by incrementing the value been using that as the natural order, a somewhat crazy solution if you ask me. 

> Pruning
> -------
>
>                 Key: LUCENE-851
>                 URL: https://issues.apache.org/jira/browse/LUCENE-851
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index, Search
>            Reporter: Marvin Humphrey
>            Priority: Minor
>
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique used by Nutch for cutting down the number of hits that have to be processed:  if you have an algorithm for ordering documents by importance, and you sort them so that the lowest document numbers have the highest rank, then most of your high-scoring hits are going to occur early on in the hit-collection process.  Say you're looking for the top 100 matches -- the odds are pretty good that after you've found 1000 hits, you've gotten most of the good stuff.  It may not be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby documents are re-ordered by page score using the IndexSorter class.  Unfortunately, post-processing does not live happily with incremental indexing.  
> However, if we ensure that document numbers are ordered according to our criteria within each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of 1000 hits per segment.  If we are dealing with an index made up of 25 segments, that's 25,000 hits max we'll have to process fully -- the rest we can skip over.  That's not as quick as only processing 1000 hits then stopping in a fully optimized index, but it's a lot better than churning through all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get most of our good hits from a few large segments most of the time.  But that's fine -- the cost to process any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is straightforward.  KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level Searcher.setPruneFactor API, which sets a multiplier; the number of hits-per-segment which must be processed is determined by multiplying the number of hits you need by pruneFactor.  Here's code from KS for deriving hits-per-seg:
>     # process prune_factor if supplied
>     my $seg_starts;
>     my $hits_per_seg = 2**31 - 1;
>     if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
>         my $prune_count = $self->{prune_factor} * $args{num_wanted};
>         if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
>             $hits_per_seg = $prune_count;
>             $seg_starts   = $reader->get_seg_starts;
>         }
>     }
> What I have not yet written is the index-time mechanism for sorting documents.  
> In Nutch, they use the norms from a known indexed, non-tokenized field ("site").  However, in Lucene and KS, we can't count on any existing fields.  Document boost isn't stored directly, either.  The obvious answer is to start storing it, which would suffice for Nutch-like uses.  However, it may make sense to to avoid coupling document ordering to boost in order to influence pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it will be needed whenever segment merging occurs.  The fixed-width per-document storage in Lucene's .fdx file seems like a good place.  If we use one float per document, we can simply put it before or after the 64-bit file pointer and seek into the file after multiplying the doc num by 12 rather than 8.  
> During indexing, we'd keep the ordering info in an array; after all documents for a segment have been added, we create an array of sorted document numbers.  When flushing the postings, their document numbers get remapped using the sorted array.  Then we rewrite the .fdx file (and also the .tvx file), moving the file pointers (and ordering info) to remapped locations.  The fact that the .fdt file is now "out of order" is isn't a problem -- optimizing sequential access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM to buffer added documents", and LUCENE-847, "Factor merge policy out of IndexWriter".  Michael McCandless, Steven Parks, Ning Li, anybody else... comments?  Suggestions?
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
>                u32_t hits_per_seg, VArray *seg_starts)
> {
>     u32_t seg_num          = 0;
>     u32_t doc_num_thresh   = 0;
>     u32_t hits_this_seg    = 0;
>     u32_t hits_thresh      = hits_per_seg;
>     /* get to first doc */
>     if ( !Scorer_Skip_To(self, start) )
>         return;
>     /* execute scoring loop */
>     do {
>         u32_t doc_num = Scorer_Doc(self);
>         Tally *tally;
>         if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
>             if (doc_num >= end) {
>                 /* bail out of loop if we've hit the user-spec'd end */
>                 return;
>             }
>             else if (seg_starts == NULL || seg_starts->size == 0) {
>                 /* must supply seg_starts to enable pruning */
>                 hits_thresh    = U32_MAX;
>                 doc_num_thresh = end;
>             }
>             else if (seg_num == seg_starts->size) {
>                 /* we've finished the last segment */
>                 return;
>             }
>             else {
>                 /* get start of upcoming segment */
>                 Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
>                 Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
>                 u32_t this_seg_start = this_start->value;
>                 seg_num++;
>                 /* skip docs as appropriate if we're pruning */
>                 if (doc_num < this_seg_start) {
>                     if ( Scorer_Skip_To(self, this_seg_start) )
>                         doc_num = Scorer_Doc(self);
>                     else
>                         return;
>                 }
>                 /* set the last doc_num we'll score this upcoming segment */
>                 doc_num_thresh = next_start == NULL
>                     ? end  /* last segment */
>                     : next_start->value;
>             }
>             /* start over counting hits for the new segment */
>             hits_this_seg = 0;
>         }
>         /* this doc is in range, so collect it */
>         tally = Scorer_Tally(self);
>         hc->collect(hc, doc_num, tally->score);
>         hits_this_seg++;
>     } while (Scorer_Next(self));
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-851) Pruning

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484248 ] 

Marvin Humphrey commented on LUCENE-851:
----------------------------------------

That suggests an implementation derived from how sorting is currently done in Lucene/KS. You would need to specify a primary sort field.  (In KS, this can be achieved by adding a sort_field property to the Schema, which must name an indexed, non-tokenized field). If you wanted to store boost as a field, you could use that. But you could also use publication date, price, or whatever.  

Doing things that way wouldn't require any index format changes.

> Pruning
> -------
>
>                 Key: LUCENE-851
>                 URL: https://issues.apache.org/jira/browse/LUCENE-851
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index, Search
>            Reporter: Marvin Humphrey
>            Priority: Minor
>
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique used by Nutch for cutting down the number of hits that have to be processed:  if you have an algorithm for ordering documents by importance, and you sort them so that the lowest document numbers have the highest rank, then most of your high-scoring hits are going to occur early on in the hit-collection process.  Say you're looking for the top 100 matches -- the odds are pretty good that after you've found 1000 hits, you've gotten most of the good stuff.  It may not be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby documents are re-ordered by page score using the IndexSorter class.  Unfortunately, post-processing does not live happily with incremental indexing.  
> However, if we ensure that document numbers are ordered according to our criteria within each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of 1000 hits per segment.  If we are dealing with an index made up of 25 segments, that's 25,000 hits max we'll have to process fully -- the rest we can skip over.  That's not as quick as only processing 1000 hits then stopping in a fully optimized index, but it's a lot better than churning through all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get most of our good hits from a few large segments most of the time.  But that's fine -- the cost to process any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is straightforward.  KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level Searcher.setPruneFactor API, which sets a multiplier; the number of hits-per-segment which must be processed is determined by multiplying the number of hits you need by pruneFactor.  Here's code from KS for deriving hits-per-seg:
>     # process prune_factor if supplied
>     my $seg_starts;
>     my $hits_per_seg = 2**31 - 1;
>     if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
>         my $prune_count = $self->{prune_factor} * $args{num_wanted};
>         if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
>             $hits_per_seg = $prune_count;
>             $seg_starts   = $reader->get_seg_starts;
>         }
>     }
> What I have not yet written is the index-time mechanism for sorting documents.  
> In Nutch, they use the norms from a known indexed, non-tokenized field ("site").  However, in Lucene and KS, we can't count on any existing fields.  Document boost isn't stored directly, either.  The obvious answer is to start storing it, which would suffice for Nutch-like uses.  However, it may make sense to to avoid coupling document ordering to boost in order to influence pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it will be needed whenever segment merging occurs.  The fixed-width per-document storage in Lucene's .fdx file seems like a good place.  If we use one float per document, we can simply put it before or after the 64-bit file pointer and seek into the file after multiplying the doc num by 12 rather than 8.  
> During indexing, we'd keep the ordering info in an array; after all documents for a segment have been added, we create an array of sorted document numbers.  When flushing the postings, their document numbers get remapped using the sorted array.  Then we rewrite the .fdx file (and also the .tvx file), moving the file pointers (and ordering info) to remapped locations.  The fact that the .fdt file is now "out of order" is isn't a problem -- optimizing sequential access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM to buffer added documents", and LUCENE-847, "Factor merge policy out of IndexWriter".  Michael McCandless, Steven Parks, Ning Li, anybody else... comments?  Suggestions?
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
>                u32_t hits_per_seg, VArray *seg_starts)
> {
>     u32_t seg_num          = 0;
>     u32_t doc_num_thresh   = 0;
>     u32_t hits_this_seg    = 0;
>     u32_t hits_thresh      = hits_per_seg;
>     /* get to first doc */
>     if ( !Scorer_Skip_To(self, start) )
>         return;
>     /* execute scoring loop */
>     do {
>         u32_t doc_num = Scorer_Doc(self);
>         Tally *tally;
>         if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
>             if (doc_num >= end) {
>                 /* bail out of loop if we've hit the user-spec'd end */
>                 return;
>             }
>             else if (seg_starts == NULL || seg_starts->size == 0) {
>                 /* must supply seg_starts to enable pruning */
>                 hits_thresh    = U32_MAX;
>                 doc_num_thresh = end;
>             }
>             else if (seg_num == seg_starts->size) {
>                 /* we've finished the last segment */
>                 return;
>             }
>             else {
>                 /* get start of upcoming segment */
>                 Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
>                 Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
>                 u32_t this_seg_start = this_start->value;
>                 seg_num++;
>                 /* skip docs as appropriate if we're pruning */
>                 if (doc_num < this_seg_start) {
>                     if ( Scorer_Skip_To(self, this_seg_start) )
>                         doc_num = Scorer_Doc(self);
>                     else
>                         return;
>                 }
>                 /* set the last doc_num we'll score this upcoming segment */
>                 doc_num_thresh = next_start == NULL
>                     ? end  /* last segment */
>                     : next_start->value;
>             }
>             /* start over counting hits for the new segment */
>             hits_this_seg = 0;
>         }
>         /* this doc is in range, so collect it */
>         tally = Scorer_Tally(self);
>         hc->collect(hc, doc_num, tally->score);
>         hits_this_seg++;
>     } while (Scorer_Next(self));
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-851) Pruning

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484253 ] 

Karl Wettin commented on LUCENE-851:
------------------------------------

I did not mean it should be used for primary sorting, but rather as you describe it as priority thingy: in the n first hits collected, the m (say n/10 as you guessed in the description of the issue) most relevant (in this case relevant would mean the most recent dates) would probably have been collected and thus the collection process could be stopped. Or so. Or? Then the crazy norms ordering I described could be skipped and the standard Lucene field sort would be applied.

> Pruning
> -------
>
>                 Key: LUCENE-851
>                 URL: https://issues.apache.org/jira/browse/LUCENE-851
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index, Search
>            Reporter: Marvin Humphrey
>            Priority: Minor
>
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique used by Nutch for cutting down the number of hits that have to be processed:  if you have an algorithm for ordering documents by importance, and you sort them so that the lowest document numbers have the highest rank, then most of your high-scoring hits are going to occur early on in the hit-collection process.  Say you're looking for the top 100 matches -- the odds are pretty good that after you've found 1000 hits, you've gotten most of the good stuff.  It may not be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby documents are re-ordered by page score using the IndexSorter class.  Unfortunately, post-processing does not live happily with incremental indexing.  
> However, if we ensure that document numbers are ordered according to our criteria within each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of 1000 hits per segment.  If we are dealing with an index made up of 25 segments, that's 25,000 hits max we'll have to process fully -- the rest we can skip over.  That's not as quick as only processing 1000 hits then stopping in a fully optimized index, but it's a lot better than churning through all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get most of our good hits from a few large segments most of the time.  But that's fine -- the cost to process any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is straightforward.  KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level Searcher.setPruneFactor API, which sets a multiplier; the number of hits-per-segment which must be processed is determined by multiplying the number of hits you need by pruneFactor.  Here's code from KS for deriving hits-per-seg:
>     # process prune_factor if supplied
>     my $seg_starts;
>     my $hits_per_seg = 2**31 - 1;
>     if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
>         my $prune_count = $self->{prune_factor} * $args{num_wanted};
>         if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
>             $hits_per_seg = $prune_count;
>             $seg_starts   = $reader->get_seg_starts;
>         }
>     }
> What I have not yet written is the index-time mechanism for sorting documents.  
> In Nutch, they use the norms from a known indexed, non-tokenized field ("site").  However, in Lucene and KS, we can't count on any existing fields.  Document boost isn't stored directly, either.  The obvious answer is to start storing it, which would suffice for Nutch-like uses.  However, it may make sense to to avoid coupling document ordering to boost in order to influence pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it will be needed whenever segment merging occurs.  The fixed-width per-document storage in Lucene's .fdx file seems like a good place.  If we use one float per document, we can simply put it before or after the 64-bit file pointer and seek into the file after multiplying the doc num by 12 rather than 8.  
> During indexing, we'd keep the ordering info in an array; after all documents for a segment have been added, we create an array of sorted document numbers.  When flushing the postings, their document numbers get remapped using the sorted array.  Then we rewrite the .fdx file (and also the .tvx file), moving the file pointers (and ordering info) to remapped locations.  The fact that the .fdt file is now "out of order" is isn't a problem -- optimizing sequential access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM to buffer added documents", and LUCENE-847, "Factor merge policy out of IndexWriter".  Michael McCandless, Steven Parks, Ning Li, anybody else... comments?  Suggestions?
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
>                u32_t hits_per_seg, VArray *seg_starts)
> {
>     u32_t seg_num          = 0;
>     u32_t doc_num_thresh   = 0;
>     u32_t hits_this_seg    = 0;
>     u32_t hits_thresh      = hits_per_seg;
>     /* get to first doc */
>     if ( !Scorer_Skip_To(self, start) )
>         return;
>     /* execute scoring loop */
>     do {
>         u32_t doc_num = Scorer_Doc(self);
>         Tally *tally;
>         if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
>             if (doc_num >= end) {
>                 /* bail out of loop if we've hit the user-spec'd end */
>                 return;
>             }
>             else if (seg_starts == NULL || seg_starts->size == 0) {
>                 /* must supply seg_starts to enable pruning */
>                 hits_thresh    = U32_MAX;
>                 doc_num_thresh = end;
>             }
>             else if (seg_num == seg_starts->size) {
>                 /* we've finished the last segment */
>                 return;
>             }
>             else {
>                 /* get start of upcoming segment */
>                 Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
>                 Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
>                 u32_t this_seg_start = this_start->value;
>                 seg_num++;
>                 /* skip docs as appropriate if we're pruning */
>                 if (doc_num < this_seg_start) {
>                     if ( Scorer_Skip_To(self, this_seg_start) )
>                         doc_num = Scorer_Doc(self);
>                     else
>                         return;
>                 }
>                 /* set the last doc_num we'll score this upcoming segment */
>                 doc_num_thresh = next_start == NULL
>                     ? end  /* last segment */
>                     : next_start->value;
>             }
>             /* start over counting hits for the new segment */
>             hits_this_seg = 0;
>         }
>         /* this doc is in range, so collect it */
>         tally = Scorer_Tally(self);
>         hc->collect(hc, doc_num, tally->score);
>         hits_this_seg++;
>     } while (Scorer_Next(self));
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org