You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2009/08/15 16:58:14 UTC

[jira] Created: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Static index pruning by in-document term frequency (Carmel pruning)
-------------------------------------------------------------------

                 Key: LUCENE-1812
                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*
    Affects Versions: 2.9
            Reporter: Andrzej Bialecki 


This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 

Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).

As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 

Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 

NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 

Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.

A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776472#action_12776472 ] 

Robert Muir commented on LUCENE-1812:
-------------------------------------

bq. Default threshold of what?

What was confusing me is that the console output always says "deleted: 0" for -impl carmel
For -impl tf, the console output is correct.

But looking at the resulting index (which I should have done earlier, sorry), I can see that -impl carmel does work.

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774355#action_12774355 ] 

Steven Rowe edited comment on LUCENE-1812 at 11/6/09 6:37 PM:
--------------------------------------------------------------

Andzrej, when I try to look at [the PDF you posted|http://wiki.apache.org/lucene-java/StaticIndexPruning?action=AttachFile&do=get&target=pruning.pdf] on [the StaticIndexPruning wiki page|http://wiki.apache.org/lucene-java/StaticIndexPruning], Adobe Acrobat gives me the following error:

{quote}
Cannot extract the embedded font 'CAAAA+ArialMT'.  Some characters may not display or print correctly.
{quote}

and the text is illegible - everything except the page titles looks like a series of dots.

      was (Author: steve_rowe):
    Andzrej, when I try to look at [the PDF you posted|http://wiki.apache.org/lucene-java/StaticIndexPruning?action=AttachFile&do=get&target=pruning.pdf] on [the StaticIndexPruning wiki page|http://wiki.apache.org/lucene-java/StaticIndexPruning], Adobe Acrobats gives me the following error:

{quote}
Cannot extract the embedded font 'CAAAA+ArialMT'.  Some characters may not display or print correctly.
{quote}

and the text is illegible - everything except the page titles looks like a series of dots.
  
> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Steven A Rowe wrote:
> Ack, the last four pages are not visible :( - Steve

This is disappointing, indeed - all versions of the PDFs open without 
problems on my machine, either with Foxit or Acrobat, but after they are 
uploaded they are broken, and they actually differ (on a binary level) 
from my local copies ... :(


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by Steven A Rowe <sa...@syr.edu>.
Ack, the last four pages are not visible :( - Steve

> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe@syr.edu]
> Sent: Friday, November 06, 2009 2:53 PM
> To: java-dev@lucene.apache.org
> Subject: RE: [jira] Commented: (LUCENE-1812) Static index pruning by
> in-document term frequency (Carmel pruning)
> 
> I tried again, and it all seems to be visible now.  Bardzo dzienkuje
> for persisting! - Steve
> 
> > -----Original Message-----
> > From: Steven A Rowe [mailto:sarowe@syr.edu]
> > Sent: Friday, November 06, 2009 2:50 PM
> > To: java-dev@lucene.apache.org
> > Subject: RE: [jira] Commented: (LUCENE-1812) Static index pruning by
> > in-document term frequency (Carmel pruning)
> >
> > Acrobat still gives an error (about a different font).
> >
> > The text shows up properly now, but the slide titles are no longer
> > legible.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: Andrzej Bialecki [mailto:ab@getopt.org]
> > > Sent: Friday, November 06, 2009 2:41 PM
> > > To: java-dev@lucene.apache.org
> > > Subject: Re: [jira] Commented: (LUCENE-1812) Static index pruning
> by
> > > in-document term frequency (Carmel pruning)
> > >
> > > Andrzej Bialecki (JIRA) wrote:
> > > >     [ https://issues.apache.org/jira/browse/LUCENE-
> > > 1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > > tabpanel&focusedCommentId=12774363#action_12774363 ]
> > > >
> > > > Andrzej Bialecki  commented on LUCENE-1812:
> > > > -------------------------------------------
> > > >
> > > > There have been problems with PDF uploads since the recent Wiki
> > > upgrade ... I'll keep trying until it gets through in one piece.
> > Sorry
> > > ...
> > >
> > > This should be fixed now - please let me know if it still doesn't
> > work
> > > for you ...
> > >
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >   ___. ___ ___ ___ _ _   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> > >
> > > -------------------------------------------------------------------
> --
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by Steven A Rowe <sa...@syr.edu>.
I tried again, and it all seems to be visible now.  Bardzo dzienkuje for persisting! - Steve

> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe@syr.edu]
> Sent: Friday, November 06, 2009 2:50 PM
> To: java-dev@lucene.apache.org
> Subject: RE: [jira] Commented: (LUCENE-1812) Static index pruning by
> in-document term frequency (Carmel pruning)
> 
> Acrobat still gives an error (about a different font).
> 
> The text shows up properly now, but the slide titles are no longer
> legible.
> 
> Steve
> 
> > -----Original Message-----
> > From: Andrzej Bialecki [mailto:ab@getopt.org]
> > Sent: Friday, November 06, 2009 2:41 PM
> > To: java-dev@lucene.apache.org
> > Subject: Re: [jira] Commented: (LUCENE-1812) Static index pruning by
> > in-document term frequency (Carmel pruning)
> >
> > Andrzej Bialecki (JIRA) wrote:
> > >     [ https://issues.apache.org/jira/browse/LUCENE-
> > 1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > tabpanel&focusedCommentId=12774363#action_12774363 ]
> > >
> > > Andrzej Bialecki  commented on LUCENE-1812:
> > > -------------------------------------------
> > >
> > > There have been problems with PDF uploads since the recent Wiki
> > upgrade ... I'll keep trying until it gets through in one piece.
> Sorry
> > ...
> >
> > This should be fixed now - please let me know if it still doesn't
> work
> > for you ...
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org


RE: [jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by Steven A Rowe <sa...@syr.edu>.
Acrobat still gives an error (about a different font).

The text shows up properly now, but the slide titles are no longer legible.

Steve

> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Friday, November 06, 2009 2:41 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-1812) Static index pruning by
> in-document term frequency (Carmel pruning)
> 
> Andrzej Bialecki (JIRA) wrote:
> >     [ https://issues.apache.org/jira/browse/LUCENE-
> 1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=12774363#action_12774363 ]
> >
> > Andrzej Bialecki  commented on LUCENE-1812:
> > -------------------------------------------
> >
> > There have been problems with PDF uploads since the recent Wiki
> upgrade ... I'll keep trying until it gets through in one piece. Sorry
> ...
> 
> This should be fixed now - please let me know if it still doesn't work
> for you ...
> 
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774363#action_12774363 ] 
> 
> Andrzej Bialecki  commented on LUCENE-1812:
> -------------------------------------------
> 
> There have been problems with PDF uploads since the recent Wiki upgrade ... I'll keep trying until it gets through in one piece. Sorry ...

This should be fixed now - please let me know if it still doesn't work 
for you ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774363#action_12774363 ] 

Andrzej Bialecki  commented on LUCENE-1812:
-------------------------------------------

There have been problems with PDF uploads since the recent Wiki upgrade ... I'll keep trying until it gets through in one piece. Sorry ...

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774355#action_12774355 ] 

Steven Rowe commented on LUCENE-1812:
-------------------------------------

Andzrej, when I try to look at [the PDF you posted|http://wiki.apache.org/lucene-java/StaticIndexPruning?action=AttachFile&do=get&target=pruning.pdf] on [the StaticIndexPruning wiki page|http://wiki.apache.org/lucene-java/StaticIndexPruning] wiki page, Adobe Acrobats gives me the following error:

{quote}
Cannot extract the embedded font 'CAAAA+ArialMT'.  Some characters may not display or print correctly.
{quote}

and the text is illegible - everything except the page titles looks like a series of dots.

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775358#action_12775358 ] 

Andrzej Bialecki  commented on LUCENE-1812:
-------------------------------------------

Nice job, Robert - thanks! BTW, your results show an effect that was reported in the papers on this subject, namely that some metrics may actually improve, like MRR and P@10 above.

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-1812:
--------------------------------------

    Attachment: pruning.patch

Patch relative to the current trunk.

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776479#action_12776479 ] 

Uwe Schindler commented on LUCENE-1812:
---------------------------------------

Code seems to be Java 1.5, which is good, but I am wondering about some @SuppressWarnings e.g. in getFieldNames(). The original overriden method returns Collection<String>, if you change that to return the correct type it doesn't need SuppressWarnings. There are more places. Also if you use Collections.<Type>emptyMap() and so on, it is also type safe.

Also we use no space after comma in Generic type parameters.

But I like the patch, nice work!

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774355#action_12774355 ] 

Steven Rowe edited comment on LUCENE-1812 at 11/6/09 6:36 PM:
--------------------------------------------------------------

Andzrej, when I try to look at [the PDF you posted|http://wiki.apache.org/lucene-java/StaticIndexPruning?action=AttachFile&do=get&target=pruning.pdf] on [the StaticIndexPruning wiki page|http://wiki.apache.org/lucene-java/StaticIndexPruning], Adobe Acrobats gives me the following error:

{quote}
Cannot extract the embedded font 'CAAAA+ArialMT'.  Some characters may not display or print correctly.
{quote}

and the text is illegible - everything except the page titles looks like a series of dots.

      was (Author: steve_rowe):
    Andzrej, when I try to look at [the PDF you posted|http://wiki.apache.org/lucene-java/StaticIndexPruning?action=AttachFile&do=get&target=pruning.pdf] on [the StaticIndexPruning wiki page|http://wiki.apache.org/lucene-java/StaticIndexPruning] wiki page, Adobe Acrobats gives me the following error:

{quote}
Cannot extract the embedded font 'CAAAA+ArialMT'.  Some characters may not display or print correctly.
{quote}

and the text is illegible - everything except the page titles looks like a series of dots.
  
> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776450#action_12776450 ] 

Andrzej Bialecki  commented on LUCENE-1812:
-------------------------------------------

Default threshold of what? When using the Carmel method, the threshold value should be between 0.0 - 1.0, where 1.0 means no pruning, i.e. 100% of docs are retained. I'm sorry for the confusion - the documentation should be clearer on this point.

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775266#action_12775266 ] 

Robert Muir commented on LUCENE-1812:
-------------------------------------

Andrzej, i tested your patch. I found two places where @override was on an interface, only problem so far.

here are some results on the hamshahri persian test collection (I used TF method with -t 2)

||Measure||Unpruned||Pruned||
|index size|98627KB|42339KB|
|map|0.4809|0.4241|
|recip_rank|0.8368|0.8393|
|P5|0.6277|0.6369|
|P10|0.5677|0.5785|
|P15|0.5436|0.5231|
|P20|0.5185|0.4969|
|P30|0.4703|0.4385|
|P100|0.2782|0.2440|

the queries in this corpus are somewhat general, but seems to be a nice way to reduce the index to more than half its size, still with reasonable quality.


> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-1812:
--------------------------------------

    Attachment: pruning.patch

Updated patch against trunk/ . This patch is a major refactoring that opens way for other implementations of stored fields and postings pruning. Two policies are included in this patch - the original Carmel method, and a simple TF-based threshold method.

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775443#action_12775443 ] 

Robert Muir commented on LUCENE-1812:
-------------------------------------

Andrzej, are you still working on the carmel policy? 
I see -conf isn't yet implemented, and I can't seem to get it to prune anything with just a default threshold... guessing its still work in progress?


> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776589#action_12776589 ] 

Andrzej Bialecki  commented on LUCENE-1812:
-------------------------------------------

I'll prepare a new patch - the reason for these deficiencies is that I worked against trunk just before the generics patches were applied ;)

> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1812
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1812
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch
>
>
> This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. 
> Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1).
> As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. 
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org