You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Jason Rutherglen (JIRA)" <ji...@apache.org> on 2008/05/02 13:58:55 UTC

[jira] Created: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Add optional storing of document numbers in term dictionary
-----------------------------------------------------------

                 Key: LUCENE-1278
                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
    Affects Versions: 2.3.1
            Reporter: Jason Rutherglen
            Priority: Minor


Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:

TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());

Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598770#action_12598770 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Have a new patch that handles deleted docs but realized that returning DocIdSetIterator is not needed.  This implementation can integrate with TermDocs transparently.  The issue is then whether to keep the Fieldable.isStoreTermDocs or make the implementation a default for untokenized fields.  For untokenized fields, this would mean not having to store the docs in the segment.frq file.  

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594127#action_12594127 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Is there a way to know the number of documents for a term in DocumentsWriter.appendPostings before running through all of them?  Currently a non-optimal linkedlist is used.  Otherwise will implement a growable int array.

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594206#action_12594206 ] 

Paul Elschot commented on LUCENE-1278:
--------------------------------------

Would there be any performance measurements for this? It might be quite good for terms that occur in very many documents, an area in which some improvement is possible I think.
Btw, for this case it might also be good to use a SortedVIntList instead of an IntArrayList.

I had a look at today's patch, but I stopped at DocumentsWriter because it contains a lot of layout changes, so it's hard to focus on the functional differences.

Are there any index format changes involved in this?

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Attachment: lucene.1278.5.7.2008.test.patch
                lucene.1278.5.7.2008.patch

lucene.1278.5.7.2008.patch - RangeFilter, FieldCacheImpl, ExtendedFieldCacheImpl automatically use termenum loaddocs=true.  A byte buffer is used to load the docs, if the bytes read will exceed the buffer, a new IndexInput is created and a DocIdSetIterator will read over that when TermEnum.docs() is called.  This makes the usual case of a small number of docs (usually 1) to a term fast using a byte buffer.  It also removes the issue if a term has too many docs and the whole byte array is loaded into ram.  TermEnum.docs() returns DocIdSetIterator.

lucene.1278.5.7.2008.test.patch - TestSort stores documents with Field.TermDocs.STORE



> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Description: 
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:
{noformat}
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
{noformat}
Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);

  was:
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:

TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);\\
do {\\
  Term term = termEnum.term();\\
  if (term == null || term.field() != field) break;\\
  int[] docs = termEnum.docs();\\
} while (termEnum.next());\\

Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);


> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615611#action_12615611 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

In order for the proposal I mentioned to work, DocumentsWriter.appendPostings needs to be changed to store the docs in an IntArrayList or something or the sort, then decide where to store the postings.  

I started working on LUCENE-1292 to address this problem outside of reworking core Lucene.  LUCENE-1278 only addresses half of my problem.  I also want realtime updates to an in memory term index.  The most efficient way to achieve this is what is outlined in LUCENE-1292.

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594744#action_12594744 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Implemented returning DocIdSetIterator however when running org.apache.lucene.search.TestSort remote search fails.  Reading the docs from a DocIdSetIterator directly from the file is troublesome due to the way termenum is designed with the other parts of Lucene.  My own basic unit test works, however TestSort does not and it is probably due to the file pointer not being on the correct position during enumeration.  

Perhaps there is a way for the int array work?  

Or is it best to create a separate file that is very similar to the term dictionary file but only stores terms and docs?

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594225#action_12594225 ] 

Michael McCandless commented on LUCENE-1278:
--------------------------------------------


It looks like the .tii file is also storing the int[] docIDs (as
inlined byte blob)?  I think that shouldn't be necessary?

This change adds a posting list like the frq file, except that it
stores only docIDs (no freq information), is stored inline in the term
dict, and includes a reader that materializes the full doc list as an
int[] instead of offering an iterator like (nextDoc()) interface
alone.

I think these changes would fit cleanly into what's been proposed for
flexible indexing.  EG, case 1a talks about storing only docID in a
posting list, here:

    http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

And recent discussions on the dev list around how to be flexible as to
which index file(s) (one or many) things are stored in, eg:

   http://www.mail-archive.com/java-dev@lucene.apache.org/msg15681.html

should allow you to store this data inlined into the terms dict, or as
a separate file.

Some other initial comments/questions:

  * I think this would bloat the index because the docIDs are being
    double stored (in the terms dict, and, in the frq file).  Would
    you propose changing the frq file to not store the docID when the
    term dict is doing so?

  * Why store the byte blob in the term dict, and not a separate (new)
    index file?  We lose locality for cases where one wants to iterate
    through terms and not loads these docs (eg RangeQuery).

  * Could you, instead, make a reader that reads in the full byte blob
    from the frq file for a term, and then processes that into the
    int[]?  This would require no change to indexing & the index
    format, and wouldn't waste space double-storing the docIDs.

  * I'm worried how well this scales up.  For very common terms
    allocating then decoding & holding entirely in RAM the full list
    of docIDs can become extremely costly.  Also, I don't have a clear
    sense of how apps would use the returned int[].  For example,
    would the int[] for many terms need to remain resident at the same
    time?  (Eg when running a RangeQuery).  If so, that compounds the
    scale challenge.



> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Attachment: TestTermEnumDocs.java

Was going to write a Lucene test case but need an example and svn is down.

The example test is extremely poor because the term and field saturation is nil.  Normal documents will have far more terms and the file cache will not have cached as much of the term docs as it will be larger.  However it does illustrate the speed up.  Please suggest other tests.

Laptop Windows XP SP2 Java6 core2duo, about the same on 3 separate runs:
3360 millis termenum loaddocs
25641 millis termdocs
7.6 times speedup

There have been previous discussions regarding the speed issue.  
http://www.gossamer-threads.com/lists/lucene/java-dev/53786
The conclusion was to use payloads which do not speed up stringindex or range queries.  


> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594220#action_12594220 ] 

Michael McCandless commented on LUCENE-1278:
--------------------------------------------

{quote}
I had a look at today's patch, but I stopped at DocumentsWriter because it contains a lot of layout changes, so it's hard to focus on the functional differences.
{quote}

I also stopped at DocumentsWriter: it seems like nearly all the
changes are cosmetic.  SegmentTermEnum is also hard to read.

In general it's best to not make cosmetic changes (moving around
import lines, changing whitespace, re-justifying whole paragraphs in
javadocs, etc.) at the same time as a "real" change, when possible.  I
do admit there is a strong temptation ;)

Also, indentation should be two spaces, not tab.  A number of sources
were changed to tab in the patch.



> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Attachment: lucene.1278.5.5.2008.patch

lucene.1278.5.5.2008.patch - added and implemented IntArrayList, cleaned up comments, fixed bugs in segmentmerger, and multisegmentreader

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Description: 
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:
{noformat}
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
{noformat}

Example write code:
{noformat}
Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);
{noformat}

  was:
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:
{noformat}
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
{noformat}
Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);


> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594231#action_12594231 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Storing the docs is off by default and will add index size only if the user wishes.  The byte blob allows not reading the docs when loaddocs is false.  Field cache and range query loading is very slow because of the dual seeks per term (for termenum then termdocs).  If in a separate file the terms are redundant.  

An field cache example:

protected Object createValue(IndexReader reader, Object entryKey)
        throws IOException {
      Entry entry = (Entry) entryKey;
      String field = entry.field;
      IntParser parser = (IntParser) entry.custom;
      final int[] retArray = new int[reader.maxDoc()];
      // TermDocs termDocs = reader.termDocs();  
      //TermEnum termEnum = reader.terms (new Term (field, ""));
      TermEnum termEnum = reader.terms (new Term (field, ""), true);
      try {
        do {
          Term term = termEnum.term();
          if (term==null || term.field() != field) break;
          int termval = parser.parseInt(term.text());
          int[] docs = termEnum.docs();
          for (int x=0; x < docs.length; x++) {
            retArray[docs[x]] = termval;
          }
          //termDocs.seek (termEnum);
          //while (termDocs.next()) {
          //  retArray[termDocs.doc()] = termval;
          //}
        } while (termEnum.next());
      } finally {
        //termDocs.close();
        termEnum.close();
      }
      return retArray;
    }

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594761#action_12594761 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

What if the int array is saved in TermInfo only if the docfreq was below a certain threshold?  Otherwise on int[] docs = TermEnum.docs() the docs are loaded from the file.  This solves the main issue with the int array, the potential for high numbers of docs being stored in ram.

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Attachment: lucene.1278.5.5.2008.2.patch

lucene.1278.5.5.2008.2.patch - Source formatting fixed.  SortedVIntList is for unchanging arrays.  There are index format changes.

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594233#action_12594233 ] 

Paul Elschot commented on LUCENE-1278:
--------------------------------------

Had a quick look at today's second patch. Some details:
There are still some layout changes in TermInfosReader (iirc).
There is an addition to Fieldable, which is a public interface so there is a backward compatibility issue there (sigh).
There is an extra constructor for SortedVIntList which is not used.

Instead of returning a full int[] for common terms, perhaps one could return a DocIdSetIterator that does the decoding of the doc ids as it is needed.

I'm getting curious about performance improvements from this, any numbers available?

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Description: 
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:

TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);\\
do {\\
  Term term = termEnum.term();\\
  if (term == null || term.field() != field) break;\\
  int[] docs = termEnum.docs();\\
} while (termEnum.next());\\

Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);

  was:
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:
<pre>
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
</pre>
Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);


> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);\\
> do {\\
>   Term term = termEnum.term();\\
>   if (term == null || term.field() != field) break;\\
>   int[] docs = termEnum.docs();\\
> } while (termEnum.next());\\
> Example write code:
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1278.
----------------------------------------

    Resolution: Won't Fix

I think the pulsing codec (wraps any other codec, but inlines low-freq terms directly into the terms dict) solves this?

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594219#action_12594219 ] 

Michael McCandless commented on LUCENE-1278:
--------------------------------------------

{quote}
Is there a way to know the number of documents for a term in DocumentsWriter.appendPostings before running through all of them?
{quote}

I don't think so.  You have to run through the list.



> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594317#action_12594317 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Returning DocIdSetIterator from TermEnum is good, will implement decoding bytes directly from file.

Flexible indexing is good, will implement when it's completed.

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594128#action_12594128 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Test cases being worked on

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596762#action_12596762 ] 

Paul Elschot commented on LUCENE-1278:
--------------------------------------

Some comments on the 5.7.2008 patch:

The test with 7.6 times speedup for very few docs per term makes me wonder why this never showed up as a performance problem before. It certainly shows an advantage of flexible indexing for the case in which the within document term frequencies are not needed (for example primary/foreign keys, which normally end up in a keyword field.)

In the patch, DocIdSetIterator is used in the org.apache.lucene.index package, so it would be a good idea to move it from o.a.l.search to o.a.l.index or to o.a.l.util to avoid a circular dependency involving the index and search packages. As DocIdSetIterator is not yet released, this move should be no problem.

The DocIdSetReader class in the patch has so much code in common with SortedVIntList that it might be better to merge the two into a single one, and try and refactor common code into new methods there.
That would also be an easy way to get rid of the unsupported skipTo() operation.



> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615077#action_12615077 ] 

Eks Dev commented on LUCENE-1278:
---------------------------------

in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I think it is worth mentioning that I am working on LUCENE-1340, that is storing postings without additional frq info. 

correct me if I am wrong, the only difference is that this approach with *.frq needs one seek more... at the same time, this could potentially increase term dict size, so we loose some locality.

Your your last proposal sounds interesting,  "inline short postings" into term dict , so for short postings (about the size of offset pointer into *.frq) with tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340)  we spare one seek()... this could be a lot. Also, there is no need to store postings into *frq  (this complicates maintenance I guess)  

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Attachment: lucene.1278.5.4.2008.patch

Patch generated using TortoiseSVN.  

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1278:
-------------------------------------

    Description: 
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:
<pre>
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
</pre>
Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);

  was:
Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  

Example read code:

TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());

Example write code:

Document document = new Document();
document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
indexWriter.addDocument(document);


> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> <pre>
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> </pre>
> Example write code:
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598793#action_12598793 ] 

Jason Rutherglen commented on LUCENE-1278:
------------------------------------------

Thought of some simple logic for this that will make it work automatically with no user interaction and no API additions.

If the term is located in less than or equal to the skipinterval of termdocs docs, and the term frequency for each doc is 1, then the docs should be stored in segment.tis.  Otherwise they should be stored as usual in segment.frq.  

The problem is knowing whether the logic is true in the DocumentsWriter.appendPostings method.  

> Add optional storing of document numbers in term dictionary
> -----------------------------------------------------------
>
>                 Key: LUCENE-1278
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1278
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
>
>
> Add optional storing of document numbers in term dictionary.  String index field cache and range filter creation will be faster.  
> Example read code:
> {noformat}
> TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
> do {
>   Term term = termEnum.term();
>   if (term == null || term.field() != field) break;
>   int[] docs = termEnum.docs();
> } while (termEnum.next());
> {noformat}
> Example write code:
> {noformat}
> Document document = new Document();
> document.add(new Field("tag", "dog", Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
> indexWriter.addDocument(document);
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org