You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Eks Dev (JIRA)" <ji...@apache.org> on 2008/07/19 00:39:31 UTC

[jira] Created: (LUCENE-1340) Make it posible not to include TF information in index

Make it posible not to include TF information in index
------------------------------------------------------

                 Key: LUCENE-1340
                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
            Reporter: Eks Dev
            Priority: Minor


Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.

benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...

Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests

Complexity: simpler than expected

can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615446#action_12615446 ] 

Michael McCandless commented on LUCENE-1340:
--------------------------------------------

bq. About this one, it would be nice not to store this as well, but I think the pointers are already reduced to one byte, as they are 0 for these cases (are they,?) So we have this benefit without expecting it

Ahh, right.  The delta between the proxPointers are written as vlong's.  Since the delta will be zero it's now only 1 byte; only a bit worse than 0 bytes ;)

bq. that would mean we could easily "inline" very short postings into term dict (here I expect huge performance benefit, as skip() on another large file is going to be saved independent from omitTf(true))

Yes, this looks like it would be a win for cases that need to visit the postings for many small terms.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1340:
------------------------------------------

    Assignee: Michael McCandless

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616513#action_12616513 ] 

Michael McCandless commented on LUCENE-1340:
--------------------------------------------

bq. The delta between the proxPointers are written as vlong's. Since the delta will be zero it's now only 1 byte; only a bit worse than 0 bytes

One more thing here: since the tiis are loaded into RAM, that unused proxPointer wastes 8 bytes for each indexed terms.  For indices with alot of terms this can add up to alot of wasted ram.  But still I think we should wait and fix this as part of flexible indexing, when we maybe refactor the TermInfos to be "column stride" instead.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617140#action_12617140 ] 

Eks Dev commented on LUCENE-1340:
---------------------------------

we  finished our tests

Index without omitTf() :
- 87Mio Documents, 2 indexed Fields one stored field
- Unique terms in index 2.5Mio
- Average Field lengths in tokens: 3.3 and 5.5 (very short fields)
- On Disk size 3.8 Gb total with stored field
 
Queries under test: 
- BooleanQuery in all shapes and forms (disjunctive, conjunctive, nested, with minNumberShouldMatch()) . with a lot of clauses (5-100).
- Filter used, yes

Test scope, regression with 30k Queries on the same index with omitTf(true/false).

Result:

- The Queries returned 100% identical Hits (full recall tested, all hits checked)!

- Index size reduction(not including stored field!): 7% (short documents => less positions than in Mike's case)

- Performance of Queries: 5.2% faster, but index was loaded as RAMIndex (on disk setup should bring even more due to the reduced IO for reading postings)

-Indexing performance (FSDisk!) 13% faster

Also, we compared omitTf(false) with this patch and lucene.jar without this patch, no changes whatsoever.

>From my perspective, this is good to go into production. At least for our usage of lucene, there are no differences with homitTf(true)... 

>One more thing here: since the tiis are loaded into RAM, that unused proxPointer wastes 8 bytes for each indexed terms. For indices with alot of terms this can add up to alot of wasted ram. But still I think we should wait and fix this as part of flexible indexing, when we maybe refactor the TermInfos to be "column stride" instead.

I am more than happy with the results, no need to squeeze the last bit out of it right now.

Mike, thanks again for the great work! 



> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618272#action_12618272 ] 

Michael McCandless commented on LUCENE-1340:
--------------------------------------------

Sigh, I too missed that we broke back-compatibility.

But I agree: let's mark Fieldable interface as being allowed to change from release to release (consciously make an exception to back compatibility requirements).

Let's also transition away from interface for Field, for 3.0   EG we last had discussions on this, here:

    http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200803.mbox/%3C45933841-6EBF-4208-B10D-0D3B0BB530FF@mikemccandless.com%3E



> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1340:
---------------------------------------

    Attachment: LUCENE-1340.patch

I attached a new rev of the patch:

  * Use less RAM if field omits tf's (don't write the tf's into the RAM buffer), so we flush less often

  * Added another test case to TestOmitTf

As a test, I indexed full wikipedia (~3.2 million docs) with this alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = false
doc.term.vector = false
doc.add.log.step=10000
max.field.length=2147483647

directory=FSDirectory
autocommit=false
compound=false
doc.maker.forever = false

work.dir=/lucene/work2
ram.flush.mb=64

- CreateIndex
{ "AddDocs" AddDoc > : *
- CloseIndex

RepSumByPrefRound AddDoc

{code}

With tf's it takes 970 seconds and index size is 2.5 GB.  Without tf's
it takes 834 seconds (14% faster) and index size is 1.1 GB (56%
smaller).


> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617996#action_12617996 ] 

Grant Ingersoll commented on LUCENE-1340:
-----------------------------------------

Yeah, it's one of my biggest regrets in Lucene (yes, I am responsible for it), yet I firmly believe there is a way to do interfaces and abstracts in a proper way in Java.

We could make LazyField extend AbstractField, I think, but it's not clear, as there are some differences between the two, mostly around construction.  I'd have to go back and review again.

That being said, I still think if there is one place where we should allow breaking the back compat. contract, it is Fieldable!  For every rule, there is an exception, right?  I thinnk we could, w/ sufficient warning, tell people that we are changing the interface.  I am willing to bet that the number of people that would be effected by that would be less than 10.

So, please don't give up on this patch.  I am totally 100% for it.  I think it makes total sense to do.  

Another option is to speed up going towards 3.0

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618290#action_12618290 ] 

Grant Ingersoll commented on LUCENE-1340:
-----------------------------------------

OK, I think we should call a vote on it, as it is significant enough in my mind.  I will write it up.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614909#action_12614909 ] 

Paul Elschot commented on LUCENE-1340:
--------------------------------------

Ok ok. I'll start working on adding a Filter as a clause to BooleanQuery. Will take some time though, there's a holiday coming up.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617978#action_12617978 ] 

Eks Dev commented on LUCENE-1340:
---------------------------------

ouch! it is kind of getting personal between me and Fieldable :) Not the first time to get bugged by it!

Due to Fieldable (things really important, at lest to me):  
- We cannot get binary stored Field in and out of lucene without getting gc() go crazy
- We cannot omitTF 
 
it would be possible somehow to make it at AbstractField levele and instanceoff at a few places, but I simply hate to do it (I will patch my local copy, this issue is worth to me... must branch off from the trunk for the first time, sigh)

funny it is, I see no reason to have anything but AbstractField (Field/Fieldable are just redundant)

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eks Dev updated LUCENE-1340:
----------------------------

    Attachment: LUCENE-1340.patch

first cut

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1340.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.4
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Thanks Eks!

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615357#action_12615357 ] 

Eks Dev commented on LUCENE-1340:
---------------------------------

Great, it is already more than I expected, even indexing is going to be somewhat faster.

I have tried your patch on smallish index with 8Mio documents and it worked on our regression test without problems. 
it worked fine with and without omitTf(true), no performance drop or bad surprises when we do not use it. Tomorrow is scheduled real test with production data, around 80Mio very small documents, with some very extensive tests.... I will report back.

"The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in .tii,.tis) and
also in memory because we load *.tii into RAM.... "

 About this one, it would be nice not to store this as well, but I think the pointers are already reduced to one byte, as they are 0 for these cases (are they,?) So we have this benefit without expecting it :)

And yes, more "column stride" is great, if you followed my comments on LUCENE-1278, that would mean we could easily "inline" very short postings into term dict (here I expect huge performance benefit, as skip()  on another large file is going to be saved independent from omitTf(true)), without increase in size (or minimal) of tii (no locality penalty) If we follow Zipfian distribution, there is *a lot* of terms with postings shorter than e.g. 16 ... 

Thanks again for your support, without you this patch would be just another nice idea :)








> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1340:
---------------------------------------

    Attachment: LUCENE-1340.patch

OK good progress eks!

I started from your latest patch and made some further changes:

  * Fixed DW to not consume RAM writing prx if omitTf==true

  * Fixed FreqProxTermsWriter to not create *.prx file if all fields
    omit term freq.  I added hasProx to SegmentInfo, and changed the
    index file format to store this new boolean.

  * Fixed FreqProxTermsWriterPerField to not write prox into the RAM
    buffer if we will omitTf on flushing the segment to disk.  This
    makes the RAM buffer efficient (no bytes wasted on prox when
    omitTf==true for a field).

  * Added more test cases to TestOmitTf

  * Small whitespace, comment changes

The one place I know of that will still waste bytes is the term dict
(TermInfo): it stores a long proxPointer on disk (in *.tii,*.tis) and
also in memory because we load *.tii into RAM.  For fields with
omitTf==true this will always be unused, and we could save alot of
disk/RAM if we didn't waste it.

Unfortunately, I think it's too big a change to try to fix this now; I
think we should wait until flex indexing is online.  I wonder how we
can solve it at that point: maybe should we change TermInfo to be
"column stride", meaning, there are separate arrays storing the values
for all terms (ie long[] proxPointers, long[] freqPointers, etc.).
This would also fit the "pluggable" model better, meaning any plugin
can store new stuff (its own arrays) per-term.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619963#action_12619963 ] 

Michael McCandless commented on LUCENE-1340:
--------------------------------------------

LUCENE-1349 is in; I plan to commit this shortly...

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617143#action_12617143 ] 

Michael McCandless commented on LUCENE-1340:
--------------------------------------------

OK that sounds like a healthy test.

bq. Mike, thanks again for the great work! 

Thank you for the sudden burst of effort to make this happen!

So I think this is ready to commit.  I'll wait a few days and then commit...

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618001#action_12618001 ] 

Doug Cutting commented on LUCENE-1340:
--------------------------------------

> I firmly believe there is a way to do interfaces and abstracts in a proper way in Java. 

Personally, I've given up on interfaces for stuff with more than one method with at most one parameter.  Ditch the interface and move on.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617954#action_12617954 ] 

Grant Ingersoll commented on LUCENE-1340:
-----------------------------------------

I note a change to Fieldable...  sigh...  Back compatibility fails.  Ugh.

Me thinks we should either rework Fieldable as we've previously discussed, or we mark it as being one of the very few classes in LUcene that is subject to change between releases.



> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eks Dev updated LUCENE-1340:
----------------------------

    Attachment: LUCENE-1340.patch

- fixed stupid bug in SegmentTermDocs (was doc = docCode; instead of doc += docCode;)
- TestOmitTf extended a bit 


> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1340:
---------------------------------------

    Attachment: LUCENE-1340.patch

Attached patch that also includes fixes to fileformat.{xml,html,pdf}.

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eks Dev updated LUCENE-1340:
----------------------------

    Attachment: LUCENE-1340.patch

Thanks Mike, with just a little bit more hand-holding we are going to be there :)
 
I *think* I have *.prx IO excluded in case omitTf==true, please have a look, this part is really not an easy one (*Merger).

Also, now if a single field has mixed true/false for omitTf, I set it to true.

One unit test is already there, basic use case works, but the test has to cover a bit more



> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1340:
---------------------------------------

    Attachment: LUCENE-1340.patch

Thanks eks, that was fast -- I think you set a new record!

The patch looks good, though we definitely need some solid unit tests
here.  I made some small (whitespace, spelling, naming) corrections &
attached a new rev of the patch.

One question I have: right now if a single field has mixed true/false
for omitTf, you set it to false, meaning we start storing the term
freq, pos, payloads again.  Can/should we do the reverse instead?  If
we did, we could make some further optimizations, eg right now we
consume RAM storing all positions/payloads on a field that has omitTF=true
on the possibility that we may stll see omitTf=false in the same session.

With this patch we still store the *.prx bytes for a field with
omitTf=true.  Can you fix that?  I think in FreqProxTermsWriter you
can simply not write any bytes to the proxOut; likewise in
SegmentMerger and SegmentTermPositions, don't try to read bytes from
the prx file if omitTf==true.

I'd also be curious about what gains in index size & filter
performance we see with these new boolean fields.


> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1340) Make it posible not to include TF information in index

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618069#action_12618069 ] 

Eks Dev commented on LUCENE-1340:
---------------------------------

that sound like consensus :) Great!

in that case LUCENE-1219 can be reworked slightly to avoid instanceoff (less code). Also it opens a way to pass reference to byte[] for retrieving stored fields out of lucene and communicating length back to caller (now we new byte[] every time we fetch stored field) 

bq. it's one of my biggest regrets in Lucene (yes, I am responsible for it), yet I firmly believe there is a way to do interfaces and abstracts in a proper way in Java. 

no need to regret Grant, if you do nothing you make no mistakes... Interfaces are ok, as long as you can tell what they are going to be doing in next 5 years... this forces you to design "for the future"... something we cannot afford in so popular and complex libraries like lucene at places like Field. Abstract* is equally good design-abstraction...  

Proposal:
We could live with a statement "Fieldable changes are allowed from now, it is deprecated and will be  probably removed in 3.0" , it causes just a tiny bit of work in case someone is really implementing it (adding new methods to Fieldable like omitTf() costs you max 5 minutes work to change your implementing class to implement it!).

from 3.0 on, I could very well live without it, until then, we cause 5 minutes work for people that implement Fieldable on their own and want to stay up to date with the trunk.  It is fair  deal for everyone and lucene moves forward... 




 


  

> Make it posible not to include TF information in index
> ------------------------------------------------------
>
>                 Key: LUCENE-1340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1340
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Eks Dev
>            Priority: Minor
>         Attachments: LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch, LUCENE-1340.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Term Frequency is typically not needed  for all fields, some CPU (reading one VInt less and one X>>>1...) and IO can be spared by making pure boolen fields possible in Lucene. This topic has already been discussed and accepted as a part of Flexible Indexing... This issue tries to push things a bit faster forward as I have some concrete customer demands.
> benefits can be expected for fields that are typical candidates for Filters, enumerations, user rights, IDs or very short "texts", phone  numbers, zip codes, names...
> Status: just passed standard test (compatibility), commited for early review, I have not tried new feature, missing some asserts and one two unit tests
> Complexity: simpler than expected
> can be used via omitTf() (who used omitNorms() will know where to find it :)  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org