You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "peter chang (Created) (JIRA)" <ji...@apache.org> on 2011/12/11 16:24:40 UTC

[jira] [Created] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
---------------------------------------------------------------------------------------------------------------------------------------------

                 Key: LUCENE-3638
                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/index, core/search
    Affects Versions: 4.0
         Environment: 64bit linux java 1.6
            Reporter: peter chang
            Priority: Minor
             Fix For: 4.0


when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167157#comment-13167157 ] 

Shai Erera commented on LUCENE-3638:
------------------------------------

Ok. One last comment (b/c I really don't mind if it's added or not) -- I meant that if we'll put a Set method on IR, users might falsely create a Set on every document() call, b/c it's there and it's convenient. Maybe javadocs can warn people against doing this ...
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Uwe Schindler (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167131#comment-13167131 ] 

Uwe Schindler edited comment on LUCENE-3638 at 12/11/11 4:36 PM:
-----------------------------------------------------------------

bq. 3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.

Digest is the wrong word, this confused here lots of people. The use case you talk about is "highlighting". I agree for very large fields this is expensive.

In fact your patch does not handle this case and I agree with the others as it's to heavy to implement and adds back the crazy complexity we had with lazy fields & co.
                
      was (Author: thetaphi):
    bq. 3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.

Digest is the wrong word, this confused here lots of people. The use case you talk about is "highlighting". I agree for very large fields this is expensive.

In fact your patch does not handle this case and I agree it's to heavy to implement and adds back the crazy complexity we had with lazy fields & co.
                  
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167149#comment-13167149 ] 

Robert Muir commented on LUCENE-3638:
-------------------------------------

{quote}
I don't mind much either. It's just that this sugar method suggests that you have to create a Set<String> on every call, while if we point people to DSFV, people will fins that they can pass String... too.
{quote}

True, but thats just because DSFV creates the hashset on the fly :)

{quote}
Perhaps if we omit the sugar method, people will think that way, and indeed create the object just once. Dunno, it's your call.
{quote}

Thats true too, because if you reuse the DSFV then the String... method is not harmful since you are only doing it once.
So I think the String... method is ok on DSFV for this reason.

However on indexreader, i think its also ok to have a sugar method with Set, because it just creates a DSFV around that hashset,
so its hardly wasteful. 

In other words: Create a Set<String> and reuse your own Set via the proposed sugar method, and I think its fine, 
and a lot friendlier. Its not hashing anything. Sure its creating a DSFV each time, but like using a DSFV in any way, 
its also creating a Document object each time. If you are really worried about this stuff, implement your own visitor 
and don't use Document at all :) Don't forget we are talking about stored fields too!

And I say keep the String... on DSFV only, but don't add to IR, so we don't encourage lots of wasteful rehashing.

                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167121#comment-13167121 ] 

Uwe Schindler commented on LUCENE-3638:
---------------------------------------

I think the issue is more about something else:

bq. when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

He wants to specify an offset/length into a very large field and only retrieve the subslice.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167131#comment-13167131 ] 

Uwe Schindler commented on LUCENE-3638:
---------------------------------------

bq. 3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.

Digest is the wrong word, this confused here lots of people. The use case you talk about is "highlighting". I agree for very large fields this is expensive.

In fact your patch does not handle this case and I agree it's to heavy to implement and adds back the crazy complexity we had with lazy fields & co.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167123#comment-13167123 ] 

Robert Muir commented on LUCENE-3638:
-------------------------------------

{quote}
He wants to specify an offset/length into a very large field and only retrieve the subslice.
{quote}

I really think the solution here is just to put the interesting part in its own field...

Otherwise we have to add lots of complexity to the codec apis to support this (for instance what are the offsets/lengths? bytes? utf-16?)

                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167122#comment-13167122 ] 

Uwe Schindler commented on LUCENE-3638:
---------------------------------------

bq. So i would recommend we consider adding some sugar to indexreader:

Thats unrelated, BUT: yes please! And you already noted in the signature, one is very important: This method must be FINAL in IR!
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-3638.
----------------------------------------

    Resolution: Fixed

Thanks Peter!
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: LUCENE-3638.patch, doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167120#comment-13167120 ] 

Robert Muir commented on LUCENE-3638:
-------------------------------------

{quote}
However, I do see the convenience of specifying just 1-2 fields that you don't want to load, rather than 20 that you do. So how about you create a new StoredFieldVisitor, which takes the list of fields 'not to load'? It can extend DocumentStoredFieldVisitor by overriding needsField?
{quote}

DocumentStoredFieldsVisitor already supports this in its ctors.

So i would recommend we consider adding some sugar to indexreader:

{noformat}
public final Document document(int docID, Set<String> fields) {
  return document(docid, new DocumentStoredFieldsVisitor(fields));
}
{noformat}

sure there are cases where you want more complicated logic like to return STOP after certain fields, for that write your own logic.
But this is probably pretty common.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167211#comment-13167211 ] 

Michael McCandless commented on LUCENE-3638:
--------------------------------------------

+1 to adding simple sugar method to IR to only load the specified fields (Set<String>) of the document.

It's just sugar to forward to DSFV.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167147#comment-13167147 ] 

Shai Erera commented on LUCENE-3638:
------------------------------------

bq. The only question is do we need to provide one that does this?

Oh, I did not propose that we do it in Lucene, but rather that Peter do it himself (I wrote "how about *you* ..."). I agree we should not cater for all use cases out there.

bq. Yes its definitely redundant. But I think this is probably very common? Doesn't matter to me either way though.

I don't mind much either. It's just that this sugar method suggests that you have to create a Set<String> on every call, while if we point people to DSFV, people will fins that they can pass String... too.

I anyway think that for most apps, this object is probably constructed just once, because usually the list of fields does not change between queries, or at least you will have a handful of those, one per query type. Perhaps if we omit the sugar method, people will think that way, and indeed create the object just once. Dunno, it's your call.

If someone ends up committing that method in the context of this issue, I suggest that its subject is renamed accordingly. Otherwise, just close it.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3638:
---------------------------------------

    Attachment: LUCENE-3638.patch

I was thinking just simple sugar, like the attached patch...
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: LUCENE-3638.patch, doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "peter chang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167129#comment-13167129 ] 

peter chang commented on LUCENE-3638:
-------------------------------------

3. i do not think i can store only the interesting part because i do not know which is interesting part at index time. For example, the digest part of the search results is generated according to the query of somebody's.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167137#comment-13167137 ] 

Robert Muir commented on LUCENE-3638:
-------------------------------------

{quote}
Where? What am I missing? DSFV only takes a list of fieldsToAdd, not fieldsToFilter. If you have 20 fields in your index, and you want to load all but 2 fields, it may be more convenient to specify these two, and I proposed that it can be done in a DSFV extension.
{quote}

Well, you certainly *can* do this in a DSFV extension. The only question is do we need to provide one that does this? I think in general
each app will be different and having this visitor interface is "enough" rather than us supplying tons of concrete implementations for various
use cases.

{quote}
I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version.
{quote}

Yes its definitely redundant. But I think this is probably very common? Doesn't matter to me either way though.



                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "peter chang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167132#comment-13167132 ] 

peter chang commented on LUCENE-3638:
-------------------------------------

yes, i mean hightlighting or sth. else dynamic generated at search time. Thnaks for Uwe's reminding.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "peter chang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167126#comment-13167126 ] 

peter chang commented on LUCENE-3638:
-------------------------------------

1. i agree with robert, fieldsToAdd, fieldsToFilter something like this can be added for IR and IS.doc
2. yes, the offset info is specified topic related. it can be process in app level when process multi-bytes encoded languages such as Zh_CN. in this situation, the offset is just an estimation. 
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167125#comment-13167125 ] 

Shai Erera commented on LUCENE-3638:
------------------------------------

bq. He wants to specify an offset/length into a very large field and only retrieve the subslice.

That's indeed what the issue's description says, but there's no evidence to it in the patch. And I agree with Robert that in that case, one should store the interesting part in a special field.

bq. DocumentStoredFieldsVisitor already supports this in its ctors.

Where? What am I missing? DSFV only takes a list of fieldsToAdd, not fieldsToFilter. If you have 20 fields in your index, and you want to load all but 2 fields, it may be more convenient to specify these two, and I proposed that it can be done in a DSFV extension.

bq. So i would recommend we consider adding some sugar to indexreader:

I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. And why do we favor Set<String> over String...? :)
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "peter chang (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

peter chang updated LUCENE-3638:
--------------------------------

    Attachment: doc.fields.patch
    
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167117#comment-13167117 ] 

Shai Erera commented on LUCENE-3638:
------------------------------------

IndexReader and IndexSearcher already offer a doc/document method which takes StoredFieldVisitor, so why adding another version to them?

Also, I don't think that DocumentStoredFieldVisitor should change. I find it very intuitive that I need to specify that fields that I want to load, rather than the fields that I don't want to. I.e., in my apps, there are many fields that are stored, but not loaded for results display.

However, I do see the convenience of specifying just 1-2 fields that you don't want to load, rather than 20 that you do. So how about you create a new StoredFieldVisitor, which takes the list of fields 'not to load'? It can extend DocumentStoredFieldVisitor by overriding needsField?
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "peter chang (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167334#comment-13167334 ] 

peter chang commented on LUCENE-3638:
-------------------------------------

i upload this patch just for convenience
{code:title=IndexSearcher.java|borderStyle=solid}
  /* Sugar for <code>.getIndexReader().document(docID)</code> */
  /** see {@link IndexReader#document(int, Set, Set)} for detail*/
  public Document doc(int docID, Set<String> fieldsToAdd, Set<String> fieldsToFilter) throws CorruptIndexException, IOException {
	return reader.document(docID, fieldsToAdd, fieldsToFilter);
  }
{code}
here, you see the IS also has the access to document fetch. so in this case, IS will look like powerless if IR can not supply such method or interface to the external.

                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3638) IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167130#comment-13167130 ] 

Uwe Schindler commented on LUCENE-3638:
---------------------------------------

bq. I think this method is redundant, because one can easily call new DSFV(fields) and use the SFV document version. And why do we favor Set<String> over String...? 

Ahm this would favour veeeery slow code. We would need to create a Set<String> on every call :-)

But I agree here with Robert we should add the sugar method (final please and without maxDoc checks, that's up to the abstract impl) for easier use.
                
> IndexReader.document always return a doc with all the stored fields loaded. And this can be slow for the indexed document contain huge fields
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3638
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0
>         Environment: 64bit linux java 1.6
>            Reporter: peter chang
>            Priority: Minor
>              Labels: patch
>             Fix For: 4.0
>
>         Attachments: doc.fields.patch
>
>
> when generating digest for some documents with huge fields, it should be unnecessary to load the field but just interesting part of the field with the offset information. but indexreader always return the whole field content. afterward, the customized storedfieldsreader will got a repeated loading

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org