You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/05/07 13:05:03 UTC

[jira] [Created] (LUCENE-3080) cutover highlighter to BytesRef

cutover highlighter to BytesRef
-------------------------------

                 Key: LUCENE-3080
                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/highlighter
            Reporter: Michael McCandless


Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging LUCENE-3065 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053254#comment-13053254 ] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

Mike, its an interesting idea, as I think the offsets are intended to be opaque to the app (so you should be able to use byte offsets if you want).

There are some problems though, especially tokenfilters that muck with offsets:
NGramTokenFilter, WordDelimiterFilter, ...

In general there are assumptions here that offsets are utf16.

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053245#comment-13053245 ] 

Mike Sokolov commented on LUCENE-3080:
--------------------------------------

There could be a good reason though for using byte-offsets in highlighting. I have in mind an optimization that would pull in text from an external file or other source, enabling highlighting without stored fields.  For best performance the snippet should be pulled from the external source using random access to storage, but this requires byte offsets.  I think this might be a big win for large field values.

This could only be done if the highlighter doesn't need to perform any text manipulation itself, so it's not really appropriate for Highlighter, as Robert said, but in the case of FVH it might be possible to implement.  I'm looking at this, but wondering before I get too deep in if anyone can comment on the feasibility of using byte offsets - I'm unclear on what they get used for other than highlighting: would it cause problems to have a CharFilter that returns "corrected" offsets such that char positions in the analyzed text are translated into byte positions in the source text? 

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053319#comment-13053319 ] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

Well, personally i am hesitant to introduce any encodings or bytes into our current analysis chain, because its unnecessary complexity that will introduce bugs (at the moment, its the users responsibility to create the appropriate Reader etc).

Furthermore, not all character sets can be 'corrected' with a linear conversion like this: for example some actually order the text in a different direction, and things like that... there are many quirks to non-unicode character sets.

Maybe as a start, it would be useful to prototype some simple experiments with a "binary analysis chain" and hackup a highlighter to work with them? This way we would have an understanding of what the potential performance gain is.

Here's some example code for a dead simple binary analysis chain that only uses bytes the whole way through, you could take these ideas and prototype one with just all ascii-terms and split on the space byte and such:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestBinaryTerms.java
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/BinaryTokenStream.java 


> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030340#comment-13030340 ] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

Right, because highlighters are designed for doing a lot of high-level things on text to bring out (hopefully good) snippets, e.g. various segmentation and fragmenting and whatever they do, I think it would be best if they continued to work on char[]. This makes it easier for people to extend the APIs.

I agree with the idea of somehow adding a check or assert to prevent Numerics from being highlighted... I don't think it makes sense to highlight NumericFields.

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053300#comment-13053300 ] 

Mike Sokolov commented on LUCENE-3080:
--------------------------------------

Yeah I knew that at some point, but stuffed it away as something to think about later :) There really is no way to pass byte streams into the analysis chain.  Maybe providing a character encoding to the filter could enable it to compute the needed byte offsets. 

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053275#comment-13053275 ] 

Mike Sokolov commented on LUCENE-3080:
--------------------------------------

It might be a bit more complicated?  Looks like WordDelimiterFilter, in generatePart and concatenate, eg, performs computation with the offsets.  So it would either need to know the units of the offsets it was passed, or be given more than just a correctOffset() method: rather it seems to require something like addCharsToOffset (offset, charOffsetIncr), where charOffsetIncr is a number of chars, but offset is in some unspecified unit.

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030339#comment-13030339 ] 

Uwe Schindler commented on LUCENE-3080:
---------------------------------------

In the past this was not an issue, because stored numeric fields in Solr were only binary, but now they are consumed by the highlighter component.

If we dont fix highlighter I have no problem. We should add some checks in highlighter to prevent NumericFields from beeing highlighted (e.g. if the TokenStream does not support CTA, it should stop highlighting). The above hack was just faster to implement in Solr to not slow down merging (because index formats from 3.x and trunk are incompatible now, so trunk needs to be committed ASAP).

This is why I opened this issue. Just to record that its a little bit inconsistent. But of course your response is true, it may make more problems.

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030333#comment-13030333 ] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

I'm sorry, I don't think we should change the highlighter to use BytesRef.

Not unless offsets are changed to byte offsets to fit, otherwise this will make highlighting even more complicated for no good reason.

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053432#comment-13053432 ] 

Mike Sokolov commented on LUCENE-3080:
--------------------------------------

I agree it's necessary to prove there is some point to all this - I'm working on getting some numbers.  At the moment I'm just assuming ASCII encoding, but I'll take a look at the binary stuff too - thanks.

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-3080:
----------------------------------

    Description: Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.  (was: Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging LUCENE-3065 to trunk.)

> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053260#comment-13053260 ] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

not trying to scare you off from that byte-offset-converting-charfilter idea with my comment above either, I think this would be interesting to fix! maybe we just need to fix the situation so that tokenfilters can pass down the call to 'correctOffset' and then use it, I think it might only be defined on Tokenizer...


> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3080) cutover highlighter to BytesRef

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053281#comment-13053281 ] 

Robert Muir commented on LUCENE-3080:
-------------------------------------

yes: in general I think it would be problematic, especially since most tests use only all-ascii data.

Another problem on this issue is that if you want to use bytes, but with the Tokenizer-analysis-chain, it only takes Reader, so you cannot assume anything about the original bytes or encoding (e.g. that its UTF-8 for example).





> cutover highlighter to BytesRef
> -------------------------------
>
>                 Key: LUCENE-3080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3080
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Michael McCandless
>
> Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org