You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Ryan McKinley (JIRA)" <ji...@apache.org> on 2007/05/11 21:15:15 UTC

[jira] Created: (SOLR-234) TrimFilter should update the start and end offsets

TrimFilter should update the start and end offsets
--------------------------------------------------

                 Key: SOLR-234
                 URL: https://issues.apache.org/jira/browse/SOLR-234
             Project: Solr
          Issue Type: Improvement
            Reporter: Ryan McKinley
            Priority: Minor


As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset

see:
http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495202 ] 

Hoss Man commented on SOLR-234:
-------------------------------

> I'd think that updating the offsets is almost always the right thing to do (and should be the default?), given that spaces will 
> almost always come from the field value itself.

i don't follow your reasoning ... the offsets are suppose to denote where in the original text the Token came from ... a Filter can't make any assumptions about source of the tokens except the token itself, so i don't' see why a Filter would by default assume it can muck with the offsets.

In Ryan's use case he may want his highlighter-esque code to be able to know how many characters were trimmed off of each end -- and i buy that it makes sense for TrimFilter to have an option to relay that info by modifying the offset -- but joe random user should be able to expect that by default the offsets of the Tokens his tokenizer produces won't be modified ... i would personally think it's a bug to get the behavior ryan describes out of a highlighter if i knew that my tokenizer was only spliting on punctuation.

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495209 ] 

Ryan McKinley commented on SOLR-234:
------------------------------------


> 
> .. a Filter can't make any assumptions about source of the tokens except the token itself ...

I get the basic pattern now:  Tokenizers determin the start/end offsets and Filters just transform the text along the way.  


> In Ryan's use case he may want his highlighter-esque code to be able to know ...
> 

I am fine with either:

1. leave the TrimFilter as is and do the highlighter-esque code on the highlighting side.  

2. Add an optional updateOffsets="true" param, with the default set to "false"


> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-234:
-------------------------------

    Attachment: SOLR-234-TrimFilterOffsets.patch

This patch moves the start and end offests by how many spaces where eaten on either site.

This patch also extracts many of the generally useful token testing bits from TestSynonymFilter (and a few others) into a BaseTokenTestCase, and uses this base class rather then duplicating the helper functions.



> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495147 ] 

Yonik Seeley commented on SOLR-234:
-----------------------------------

Updating the offsets does seem like the right thing to do.

I imagine using toCharArray() will be slower than using charAt() given that it will allocate a new array, and the number of charAt() calls will be low in the average case because there will only be a small amount of whitespace.

Isn't it annoying that Java never seems to let you do things as efficiently as the class lib itself...

Another issue here is that the position increment isn't maintained.
And let another future issue is that any payloads aren't maintained (that's in a newer version of Lucene).
I'll bring up the latter issue on the lucene list since I think it's a bit of a design flaw.

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-234:
-------------------------------

    Attachment: SOLR-234-TrimFilterOffsets.patch

updated to take a n "updateOffsets" argument.

this uses charAt() rather then toCharArray()

What should happen with the position increment?


      if( start > 0 || end < txt.length() ) {
        int incr = t.getPositionIncrement();
        t = new Token( t.termText().substring( start, end ),
             t.startOffset()+start,
             t.endOffset()-endOff,
             t.type() );
        
        t.setPositionIncrement( incr ); //+ start ); TODO? what should happen with the offset
      }

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495213 ] 

Yonik Seeley commented on SOLR-234:
-----------------------------------

offsets point back to the original field value for a particular token... and to me, it's a semantic contract (point to what makes sense in the source). It's not limited to the offsets generated by the Tokenizer... Analyzers don't have to use Tokenizers and TokenFilters at all.

As an example, WordDelimiterFilter modifies offsets when it splits words, and that makese sense to me.

Another way to think about it is that there is more than one way to solve a problem (construct an analyzer).
What matters is the tokens that come out the end... not if I did
a) a tokenizer that split on something followed by a filter that trimmed
vs
b) a tokenizer that managed to split on something including discarding the whitespace

For this specific case, I think it comes down to the likely usecases for the filter, and an argument could be made either way.  I'm fine with either as this is a very minor issue.

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reopened SOLR-234:
--------------------------------


> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley reassigned SOLR-234:
----------------------------------

    Assignee: Ryan McKinley

Unless there are objections, I'd like to add "updateOffsets" as an option to the TrimFilter.

By default it will *not* modify the offsets.

Depending on how the Tokenizer+Analyzer stream is configured it may or may not make sense, so the option seems reasonable.

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>         Assigned To: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-234.
--------------------------------

    Resolution: Fixed

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Assignee: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley resolved SOLR-234.
--------------------------------

    Resolution: Invalid

Chris points out:

"in Lucene Token offset information is suppose to reflect exactly where in
the orriginal stream of date the source of the token was found ... if hte
token is modified in some way (ie: stemmed, trimmed, etc..)  the offsets
are suppose to remain the same becuase regardless of the token text
munging, the orriginal location hsa not actually changed."


I'll move the test refactoring to another issue as that is still useful

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495161 ] 

Yonik Seeley commented on SOLR-234:
-----------------------------------

> What should happen with the position increment? 
It should remain the same as the original.

> The case i can imagine leading to something like SOLR-42 is if a token is replaced with something
> that has leading or trailing spaces. 

Really whacky, but possible I guess.  I don't know of any token filters that would do that, unless someone explicitly used a synonym with spaces at the end.  It doesn't make any sense.
I'd think that updating the offsets is almost always the right thing to do (and should be the default?), given that spaces will almost always come from the field value itself.

-Yonik

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Chris Hostetter <ho...@fucit.org>.
: > Incidently, PatternTokenizerFactory seems to have the anoying limitation
: > of assuming there is a token prior to each match -- even if the match
: > explicitly matches on the start of the string (so it creates a 0 width
: > token) ... that seems like a bug right?

: how would you change it?  I don't know regex well enough to see the
: limitation.  My only criteria was that the output is the same as if you
: send it to string.split( pattern );

Ahhh.... yes i see ... if you are trying to mimic String.split (or
Pattern.split) then the current behavior is correct.  my thinking was that
if you were trying to use this to tokenize on whitespace (or something
like that) and your input as "  aaa bbb   ccc  " ... this would create 4
tokens: an zero width token, followed by tokens for aaa, bbb, and ccc ...
but that first token seeemed like a mistake to me (or if it's not a
mistake, then it seemed like there should also be a zero width width token
at the end after the last space too ... but that's the say string
splitting works too.


-Hoss


Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Ryan McKinley <ry...@gmail.com>.
Chris Hostetter wrote:
> : After 1/2 hour of regex hacking... I think I'll stick with a two step
> : process: split then trim ;)
> 
> But regex hacking is FUN!!
> 
> I'm 99% certain this does waht you want...
> 
>         <tokenizer class="solr.PatternTokenizerFactory"
>                    pattern="((\A\s*)|\s*?(\s+-\s+|--|,|\(|\))|\s+)\s*\z?"
> 

yup!  that does it.  thanks


> ..if it doesn't send me an example string that it fails on and tell me
> what hte desired output is.
> 
> Incidently, PatternTokenizerFactory seems to have the anoying limitation
> of assuming there is a token prior to each match -- even if the match
> explicitly matches on the start of the string (so it creates a 0 width
> token) ... that seems like a bug right?
> 

how would you change it?  I don't know regex well enough to see the 
limitation.  My only criteria was that the output is the same as if you 
send it to string.split( pattern );


thanks again
ryan

Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Chris Hostetter <ho...@fucit.org>.
: After 1/2 hour of regex hacking... I think I'll stick with a two step
: process: split then trim ;)

But regex hacking is FUN!!

I'm 99% certain this does waht you want...

        <tokenizer class="solr.PatternTokenizerFactory"
                   pattern="((\A\s*)|\s*?(\s+-\s+|--|,|\(|\))|\s+)\s*\z?"

..if it doesn't send me an example string that it fails on and tell me
what hte desired output is.

Incidently, PatternTokenizerFactory seems to have the anoying limitation
of assuming there is a token prior to each match -- even if the match
explicitly matches on the start of the string (so it creates a 0 width
token) ... that seems like a bug right?




-Hoss


Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Ryan McKinley <ry...@gmail.com>.
> 
> ...oh, hmm ... you only want to split on - if it has a space on both sides
> huh?  does java regex have a "don't be greedy option?" ... javadocs say
> yes (they call it "Reluctant" vs "greedy" so try something like this...
> 
>    pattern="\s*?(\s-\s|--|,|\(|\))\s*?"
> 

it *almost* works, with:

   (\s{1,}-\s{1,}|^\s*|\s*$|\s*--\s*|\s*\(\s*)

BUT if i add a condition for ')' it falls apart...  this one *almost* 
works too:
   (\s*(\s{1,}-\s{1,}|--|,|\(|\))\s*?|^\s*|\s*$)


After 1/2 hour of regex hacking... I think I'll stick with a two step 
process: split then trim ;)



Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Chris Hostetter <ho...@fucit.org>.
: probably....  I'm just not very good at regex ;)
:
:    pattern="--|,|\s-\s|\(|\)"
:
: this will split on "--", " - ", "(", and ")".  I can't figure out how to
: build the pattern so it will trim each thing on the way out.

just make sure you match the whitespace in the pattern, you're already
doing that for the single "-" case, something like this should work but
i haven't tested it...

   pattern="\s*(-|--|,|\(|\))\s*"

...oh, hmm ... you only want to split on - if it has a space on both sides
huh?  does java regex have a "don't be greedy option?" ... javadocs say
yes (they call it "Reluctant" vs "greedy" so try something like this...

   pattern="\s*?(\s-\s|--|,|\(|\))\s*?"

-Hoss


Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Mike Klaas <mi...@gmail.com>.
On 11-May-07, at 5:02 PM, Ryan McKinley wrote:

> Chris Hostetter wrote:
>> : My real use case is adding the the trim filter to the pattern  
>> tokenizer.
>> : the 'correct' answer in my case it to update the offsets.
>> hmmm... wouldn't the "correct" thing to do in that case be to  
>> change your
>> pattern so it strips the whitespace when tokenizing?  that way the  
>> offsets
>> of your tokens will be accurate from the begining.
>
> probably....  I'm just not very good at regex ;)
>
>   pattern="--|,|\s-\s|\(|\)"
>
> this will split on "--", " - ", "(", and ")".  I can't figure out  
> how to build the pattern so it will trim each thing on the way out.

Try:

\s*[(),-]+\s*

Note that this will also split on multiple (('s and ,,'s.  If more  
precision is required, change the + to {1, 2} or special-case the '--'.

-Mike



Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Ryan McKinley <ry...@gmail.com>.
Chris Hostetter wrote:
> : My real use case is adding the the trim filter to the pattern tokenizer.
> : the 'correct' answer in my case it to update the offsets.
> 
> hmmm... wouldn't the "correct" thing to do in that case be to change your
> pattern so it strips the whitespace when tokenizing?  that way the offsets
> of your tokens will be accurate from the begining.
> 

probably....  I'm just not very good at regex ;)

   pattern="--|,|\s-\s|\(|\)"

this will split on "--", " - ", "(", and ")".  I can't figure out how to 
build the pattern so it will trim each thing on the way out.




Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by Chris Hostetter <ho...@fucit.org>.
: My real use case is adding the the trim filter to the pattern tokenizer.
: the 'correct' answer in my case it to update the offsets.

hmmm... wouldn't the "correct" thing to do in that case be to change your
pattern so it strips the whitespace when tokenizing?  that way the offsets
of your tokens will be accurate from the begining.



-Hoss


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495153 ] 

Ryan McKinley commented on SOLR-234:
------------------------------------

> 
> Updating the offsets does seem like the right thing to do.
> 

My real use case is adding the the trim filter to the pattern tokenizer.  the 'correct' answer in my case it to update the offsets.

The case i can imagine leading to something like SOLR-42 is if a token is replaced with something that has leading or trailing spaces.  

Perhaps we could add a parameter to the factory:

 <filter class="solr.TrimFilterFactory" updateOffests="true" />

To limit SOLR-42 style errors, the default could be false.


> 
> Isn't it annoying that Java never seems to let you do things as efficiently as the class lib itself...
> 

*especially* for strings!

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495224 ] 

Hoss Man commented on SOLR-234:
-------------------------------

yeah ... i'm not saying there aren't good usecases both ways -- it definitely makes sense to have an option -- i'm just saying that as a general rule TokenFilters shouldn't be munging offsets ... i don't see a big difference between TrimFilter and StemmingFilter (where the the stem of "foo   " and "foo      " is "foo").  so the option should default to off.

> TrimFilter should update the start and end offsets
> --------------------------------------------------
>
>                 Key: SOLR-234
>                 URL: https://issues.apache.org/jira/browse/SOLR-234
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-234-TrimFilterOffsets.patch, SOLR-234-TrimFilterOffsets.patch
>
>
> As implemented, the TrimFilter only trims the text.  It does not update the the startOffset and endOffset
> see:
> http://www.nabble.com/TrimFilter----t.startOffset%28%29%2C-t.endOffset%28%29-tf3728875.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.