You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dmitry Kan <so...@gmail.com> on 2015/06/18 09:07:34 UTC

MappingCharFilterFactory and start and end offsets

Hi,

It looks like MappingCharFilter sets start and end offset to the same
value. Can this be affected on by some setting?

For a string: test $ test2 and mapping "$" => " dollarsign " (we insert
extra space to separate $ into its own token)

we get: http://snag.gy/eJT1H.jpg

Ideally, we would like to have start and end offset respecting the remapped
token. Can this be achieved with settings?

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Re: MappingCharFilterFactory and start and end offsets

Posted by Dmitry Kan <so...@gmail.com>.
Hi Steve,

Sorry for a late reply, been quite busy. I have had afterthoughts
immediately after sending the question, in line with what you said: I meant
the source token start and end offset positions.

When MCFF is removed, the $ disappears after ST and start and end offsets
of all the terms are correct.

Is MCFF's behaviour correct? Should I raise a jira for retaining the start
and end offsets of the original tokens?

On Thu, Jun 18, 2015 at 10:06 PM, Steve Rowe <sa...@gmail.com> wrote:

> Hi Dmitry,
>
> It’s weird that start and end offsets are the same - what do you see for
> the start/end of ‘$’, i.e. if you take out MCFF?  (I think it should be
> start:5, end:6.)
>
> As far as offsets “respecting the remapped token”, are you asking for
> offsets to be set as if ‘dollarsign' were part of the original text?  If
> so, there is no setting that would do that - the intent is for offsets to
> map to the *original* text.  You can work around this by performing the
> substitution prior to Solr analysis, e.g. in an update processor like
> RegexReplaceProcessorFactory.
>
> Steve
> www.lucidworks.com
>
> > On Jun 18, 2015, at 3:07 AM, Dmitry Kan <so...@gmail.com> wrote:
> >
> > Hi,
> >
> > It looks like MappingCharFilter sets start and end offset to the same
> > value. Can this be affected on by some setting?
> >
> > For a string: test $ test2 and mapping "$" => " dollarsign " (we insert
> > extra space to separate $ into its own token)
> >
> > we get: http://snag.gy/eJT1H.jpg
> >
> > Ideally, we would like to have start and end offset respecting the
> remapped
> > token. Can this be achieved with settings?
> >
> > --
> > Dmitry Kan
> > Luke Toolbox: http://github.com/DmitryKey/luke
> > Blog: http://dmitrykan.blogspot.com
> > Twitter: http://twitter.com/dmitrykan
> > SemanticAnalyzer: www.semanticanalyzer.info
>
>


-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Re: MappingCharFilterFactory and start and end offsets

Posted by Steve Rowe <sa...@gmail.com>.
Hi Dmitry,

It’s weird that start and end offsets are the same - what do you see for the start/end of ‘$’, i.e. if you take out MCFF?  (I think it should be start:5, end:6.)

As far as offsets “respecting the remapped token”, are you asking for offsets to be set as if ‘dollarsign' were part of the original text?  If so, there is no setting that would do that - the intent is for offsets to map to the *original* text.  You can work around this by performing the substitution prior to Solr analysis, e.g. in an update processor like RegexReplaceProcessorFactory.

Steve
www.lucidworks.com

> On Jun 18, 2015, at 3:07 AM, Dmitry Kan <so...@gmail.com> wrote:
> 
> Hi,
> 
> It looks like MappingCharFilter sets start and end offset to the same
> value. Can this be affected on by some setting?
> 
> For a string: test $ test2 and mapping "$" => " dollarsign " (we insert
> extra space to separate $ into its own token)
> 
> we get: http://snag.gy/eJT1H.jpg
> 
> Ideally, we would like to have start and end offset respecting the remapped
> token. Can this be achieved with settings?
> 
> -- 
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info