You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "alessandro.benedetti" <a....@sease.io> on 2017/02/01 15:58:59 UTC

Phrase Queries and Punctuation

Hi all,
I was just thinking about Phrase Queries and punctuation ( and in general
how to manage increment positions when such a sentence delimiter happens).

At the moment for multi valued fields we have the "increment position gap"
which allow to avoid phrase queries to span different values for the same
field.

In a single valued textual fields, we may have hundreds of different
sentences ( separated by punctuation).
Generally we don't want phrase queries to span different sentences so I
would expect a similar position increment behaviour.

A possible solution could be to have a tokenizer which is able to split
sentences ( a lot of approaches in NLP are already there to be used) and add
an incrementPositionGap between sentences as well ( < multi value increment
position gap).
A very naive solution would be to add the position increment whenever we
find a punctuation delimiter ( such in the standard tokenizer happens for
stop words.
I have not analysed the implementations in details yet,
At this stage I was just wondering if anyone has faced this problem with
Lucene and Solr ?
Which kind of side effects could happen if we add the increment position gap
on a punctuation delimiter basis, by default on the Standard Tokenizer ?

Cheers

-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/Phrase-Queries-and-Punctuation-tp4318290.html
Sent from the Solr - User mailing list archive at Nabble.com.