You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Carlos Gonzalez-Cadenas <cg...@experienceon.com> on 2012/03/19 16:02:14 UTC

processing of merged tokens

Hello,

For our search system we'd like to be able to process merged tokens (sorry,
I don't know what's the proper name for this), i.e. when a user enters a
query like "hotelsin barcelona", we'd like to know that the user means
"hotels in barcelona".

At some point in the past we implemented this kind of functionality with
shingles (using ShingleFilter), that is, if we were indexing the sentence
"hotels in barcelona" as a document, we'd be able to match at query time
merged tokens like "hotelsin" and "inbarcelona".

This solution has two problems:
1) The index size increases a lot.
2) We only catch a small % of the possibilities. Merged tokens derived from
different token positions in the user query, like "hotelsbarcelona" or
"barcelonahotels", cannot be processed.

Our intuition is that there should be a better solution. Maybe it's solved
in SOLR or Lucene and we haven't found it yet. If it's not solved, I can
imagine a simple solution that would use TermsEnum to identify whether a
token exists in the index or not, and then if it doesn't exist, use the
TermsEnum again to check whether it's a composition of two known tokens.

It's highly likely that there are much better solutions and algorithms for
this. It would be great if you can help us identify the best way to solve
this problem.

Thanks a lot for your help.

Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas