You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by Jeroen Lauwers <Je...@CTLO.NET> on 2010/08/06 12:24:07 UTC

Indexing the multiple words at the same position

Has anyone encountered the following problem (and found a solution)

I need to index a classical text that can have multiple words at that same position. Example: if a publisher isn't sure if Shakespeare wrote "To be or not to be happy" or "To be or not to be daddy", he will put the 'best' word (eg. 'happy') in the full text and the second option (eg. 'daddy') in the "notes" at the bottom of a page.
Now, our customer wants to search for "to be daddy" and find "to be happy". So, if I could index "daddy" at the same position as "happy" , I would be very happy too.

Of course you can think of a solution where one would index the full text for each version, but this is not sustainable when the number of "multiple occupation of a single position" increase.

I have been looking at the 'next()' method of the 'Tokenizer' class, but I haven't found the solution (yet).

Thanks in advance to all who reply.
Jeroen

RE: Indexing the multiple words at the same position

Posted by Jeroen Lauwers <Je...@CTLO.NET>.

Hi, Daniele

That's it! You put me on the right track!
I found the answer in "http://www.codeproject.com/KB/cs/lucene_custom_analyzer.aspx" where they talk about a "Custom Synonym Analyzer". The final clue was "Token.SetPositionIncrement(0)".
I never realized it was as simple as manipulating the SetPositionIncrement of the token your about to add in the "next()" method.

Thanks,
Jeroen

-----Original Message-----
From: Daniele Fusi [mailto:fusi.daniele@tiscali.it] 
Sent: vrijdag 6 augustus 2010 12:30
To: lucene-net-dev@lucene.apache.org
Subject: RE: Indexing the multiple words at the same position

Hi, it also depends on the complexity of your critical apparatus, but you
could just use a custom analyzer which injects "synonyms" (here variants) of
your tokens in THE SAME POSITION as the original word. This way a search
will match both "daddy" and "happy".

-----Original Message-----
From: Jeroen Lauwers [mailto:Jeroen.Lauwers@CTLO.NET] 
Sent: venerdì 6 agosto 2010 12:24
To: lucene-net-dev@lucene.apache.org
Subject: Indexing the multiple words at the same position

Has anyone encountered the following problem (and found a solution)

I need to index a classical text that can have multiple words at that same
position. Example: if a publisher isn't sure if Shakespeare wrote "To be or
not to be happy" or "To be or not to be daddy", he will put the 'best' word
(eg. 'happy') in the full text and the second option (eg. 'daddy') in the
"notes" at the bottom of a page.
Now, our customer wants to search for "to be daddy" and find "to be happy".
So, if I could index "daddy" at the same position as "happy" , I would be
very happy too.

Of course you can think of a solution where one would index the full text
for each version, but this is not sustainable when the number of "multiple
occupation of a single position" increase.

I have been looking at the 'next()' method of the 'Tokenizer' class, but I
haven't found the solution (yet).

Thanks in advance to all who reply.
Jeroen

RE: Indexing the multiple words at the same position

Posted by Daniele Fusi <fu...@tiscali.it>.

Hi, it also depends on the complexity of your critical apparatus, but you
could just use a custom analyzer which injects "synonyms" (here variants) of
your tokens in THE SAME POSITION as the original word. This way a search
will match both "daddy" and "happy".

-----Original Message-----
From: Jeroen Lauwers [mailto:Jeroen.Lauwers@CTLO.NET] 
Sent: venerdì 6 agosto 2010 12:24
To: lucene-net-dev@lucene.apache.org
Subject: Indexing the multiple words at the same position

Has anyone encountered the following problem (and found a solution)

I need to index a classical text that can have multiple words at that same
position. Example: if a publisher isn't sure if Shakespeare wrote "To be or
not to be happy" or "To be or not to be daddy", he will put the 'best' word
(eg. 'happy') in the full text and the second option (eg. 'daddy') in the
"notes" at the bottom of a page.
Now, our customer wants to search for "to be daddy" and find "to be happy".
So, if I could index "daddy" at the same position as "happy" , I would be
very happy too.

Of course you can think of a solution where one would index the full text
for each version, but this is not sustainable when the number of "multiple
occupation of a single position" increase.

I have been looking at the 'next()' method of the 'Tokenizer' class, but I
haven't found the solution (yet).

Thanks in advance to all who reply.
Jeroen