You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Nicolas Maisonneuve <ni...@free.fr> on 2004/10/08 17:08:17 UTC
NGramSpeller for n field
hy
i would like use the David Spencer NGramSpeller for N fields in a index.
With this algorithm, 1 field i = 1 NGramSpeller index.
So if i have N fields, i must create N NgramSpeller index. ok why not... but in fact the structure for a 5gram(for example) is :
"word"
"transposition"
"3gram"
"4gram"
"5gram"
+ the field "freq " for the popularity of the word in the field to be processed
+ the document is boosted during the indexation
As we see, from "word" to "5gram" (5/6 fields) the data are only dependant of the word and not of the data of the index to be processed. So, for N fields , i have N times the same information from "word" field to "5gram" field in N index. it's not really optimized for n fields.
---First method ----
In fact i would like change the field "freq" to field named "freq_nameofField". The structure of document for the field "field1" could be
:
"word"
..
"5gram"
"freq_field1" ,freq for the field "field1"
so i have:
- n document for 1 word (each document have a freq field for a specific field)
- but only 1 index . My structure of the index
will be:
"word"
..
"5gram"
"freq_field1"
"freq_field2"
...
"freq_fieldn"
---Second method ----
But in the first method the 5/6 of the information of a document are redundant and not useful (from word to 5gram field), so i would like create only 1 document for 1 word, with this structure:
"word"
..
"5gram"
"freq_field1" ,freq for the field "field1"
"freq_field2" ,freq for the field "field2"
"freq_field3" ,freq for the field "field3"
But the problem is the boosting of the document: the boost value depend on the freq and i have differents freq to be processed.
Have a idea to not allow redondant information in the NGramSpeller index for n field ?