You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "scott chu (朱炎詹)" <sc...@udngroup.com> on 2010/08/20 12:19:46 UTC

Doing Shingle but also keep special single word

I am building index with Shingle filter. We know it's minimum 2-gram but I also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I want to do a minimum 2-gram but also want to have these single word in my index, Is it possible?

Scott

Re: Doing Shingle but also keep special single word

Posted by scott chu <sc...@udngroup.com>.

Hi, Brendan,

    Thanks for reply. The real case is that I can't predict when there's a 
new important special word that users are interesting cause I am building a 
daily news article data. Therefore, I don't know when & what single words 
should include into that new field.  I've ever thought about manually 
maintaining a special word dictionary but it costs too much effort, so I 
gave up that idea.

However, you suggestion still sound a good trade-off to me, I'll take into 
account seriously.

Scott

----- Original Message ----- 
From: "Brendan Grainger" <br...@gmail.com>
To: <so...@lucene.apache.org>
Sent: Friday, August 20, 2010 10:06 PM
Subject: Re: Doing Shingle but also keep special single word

Hi Scott,

Is there a reason why you wouldn't just index these special words into 
another field and then search over both fields? That would also have the 
nice property of being able to boost on the special word field if you 
wanted.

HTH
Brendan

On Aug 20, 2010, at 6:19 AM, scott chu (朱炎詹) wrote:

> I am building index with Shingle filter. We know it's minimum 2-gram but I 
> also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I 
> want to do a minimum 2-gram but also want to have these single word in my 
> index, Is it possible?
>
> Scott

Re: Doing Shingle but also keep special single word

Posted by Brendan Grainger <br...@gmail.com>.

Hi Scott,

Is there a reason why you wouldn't just index these special words into another field and then search over both fields? That would also have the nice property of being able to boost on the special word field if you wanted.

HTH
Brendan

On Aug 20, 2010, at 6:19 AM, scott chu (朱炎詹) wrote:

> I am building index with Shingle filter. We know it's minimum 2-gram but I also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I want to do a minimum 2-gram but also want to have these single word in my index, Is it possible?
> 
> Scott

Re: Why it's boosted up?

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

Thanks for your clear explanation! I got it :)
----- Original Message ----- 
From: "MitchK" <mi...@web.de>
To: <so...@lucene.apache.org>
Sent: Tuesday, August 24, 2010 3:37 PM
Subject: Re: Why it's boosted up?


>
> Hi Scott,
>
>
>
>> (so  shorter fields are automatically boosted up). "
>>
> The theory behind that is the following (in easy words):
> Let's say you got two documents, each doc contains on 1 field (like it was
> in my example).
> Additionally we got a query that contains two words.
> Let's say doc1 contains on 10 words and doc2 contains on 20 words.
> The query matches both docs with both words.
> The idea of boosting shorter fields stronger than longer fields is the
> following:
> In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
> In doc2 2/20 = 0.1 => 10% of the words are matching your query.
>
> So doc1 should get a better score, because the rate of matching words vs 
> the
> total number of occuring words is greater than in doc2
> This is the idea of using norms as an index-time-boosting-factor. NOTE: 
> This
> does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
> illustrates what the idea behind such norms is.
>
> From the similarity-class's documentation of lengthNorm():
>
>
>
>> Matches in longer fields are less precise, so implementations of this
>> method usually return smaller values when numTokens is large, and larger
>> values when numTokens is small.
>>
>
> However, you, as a search-application-developer got the task, that you 
> have
> to decide whether this theory applies to your application or not. In some
> cases using norms makes no sense, in others it does.
> If you think that norms are applying to your project, ommitting them is no
> good approach to save disk-space.
> Furthermore: If you think the theory does apply to the business-needs of
> your application but its impact is currently to heavy, you can have a look
> at the sweetSpotSimilarity in Lucene.
>
>
>
>> The request is from our business team, they wish user of our product can
>> type in partial string of a word that exists in title or body field.
>>
> You mean something like typing "note" and also getting results like
> "notebook"?
> The correct approach for something like that is not using shingleFilter 
> but
> NGrams or edged NGrams.
> Shingles are doing something like that:
> "This is my shingle sentence" -> "This is, is my, my shingle, shingle
> sentence" -> it breaks up the sentence into smaller pieces. The benefit of
> doins so is, that, if a query matches one of these shingles, you have 
> found
> a short phrase without using the performance-consuming 
> phraseQuery-feature.
>
> Kind regards,
> - Mitch
>
>
> scott chu wrote:
>>
>> In Lucene's web page, there's a paragraph:
>>
>> "Indexing time boosts are preprocessed for storage efficiency and written
>> to
>> the directory (when writing the document) in a single byte (!) as 
>> follows:
>> For each field of a document, all boosts of that field (i.e. all boosts
>> under the same field name in that doc) are multiplied. The result is
>> multiplied by the boost of the document, and also multiplied by a "field
>> length norm" value that represents the length of that field in that doc
>> (so
>> shorter fields are automatically boosted up). "
>>
>> I though the greater the value, the boosting is upper. Then why short
>> fields
>> are boost up? Isn't Norm value for short fields smaller?
>>
>>
>>
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00

Re: Why it's boosted up?

Posted by MitchK <mi...@web.de>.

Hi Scott,



> (so  shorter fields are automatically boosted up). " 
> 
The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was
in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the
following:
In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
In doc2 2/20 = 0.1 => 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs the
total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: This
does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
illustrates what the idea behind such norms is.

>From the similarity-class's documentation of lengthNorm():



> Matches in longer fields are less precise, so implementations of this
> method usually return smaller values when numTokens is large, and larger
> values when numTokens is small.
> 

However, you, as a search-application-developer got the task, that you have
to decide whether this theory applies to your application or not. In some
cases using norms makes no sense, in others it does. 
If you think that norms are applying to your project, ommitting them is no
good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of
your application but its impact is currently to heavy, you can have a look
at the sweetSpotSimilarity in Lucene. 



> The request is from our business team, they wish user of our product can 
> type in partial string of a word that exists in title or body field.
> 
You mean something like typing "note" and also getting results like
"notebook"?
The correct approach for something like that is not using shingleFilter but
NGrams or edged NGrams.
Shingles are doing something like that:
"This is my shingle sentence" -> "This is, is my, my shingle, shingle
sentence" -> it breaks up the sentence into smaller pieces. The benefit of
doins so is, that, if a query matches one of these shingles, you have found
a short phrase without using the performance-consuming phraseQuery-feature.

Kind regards,
- Mitch


scott chu wrote:
> 
> In Lucene's web page, there's a paragraph:
> 
> "Indexing time boosts are preprocessed for storage efficiency and written
> to 
> the directory (when writing the document) in a single byte (!) as follows: 
> For each field of a document, all boosts of that field (i.e. all boosts 
> under the same field name in that doc) are multiplied. The result is 
> multiplied by the boost of the document, and also multiplied by a "field 
> length norm" value that represents the length of that field in that doc
> (so 
> shorter fields are automatically boosted up). "
> 
> I though the greater the value, the boosting is upper. Then why short
> fields 
> are boost up? Isn't Norm value for short fields smaller?
> 
> 
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why it's boosted up?

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

Thanks! That' make sense :)

----- Original Message ----- 
From: "Ahmet Arslan" <io...@yahoo.com>
To: <so...@lucene.apache.org>
Sent: Tuesday, August 24, 2010 4:30 PM
Subject: Re: Why it's boosted up?


>> Then why short fields are boost up?
>
> In other words longer documents are punished. Because they contain 
> possibly many terms/words. If this mechanism does not exist, longer 
> documents takes over and pops up usually in the first page.
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00

Re: Why it's boosted up?

Posted by Ahmet Arslan <io...@yahoo.com>.

> Then why short fields are boost up? 

In other words longer documents are punished. Because they contain possibly many terms/words. If this mechanism does not exist, longer documents takes over and pops up usually in the first page.

Why it's boosted up?

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

In Lucene's web page, there's a paragraph:

"Indexing time boosts are preprocessed for storage efficiency and written to 
the directory (when writing the document) in a single byte (!) as follows: 
For each field of a document, all boosts of that field (i.e. all boosts 
under the same field name in that doc) are multiplied. The result is 
multiplied by the boost of the document, and also multiplied by a "field 
length norm" value that represents the length of that field in that doc (so 
shorter fields are automatically boosted up). "

I though the greater the value, the boosting is upper. Then why short fields 
are boost up? Isn't Norm value for short fields smaller?

Re: Doing Shingle but also keep special single word

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

Thanks! I'll give more effort to understand your suggestion & that Norm 
thing.

----- Original Message ----- 
From: "MitchK" <mi...@web.de>
To: <so...@lucene.apache.org>
Sent: Tuesday, August 24, 2010 5:28 AM
Subject: Re: Doing Shingle but also keep special single word



No, I mean that you use an additional field (indexed) for searching (i.e.
whitespace-tokenized, so every word - seperated by a whitespace - becomes to
a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at
query-time, since a match in a shingle-field would mean, that there matches
an exact phrase.

Additionally: You can search with single-word-queries as well as
multi-word-queries.
Furthermore you can apply synonyms to your single-token-field.

If you want to keep your index as small as possible but as large as needed,
try to understand Lucene's similarity implementation to consider, whether
you can set the field option "omitNorms"=true or
omitTermFreqAndPositions="true".
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: "this is a short example doc"
doc2: "this is a longer example doc for presenting the effect of omitNorms"

If you are searching for "doc" while omitNorms=false your response will look
like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for
doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch


scott chu wrote:
>
> I don't quite understand additional-field-way? Do you mean making another
> field that stores special words particularly but no indexing for that
> field?
>
> Scott
>
> ----- Original Message ----- 
> From: "MitchK" <mi...@web.de>
> To: <so...@lucene.apache.org>
> Sent: Sunday, August 22, 2010 11:48 PM
> Subject: Re: Doing Shingle but also keep special single word
>
>
>>
>> Hi,
>>
>> keepword-filter is no solution for this problem, since this would lead to
>> the problematic that one has to manage a word-dictionary. As explained,
>> this
>> would lead to too much effort.
>>
>> You can easily add outputUnigrams=true and check out the analysis.jsp for
>> this field. So you can see how much bigger a single field will become
>> with
>> this option.
>> However, I am quite sure that the difference between using
>> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>>
>> I would suggest you to do it the additionally-field-way, since this would
>> lead to more flexibility in boosting the different fields.
>>
>> Unfortunately, I haven't understood your explanation about the use-case.
>> But
>> it sounds a little bit like tagging?
>>
>> Kind regards,
>> - Mitch
>>
>>
>> iorixxx wrote:
>>>
>>>> Isn't set outputUnigrams="true" will
>>>> make index size about twice than when it's set to false?
>>>
>>> Sure index will be bigger. I didn't know that this is problem for you.
>>> But
>>> if you have a list of special single words that you want to keep,
>>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>>
>>>>
>>>> Scott
>>>>
>>>> ----- Original Message ----- From: "Ahmet Arslan" <io...@yahoo.com>
>>>> To: <so...@lucene.apache.org>
>>>> Sent: Saturday, August 21, 2010 1:15 AM
>>>> Subject: Re: Doing Shingle but also keep special single
>>>> word
>>>>
>>>>
>>>> >> I am building index with Shingle
>>>> >> filter. We know it's minimum 2-gram but I also
>>>> want keep
>>>> >> some special single word, e.g. IBM, Microsoft,
>>>> etc. i.e. I
>>>> >> want to do a minimum 2-gram but also want to have
>>>> these
>>>> >> single word in my index, Is it possible?
>>>> >
>>>> > outputUnigrams="true" parameter does not work for
>>>> you?
>>>> >
>>>> > After that you can cast <filter
>>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>> -- 
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
> --------------------------------------------------------------------------------
>
>
>
> ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10
> 14:35:00
>
>
>
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html
Sent from the Solr - User mailing list archive at Nabble.com.



--------------------------------------------------------------------------------



___b___J_T_________f_r_C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00

Re: Doing Shingle but also keep special single word

Posted by MitchK <mi...@web.de>.

No, I mean that you use an additional field (indexed) for searching (i.e.
whitespace-tokenized, so every word - seperated by a whitespace - becomes to
a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at
query-time, since a match in a shingle-field would mean, that there matches
an exact phrase.

Additionally: You can search with single-word-queries as well as
multi-word-queries.
Furthermore you can apply synonyms to your single-token-field. 

If you want to keep your index as small as possible but as large as needed,
try to understand Lucene's similarity implementation to consider, whether
you can set the field option "omitNorms"=true or
omitTermFreqAndPositions="true". 
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: "this is a short example doc"
doc2: "this is a longer example doc for presenting the effect of omitNorms"

If you are searching for "doc" while omitNorms=false your response will look
like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for
doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch

scott chu wrote:
> 
> I don't quite understand additional-field-way? Do you mean making another 
> field that stores special words particularly but no indexing for that
> field?
> 
> Scott
> 
> ----- Original Message ----- 
> From: "MitchK" <mi...@web.de>
> To: <so...@lucene.apache.org>
> Sent: Sunday, August 22, 2010 11:48 PM
> Subject: Re: Doing Shingle but also keep special single word
> 
> 
>>
>> Hi,
>>
>> keepword-filter is no solution for this problem, since this would lead to
>> the problematic that one has to manage a word-dictionary. As explained, 
>> this
>> would lead to too much effort.
>>
>> You can easily add outputUnigrams=true and check out the analysis.jsp for
>> this field. So you can see how much bigger a single field will become
>> with
>> this option.
>> However, I am quite sure that the difference between using
>> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>>
>> I would suggest you to do it the additionally-field-way, since this would
>> lead to more flexibility in boosting the different fields.
>>
>> Unfortunately, I haven't understood your explanation about the use-case. 
>> But
>> it sounds a little bit like tagging?
>>
>> Kind regards,
>> - Mitch
>>
>>
>> iorixxx wrote:
>>>
>>>> Isn't set outputUnigrams="true" will
>>>> make index size about twice than when it's set to false?
>>>
>>> Sure index will be bigger. I didn't know that this is problem for you. 
>>> But
>>> if you have a list of special single words that you want to keep,
>>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>>
>>>>
>>>> Scott
>>>>
>>>> ----- Original Message ----- From: "Ahmet Arslan" <io...@yahoo.com>
>>>> To: <so...@lucene.apache.org>
>>>> Sent: Saturday, August 21, 2010 1:15 AM
>>>> Subject: Re: Doing Shingle but also keep special single
>>>> word
>>>>
>>>>
>>>> >> I am building index with Shingle
>>>> >> filter. We know it's minimum 2-gram but I also
>>>> want keep
>>>> >> some special single word, e.g. IBM, Microsoft,
>>>> etc. i.e. I
>>>> >> want to do a minimum 2-gram but also want to have
>>>> these
>>>> >> single word in my index, Is it possible?
>>>> >
>>>> > outputUnigrams="true" parameter does not work for
>>>> you?
>>>> >
>>>> > After that you can cast <filter
>>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>> -- 
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> --------------------------------------------------------------------------------
> 
> 
> 
> ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10 
> 14:35:00
> 
> 
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doing Shingle but also keep special single word

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

Thanks! It seems that I really go the wrong direction.

----- Original Message ----- 
From: "Ahmet Arslan" <io...@yahoo.com>
To: <so...@lucene.apache.org>
Sent: Tuesday, August 24, 2010 4:21 PM
Subject: Re: Doing Shingle but also keep special single word


>> The request is from our business
>> team, they wish user of our product can
>> type in partial string of a word that exists in title or
>> body field. But now
>> I also doubt if this request is really necessary?
>
> "partial string of a word"? I think there is misunderstanding here. 
> SingleFilter operates token level.
>
> please divide this text => "please divide", "divide this", "this text"
>
> If you want partial string of a single word, then EdgeNGramFilter and 
> NGramFilter is used for that purpose.
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00

Re: Doing Shingle but also keep special single word

Posted by Ahmet Arslan <io...@yahoo.com>.

> The request is from our business
> team, they wish user of our product can 
> type in partial string of a word that exists in title or
> body field. But now 
> I also doubt if this request is really necessary?

"partial string of a word"? I think there is misunderstanding here. SingleFilter operates token level. 

please divide this text => "please divide", "divide this", "this text"

If you want partial string of a single word, then EdgeNGramFilter and NGramFilter is used for that purpose.

Re: Doing Shingle but also keep special single word

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

The request is from our business team, they wish user of our product can 
type in partial string of a word that exists in title or body field. But now 
I also doubt if this request is really necessary?

Scott

----- Original Message ----- 
From: "Ahmet Arslan" <io...@yahoo.com>
To: <so...@lucene.apache.org>
Sent: Monday, August 23, 2010 8:35 PM
Subject: Re: Doing Shingle but also keep special single word


>> 1. We have over ten million news articles to build into
>> Solr index.
>> 2. We copy several fields, such as title, author, body,
>> caption of attahed photos into a new field for default
>> search.
>> 3. We then wanna use shingle filter on this new field.
>> 4. We can't predict what new single-word noun that our
>> users may be interesting cause it's "news", you know. For
>> exmple, the word "ECFA" is only very popular word in news
>> here recently, so I wish users can type in 'ECFA' to search
>> and Solr will output see some relevant news articles.
>> 5. I wish to keep index as smaller as possible.
>> 6. I also wish to do same thing descirbed in 5 when I
>> search by explicitly specifyng field name of those fields,
>> too.
>
> Can i ask why do you need/use shingle filter?
>
>
>
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3088 - Release Date: 08/23/10 
02:35:00

Re: Doing Shingle but also keep special single word

Posted by Ahmet Arslan <io...@yahoo.com>.

> 1. We have over ten million news articles to build into
> Solr index.
> 2. We copy several fields, such as title, author, body,
> caption of attahed photos into a new field for default
> search.
> 3. We then wanna use shingle filter on this new field.
> 4. We can't predict what new single-word noun that our
> users may be interesting cause it's "news", you know. For
> exmple, the word "ECFA" is only very popular word in news
> here recently, so I wish users can type in 'ECFA' to search
> and Solr will output see some relevant news articles.
> 5. I wish to keep index as smaller as possible.
> 6. I also wish to do same thing descirbed in 5 when I
> search by explicitly specifyng field name of those fields,
> too.

Can i ask why do you need/use shingle filter?

Re: Doing Shingle but also keep special single word

Posted by "scott chu (朱炎詹)" <sc...@udngroup.com>.

I think I didn't state my problem very well, allow me rephrase my case here:

1. We have over ten million news articles to build into Solr index.
2. We copy several fields, such as title, author, body, caption of attahed 
photos into a new field for default search.
3. We then wanna use shingle filter on this new field.
4. We can't predict what new single-word noun that our users may be 
interesting cause it's "news", you know. For exmple, the word "ECFA" is only 
very popular word in news here recently, so I wish users can type in 'ECFA' 
to search and Solr will output see some relevant news articles.
5. I wish to keep index as smaller as possible.
6. I also wish to do same thing descirbed in 5 when I search by explicitly 
specifyng field name of those fields, too.

I don't quite understand additional-field-way? Do you mean making another 
field that stores special words particularly but no indexing for that field?

Scott

----- Original Message ----- 
From: "MitchK" <mi...@web.de>
To: <so...@lucene.apache.org>
Sent: Sunday, August 22, 2010 11:48 PM
Subject: Re: Doing Shingle but also keep special single word


>
> Hi,
>
> keepword-filter is no solution for this problem, since this would lead to
> the problematic that one has to manage a word-dictionary. As explained, 
> this
> would lead to too much effort.
>
> You can easily add outputUnigrams=true and check out the analysis.jsp for
> this field. So you can see how much bigger a single field will become with
> this option.
> However, I am quite sure that the difference between using
> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>
> I would suggest you to do it the additionally-field-way, since this would
> lead to more flexibility in boosting the different fields.
>
> Unfortunately, I haven't understood your explanation about the use-case. 
> But
> it sounds a little bit like tagging?
>
> Kind regards,
> - Mitch
>
>
> iorixxx wrote:
>>
>>> Isn't set outputUnigrams="true" will
>>> make index size about twice than when it's set to false?
>>
>> Sure index will be bigger. I didn't know that this is problem for you. 
>> But
>> if you have a list of special single words that you want to keep,
>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>
>>>
>>> Scott
>>>
>>> ----- Original Message ----- From: "Ahmet Arslan" <io...@yahoo.com>
>>> To: <so...@lucene.apache.org>
>>> Sent: Saturday, August 21, 2010 1:15 AM
>>> Subject: Re: Doing Shingle but also keep special single
>>> word
>>>
>>>
>>> >> I am building index with Shingle
>>> >> filter. We know it's minimum 2-gram but I also
>>> want keep
>>> >> some special single word, e.g. IBM, Microsoft,
>>> etc. i.e. I
>>> >> want to do a minimum 2-gram but also want to have
>>> these
>>> >> single word in my index, Is it possible?
>>> >
>>> > outputUnigrams="true" parameter does not work for
>>> you?
>>> >
>>> > After that you can cast <filter
>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>
>>
>>
>>
>>
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--------------------------------------------------------------------------------



¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10 
14:35:00

Re: Doing Shingle but also keep special single word

Posted by MitchK <mi...@web.de>.

Hi,

keepword-filter is no solution for this problem, since this would lead to
the problematic that one has to manage a word-dictionary. As explained, this
would lead to too much effort.

You can easily add outputUnigrams=true and check out the analysis.jsp for
this field. So you can see how much bigger a single field will become with
this option.
However, I am quite sure that the difference between using
outputUnigrams=true and indexing in a seperate field is not noteworthy.

I would suggest you to do it the additionally-field-way, since this would
lead to more flexibility in boosting the different fields.

Unfortunately, I haven't understood your explanation about the use-case. But
it sounds a little bit like tagging?

Kind regards,
- Mitch

iorixxx wrote:
> 
>> Isn't set outputUnigrams="true" will
>> make index size about twice than when it's set to false?
> 
> Sure index will be bigger. I didn't know that this is problem for you. But
> if you have a list of special single words that you want to keep,
> keepwordfilter can eliminate other tokens. So index size will be okey.
> 
>> 
>> Scott
>> 
>> ----- Original Message ----- From: "Ahmet Arslan" <io...@yahoo.com>
>> To: <so...@lucene.apache.org>
>> Sent: Saturday, August 21, 2010 1:15 AM
>> Subject: Re: Doing Shingle but also keep special single
>> word
>> 
>> 
>> >> I am building index with Shingle
>> >> filter. We know it's minimum 2-gram but I also
>> want keep
>> >> some special single word, e.g. IBM, Microsoft,
>> etc. i.e. I
>> >> want to do a minimum 2-gram but also want to have
>> these
>> >> single word in my index, Is it possible?
>> > 
>> > outputUnigrams="true" parameter does not work for
>> you?
>> > 
>> > After that you can cast <filter
>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>> > 
>> > 
>> > 
>> > 
>> 
>> 
> 
> 
>       
> 
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Doing Shingle but also keep special single word

Posted by Ahmet Arslan <io...@yahoo.com>.

> Isn't set outputUnigrams="true" will
> make index size about twice than when it's set to false?

Sure index will be bigger. I didn't know that this is problem for you. But if you have a list of special single words that you want to keep, keepwordfilter can eliminate other tokens. So index size will be okey.

> 
> Scott
> 
> ----- Original Message ----- From: "Ahmet Arslan" <io...@yahoo.com>
> To: <so...@lucene.apache.org>
> Sent: Saturday, August 21, 2010 1:15 AM
> Subject: Re: Doing Shingle but also keep special single
> word
> 
> 
> >> I am building index with Shingle
> >> filter. We know it's minimum 2-gram but I also
> want keep
> >> some special single word, e.g. IBM, Microsoft,
> etc. i.e. I
> >> want to do a minimum 2-gram but also want to have
> these
> >> single word in my index, Is it possible?
> > 
> > outputUnigrams="true" parameter does not work for
> you?
> > 
> > After that you can cast <filter
> class="solr.KeepWordFilterFactory" words="keepwords.txt"
> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
> > 
> > 
> > 
> > 
> 
>

Re: Doing Shingle but also keep special single word

Posted by scott chu <sc...@udngroup.com>.

Isn't set outputUnigrams="true" will make index size about twice than when 
it's set to false?

Scott

----- Original Message ----- 
From: "Ahmet Arslan" <io...@yahoo.com>
To: <so...@lucene.apache.org>
Sent: Saturday, August 21, 2010 1:15 AM
Subject: Re: Doing Shingle but also keep special single word


>> I am building index with Shingle
>> filter. We know it's minimum 2-gram but I also want keep
>> some special single word, e.g. IBM, Microsoft, etc. i.e. I
>> want to do a minimum 2-gram but also want to have these
>> single word in my index, Is it possible?
>
> outputUnigrams="true" parameter does not work for you?
>
> After that you can cast <filter class="solr.KeepWordFilterFactory" 
> words="keepwords.txt" ignoreCase="true"/> with keepwords.txt=IBM, 
> Microsoft.
>
>
>
>

Re: Doing Shingle but also keep special single word

Posted by Ahmet Arslan <io...@yahoo.com>.

> I am building index with Shingle
> filter. We know it's minimum 2-gram but I also want keep
> some special single word, e.g. IBM, Microsoft, etc. i.e. I
> want to do a minimum 2-gram but also want to have these
> single word in my index, Is it possible?

outputUnigrams="true" parameter does not work for you?

After that you can cast <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.