You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by epnRui <ru...@hotmail.com> on 2014/02/27 12:09:59 UTC

Facets, termvectors, relevancy and Multi word tokenizing

Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
 - Is it possible to have facets of two or more words?
 - Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
 - Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
 - Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="blacklist.txt"
ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory" />
		<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" words="en"
ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" />
      </analyzer>
    </fieldType>


Thank you for the help!



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by epnRui <ru...@hotmail.com>.

Hi Iorixxx!

I have not optimized the index but the day after this post I saw I didn't
have this problem anymore.

I will follow your advice next time!

Now I'm avoiding so much manipulation at indexation time and I'm doing more
work in the java code in the client side.

If I had time I would implement a new tokenizer...



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4122862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by Ahmet Arslan <io...@yahoo.com>.

Hi,

Please optimize your index (you can do it core admin GUI) and see if problem goes away. 

Ahmet



On Friday, March 7, 2014 1:18 PM, epnRui <ru...@hotmail.com> wrote:
Hi guys!

I solved my problem on the client side but at least I solved it...

Anyway, now I have another problem, which is related to the following:

- I had previously used replace chars and replace patterns, charfilters and
filters, at index time to replace "EP" by "European Parliament". At that
point, it increased the facet_field count for "European Parliament".
Well now I have a big problem which is: I have already deleted the document
which generated the "European Parliament" and still that facet_field.count
will not subtract!! Is there a way to either remove a facet_field or to
subtract its count manually?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by epnRui <ru...@hotmail.com>.

Hi guys!

I solved my problem on the client side but at least I solved it...

Anyway, now I have another problem, which is related to the following:

 - I had previously used replace chars and replace patterns, charfilters and
filters, at index time to replace "EP" by "European Parliament". At that
point, it increased the facet_field count for "European Parliament".
Well now I have a big problem which is: I have already deleted the document
which generated the "European Parliament" and still that facet_field.count
will not subtract!! Is there a way to either remove a facet_field or to
subtract its count manually?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by epnRui <ru...@hotmail.com>.

Hi guys,

So, I keep facing this problem which I can't solve. I thought it was due to
HTML anchors containing the name of the hashtag, and thus repeating it, but
it's not.

So the use case is:
1 - I need to consider hashtags as tokens.
2 - The hashtag has to show up in the facets.

Right now if I index this text:
"Action, sanctions or diplomacy: which way forward for the  #EU
<http://twitter.com/search?q=%23EU>   &amp;  #Ukraine
<http://twitter.com/search?q=%23Ukraine>  ? Tell us  @LinkedIn
<http://twitter.com/LinkedIn>   debate  http://t.co/umf9olxH9f
<http://t.co/umf9olxH9f>  "

I get the tokens as follows (see image for more detail):
action	sanction	diplomacy	forward	#eu	#ukraine	tell	linkedin	debate
umf9olxh9f
ace	bate

<http://lucene.472066.n3.nabble.com/file/n4121389/solr.png> 

Then, if I have a look at the facets after the indexation, I find that (for
ukraine), the facets counts is increased for both "Ukraine" and "#Ukraine",
isntead of only for #Ukraine.

Does anyone have any idea of why this is happening?



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121389.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by epnRui <ru...@hotmail.com>.

Hi guys,

I'm on my way to solve it properly.

This is how my field looks like now:


<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
		<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(#)|(%23)" replacement="79f20724d6985c5b857d2fa06a3ff8c6"/>
		<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(((?i)((european parliament)|(parlament europeenne)))|(EP)|(PE))"
replacement="0ee062d61f44ae0a2aee145076ca6a69european_parliament"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
		<filter class="solr.StopFilterFactory" words="blacklist.txt"
ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" words="en"
ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" />
		<filter class="solr.PatternReplaceFilterFactory"
pattern="0ee062d61f44ae0a2aee145076ca6a69european_parliament"
replacement="european parliament" replace="all" />
		<filter class="solr.PatternReplaceFilterFactory"
pattern="79f20724d6985c5b857d2fa06a3ff8c6" replacement="#" replace="all" />
      </analyzer>

I still have one case where I'm facing issues because in fact I want to
preserve the #:
 - #European Parliament is translated into one token instead of two:
"#European" and "Parliament"... anyway, I have some ideas on how to do it.
Ill let you know whatss the final solution



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120948.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by Ahmet Arslan <io...@yahoo.com>.

Hi,

Let's say you have accomplished what you want. You have a .txt with the tokens tomerge, like "European" and "Parliament". What is your use case then? What is your high level goal?

MappingCharFilter approach is closer (to your .txt approach) than PatternReplaceCharFilterFactory approach.

By the way, it could also be simulated with ShingleFilterFactory + KeepWordFilterFactory + TypeTokenFilterFactory

May be it can be done via firing phrase queries at query time (without interfering with the index) at client side? e.g. q="European Parliament"~0

On Friday, February 28, 2014 11:55 AM, epnRui <ru...@hotmail.com> wrote:
Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _ :)

So I guess there would be no way to do this more cleanly, maybe only
implementing my own Tokenizer and Filters, but I honestly couldn't find a
tutorial for implement a customized solr Tokenizer. If I end up needing to
do it I will write a tutorial.

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with <MD5Hash>European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

I had seen the ShingleFilterFactory but the problem is I don't need the
whole phrase in tokens of 2 words and I understood it's what it does. Of
course I would need some filter that would handle a .txt with the tokens to
merge, like "European" and "Parliament".

I'm still having some other problem now but maybe I find a solution after I
read the page you annexed which seems great. Solr is considering #European
as #European and European, meaning it does 2 facets for one token. I want it
to consider it only as #European. I ran the analyzer debugger in my Solr
admin console and I don't see how he can be doing that.
Would you know of a reason for this?

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.

--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by David Santamauro <da...@gmail.com>.

Have you tried to just use a copyField? For example, I had a similar use 
case where I needed to have particular field (f1) tokenized but also 
needed to facet on the complete contents.

For that, I created a copyField

   <copyField source="f1" dest="f2" />

f1 used tokenizers and filters but f2 was just a plain string. You then 
facet on f2

... just an idea



On 02/28/2014 04:54 AM, epnRui wrote:
> Hi Ahmet!!
>
> I went ahead and did something I thought it was not a clean solution and
> then when I read your post and I found we thought of the same solution,
> including the European_Parliament with the _  :)
>
> So I guess there would be no way to do this more cleanly, maybe only
> implementing my own Tokenizer and Filters, but I honestly couldn't find a
> tutorial for implement a customized solr Tokenizer. If I end up needing to
> do it I will write a tutorial.
>
> So for now I'm doing PatternReplaceCharFilterFactory to replace "European
> Parliament" with <MD5Hash>European_Parliament (initially I didnt use the
> md5hash European_Parliament).
>
> Then I replace it back after the StandardTokenizerFactory ran, into
> "European Parliament". Well I guess I just found a way to do a 2 words token
> :)
>
> I had seen the ShingleFilterFactory but the problem is I don't need the
> whole phrase in tokens of 2 words and I understood it's what it does. Of
> course I would need some filter that would handle a .txt with the tokens to
> merge, like "European" and "Parliament".
>
> I'm still having some other problem now but maybe I find a solution after I
> read the page you annexed which seems great. Solr is considering #European
> as #European and European, meaning it does 2 facets for one token. I want it
> to consider it only as #European. I ran the analyzer debugger in my Solr
> admin console and I don't see how he can be doing that.
> Would you know of a reason for this?
>
> Thanks for your reply and that page you annexed seems excelent and I'll read
> it through.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by epnRui <ru...@hotmail.com>.

Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _ :)

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with <MD5Hash>European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.

--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Posted by Ahmet Arslan <io...@yahoo.com>.


Hi epnRui,

I don't full follow your e-mail (I think you need to describe your use case) but here are some answers,

- Is it possible to have facets of two or more words?

Yes. For example if you use ShingleFilterFactory at index time you will see two or more words in facets.


- Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?


Yes. For example you can use mappingCharFilter (executed before tokenizer) with this mapping :
"European Union" => "European_Union"


Regarding synonym filter, please see : http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/

Ahmet


On Thursday, February 27, 2014 1:10 PM, epnRui <ru...@hotmail.com> wrote:
Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
- Is it possible to have facets of two or more words?
- Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
- Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
- Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="blacklist.txt"
ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" words="en"
ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" />
      </analyzer>
    </fieldType>


Thank you for the help!



--
View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.