You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Wang Guangchen <gu...@gmail.com> on 2009/09/08 11:11:08 UTC

Re: SOLR-769 clustering

Hi Staszek,

I try to apply the stoplabels with the instructions that you given in the
solr clustering Wiki. But it didn't work.

I am runing the patched solr on tomcat. So to enable the stop label. I add
"-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS. I
tried to change the file name from stoplabels.txt to stoplabel.en also . It
didn't work too.

Then I also find out that in carrot manual page
(
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words).
It suggested to edit the stopwords files inside the carrot2-core.jar. I
tried this but it didn't work too.

I am not sure what is wrong with my set up. will it be caused by any sort of
caching? Please help.
Thanks in advance.

-GC


On Fri, Apr 24, 2009 at 4:31 PM, Stanislaw Osinski <st...@gmail.com>wrote:

> >
> > How would we enable people via SOLR-769 to do this?
>
>
> Good point, Grant! To apply the modified stopwords.* and stoplabels.* files
> to Solr, simply make them available in the classpath. For the example Solr
> runner scripts that would be something like:
>
> java -cp <dir-with-your-modified-stopwords>
> -Dsolr.solr.home=./clustering/solr -jar start.jar
>
> I've documented the whole tuning procedure on the Wiki:
>
> http://wiki.apache.org/solr/ClusteringComponent
>
> Cheers,
>
> S.
>

Re: SOLR-769 clustering

Posted by Wang Guangchen <gu...@gmail.com>.
hi Staszek,

Thank you very much for your advice. My problem has been solved. It is
caused by the regexp in the stoplables.en. I didn't released that regular
expression is required in order to filter out the words. I have add in the
regexp in my stoplabels.en and it works like a charm.

-GC

On Wed, Sep 9, 2009 at 3:34 AM, Stanislaw Osinski <st...@gmail.com> wrote:

> Hi,
>
> It seems like the problem can be on two layers: 1) getting the right
> contents of stop* files for Carrot2, 2) making sure Solr picks up the
> changes.
>
> I tried your quick and dirty hack too. It didn't work also. phase like
> > "Carbon Atoms in the Group" with "in" still appear in my clustering
> labels.
> >
>
> Here most probably layer 1) applies: if you add "in" to stopwords, the
> Lingo
> algorithm (Carrot2's default) will still create labels with "in" inside,
> but
> will not create labels starting / ending in "in". If you'd like to
> eliminate
> "in" completely, you'd need to put an appropriate regexp in stoplabels.*.
>
> For more details, please see Carrot2 manual:
>
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps
>
> The easiest way to tune the stopwords and see their impact on clusters is
> to
> use Carrot2 Document Clustering Workbench (see
> http://wiki.apache.org/solr/ClusteringComponent).
>
>
> > What i did is,
> >
> > 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> > stoplabel.en file.
> > 2. apply clustering patch. re-complie the solr with the new
> > carrot2-mini.jar.
> > 3. deploy the new apache-solr-1.4-dev.war to tomcat.
> >
>
> Once you make sure the changes to stopwords.* and stoplabels.* have the
> desired effect on clusters, the above procedure should do the trick. You
> can
> also put the modified files in WEB-INF/classes of the WAR, if that's any
> easier.
>
> For your reference, I've updated
> http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
> working with the Jetty starter distributed in Solr's examples folder.
>
>
> > <searchComponent
> > class="org.apache.solr.handler.clustering.ClusteringComponent"
> > name="clustering">
> >  <lst name="engine">
> >    <str name="name">default</str>
> >    <str
> >
> >
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
> >    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
> >    <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
> >    <float
> > name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
> >
>
> Not really related to your issue, but the above file looks a little
> outdated
> -- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
> "carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
> (but there are many others:
> http://download.carrot2.org/stable/manual/#section.component.lingo). For
> most up to date examples, please see
> http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
> contrib\clustering\example\conf.
>
> Cheers,
>
> Staszek
>

Re: SOLR-769 clustering

Posted by Stanislaw Osinski <st...@gmail.com>.
Hi,

It seems like the problem can be on two layers: 1) getting the right
contents of stop* files for Carrot2, 2) making sure Solr picks up the
changes.

I tried your quick and dirty hack too. It didn't work also. phase like
> "Carbon Atoms in the Group" with "in" still appear in my clustering labels.
>

Here most probably layer 1) applies: if you add "in" to stopwords, the Lingo
algorithm (Carrot2's default) will still create labels with "in" inside, but
will not create labels starting / ending in "in". If you'd like to eliminate
"in" completely, you'd need to put an appropriate regexp in stoplabels.*.

For more details, please see Carrot2 manual:

http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps

The easiest way to tune the stopwords and see their impact on clusters is to
use Carrot2 Document Clustering Workbench (see
http://wiki.apache.org/solr/ClusteringComponent).


> What i did is,
>
> 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> stoplabel.en file.
> 2. apply clustering patch. re-complie the solr with the new
> carrot2-mini.jar.
> 3. deploy the new apache-solr-1.4-dev.war to tomcat.
>

Once you make sure the changes to stopwords.* and stoplabels.* have the
desired effect on clusters, the above procedure should do the trick. You can
also put the modified files in WEB-INF/classes of the WAR, if that's any
easier.

For your reference, I've updated
http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
working with the Jetty starter distributed in Solr's examples folder.


> <searchComponent
> class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  <lst name="engine">
>    <str name="name">default</str>
>    <str
>
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
>    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
>    <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
>    <float
> name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
>

Not really related to your issue, but the above file looks a little outdated
-- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
"carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
(but there are many others:
http://download.carrot2.org/stable/manual/#section.component.lingo). For
most up to date examples, please see
http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
contrib\clustering\example\conf.

Cheers,

Staszek

Re: SOLR-769 clustering

Posted by Wang Guangchen <gu...@gmail.com>.
Hi Staszek,

I tried your quick and dirty hack too. It didn't work also. phase like
"Carbon Atoms in the Group" with "in" still appear in my clustering labels.

What i did is,

1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
stoplabel.en file.
2. apply clustering patch. re-complie the solr with the new
carrot2-mini.jar.
3. deploy the new apache-solr-1.4-dev.war to tomcat.

I am using the nightly build version of the solr.

following is clustering setting in solrconfig.xml , pretty standard:

*<lst name="defaults">
<str name="echoParams">explicit</str>
       <str name="clustering.engine">default</str>
       <bool name="clustering.results">true</bool>
       <str name="carrot.title">name</str>
       <str name="carrot.snippet">abstract</str>
       <str name="carrot.url">id</str>
       <bool name="carrot.produceSummary">true</bool>
       <bool name="carrot.outputSubClusters">false</bool>
</lst>


<searchComponent
class="org.apache.solr.handler.clustering.ClusteringComponent"
name="clustering">
  <lst name="engine">
    <str name="name">default</str>
    <str
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
    <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
    <float
name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>

  </lst>
 </searchComponent>


*I am wondering is there any extra setting that i need to configure in my
solrconfig.xml or schema.xml? or any special parameters that i need to
enable in the solrconfig.xml?*

thanks

-GC
*



On Tue, Sep 8, 2009 at 11:04 PM, Stanislaw Osinski <st...@gmail.com>wrote:

> Hi there,
>
> I try to apply the stoplabels with the instructions that you given in the
> > solr clustering Wiki. But it didn't work.
> >
> > I am runing the patched solr on tomcat. So to enable the stop label. I
> add
> > "-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS.
> I
> > tried to change the file name from stoplabels.txt to stoplabel.en also .
> It
> > didn't work too.
> >
> > Then I also find out that in carrot manual page
> > (
> >
> >
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
> > ).
> > It suggested to edit the stopwords files inside the carrot2-core.jar. I
> > tried this but it didn't work too.
> >
> > I am not sure what is wrong with my set up. will it be caused by any sort
> > of
> > caching?
> >
>
> A quick and dirty hack would be to simply replace the corresponding files
> (stoplabels.*) in carrot2-mini.jar.
>
> I know the packaging of the clustering contrib has changed a bit, so let me
> see how it currently works and correct the wiki if needed.
>
> Thanks,
>
> Staszek
>

Re: SOLR-769 clustering

Posted by Stanislaw Osinski <st...@gmail.com>.
Hi there,

I try to apply the stoplabels with the instructions that you given in the
> solr clustering Wiki. But it didn't work.
>
> I am runing the patched solr on tomcat. So to enable the stop label. I add
> "-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS. I
> tried to change the file name from stoplabels.txt to stoplabel.en also . It
> didn't work too.
>
> Then I also find out that in carrot manual page
> (
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
> ).
> It suggested to edit the stopwords files inside the carrot2-core.jar. I
> tried this but it didn't work too.
>
> I am not sure what is wrong with my set up. will it be caused by any sort
> of
> caching?
>

A quick and dirty hack would be to simply replace the corresponding files
(stoplabels.*) in carrot2-mini.jar.

I know the packaging of the clustering contrib has changed a bit, so let me
see how it currently works and correct the wiki if needed.

Thanks,

Staszek

Re: SOLR-769 clustering

Posted by Wang Guangchen <gu...@gmail.com>.
On Tue, Sep 8, 2009 at 9:56 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Sep 8, 2009, at 5:11 AM, Wang Guangchen wrote:
>
>  Hi Staszek,
>>
>> I try to apply the stoplabels with the instructions that you given in the
>> solr clustering Wiki. But it didn't work.
>>
>> I am runing the patched solr on tomcat. So to enable the stop label. I add
>> "-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS.
>> I
>> tried to change the file name from stoplabels.txt to stoplabel.en also .
>> It
>> didn't work too.
>>
>
>
> Does it work if you add them to the Solr Home lib directory, which is where
> the other clustering files get loaded from?  I haven't tried it.
>
Hi,
Thanks for your suggestions, but  I put the stoplabels.en file into the solr
home lib direcotry , it didn't work also. I tried botht he solr's lib
directory and also the  "../webapp/solr/WEB-INF/lib/".



>
>
>> Then I also find out that in carrot manual page
>> (
>>
>> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
>> ).
>> It suggested to edit the stopwords files inside the carrot2-core.jar. I
>> tried this but it didn't work too.
>>
>> I am not sure what is wrong with my set up. will it be caused by any sort
>> of
>> caching? Please help.
>> Thanks in advance.
>>
>> -GC
>>
>>
>> On Fri, Apr 24, 2009 at 4:31 PM, Stanislaw Osinski <stachoo@gmail.com
>> >wrote:
>>
>>
>>>> How would we enable people via SOLR-769 to do this?
>>>>
>>>
>>>
>>> Good point, Grant! To apply the modified stopwords.* and stoplabels.*
>>> files
>>> to Solr, simply make them available in the classpath. For the example
>>> Solr
>>> runner scripts that would be something like:
>>>
>>> java -cp <dir-with-your-modified-stopwords>
>>> -Dsolr.solr.home=./clustering/solr -jar start.jar
>>>
>>> I've documented the whole tuning procedure on the Wiki:
>>>
>>> http://wiki.apache.org/solr/ClusteringComponent
>>>
>>> Cheers,
>>>
>>> S.
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: SOLR-769 clustering

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 8, 2009, at 5:11 AM, Wang Guangchen wrote:

> Hi Staszek,
>
> I try to apply the stoplabels with the instructions that you given  
> in the
> solr clustering Wiki. But it didn't work.
>
> I am runing the patched solr on tomcat. So to enable the stop label.  
> I add
> "-cp <dir-with-your-modified-stopwords>" in to my system's  
> CATALINA_OPTS. I
> tried to change the file name from stoplabels.txt to stoplabel.en  
> also . It
> didn't work too.


Does it work if you add them to the Solr Home lib directory, which is  
where the other clustering files get loaded from?  I haven't tried it.


>
> Then I also find out that in carrot manual page
> (
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words 
> ).
> It suggested to edit the stopwords files inside the carrot2- 
> core.jar. I
> tried this but it didn't work too.
>
> I am not sure what is wrong with my set up. will it be caused by any  
> sort of
> caching? Please help.
> Thanks in advance.
>
> -GC
>
>
> On Fri, Apr 24, 2009 at 4:31 PM, Stanislaw Osinski  
> <st...@gmail.com>wrote:
>
>>>
>>> How would we enable people via SOLR-769 to do this?
>>
>>
>> Good point, Grant! To apply the modified stopwords.* and  
>> stoplabels.* files
>> to Solr, simply make them available in the classpath. For the  
>> example Solr
>> runner scripts that would be something like:
>>
>> java -cp <dir-with-your-modified-stopwords>
>> -Dsolr.solr.home=./clustering/solr -jar start.jar
>>
>> I've documented the whole tuning procedure on the Wiki:
>>
>> http://wiki.apache.org/solr/ClusteringComponent
>>
>> Cheers,
>>
>> S.
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search