You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Wang Guangchen <gu...@gmail.com> on 2009/09/08 11:11:08 UTC
Re: SOLR-769 clustering
Hi Staszek,
I try to apply the stoplabels with the instructions that you given in the
solr clustering Wiki. But it didn't work.
I am runing the patched solr on tomcat. So to enable the stop label. I add
"-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS. I
tried to change the file name from stoplabels.txt to stoplabel.en also . It
didn't work too.
Then I also find out that in carrot manual page
(
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words).
It suggested to edit the stopwords files inside the carrot2-core.jar. I
tried this but it didn't work too.
I am not sure what is wrong with my set up. will it be caused by any sort of
caching? Please help.
Thanks in advance.
-GC
On Fri, Apr 24, 2009 at 4:31 PM, Stanislaw Osinski <st...@gmail.com>wrote:
> >
> > How would we enable people via SOLR-769 to do this?
>
>
> Good point, Grant! To apply the modified stopwords.* and stoplabels.* files
> to Solr, simply make them available in the classpath. For the example Solr
> runner scripts that would be something like:
>
> java -cp <dir-with-your-modified-stopwords>
> -Dsolr.solr.home=./clustering/solr -jar start.jar
>
> I've documented the whole tuning procedure on the Wiki:
>
> http://wiki.apache.org/solr/ClusteringComponent
>
> Cheers,
>
> S.
>
Re: SOLR-769 clustering
Posted by Wang Guangchen <gu...@gmail.com>.
hi Staszek,
Thank you very much for your advice. My problem has been solved. It is
caused by the regexp in the stoplables.en. I didn't released that regular
expression is required in order to filter out the words. I have add in the
regexp in my stoplabels.en and it works like a charm.
-GC
On Wed, Sep 9, 2009 at 3:34 AM, Stanislaw Osinski <st...@gmail.com> wrote:
> Hi,
>
> It seems like the problem can be on two layers: 1) getting the right
> contents of stop* files for Carrot2, 2) making sure Solr picks up the
> changes.
>
> I tried your quick and dirty hack too. It didn't work also. phase like
> > "Carbon Atoms in the Group" with "in" still appear in my clustering
> labels.
> >
>
> Here most probably layer 1) applies: if you add "in" to stopwords, the
> Lingo
> algorithm (Carrot2's default) will still create labels with "in" inside,
> but
> will not create labels starting / ending in "in". If you'd like to
> eliminate
> "in" completely, you'd need to put an appropriate regexp in stoplabels.*.
>
> For more details, please see Carrot2 manual:
>
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps
>
> The easiest way to tune the stopwords and see their impact on clusters is
> to
> use Carrot2 Document Clustering Workbench (see
> http://wiki.apache.org/solr/ClusteringComponent).
>
>
> > What i did is,
> >
> > 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> > stoplabel.en file.
> > 2. apply clustering patch. re-complie the solr with the new
> > carrot2-mini.jar.
> > 3. deploy the new apache-solr-1.4-dev.war to tomcat.
> >
>
> Once you make sure the changes to stopwords.* and stoplabels.* have the
> desired effect on clusters, the above procedure should do the trick. You
> can
> also put the modified files in WEB-INF/classes of the WAR, if that's any
> easier.
>
> For your reference, I've updated
> http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
> working with the Jetty starter distributed in Solr's examples folder.
>
>
> > <searchComponent
> > class="org.apache.solr.handler.clustering.ClusteringComponent"
> > name="clustering">
> > <lst name="engine">
> > <str name="name">default</str>
> > <str
> >
> >
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
> > <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
> > <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
> > <float
> > name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
> >
>
> Not really related to your issue, but the above file looks a little
> outdated
> -- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
> "carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
> (but there are many others:
> http://download.carrot2.org/stable/manual/#section.component.lingo). For
> most up to date examples, please see
> http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
> contrib\clustering\example\conf.
>
> Cheers,
>
> Staszek
>
Re: SOLR-769 clustering
Posted by Stanislaw Osinski <st...@gmail.com>.
Hi,
It seems like the problem can be on two layers: 1) getting the right
contents of stop* files for Carrot2, 2) making sure Solr picks up the
changes.
I tried your quick and dirty hack too. It didn't work also. phase like
> "Carbon Atoms in the Group" with "in" still appear in my clustering labels.
>
Here most probably layer 1) applies: if you add "in" to stopwords, the Lingo
algorithm (Carrot2's default) will still create labels with "in" inside, but
will not create labels starting / ending in "in". If you'd like to eliminate
"in" completely, you'd need to put an appropriate regexp in stoplabels.*.
For more details, please see Carrot2 manual:
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps
The easiest way to tune the stopwords and see their impact on clusters is to
use Carrot2 Document Clustering Workbench (see
http://wiki.apache.org/solr/ClusteringComponent).
> What i did is,
>
> 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> stoplabel.en file.
> 2. apply clustering patch. re-complie the solr with the new
> carrot2-mini.jar.
> 3. deploy the new apache-solr-1.4-dev.war to tomcat.
>
Once you make sure the changes to stopwords.* and stoplabels.* have the
desired effect on clusters, the above procedure should do the trick. You can
also put the modified files in WEB-INF/classes of the WAR, if that's any
easier.
For your reference, I've updated
http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
working with the Jetty starter distributed in Solr's examples folder.
> <searchComponent
> class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
> <lst name="engine">
> <str name="name">default</str>
> <str
>
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
> <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
> <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
> <float
> name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
>
Not really related to your issue, but the above file looks a little outdated
-- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
"carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
(but there are many others:
http://download.carrot2.org/stable/manual/#section.component.lingo). For
most up to date examples, please see
http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
contrib\clustering\example\conf.
Cheers,
Staszek
Re: SOLR-769 clustering
Posted by Wang Guangchen <gu...@gmail.com>.
Hi Staszek,
I tried your quick and dirty hack too. It didn't work also. phase like
"Carbon Atoms in the Group" with "in" still appear in my clustering labels.
What i did is,
1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
stoplabel.en file.
2. apply clustering patch. re-complie the solr with the new
carrot2-mini.jar.
3. deploy the new apache-solr-1.4-dev.war to tomcat.
I am using the nightly build version of the solr.
following is clustering setting in solrconfig.xml , pretty standard:
*<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="clustering.engine">default</str>
<bool name="clustering.results">true</bool>
<str name="carrot.title">name</str>
<str name="carrot.snippet">abstract</str>
<str name="carrot.url">id</str>
<bool name="carrot.produceSummary">true</bool>
<bool name="carrot.outputSubClusters">false</bool>
</lst>
<searchComponent
class="org.apache.solr.handler.clustering.ClusteringComponent"
name="clustering">
<lst name="engine">
<str name="name">default</str>
<str
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
<str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
<float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
<float
name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
</lst>
</searchComponent>
*I am wondering is there any extra setting that i need to configure in my
solrconfig.xml or schema.xml? or any special parameters that i need to
enable in the solrconfig.xml?*
thanks
-GC
*
On Tue, Sep 8, 2009 at 11:04 PM, Stanislaw Osinski <st...@gmail.com>wrote:
> Hi there,
>
> I try to apply the stoplabels with the instructions that you given in the
> > solr clustering Wiki. But it didn't work.
> >
> > I am runing the patched solr on tomcat. So to enable the stop label. I
> add
> > "-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS.
> I
> > tried to change the file name from stoplabels.txt to stoplabel.en also .
> It
> > didn't work too.
> >
> > Then I also find out that in carrot manual page
> > (
> >
> >
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
> > ).
> > It suggested to edit the stopwords files inside the carrot2-core.jar. I
> > tried this but it didn't work too.
> >
> > I am not sure what is wrong with my set up. will it be caused by any sort
> > of
> > caching?
> >
>
> A quick and dirty hack would be to simply replace the corresponding files
> (stoplabels.*) in carrot2-mini.jar.
>
> I know the packaging of the clustering contrib has changed a bit, so let me
> see how it currently works and correct the wiki if needed.
>
> Thanks,
>
> Staszek
>
Re: SOLR-769 clustering
Posted by Stanislaw Osinski <st...@gmail.com>.
Hi there,
I try to apply the stoplabels with the instructions that you given in the
> solr clustering Wiki. But it didn't work.
>
> I am runing the patched solr on tomcat. So to enable the stop label. I add
> "-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS. I
> tried to change the file name from stoplabels.txt to stoplabel.en also . It
> didn't work too.
>
> Then I also find out that in carrot manual page
> (
>
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
> ).
> It suggested to edit the stopwords files inside the carrot2-core.jar. I
> tried this but it didn't work too.
>
> I am not sure what is wrong with my set up. will it be caused by any sort
> of
> caching?
>
A quick and dirty hack would be to simply replace the corresponding files
(stoplabels.*) in carrot2-mini.jar.
I know the packaging of the clustering contrib has changed a bit, so let me
see how it currently works and correct the wiki if needed.
Thanks,
Staszek
Re: SOLR-769 clustering
Posted by Wang Guangchen <gu...@gmail.com>.
On Tue, Sep 8, 2009 at 9:56 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Sep 8, 2009, at 5:11 AM, Wang Guangchen wrote:
>
> Hi Staszek,
>>
>> I try to apply the stoplabels with the instructions that you given in the
>> solr clustering Wiki. But it didn't work.
>>
>> I am runing the patched solr on tomcat. So to enable the stop label. I add
>> "-cp <dir-with-your-modified-stopwords>" in to my system's CATALINA_OPTS.
>> I
>> tried to change the file name from stoplabels.txt to stoplabel.en also .
>> It
>> didn't work too.
>>
>
>
> Does it work if you add them to the Solr Home lib directory, which is where
> the other clustering files get loaded from? I haven't tried it.
>
Hi,
Thanks for your suggestions, but I put the stoplabels.en file into the solr
home lib direcotry , it didn't work also. I tried botht he solr's lib
directory and also the "../webapp/solr/WEB-INF/lib/".
>
>
>> Then I also find out that in carrot manual page
>> (
>>
>> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
>> ).
>> It suggested to edit the stopwords files inside the carrot2-core.jar. I
>> tried this but it didn't work too.
>>
>> I am not sure what is wrong with my set up. will it be caused by any sort
>> of
>> caching? Please help.
>> Thanks in advance.
>>
>> -GC
>>
>>
>> On Fri, Apr 24, 2009 at 4:31 PM, Stanislaw Osinski <stachoo@gmail.com
>> >wrote:
>>
>>
>>>> How would we enable people via SOLR-769 to do this?
>>>>
>>>
>>>
>>> Good point, Grant! To apply the modified stopwords.* and stoplabels.*
>>> files
>>> to Solr, simply make them available in the classpath. For the example
>>> Solr
>>> runner scripts that would be something like:
>>>
>>> java -cp <dir-with-your-modified-stopwords>
>>> -Dsolr.solr.home=./clustering/solr -jar start.jar
>>>
>>> I've documented the whole tuning procedure on the Wiki:
>>>
>>> http://wiki.apache.org/solr/ClusteringComponent
>>>
>>> Cheers,
>>>
>>> S.
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Re: SOLR-769 clustering
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 8, 2009, at 5:11 AM, Wang Guangchen wrote:
> Hi Staszek,
>
> I try to apply the stoplabels with the instructions that you given
> in the
> solr clustering Wiki. But it didn't work.
>
> I am runing the patched solr on tomcat. So to enable the stop label.
> I add
> "-cp <dir-with-your-modified-stopwords>" in to my system's
> CATALINA_OPTS. I
> tried to change the file name from stoplabels.txt to stoplabel.en
> also . It
> didn't work too.
Does it work if you add them to the Solr Home lib directory, which is
where the other clustering files get loaded from? I haven't tried it.
>
> Then I also find out that in carrot manual page
> (
> http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
> ).
> It suggested to edit the stopwords files inside the carrot2-
> core.jar. I
> tried this but it didn't work too.
>
> I am not sure what is wrong with my set up. will it be caused by any
> sort of
> caching? Please help.
> Thanks in advance.
>
> -GC
>
>
> On Fri, Apr 24, 2009 at 4:31 PM, Stanislaw Osinski
> <st...@gmail.com>wrote:
>
>>>
>>> How would we enable people via SOLR-769 to do this?
>>
>>
>> Good point, Grant! To apply the modified stopwords.* and
>> stoplabels.* files
>> to Solr, simply make them available in the classpath. For the
>> example Solr
>> runner scripts that would be something like:
>>
>> java -cp <dir-with-your-modified-stopwords>
>> -Dsolr.solr.home=./clustering/solr -jar start.jar
>>
>> I've documented the whole tuning procedure on the Wiki:
>>
>> http://wiki.apache.org/solr/ClusteringComponent
>>
>> Cheers,
>>
>> S.
>>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search