You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bogdan Vatkov <bo...@gmail.com> on 2010/01/02 13:30:56 UTC

Stopwords work for Solr but not for Mahout

Hi,

I am using the standard solr.StopFilterFactory with my own stopwords.txt
file which seem to quite work for Solr itself - e.g. queries are not working
with stopwords after I define a stopword.
But when I push Solr content to Mahout with the Lucene Driver I get all the
words in the dictionary and clusters - even the words that are supposed to
be stopped.
Any idea how to stop these words in Mahout?

Best regards,
Bogdan

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

OK, I just committed a change to the ClusterDumper that can now use a dictionary of terms to print out the values in the cell of the vector for the centroid.  It does this by default instead of using the vector.asFormatString() capability.  If you want the old functionality, pass in -j

I also applied the same functionality to VectorDumper.

Let me know if that helps.


On Jan 2, 2010, at 3:11 PM, Drew Farris wrote:

> I've managed to get k-means clustering working, but I agree it would be very
> nice to have an end-to-end example that would allow others to get up to
> speed quickly. I think the largest holes here are related to the vacuum of a
> corpus of text into the Lucene index and the presentation of a
> human-readable display of the results. It might be interesting to also
> calculate and include some metrics such as the F-measure (in cases where we
> have a reference categorization) and scatter score (in cases where we
> don't).
> 
> The existing LDA example would be a useful starting point. It slurps
> in the Reuters-21578
> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
> converts it to text, loads it into a Lucene index, extracts vectors from the
> lucene index and runs LDA upon them.
> 
> This example uses the lucene benchmark utilities for the input to text
> conversion and lucene loading. The benchmark utilities code is readable but
> complex. It would be very nice to have a simple piece of code to handle the
> creation of the Lucene index that others can easilly build upon to respond
> to their existing corpus.
> 
> On Sat, Jan 2, 2010 at 2:10 PM, Benson Margulies <bi...@gmail.com>
> wrote:
>> As someone who tried, not hard enough, and failed, to assemble all
>> these bits in a row, I can only say that the situation cries out for
>> an end-to-end sample. I'd be willing to help lick it into shape to be
>> checked-in as such. My idea is that it should set up to vacuum-cleaner
>> up a corpus of text, push it through Lucene, pull it out as vectors,
>> tickle the pig hadoop, and deliver actual doc paths arranged by
>> cluster.
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

I should note, I am still validating the quality of the results and that the DIH stuff is just a sample of all the feeds I'm using.

On Jan 2, 2010, at 3:56 PM, Grant Ingersoll wrote:

> 
> On Jan 2, 2010, at 3:11 PM, Drew Farris wrote:
> 
>> I've managed to get k-means clustering working, but I agree it would be very
>> nice to have an end-to-end example that would allow others to get up to
>> speed quickly. I think the largest holes here are related to the vacuum of a
>> corpus of text into the Lucene index and the presentation of a
>> human-readable display of the results. It might be interesting to also
>> calculate and include some metrics such as the F-measure (in cases where we
>> have a reference categorization) and scatter score (in cases where we
>> don't).
>> 
>> The existing LDA example would be a useful starting point. It slurps
>> in the Reuters-21578
>> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
>> converts it to text, loads it into a Lucene index, extracts vectors from the
>> lucene index and runs LDA upon them.
>> 
>> This example uses the lucene benchmark utilities for the input to text
>> conversion and lucene loading. The benchmark utilities code is readable but
>> complex. It would be very nice to have a simple piece of code to handle the
>> creation of the Lucene index that others can easilly build upon to respond
>> to their existing corpus.
>> 
> 
> 
> +1.
> 
> I've also got this working for a bunch of RSS feeds using Solr's DataImportHandler and the following commands:
> 
> In Solr, I setup the DataImportHandler with something like:
> <dataConfig>
> 
> <dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>
> 	<document>
>   <!-- New York Times Sports feed -->
> 		<entity name="nytSportsFeed"
> 				pk="link"
> 				url="http://feeds1.nytimes.com/nyt/rss/Sports"
> 				processor="XPathEntityProcessor"
> 				forEach="/rss/channel | /rss/channel/item"
>           dataSource="rss"
>       transformer="RegexTransformer,DateFormatTransformer">
> 			<field column="source" xpath="/rss/channel/title" commonField="true" />
> 			<field column="source-link" xpath="/rss/channel/link" commonField="true" />
> 			<field column="title" xpath="/rss/channel/item/title" />
> 			<field column="id" xpath="/rss/channel/item/guid" />
> 			<field column="link" xpath="/rss/channel/item/link" />
>     <!-- Use the RegexTransformer to strip out ads -->
> 			<field column="description" xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" replaceWith=""/>
> 			<field column="category" xpath="/rss/channel/item/category" />
>     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
>     <field column="pubDate" xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
>   </entity>
>   <entity name="nytWorld"
> 				pk="link"
> 				url="http://feeds.nytimes.com/nyt/rss/World"
> 				processor="XPathEntityProcessor"
> 				forEach="/rss/channel | /rss/channel/item"
>           dataSource="rss"
>       transformer="RegexTransformer,DateFormatTransformer">
> 			<field column="source" xpath="/rss/channel/title" commonField="true" />
> 			<field column="source-link" xpath="/rss/channel/link" commonField="true" />
> 			<field column="title" xpath="/rss/channel/item/title" />
> 			<field column="id" xpath="/rss/channel/item/guid" />
> 			<field column="link" xpath="/rss/channel/item/link" />
>     <!-- Use the RegexTransformer to strip out ads -->
> 			<field column="description" xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" replaceWith=""/>
> 			<field column="category" xpath="/rss/channel/item/category" />
>     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
>     <field column="pubDate" xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
>   </entity>
> 
> </dataConfig>
> 
> Then in my browser: http://localhost:8983/solr/dataimport?command=full-import&clean=true
> 
> Then on the command line in Mahout home:
>> mvn dependecy:copy-dependencies
>> cd target/dependency
>> java -cp "*" org.apache.mahout.utils.vectors.lucene.Driver --dir [path to index]/data/index/ --output ./solr-clust-n2/part-out.vec --field desc-clustering --idField id --dictOut ./solr-clust-n2/dictionary.txt --norm 2
>> java -Xmx1024M -cp "*" org.apache.mahout.clustering.kmeans.KMeansDriver --input ./solr-clust-n2/part-out.vec --clusters ./solr-clust-n2/out/clusters  --output ./solr-clust-n2/out/ --distance org.apache.mahout.common.distance.CosineDistanceMeasure --convergence 0.001 --overwrite --k 25
>> java -Xmx1024M -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir ./solr-clust-n2/out/clusters-2  --dictionary ./solr-clust-n2/dictionary.txt  --substring 100 --pointsDir ./solr-clust-n2/out/points/
> or:
>> java -Xmx1024M -cp "*" org.apache.mahout.utils.vectors.lucene.ClusterLabels --dir [path to index]/data/index/ --field description --idField id --seqFileDir ./solr-clust-n2/out/clusters-2  --pointsDir ./solr-clust-n2/out/points/ --minClusterSize 5 --maxLabels 10
> 
> 
> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 3:11 PM, Drew Farris wrote:

> I've managed to get k-means clustering working, but I agree it would be very
> nice to have an end-to-end example that would allow others to get up to
> speed quickly. I think the largest holes here are related to the vacuum of a
> corpus of text into the Lucene index and the presentation of a
> human-readable display of the results. It might be interesting to also
> calculate and include some metrics such as the F-measure (in cases where we
> have a reference categorization) and scatter score (in cases where we
> don't).
> 
> The existing LDA example would be a useful starting point. It slurps
> in the Reuters-21578
> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
> converts it to text, loads it into a Lucene index, extracts vectors from the
> lucene index and runs LDA upon them.
> 
> This example uses the lucene benchmark utilities for the input to text
> conversion and lucene loading. The benchmark utilities code is readable but
> complex. It would be very nice to have a simple piece of code to handle the
> creation of the Lucene index that others can easilly build upon to respond
> to their existing corpus.
> 


+1.

I've also got this working for a bunch of RSS feeds using Solr's DataImportHandler and the following commands:

In Solr, I setup the DataImportHandler with something like:
<dataConfig>

 <dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>
	<document>
   <!-- New York Times Sports feed -->
		<entity name="nytSportsFeed"
				pk="link"
				url="http://feeds1.nytimes.com/nyt/rss/Sports"
				processor="XPathEntityProcessor"
				forEach="/rss/channel | /rss/channel/item"
           dataSource="rss"
       transformer="RegexTransformer,DateFormatTransformer">
			<field column="source" xpath="/rss/channel/title" commonField="true" />
			<field column="source-link" xpath="/rss/channel/link" commonField="true" />
			<field column="title" xpath="/rss/channel/item/title" />
			<field column="id" xpath="/rss/channel/item/guid" />
			<field column="link" xpath="/rss/channel/item/link" />
     <!-- Use the RegexTransformer to strip out ads -->
			<field column="description" xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" replaceWith=""/>
			<field column="category" xpath="/rss/channel/item/category" />
     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
     <field column="pubDate" xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
   </entity>
   <entity name="nytWorld"
				pk="link"
				url="http://feeds.nytimes.com/nyt/rss/World"
				processor="XPathEntityProcessor"
				forEach="/rss/channel | /rss/channel/item"
           dataSource="rss"
       transformer="RegexTransformer,DateFormatTransformer">
			<field column="source" xpath="/rss/channel/title" commonField="true" />
			<field column="source-link" xpath="/rss/channel/link" commonField="true" />
			<field column="title" xpath="/rss/channel/item/title" />
			<field column="id" xpath="/rss/channel/item/guid" />
			<field column="link" xpath="/rss/channel/item/link" />
     <!-- Use the RegexTransformer to strip out ads -->
			<field column="description" xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" replaceWith=""/>
			<field column="category" xpath="/rss/channel/item/category" />
     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
     <field column="pubDate" xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
   </entity>

</dataConfig>

Then in my browser: http://localhost:8983/solr/dataimport?command=full-import&clean=true

Then on the command line in Mahout home:
> mvn dependecy:copy-dependencies
> cd target/dependency
> java -cp "*" org.apache.mahout.utils.vectors.lucene.Driver --dir [path to index]/data/index/ --output ./solr-clust-n2/part-out.vec --field desc-clustering --idField id --dictOut ./solr-clust-n2/dictionary.txt --norm 2
> java -Xmx1024M -cp "*" org.apache.mahout.clustering.kmeans.KMeansDriver --input ./solr-clust-n2/part-out.vec --clusters ./solr-clust-n2/out/clusters  --output ./solr-clust-n2/out/ --distance org.apache.mahout.common.distance.CosineDistanceMeasure --convergence 0.001 --overwrite --k 25
> java -Xmx1024M -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir ./solr-clust-n2/out/clusters-2  --dictionary ./solr-clust-n2/dictionary.txt  --substring 100 --pointsDir ./solr-clust-n2/out/points/
or:
> java -Xmx1024M -cp "*" org.apache.mahout.utils.vectors.lucene.ClusterLabels --dir [path to index]/data/index/ --field description --idField id --seqFileDir ./solr-clust-n2/out/clusters-2  --pointsDir ./solr-clust-n2/out/points/ --minClusterSize 5 --maxLabels 10


-Grant

Re: Stopwords work for Solr but not for Mahout

Posted by Drew Farris <dr...@gmail.com>.

I've managed to get k-means clustering working, but I agree it would be very
nice to have an end-to-end example that would allow others to get up to
speed quickly. I think the largest holes here are related to the vacuum of a
corpus of text into the Lucene index and the presentation of a
human-readable display of the results. It might be interesting to also
calculate and include some metrics such as the F-measure (in cases where we
have a reference categorization) and scatter score (in cases where we
don't).

The existing LDA example would be a useful starting point. It slurps
in the Reuters-21578
corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
converts it to text, loads it into a Lucene index, extracts vectors from the
lucene index and runs LDA upon them.

This example uses the lucene benchmark utilities for the input to text
conversion and lucene loading. The benchmark utilities code is readable but
complex. It would be very nice to have a simple piece of code to handle the
creation of the Lucene index that others can easilly build upon to respond
to their existing corpus.

On Sat, Jan 2, 2010 at 2:10 PM, Benson Margulies <bi...@gmail.com>
wrote:
> As someone who tried, not hard enough, and failed, to assemble all
> these bits in a row, I can only say that the situation cries out for
> an end-to-end sample. I'd be willing to help lick it into shape to be
> checked-in as such. My idea is that it should set up to vacuum-cleaner
> up a corpus of text, push it through Lucene, pull it out as vectors,
> tickle the pig hadoop, and deliver actual doc paths arranged by
> cluster.
>

Re: Stopwords work for Solr but not for Mahout

Posted by Benson Margulies <bi...@gmail.com>.

As someone who tried, not hard enough, and failed, to assemble all
these bits in a row, I can only say that the situation cries out for
an end-to-end sample. I'd be willing to help lick it into shape to be
checked-in as such. My idea is that it should set up to vacuum-cleaner
up a corpus of text, push it through Lucene, pull it out as vectors,
tickle the pig hadoop, and deliver actual doc paths arranged by
cluster.

On Sat, Jan 2, 2010 at 1:44 PM, Ted Dunning <te...@gmail.com> wrote:
> Since k-means is a hard clustering, that term should appear in no more than
> 2 clusters and even that is very unlikely.  It is also very unlikely if the
> cluster explanation would return that term as a top term even if it appeared
> in just one cluster.
>
> This could be some confusion in turning the id's back into terms.  It
> definitely does indicate serious problems.
>
> On Sat, Jan 2, 2010 at 10:27 AM, Bogdan Vatkov <bo...@gmail.com>wrote:
>
>> How is this even possible - for 23, 000 docs and for a term which is
>> mentioned only 2 times I have it as a top term in 9 clusters? I definitely
>> did something wrong, do you have an idea what that could be?
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

how can I debug that?

On Sat, Jan 2, 2010 at 8:44 PM, Ted Dunning <te...@gmail.com> wrote:

> Since k-means is a hard clustering, that term should appear in no more than
> 2 clusters and even that is very unlikely.  It is also very unlikely if the
> cluster explanation would return that term as a top term even if it
> appeared
> in just one cluster.
>
> This could be some confusion in turning the id's back into terms.  It
> definitely does indicate serious problems.
>
>

Re: Stopwords work for Solr but not for Mahout

Posted by Ted Dunning <te...@gmail.com>.

Since k-means is a hard clustering, that term should appear in no more than
2 clusters and even that is very unlikely.  It is also very unlikely if the
cluster explanation would return that term as a top term even if it appeared
in just one cluster.

This could be some confusion in turning the id's back into terms.  It
definitely does indicate serious problems.

On Sat, Jan 2, 2010 at 10:27 AM, Bogdan Vatkov <bo...@gmail.com>wrote:

> How is this even possible - for 23, 000 docs and for a term which is
> mentioned only 2 times I have it as a top term in 9 clusters? I definitely
> did something wrong, do you have an idea what that could be?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Stopwords work for Solr but not for Mahout

Posted by Benson Margulies <bi...@gmail.com>.

Note that, as of this instant, the generates tests won't automatically
turn up in the Eclipse build path, due to a bug in the maven eclipse
plugin. I'm waiting for Dan Kulp to return from vacation to clarify
how we get around this in CXF before I submit a patch for it.


On Sun, Jan 3, 2010 at 8:57 PM, Jake Mannix <ja...@gmail.com> wrote:
> I get caught by this kind of thing all the time, Bogdan - one of the tricks
> with
> maven is, after building one with mvn install, is to do "mvn
> eclipse:eclipse"
> or "mvn idea:idea" depending on which IDE you use.  This will generate the
> project files you can open with the IDE and it'll have all the right sources
> and
> classpaths, etc.
>
>  -jake
>
> On Sun, Jan 3, 2010 at 2:15 PM, Bogdan Vatkov <bo...@gmail.com>wrote:
>
>> Ok guys, found the issue with the compilation - for the math project - I
>> was
>> using the main/java folders as the only source folder...but it seems there
>> are some more (the ones I missed originally) classes in the
>> ../target/generated-sources/ folder ...now I compile everything in
>> Eclipse...
>> Thanks guys - once I did svn co + mvn install I realized all the sources
>> are
>> in place - just not where I as looking :)
>>
>> Best regards,
>> Bogdan
>>
>> On Sun, Jan 3, 2010 at 10:36 PM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>>
>> > The quickest thing is a new svn co to a new directory.
>> >
>> > On Sun, Jan 3, 2010 at 3:33 PM, Bogdan Vatkov <bo...@gmail.com>
>> > wrote:
>> > > how do I do "clean update" ?
>> > >
>> > > On Sun, Jan 3, 2010 at 8:00 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > >
>> > >> Not sure.
>> > >>
>> > >> Are you doing a clean update then mvn compile from the root dir?
>> > >>
>> > >> On Sun, Jan 3, 2010 at 4:25 AM, Bogdan Vatkov <
>> bogdan.vatkov@gmail.com
>> > >> >wrote:
>> > >>
>> > >> > But Grant said he can compile from trunk, am I missing something?
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Ted Dunning, CTO
>> > >> DeepDyve
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > > Bogdan
>> > >
>> >
>>
>>
>>
>> --
>> Best regards,
>> Bogdan
>>
>

Re: Stopwords work for Solr but not for Mahout

Posted by Jake Mannix <ja...@gmail.com>.

I get caught by this kind of thing all the time, Bogdan - one of the tricks
with
maven is, after building one with mvn install, is to do "mvn
eclipse:eclipse"
or "mvn idea:idea" depending on which IDE you use.  This will generate the
project files you can open with the IDE and it'll have all the right sources
and
classpaths, etc.

  -jake

On Sun, Jan 3, 2010 at 2:15 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> Ok guys, found the issue with the compilation - for the math project - I
> was
> using the main/java folders as the only source folder...but it seems there
> are some more (the ones I missed originally) classes in the
> ../target/generated-sources/ folder ...now I compile everything in
> Eclipse...
> Thanks guys - once I did svn co + mvn install I realized all the sources
> are
> in place - just not where I as looking :)
>
> Best regards,
> Bogdan
>
> On Sun, Jan 3, 2010 at 10:36 PM, Benson Margulies <bimargulies@gmail.com
> >wrote:
>
> > The quickest thing is a new svn co to a new directory.
> >
> > On Sun, Jan 3, 2010 at 3:33 PM, Bogdan Vatkov <bo...@gmail.com>
> > wrote:
> > > how do I do "clean update" ?
> > >
> > > On Sun, Jan 3, 2010 at 8:00 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > >> Not sure.
> > >>
> > >> Are you doing a clean update then mvn compile from the root dir?
> > >>
> > >> On Sun, Jan 3, 2010 at 4:25 AM, Bogdan Vatkov <
> bogdan.vatkov@gmail.com
> > >> >wrote:
> > >>
> > >> > But Grant said he can compile from trunk, am I missing something?
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Ted Dunning, CTO
> > >> DeepDyve
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Bogdan
> > >
> >
>
>
>
> --
> Best regards,
> Bogdan
>

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

Ok guys, found the issue with the compilation - for the math project - I was
using the main/java folders as the only source folder...but it seems there
are some more (the ones I missed originally) classes in the
../target/generated-sources/ folder ...now I compile everything in
Eclipse...
Thanks guys - once I did svn co + mvn install I realized all the sources are
in place - just not where I as looking :)

Best regards,
Bogdan

On Sun, Jan 3, 2010 at 10:36 PM, Benson Margulies <bi...@gmail.com>wrote:

> The quickest thing is a new svn co to a new directory.
>
> On Sun, Jan 3, 2010 at 3:33 PM, Bogdan Vatkov <bo...@gmail.com>
> wrote:
> > how do I do "clean update" ?
> >
> > On Sun, Jan 3, 2010 at 8:00 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Not sure.
> >>
> >> Are you doing a clean update then mvn compile from the root dir?
> >>
> >> On Sun, Jan 3, 2010 at 4:25 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >> >wrote:
> >>
> >> > But Grant said he can compile from trunk, am I missing something?
> >> >
> >>
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
> >
> > --
> > Best regards,
> > Bogdan
> >
>

-- 
Best regards,
Bogdan

Re: Stopwords work for Solr but not for Mahout

Posted by Benson Margulies <bi...@gmail.com>.

The quickest thing is a new svn co to a new directory.

On Sun, Jan 3, 2010 at 3:33 PM, Bogdan Vatkov <bo...@gmail.com> wrote:
> how do I do "clean update" ?
>
> On Sun, Jan 3, 2010 at 8:00 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> Not sure.
>>
>> Are you doing a clean update then mvn compile from the root dir?
>>
>> On Sun, Jan 3, 2010 at 4:25 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
>> >wrote:
>>
>> > But Grant said he can compile from trunk, am I missing something?
>> >
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
>
>
> --
> Best regards,
> Bogdan
>

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

how do I do "clean update" ?

On Sun, Jan 3, 2010 at 8:00 PM, Ted Dunning <te...@gmail.com> wrote:

> Not sure.
>
> Are you doing a clean update then mvn compile from the root dir?
>
> On Sun, Jan 3, 2010 at 4:25 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > But Grant said he can compile from trunk, am I missing something?
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Best regards,
Bogdan

Re: Stopwords work for Solr but not for Mahout

Posted by Ted Dunning <te...@gmail.com>.

Not sure.

Are you doing a clean update then mvn compile from the root dir?

On Sun, Jan 3, 2010 at 4:25 AM, Bogdan Vatkov <bo...@gmail.com>wrote:

> But Grant said he can compile from trunk, am I missing something?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

But Grant said he can compile from trunk, am I missing something?
what about the others? do you use trunk or 0.2 version?

On Sun, Jan 3, 2010 at 9:08 AM, Ted Dunning <te...@gmail.com> wrote:

> This looks like Benson's new collection stuff.
>
> On Sat, Jan 2, 2010 at 6:44 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > It makes sense to have it but still cannot compile the ClusterDumper from
> > trunk:
> >
> > Exception in thread "main" java.lang.Error: Unresolved compilation
> > problems:
> > The import org.apache.mahout.math.function.IntDoubleProcedure cannot be
> > resolved
> > The import org.apache.mahout.math.list.DoubleArrayList cannot be resolved
> > The import org.apache.mahout.math.list.IntArrayList cannot be resolved
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Best regards,
Bogdan

Re: Stopwords work for Solr but not for Mahout

Posted by Ted Dunning <te...@gmail.com>.

This looks like Benson's new collection stuff.

On Sat, Jan 2, 2010 at 6:44 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> It makes sense to have it but still cannot compile the ClusterDumper from
> trunk:
>
> Exception in thread "main" java.lang.Error: Unresolved compilation
> problems:
> The import org.apache.mahout.math.function.IntDoubleProcedure cannot be
> resolved
> The import org.apache.mahout.math.list.DoubleArrayList cannot be resolved
> The import org.apache.mahout.math.list.IntArrayList cannot be resolved
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

It makes sense to have it but still cannot compile the ClusterDumper from
trunk:

Exception in thread "main" java.lang.Error: Unresolved compilation
problems:
The import org.apache.mahout.math.function.IntDoubleProcedure cannot be
resolved
The import org.apache.mahout.math.list.DoubleArrayList cannot be resolved
The import org.apache.mahout.math.list.IntArrayList cannot be resolved
IntArrayList cannot be resolved to a type
The method keys() from the type AbstractIntDoubleMap refers to the missing
type IntArrayList
IntArrayList cannot be resolved to a type
IntArrayList cannot be resolved to a type
IntArrayList cannot be resolved to a type
IntArrayList cannot be resolved to a type
DoubleArrayList cannot be resolved to a type
The method values() from the type AbstractIntDoubleMap refers to the missing
type DoubleArrayList
IntDoubleProcedure cannot be resolved to a type
The method apply(int, double) of type SparseVector.DistanceSquared must
override or implement a supertype method
The method forEachPair(IntDoubleProcedure) from the type
OpenIntDoubleHashMap refers to the missing type IntDoubleProcedure
IntDoubleProcedure cannot be resolved to a type
The method apply(int, double) of type SparseVector.AddToVector must override
or implement a supertype method
The method forEachPair(IntDoubleProcedure) from the type
OpenIntDoubleHashMap refers to the missing type IntDoubleProcedure

at org.apache.mahout.math.SparseVector.<init>(SparseVector.java:20)
at org.apache.mahout.clustering.ClusterBase.<init>(ClusterBase.java:34)
at org.apache.mahout.clustering.kmeans.Cluster.<init>(Cluster.java:140)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:355)
at java.lang.Class.newInstance(Class.java:308)
at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:121)
at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:253)

On Sun, Jan 3, 2010 at 4:34 AM, Drew Farris <dr...@gmail.com> wrote:

> On Sat, Jan 2, 2010 at 8:31 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > Still, is there a way to print out the current convergence after each
> > iteration or something?
> >
>
> Each cluster has its own convergence which is defined as the distance
> between its center and its centroid. As a result, overall convergence is a
> binary measure defined as whether all clusters are converged -- whether
> each
> cluster's convergence is less-than or equal to the convergence delta.
>
> If you are interested in the convergence measure for each cluster, you will
> need to modify computeConvergence() in o.a.m.clustering.kmeans.Cluster to
> either store or log the convergence.
>
> If there's sufficient interest in this, I can prep a patch that will allow
> convergence to be stored and dumped via ClusterDumper
>
> Drew
>



-- 
Best regards,
Bogdan

Re: Stopwords work for Solr but not for Mahout

Posted by Ted Dunning <te...@gmail.com>.

I think that total distance of documents to the nearest cluster is an
interesting convergence measure as well.  It should bottom out to some
asymptote as the clustering proceeds.

On Sat, Jan 2, 2010 at 6:34 PM, Drew Farris <dr...@gmail.com> wrote:

> On Sat, Jan 2, 2010 at 8:31 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > Still, is there a way to print out the current convergence after each
> > iteration or something?
> >
>
> Each cluster has its own convergence which is defined as the distance
> between its center and its centroid. As a result, overall convergence is a
> binary measure defined as whether all clusters are converged -- whether
> each
> cluster's convergence is less-than or equal to the convergence delta.
>
> If you are interested in the convergence measure for each cluster, you will
> need to modify computeConvergence() in o.a.m.clustering.kmeans.Cluster to
> either store or log the convergence.
>
> If there's sufficient interest in this, I can prep a patch that will allow
> convergence to be stored and dumped via ClusterDumper
>
> Drew
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Stopwords work for Solr but not for Mahout

Posted by Drew Farris <dr...@gmail.com>.

On Sat, Jan 2, 2010 at 8:31 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> Still, is there a way to print out the current convergence after each
> iteration or something?
>

Each cluster has its own convergence which is defined as the distance
between its center and its centroid. As a result, overall convergence is a
binary measure defined as whether all clusters are converged -- whether each
cluster's convergence is less-than or equal to the convergence delta.

If you are interested in the convergence measure for each cluster, you will
need to modify computeConvergence() in o.a.m.clustering.kmeans.Cluster to
either store or log the convergence.

If there's sufficient interest in this, I can prep a patch that will allow
convergence to be stored and dumped via ClusterDumper

Drew

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

Still, is there a way to print out the current convergence after each
iteration or something?

On Sun, Jan 3, 2010 at 1:47 AM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 2, 2010, at 6:23 PM, Bogdan Vatkov wrote:
>
> > Sorry guys, I was really mixing content, vector files, etc.
>
> whew!
>
> > I now took the example from Grant above...I am just wondering if it is ok
> to
> > have --convergence 0.001...I ran my data with this value and I had to
> kill
> > it since it went to 5th iteration and was already running for an hour or
> so.
> > Is there away I can monitor the current number of convergence so that I
> can
> > roughly project how many iterations I will have to wait for?
>
> I don't know where I picked that number from, so, I guess I'd probably
> start with a bigger number and then validate.
>
>
>


-- 
Best regards,
Bogdan

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 6:23 PM, Bogdan Vatkov wrote:

> Sorry guys, I was really mixing content, vector files, etc.

whew!

> I now took the example from Grant above...I am just wondering if it is ok to
> have --convergence 0.001...I ran my data with this value and I had to kill
> it since it went to 5th iteration and was already running for an hour or so.
> Is there away I can monitor the current number of convergence so that I can
> roughly project how many iterations I will have to wait for?

I don't know where I picked that number from, so, I guess I'd probably start with a bigger number and then validate.

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

Sorry guys, I was really mixing content, vector files, etc.
I now took the example from Grant above...I am just wondering if it is ok to
have --convergence 0.001...I ran my data with this value and I had to kill
it since it went to 5th iteration and was already running for an hour or so.
Is there away I can monitor the current number of convergence so that I can
roughly project how many iterations I will have to wait for?

What commands are you running?
>
> Can you share more about your setup or try to reproduce in a much smaller
> environment?
>
>


-- 
Bogdan Vatkov
email: bogdan.vatkov@gmail.com

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 1:27 PM, Bogdan Vatkov wrote:

> Thanks for the Luke hint, I will try it out but now I noticed something else
> which is very very strange - I ran k-means on 23K+ docs and with 50 clusters
> which all seem to be very very strange as top term collection - I would say
> for 90% of the top terms I get some words which I barely recognize.
> I did a short check and for one particular term, which anyway sounded
> strange and which appeared in top terms for 9 of the 50 clusters, I found
> that it has "doc freq" = 2 in the Solr dictionary.
> How is this even possible - for 23, 000 docs and for a term which is
> mentioned only 2 times I have it as a top term in 9 clusters? I definitely
> did something wrong, do you have an idea what that could be?

What commands are you running?

Can you share more about your setup or try to reproduce in a much smaller environment?

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

Thanks for the Luke hint, I will try it out but now I noticed something else
which is very very strange - I ran k-means on 23K+ docs and with 50 clusters
which all seem to be very very strange as top term collection - I would say
for 90% of the top terms I get some words which I barely recognize.
I did a short check and for one particular term, which anyway sounded
strange and which appeared in top terms for 9 of the 50 clusters, I found
that it has "doc freq" = 2 in the Solr dictionary.
How is this even possible - for 23, 000 docs and for a term which is
mentioned only 2 times I have it as a top term in 9 clusters? I definitely
did something wrong, do you have an idea what that could be?

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

Now that I checked with Luke I definitely have lots of terms in the index
which are supposed to be removed by the stopwords filtering, it seems to be
a Solr/Lucene issue.

On Sat, Jan 2, 2010 at 8:13 PM, Ted Dunning <te...@gmail.com> wrote:

> It can be very helpful to use Luke to view the index to verify that you are
> getting what you think out of your indexing process.
>
>

Re: Stopwords work for Solr but not for Mahout

Posted by Ted Dunning <te...@gmail.com>.

It can be very helpful to use Luke to view the index to verify that  
you are getting what you think out of your indexing process.

Sent from my iPhone

On Jan 2, 2010, at 8:34 AM, Bogdan Vatkov <bo...@gmail.com>  
wrote:

> I re-indexed but I cannot find a way to use the VectorDumper w/  
> Dictionary,
> I am using mahout v 0.2 and not the very latest trunk code since the  
> latter
> was not compiling and I had to use older code.
>
> On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>
>> I assume you re-indexed and you used the VectorDumper (along with the
>> dictionary) to dump out the Vectors that were converted and  
>> verified no stop
>> words?
>>
>> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>>
>>> this is my Solr config:
>>>
>>>  <field name="msg_body" type="text" termVectors="true"  
>>> indexed="true"
>>> stored="true"/>
>>>
>>> and the type text is as configured by default:
>>>
>>>   <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <!-- in this example, we will only use synonyms at query time
>>>       <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>       -->
>>>       <!-- Case insensitive stop word removal.
>>>         add enablePositionIncrements=true in both the index and  
>>> query
>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>       -->
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.SynonymFilterFactory"  
>>> synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>   </fieldType>
>>>
>>> and I have entered quite some stopwords in the stopwords.txt file
>>>
>>> my SolrToMahout.sh file:
>>>
>>> #!/bin/bash
>>> set -x
>>> cd /store/dev/inst/mahout-0.2
>>> java -classpath
>>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
>>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
>> /:/g')
>>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>>>  --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
>>> --field msg_body --dictOut
>>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
>>>
>>> Best regards,
>>> Bogdan
>>>
>>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll  
>>> <gs...@apache.org>
>> wrote:
>>>
>>>> What do the relevant pieces of your Solr setup look like and how  
>>>> are you
>>>> invoking the Lucene driver?
>>>>
>>>> -Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
>
> -- 
> Bogdan Vatkov
> email: bogdan.vatkov@gmail.com

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

just did the procedure from scratch - core, utils and examples seems to
resolve, only math is not compiling - reports something like 950 errors

On Sun, Jan 3, 2010 at 2:55 AM, Ted Dunning <te...@gmail.com> wrote:

> Benson?
>
> Did half of one of your patches get committed?
>
> On Sat, Jan 2, 2010 at 4:41 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > I am trying to create Java project in Eclipse from trunk but when I sync
> > e.g. in math a class GenericSorting is not compiling due to a missing
> > import
> > org.apache.mahout.math.function.IntComparator;
> > what am I doing wrong?
> >
> > > Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
> > > from the root Mahout dir, if that helps at all.
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Bogdan Vatkov
email: bogdan.vatkov@gmail.com

Re: Stopwords work for Solr but not for Mahout

Posted by Ted Dunning <te...@gmail.com>.

Benson?

Did half of one of your patches get committed?

On Sat, Jan 2, 2010 at 4:41 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> I am trying to create Java project in Eclipse from trunk but when I sync
> e.g. in math a class GenericSorting is not compiling due to a missing
> import
> org.apache.mahout.math.function.IntComparator;
> what am I doing wrong?
>
> > Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
> > from the root Mahout dir, if that helps at all.
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

I am trying to create Java project in Eclipse from trunk but when I sync
e.g. in math a class GenericSorting is not compiling due to a missing import
org.apache.mahout.math.function.IntComparator;
what am I doing wrong?

> Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
> from the root Mahout dir, if that helps at all.
>

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 11:56 AM, Bogdan Vatkov wrote:

> If I use the TermVectorComponent the search results do not contain stopwords
> - which seems to be ok at this point in time.
> But when I use the Lucene Driver I can see the stop words in the dictionary
> file alone and later in the clusters.
> Is there a way that I can print the vectors with the real terms in place -
> instead of just some indexes?

No, but it should be easy enough to modify ClusterDumper to do this.  

Let me double check mine, I'm pretty sure I don't have stopwords, but I haven't checked all the way down to the actual vector.  All the Lucene Driver is doing is loading the term vector, so if it isn't in the term vector, I don't see how it can be in the Mahout vector.  Could be a bug, though.

> 
> On Sat, Jan 2, 2010 at 6:40 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> 
>> On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote:
>> 
>>> I re-indexed but I cannot find a way to use the VectorDumper w/
>> Dictionary,
>>> I am using mahout v 0.2 and not the very latest trunk code since the
>> latter
>>> was not compiling and I had to use older code.
>> 
>> Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
>> from the root Mahout dir, if that helps at all.
>> 
>> If you turn on the TermVectorComponent (
>> http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your
>> vectors look like?  Do they have stopwords?
>> 
>>> 
>>> On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>> 
>>>> I assume you re-indexed and you used the VectorDumper (along with the
>>>> dictionary) to dump out the Vectors that were converted and verified no
>> stop
>>>> words?
>>>> 
>>>> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>>>> 
>>>>> this is my Solr config:
>>>>> 
>>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
>>>>> stored="true"/>
>>>>> 
>>>>> and the type text is as configured by default:
>>>>> 
>>>>>  <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>>    <analyzer type="index">
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>      <!-- in this example, we will only use synonyms at query time
>>>>>      <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>>>      -->
>>>>>      <!-- Case insensitive stop word removal.
>>>>>        add enablePositionIncrements=true in both the index and query
>>>>>        analyzers to leave a 'gap' for more accurate phrase queries.
>>>>>      -->
>>>>>      <filter class="solr.StopFilterFactory"
>>>>>              ignoreCase="true"
>>>>>              words="stopwords.txt"
>>>>>              enablePositionIncrements="true"
>>>>>              />
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"
>>>>> protected="protwords.txt"/>
>>>>>    </analyzer>
>>>>>    <analyzer type="query">
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="true"/>
>>>>>      <filter class="solr.StopFilterFactory"
>>>>>              ignoreCase="true"
>>>>>              words="stopwords.txt"
>>>>>              enablePositionIncrements="true"
>>>>>              />
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"
>>>>> protected="protwords.txt"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>>> 
>>>>> and I have entered quite some stopwords in the stopwords.txt file
>>>>> 
>>>>> my SolrToMahout.sh file:
>>>>> 
>>>>> #!/bin/bash
>>>>> set -x
>>>>> cd /store/dev/inst/mahout-0.2
>>>>> java -classpath
>>>>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
>>>>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
>>>> /:/g')
>>>>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>>>>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>>>>> --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
>>>>> --field msg_body --dictOut
>>>>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
>>>>> 
>>>>> Best regards,
>>>>> Bogdan
>>>>> 
>>>>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gs...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> What do the relevant pieces of your Solr setup look like and how are
>> you
>>>>>> invoking the Lucene driver?
>>>>>> 
>>>>>> -Grant
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>> 
>>>> Search the Lucene ecosystem using Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Bogdan Vatkov
>>> email: bogdan.vatkov@gmail.com
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 
> 
> 
> -- 
> Bogdan Vatkov
> email: bogdan.vatkov@gmail.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

If I use the TermVectorComponent the search results do not contain stopwords
- which seems to be ok at this point in time.
But when I use the Lucene Driver I can see the stop words in the dictionary
file alone and later in the clusters.
Is there a way that I can print the vectors with the real terms in place -
instead of just some indexes?

On Sat, Jan 2, 2010 at 6:40 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote:
>
> > I re-indexed but I cannot find a way to use the VectorDumper w/
> Dictionary,
> > I am using mahout v 0.2 and not the very latest trunk code since the
> latter
> > was not compiling and I had to use older code.
>
> Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
> from the root Mahout dir, if that helps at all.
>
> If you turn on the TermVectorComponent (
> http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your
> vectors look like?  Do they have stopwords?
>
> >
> > On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> I assume you re-indexed and you used the VectorDumper (along with the
> >> dictionary) to dump out the Vectors that were converted and verified no
> stop
> >> words?
> >>
> >> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
> >>
> >>> this is my Solr config:
> >>>
> >>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
> >>> stored="true"/>
> >>>
> >>> and the type text is as configured by default:
> >>>
> >>>   <fieldType name="text" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <!-- in this example, we will only use synonyms at query time
> >>>       <filter class="solr.SynonymFilterFactory"
> >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >>>       -->
> >>>       <!-- Case insensitive stop word removal.
> >>>         add enablePositionIncrements=true in both the index and query
> >>>         analyzers to leave a 'gap' for more accurate phrase queries.
> >>>       -->
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>>
> >>> and I have entered quite some stopwords in the stopwords.txt file
> >>>
> >>> my SolrToMahout.sh file:
> >>>
> >>> #!/bin/bash
> >>> set -x
> >>> cd /store/dev/inst/mahout-0.2
> >>> java -classpath
> >>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
> >>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
> >> /:/g')
> >>> org.apache.mahout.utils.vectors.lucene.Driver --dir
> >>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
> >>>  --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
> >>> --field msg_body --dictOut
> >>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
> >>>
> >>> Best regards,
> >>> Bogdan
> >>>
> >>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gs...@apache.org>
> >> wrote:
> >>>
> >>>> What do the relevant pieces of your Solr setup look like and how are
> you
> >>>> invoking the Lucene driver?
> >>>>
> >>>> -Grant
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem using Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>
> >>
> >
> >
> > --
> > Bogdan Vatkov
> > email: bogdan.vatkov@gmail.com
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Bogdan Vatkov
email: bogdan.vatkov@gmail.com

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote:

> I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary,
> I am using mahout v 0.2 and not the very latest trunk code since the latter
> was not compiling and I had to use older code.

Hmm, I'm using trunk and it is compiling.  You have to do "mvn install" from the root Mahout dir, if that helps at all.

If you turn on the TermVectorComponent (http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your vectors look like?  Do they have stopwords?

> 
> On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> I assume you re-indexed and you used the VectorDumper (along with the
>> dictionary) to dump out the Vectors that were converted and verified no stop
>> words?
>> 
>> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>> 
>>> this is my Solr config:
>>> 
>>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
>>> stored="true"/>
>>> 
>>> and the type text is as configured by default:
>>> 
>>>   <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <!-- in this example, we will only use synonyms at query time
>>>       <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>       -->
>>>       <!-- Case insensitive stop word removal.
>>>         add enablePositionIncrements=true in both the index and query
>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>       -->
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>   </fieldType>
>>> 
>>> and I have entered quite some stopwords in the stopwords.txt file
>>> 
>>> my SolrToMahout.sh file:
>>> 
>>> #!/bin/bash
>>> set -x
>>> cd /store/dev/inst/mahout-0.2
>>> java -classpath
>>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
>>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
>> /:/g')
>>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>>>  --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
>>> --field msg_body --dictOut
>>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
>>> 
>>> Best regards,
>>> Bogdan
>>> 
>>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>> 
>>>> What do the relevant pieces of your Solr setup look like and how are you
>>>> invoking the Lucene driver?
>>>> 
>>>> -Grant
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 
> 
> 
> -- 
> Bogdan Vatkov
> email: bogdan.vatkov@gmail.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary,
I am using mahout v 0.2 and not the very latest trunk code since the latter
was not compiling and I had to use older code.

On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <gs...@apache.org> wrote:

> I assume you re-indexed and you used the VectorDumper (along with the
> dictionary) to dump out the Vectors that were converted and verified no stop
> words?
>
> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>
> > this is my Solr config:
> >
> >   <field name="msg_body" type="text" termVectors="true" indexed="true"
> > stored="true"/>
> >
> > and the type text is as configured by default:
> >
> >    <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        -->
> >        <!-- Case insensitive stop word removal.
> >          add enablePositionIncrements=true in both the index and query
> >          analyzers to leave a 'gap' for more accurate phrase queries.
> >        -->
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >    </fieldType>
> >
> > and I have entered quite some stopwords in the stopwords.txt file
> >
> > my SolrToMahout.sh file:
> >
> > #!/bin/bash
> > set -x
> > cd /store/dev/inst/mahout-0.2
> > java -classpath
> > /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
> > /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
> /:/g')
> > org.apache.mahout.utils.vectors.lucene.Driver --dir
> > /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
> >   --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
> > --field msg_body --dictOut
> > /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
> >
> > Best regards,
> > Bogdan
> >
> > On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> What do the relevant pieces of your Solr setup look like and how are you
> >> invoking the Lucene driver?
> >>
> >> -Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Bogdan Vatkov
email: bogdan.vatkov@gmail.com

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

I assume you re-indexed and you used the VectorDumper (along with the dictionary) to dump out the Vectors that were converted and verified no stop words?

On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:

> this is my Solr config:
> 
>   <field name="msg_body" type="text" termVectors="true" indexed="true"
> stored="true"/>
> 
> and the type text is as configured by default:
> 
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
> 
> and I have entered quite some stopwords in the stopwords.txt file
> 
> my SolrToMahout.sh file:
> 
> #!/bin/bash
> set -x
> cd /store/dev/inst/mahout-0.2
> java -classpath
> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ /:/g')
> org.apache.mahout.utils.vectors.lucene.Driver --dir
> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>   --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
> --field msg_body --dictOut
> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
> 
> Best regards,
> Bogdan
> 
> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> What do the relevant pieces of your Solr setup look like and how are you
>> invoking the Lucene driver?
>> 
>> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Posted by Bogdan Vatkov <bo...@gmail.com>.

this is my Solr config:

   <field name="msg_body" type="text" termVectors="true" indexed="true"
stored="true"/>

and the type text is as configured by default:

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

and I have entered quite some stopwords in the stopwords.txt file

my SolrToMahout.sh file:

#!/bin/bash
set -x
cd /store/dev/inst/mahout-0.2
java -classpath
/store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
/store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ /:/g')
org.apache.mahout.utils.vectors.lucene.Driver --dir
/store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
   --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
--field msg_body --dictOut
/store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict

Best regards,
Bogdan

On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <gs...@apache.org> wrote:

> What do the relevant pieces of your Solr setup look like and how are you
> invoking the Lucene driver?
>
> -Grant

Re: Stopwords work for Solr but not for Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

What do the relevant pieces of your Solr setup look like and how are you invoking the Lucene driver?

-Grant

On Jan 2, 2010, at 7:30 AM, Bogdan Vatkov wrote:

> Hi,
> 
> I am using the standard solr.StopFilterFactory with my own stopwords.txt
> file which seem to quite work for Solr itself - e.g. queries are not working
> with stopwords after I define a stopword.
> But when I push Solr content to Mahout with the Lucene Driver I get all the
> words in the dictionary and clusters - even the words that are supposed to
> be stopped.
> Any idea how to stop these words in Mahout?
> 
> Best regards,
> Bogdan

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search