You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2005/01/14 22:07:57 UTC

Fun with the Wikipedia

For my own amusement I've indexed the Wikipedia and put up pages that:
- display search results
- cluster the results using Carrot2 (my first use of this)
- display similar pages using the entire text to re-query for similar 
docs and
- display similar pages using the "more like this" algorithm (TBD is get 
this into the sandbox, sorry for delays..)


You start off here to search:

	http://www.searchmorph.com/kat/wikipedia.jsp


And the weblog entry goes into a bit more detail:

	http://www.searchmorph.com/weblog/index.php?id=37



It's kinda fun to explore the Wikipedia by looking for pages similar to 
other ones.

Hope people find this useful...

- Dave

PS
   I'm in the process of running the page rank algorithm (from 
jung.sf.net) on most of the entries in the Wikipedia. It has taken over 
2 days so far....

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: carrot2 question too - Re: Fun with the Wikipedia

Posted by David Spencer <da...@tropo.com>.
Dawid Weiss wrote:

> 
> Hi David,
> 
> I apologize about the delay in answering this one, Lucene is a busy 
> mailing list and I had a hectic last week... Again, sorry for belated 
> answer, hope you still find it useful.
Oh no problem, and yes carrot2 is useful and fun.  It's a rich package 
so it takes a while to understand all that it can do.
> 
>>> That is awesome and very inspirational!
> 
> 
> Yes, I admit what you've done with Wikipedia is quite interesting and 
> looks very good. I'm also glad you spent some time working out Carrot 
> integration with Lucene. It works quite nice.

Thanks but I just took code that I think you wrote(!) and made minor 
mods to it - here's one link:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

I'd like to do more w/ Carrot2- that's where things get harder.

> 
>>> Carrot2 looks very interesting. Wondering if anybody has a list of 
>>> all the
>>
>>
>> Technically I don't think carrot2 uses lucene per-se- it's just that 
>> you can integrate the two, and ditto for Nutch - it has code that uses 
>> Carrot2.
> 
> 
> Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
> merely takes the output from a query (titles, urls and snippets) and 
> attempts to cluster them into some sensible groups. I think many things 
> could be improved, the most important of them is fast snippet retrieval 
>   from Lucene because right now it takes 50% of the time of the 
> clustering; I've seen a post a while ago describing a faster snippet 
> generation technique, I'm sure that would give clustering a huge boost 
> speed-wise.
> 
>> And here's my question. I reread the Carrot2<->Lucene code, esp 
>> Demo.java, and there's this fragment:
>>
>>     // warm-up round (stemmer tables must be read etc).
>>     List clusters = clusterer.clusterHits(docs);
>>
>>     long clusteringStartTime = System.currentTimeMillis();
>>     clusters = clusterer.clusterHits(docs);
>>     long clusteringEndTime = System.currentTimeMillis();
>>
>> Thus it calls clusterHits() twice.
>>
>> I don't really understand how to use Carrot2 - but I think the above 
>> is just for the sake of benchmarking clusterHits() w/o the effect of 
>> 1-time initialization - and that there's no benefit of repeatedly 
>> calling clusterHits (where a benefit might be that it can find nested 
>> clusters or whatever) - is that right (that there's no benefit)?
> 
> 
> No, there is absolutely no benefit from it. It was merely to show people 
> that the clustering needs to be warmed up a bit. I should not have put 
> it in the code knowing people would be confused by it. You can safely 
> use clusterHits just once. It will just have a small delay at the first 
> invocation.
> 
> 
> Thanks for experimenting. Please BCC me if you have any urgent projects 
> -- I read Lucene's list in batches and my personal e-mail I try to keep 
> up to date with.
> 
> Dawid

thx,
  Dave

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: carrot2 question too - Re: Fun with the Wikipedia

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Hi Adam.

Otis and David have already provided you with pointers to my previous 
post regarding Carrot2-Lucene integration, so just a tiny note here:

> Also, when I looked at Carrot2 the pipe line is implemented as over http. I
> wonder how efficient that is, or can it be changed, for instance for an all
> local implementation?

Yes, there exists a possibility to combine components locally. It is 
even demonstrated in the sample code David Spencer mentioned.

> Has Carrot2 been integrated in with Lucene, has it been used as the bases
> for a recommender system (could it be?)?

I don't know... I guess it could but you'd have to play with the source 
code and modify it a bit to get the required functionality. Can't really 
tell anything more specific because I'm not deep in that subject.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: carrot2 question too - Re: Fun with the Wikipedia

Posted by David Spencer <da...@tropo.com>.
Otis Gospodnetic wrote:

> Adam,
> 
> Dawid posted some code that lets you use Carrot2 locally with Lucene,

see embedded zip url here for carrot2/lucene code - it may also be in 
the carrot2 cvs tree too - this is what I used in the wikipedia/cluster 
stuff as the basis


http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

> without the componentized pipe line system described on Carrot2 site.



> 
> Otis
> 
> --- Adam Saltiel <ad...@btinternet.com> wrote:
> 
> 
>>David, Hi,
>>Would you be able to comment on coincidentally recent thread " RE: ->
>>Grouping Search Results by Clustering Snippets:"?
>>Also, when I looked at Carrot2 the pipe line is implemented as over
>>http. I
>>wonder how efficient that is, or can it be changed, for instance for
>>an all
>>local implementation?
>>Has Carrot2 been integrated in with Lucene, has it been used as the
>>bases
>>for a recommender system (could it be?)?
>>TIA.
>>
>>Adam
>>
>>
>>>-----Original Message-----
>>>From: Dawid Weiss [mailto:dawid.weiss@cs.put.poznan.pl]
>>>Sent: Monday, January 31, 2005 4:12 PM
>>>To: Lucene Users List
>>>Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
>>>
>>>
>>>Hi.
>>>
>>>Coming up with answers... a little belated, but hope you're still
>>
>>on:
>>
>>>>we have been experimenting with carrot2 and are very pleased so
>>
>>far,
>>
>>>>only one issue: there is no release not even an alpha one and the
>>>>dependencies seemed to be patched (jama)
>>>
>>>Yes, there is not "official" release. We just don't feel the need
>>
>>to tag
>>
>>>the sources with an official label because Carrot is not a
>>
>>stand-alone
>>
>>>product (rather a library... or a framework). It does not imply
>>
>>that the
>>
>>>project is in alpha stage... quite the contrary, in fact -- it has
>>
>>been
>>
>>>out there for a while and it seems to do a good job for most
>>
>>people.
>>
>>>>is there any intentions to have any releases in the near future?
>>>
>>>I could tag a release even today if it makes you happy ;) But I
>>
>>hope I
>>
>>>made the status of the project clear above.
>>>
>>>D.
>>>
>>>
>>
>>---------------------------------------------------------------------
>>
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail:
>>
>>lucene-user-help@jakarta.apache.org
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: carrot2 question too - Re: Fun with the Wikipedia

Posted by Adam Saltiel <ad...@btinternet.com>.
OK, thanks.

Adam

> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Monday, January 31, 2005 5:51 PM
> To: Lucene Users List; adam.saltiel@btinternet.com
> Subject: RE: carrot2 question too - Re: Fun with the Wikipedia
>
> Adam,
>
> Dawid posted some code that lets you use Carrot2 locally with Lucene,
> without the componentized pipe line system described on Carrot2 site.
>
> Otis
>
> --- Adam Saltiel <ad...@btinternet.com> wrote:
>
> > David, Hi,
> > Would you be able to comment on coincidentally recent thread " RE:
->
> > Grouping Search Results by Clustering Snippets:"?
> > Also, when I looked at Carrot2 the pipe line is implemented as over
> > http. I
> > wonder how efficient that is, or can it be changed, for instance for
> > an all
> > local implementation?
> > Has Carrot2 been integrated in with Lucene, has it been used as the
> > bases
> > for a recommender system (could it be?)?
> > TIA.
> >
> > Adam
> >
> > > -----Original Message-----
> > > From: Dawid Weiss [mailto:dawid.weiss@cs.put.poznan.pl]
> > > Sent: Monday, January 31, 2005 4:12 PM
> > > To: Lucene Users List
> > > Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
> > >
> > >
> > > Hi.
> > >
> > > Coming up with answers... a little belated, but hope you're still
> > on:
> > >
> > > > we have been experimenting with carrot2 and are very pleased so
> > far,
> > > > only one issue: there is no release not even an alpha one and
the
> > > > dependencies seemed to be patched (jama)
> > >
> > > Yes, there is not "official" release. We just don't feel the need
> > to tag
> > > the sources with an official label because Carrot is not a
> > stand-alone
> > > product (rather a library... or a framework). It does not imply
> > that the
> > > project is in alpha stage... quite the contrary, in fact -- it has
> > been
> > > out there for a while and it seems to do a good job for most
> > people.
> > >
> > > > is there any intentions to have any releases in the near future?
> > >
> > > I could tag a release even today if it makes you happy ;) But I
> > hope I
> > > made the status of the project clear above.
> > >
> > > D.
> > >
> > >
> >
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> >
> >
> >
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: carrot2 question too - Re: Fun with the Wikipedia

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Adam,

Dawid posted some code that lets you use Carrot2 locally with Lucene,
without the componentized pipe line system described on Carrot2 site.

Otis

--- Adam Saltiel <ad...@btinternet.com> wrote:

> David, Hi,
> Would you be able to comment on coincidentally recent thread " RE: ->
> Grouping Search Results by Clustering Snippets:"?
> Also, when I looked at Carrot2 the pipe line is implemented as over
> http. I
> wonder how efficient that is, or can it be changed, for instance for
> an all
> local implementation?
> Has Carrot2 been integrated in with Lucene, has it been used as the
> bases
> for a recommender system (could it be?)?
> TIA.
> 
> Adam
> 
> > -----Original Message-----
> > From: Dawid Weiss [mailto:dawid.weiss@cs.put.poznan.pl]
> > Sent: Monday, January 31, 2005 4:12 PM
> > To: Lucene Users List
> > Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
> >
> >
> > Hi.
> >
> > Coming up with answers... a little belated, but hope you're still
> on:
> >
> > > we have been experimenting with carrot2 and are very pleased so
> far,
> > > only one issue: there is no release not even an alpha one and the
> > > dependencies seemed to be patched (jama)
> >
> > Yes, there is not "official" release. We just don't feel the need
> to tag
> > the sources with an official label because Carrot is not a
> stand-alone
> > product (rather a library... or a framework). It does not imply
> that the
> > project is in alpha stage... quite the contrary, in fact -- it has
> been
> > out there for a while and it seems to do a good job for most
> people.
> >
> > > is there any intentions to have any releases in the near future?
> >
> > I could tag a release even today if it makes you happy ;) But I
> hope I
> > made the status of the project clear above.
> >
> > D.
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: carrot2 question too - Re: Fun with the Wikipedia

Posted by Adam Saltiel <ad...@btinternet.com>.
David, Hi,
Would you be able to comment on coincidentally recent thread " RE: ->
Grouping Search Results by Clustering Snippets:"?
Also, when I looked at Carrot2 the pipe line is implemented as over http. I
wonder how efficient that is, or can it be changed, for instance for an all
local implementation?
Has Carrot2 been integrated in with Lucene, has it been used as the bases
for a recommender system (could it be?)?
TIA.

Adam

> -----Original Message-----
> From: Dawid Weiss [mailto:dawid.weiss@cs.put.poznan.pl]
> Sent: Monday, January 31, 2005 4:12 PM
> To: Lucene Users List
> Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
>
>
> Hi.
>
> Coming up with answers... a little belated, but hope you're still on:
>
> > we have been experimenting with carrot2 and are very pleased so far,
> > only one issue: there is no release not even an alpha one and the
> > dependencies seemed to be patched (jama)
>
> Yes, there is not "official" release. We just don't feel the need to tag
> the sources with an official label because Carrot is not a stand-alone
> product (rather a library... or a framework). It does not imply that the
> project is in alpha stage... quite the contrary, in fact -- it has been
> out there for a while and it seems to do a good job for most people.
>
> > is there any intentions to have any releases in the near future?
>
> I could tag a release even today if it makes you happy ;) But I hope I
> made the status of the project clear above.
>
> D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: carrot2 question too - Re: Fun with the Wikipedia

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi.

Coming up with answers... a little belated, but hope you're still on:

> we have been experimenting with carrot2 and are very pleased so far,
> only one issue: there is no release not even an alpha one and the
> dependencies seemed to be patched (jama)

Yes, there is not "official" release. We just don't feel the need to tag 
the sources with an official label because Carrot is not a stand-alone 
product (rather a library... or a framework). It does not imply that the 
project is in alpha stage... quite the contrary, in fact -- it has been 
out there for a while and it seems to do a good job for most people.

> is there any intentions to have any releases in the near future?

I could tag a release even today if it makes you happy ;) But I hope I 
made the status of the project clear above.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: carrot2 question too - Re: Fun with the Wikipedia

Posted by Akmal Sarhan <as...@byteaction.de>.
Hello,

we have been experimenting with carrot2 and are very pleased so far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
is there any intentions to have any releases in the near future?

thanks 

Akmal
Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss:
> Hi David,
> 
> I apologize about the delay in answering this one, Lucene is a busy 
> mailing list and I had a hectic last week... Again, sorry for belated 
> answer, hope you still find it useful.
> 
> >> That is awesome and very inspirational!
> 
> Yes, I admit what you've done with Wikipedia is quite interesting and 
> looks very good. I'm also glad you spent some time working out Carrot 
> integration with Lucene. It works quite nice.
> 
> >> Carrot2 looks very interesting. Wondering if anybody has a list of all 
> >> the
> > 
> > Technically I don't think carrot2 uses lucene per-se- it's just that you 
> > can integrate the two, and ditto for Nutch - it has code that uses Carrot2.
> 
> Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
> merely takes the output from a query (titles, urls and snippets) and 
> attempts to cluster them into some sensible groups. I think many things 
> could be improved, the most important of them is fast snippet retrieval 
>    from Lucene because right now it takes 50% of the time of the 
> clustering; I've seen a post a while ago describing a faster snippet 
> generation technique, I'm sure that would give clustering a huge boost 
> speed-wise.
> 
> > And here's my question. I reread the Carrot2<->Lucene code, esp 
> > Demo.java, and there's this fragment:
> > 
> >     // warm-up round (stemmer tables must be read etc).
> >     List clusters = clusterer.clusterHits(docs);
> > 
> >     long clusteringStartTime = System.currentTimeMillis();
> >     clusters = clusterer.clusterHits(docs);
> >     long clusteringEndTime = System.currentTimeMillis();
> > 
> > Thus it calls clusterHits() twice.
> > 
> > I don't really understand how to use Carrot2 - but I think the above is 
> > just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
> > initialization - and that there's no benefit of repeatedly calling 
> > clusterHits (where a benefit might be that it can find nested clusters 
> > or whatever) - is that right (that there's no benefit)?
> 
> No, there is absolutely no benefit from it. It was merely to show people 
> that the clustering needs to be warmed up a bit. I should not have put 
> it in the code knowing people would be confused by it. You can safely 
> use clusterHits just once. It will just have a small delay at the first 
> invocation.
> 
> 
> Thanks for experimenting. Please BCC me if you have any urgent projects 
> -- I read Lucene's list in batches and my personal e-mail I try to keep 
> up to date with.
> 
> Dawid
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> !EXCUBATOR:41eb81f8156071530375633!
> 
-- 
Akmal Sarhan <as...@byteaction.de>
ByteAction GmbH


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: carrot2 question too - Re: Fun with the Wikipedia

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi David,

I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.

>> That is awesome and very inspirational!

Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.

>> Carrot2 looks very interesting. Wondering if anybody has a list of all 
>> the
> 
> Technically I don't think carrot2 uses lucene per-se- it's just that you 
> can integrate the two, and ditto for Nutch - it has code that uses Carrot2.

Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
   from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

> And here's my question. I reread the Carrot2<->Lucene code, esp 
> Demo.java, and there's this fragment:
> 
>     // warm-up round (stemmer tables must be read etc).
>     List clusters = clusterer.clusterHits(docs);
> 
>     long clusteringStartTime = System.currentTimeMillis();
>     clusters = clusterer.clusterHits(docs);
>     long clusteringEndTime = System.currentTimeMillis();
> 
> Thus it calls clusterHits() twice.
> 
> I don't really understand how to use Carrot2 - but I think the above is 
> just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
> initialization - and that there's no benefit of repeatedly calling 
> clusterHits (where a benefit might be that it can find nested clusters 
> or whatever) - is that right (that there's no benefit)?

No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.


Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


carrot2 question too - Re: Fun with the Wikipedia

Posted by David Spencer <da...@tropo.com>.
aneesha@codeintime.com wrote:

> That is awesome and very inspirational!

Thank you.

> 
> Carrot2 looks very interesting. Wondering if anybody has a list of all the

Technically I don't think carrot2 uses lucene per-se- it's just that you 
can integrate the two, and ditto for Nutch - it has code that uses Carrot2.

This post is where the code I used as a basis came from:

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

This is URL w/ the code:

http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip


And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

     // warm-up round (stemmer tables must be read etc).
     List clusters = clusterer.clusterHits(docs);

     long clusteringStartTime = System.currentTimeMillis();
     clusters = clusterer.clusterHits(docs);
     long clusteringEndTime = System.currentTimeMillis();

Thus it calls clusterHits() twice.

I don't really understand how to use Carrot2 - but I think the above is 
just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
initialization - and that there's no benefit of repeatedly calling 
clusterHits (where a benefit might be that it can find nested clusters 
or whatever) - is that right (that there's no benefit)?



> academic research projects using Lucene. The only other one that I know of
> is Striver - which uses a support vector machine to learn the ranking
> function: http://www.cs.cornell.edu/People/tj/career/

Could always search citeseer for mentions of Lucene too.
> 
> Aneesha
> 
> 
> 
>>For my own amusement I've indexed the Wikipedia and put up pages that:
>>- display search results
>>- cluster the results using Carrot2 (my first use of this)
>>- display similar pages using the entire text to re-query for similar
>>docs and
>>- display similar pages using the "more like this" algorithm (TBD is get
>>this into the sandbox, sorry for delays..)
>>
>>
>>You start off here to search:
>>
>>	http://www.searchmorph.com/kat/wikipedia.jsp
>>
>>
>>And the weblog entry goes into a bit more detail:
>>
>>	http://www.searchmorph.com/weblog/index.php?id=37
>>
>>
>>
>>It's kinda fun to explore the Wikipedia by looking for pages similar to
>>other ones.
>>
>>Hope people find this useful...
>>
>>- Dave
>>
>>PS
>>   I'm in the process of running the page rank algorithm (from
>>jung.sf.net) on most of the entries in the Wikipedia. It has taken over
>>2 days so far....
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Fun with the Wikipedia

Posted by an...@codeintime.com.
That is awesome and very inspirational!

Carrot2 looks very interesting. Wondering if anybody has a list of all the
academic research projects using Lucene. The only other one that I know of
is Striver - which uses a support vector machine to learn the ranking
function: http://www.cs.cornell.edu/People/tj/career/

Aneesha


>
> For my own amusement I've indexed the Wikipedia and put up pages that:
> - display search results
> - cluster the results using Carrot2 (my first use of this)
> - display similar pages using the entire text to re-query for similar
> docs and
> - display similar pages using the "more like this" algorithm (TBD is get
> this into the sandbox, sorry for delays..)
>
>
> You start off here to search:
>
> 	http://www.searchmorph.com/kat/wikipedia.jsp
>
>
> And the weblog entry goes into a bit more detail:
>
> 	http://www.searchmorph.com/weblog/index.php?id=37
>
>
>
> It's kinda fun to explore the Wikipedia by looking for pages similar to
> other ones.
>
> Hope people find this useful...
>
> - Dave
>
> PS
>    I'm in the process of running the page rank algorithm (from
> jung.sf.net) on most of the entries in the Wikipedia. It has taken over
> 2 days so far....
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org