You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2004/10/07 10:39:26 UTC

Re: Clustering lucene's results

Hi William,

Ok, here is some demo code I've put together that shows how you can 
achieve clustering of Lucene's results. I hope this will get you started 
on your projects. If you have questions, please don't hesitate to ask -- 
cross posts to carrot2-developers would be a good idea too.

The code (plus the binaries so that you don't have to check out all of 
Carrot2 ;) are at:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Take a look at Demo.java -- it is the main link between Lucene and 
Carrot. Play with the parameters, I used 100 as the number of search 
results to be clustered. Adjust it to your needs.

		int start = 0;
		int requiredHits = 100;

I hope the code will be self-explanatory.

Good luck,
Dawid

 From the readme file:

An example of using Carrot2 components to clustering search
results from Lucene.
===========================================================


Prerequisities
--------------

You must have an index created with Lucene and containing
documents with the following fields: url, title, summary.

The Lucene demo works with exactly these fields -- I just indexed
all of Lucene's source code and documentation using the following line:

mkdir index
java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
-index index .

The index is now in 'index' folder.

Remember that the quality of snippets and titles heavily influences the
output of the clustering; in fact, the above example index of Lucene's 
API is
not too good because most queries will return nonsensical cluster labels
(see below).

Building Carrot2-Lucene demo
----------------------------

Basically you should have all of Carrot2 source code checked out and
issue the building command:

ant -Dcopy.dependencies=true

All of the required libraries and Carrot2 components will end up
in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.

You can also spare yourself some time and download precompiled binaries
I've put at:

http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

Now, once you have the compiled binaries, issue the following command
(all on one line of course):

java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
	com.dawidweiss.carrot.lucene.Demo index query

The first argument is the location of the Lucene's index created before. 
The second argument
is a query. In the output you should have clusters and max. three 
documents from every cluster:

Results for: query
Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
  :> Search Lucene Rc1 Dev API
     - 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html
       Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API)
     - 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html
       org.apache.lucene.search (Lucene 1.5-rc1-dev API)
     - 
F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html
       Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
       (and 19 more)

  :> Jakarta Lucene
     - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
       Jakarta Lucene API
     - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
       Jakarta Lucene - Who We Are - Jakarta Lucene
     - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
       Jakarta Lucene - Overview - Jakarta Lucene
       (and 12 more)

If you look at the source code of Demo.java, there are plenty of things
apt for customization -- number of results from each cluster, number of 
displayed
clusters (I would cut it to some reasonable number, say 10 or 15 -- the 
further a
cluster is from the "top", the less it is likely to be important). Also keep
in mind that some of Carrot2 components produce hierarchical clusters. 
This demonstration
works with "flat" version of Lingo algorithm, so you don't need to worry 
about it.

Hope this gets you started with using Carrot2 and Lucene.
Please let me know about any successes or failures.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Clustering lucene's results

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Nope, because the example I showed is based on the "local interfaces" 
pipeline and output-xsltrenderer is for remote components only.

Anyway, I don't think it makes much sense -- if you need xslt badly, 
just modify the source code to output the results as XML and put an xslt 
filter on top of what it returns. Shouldn't be too hard.

Dawid

Albert Vila wrote:
> That's great, thanks dawid.
> 
> Just a question, how can I modify your code in order to use the 
> carrot2-output-xsltrenderer to output the clustering results in a html 
> page?
> 
> Can you provide an example?
> 
> Thanks
> 
> Dawid Weiss wrote:
> 
>>
>> Hi William,
>>
>> Ok, here is some demo code I've put together that shows how you can 
>> achieve clustering of Lucene's results. I hope this will get you 
>> started on your projects. If you have questions, please don't hesitate 
>> to ask -- cross posts to carrot2-developers would be a good idea too.
>>
>> The code (plus the binaries so that you don't have to check out all of 
>> Carrot2 ;) are at:
>> http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
>>
>> Take a look at Demo.java -- it is the main link between Lucene and 
>> Carrot. Play with the parameters, I used 100 as the number of search 
>> results to be clustered. Adjust it to your needs.
>>
>>         int start = 0;
>>         int requiredHits = 100;
>>
>> I hope the code will be self-explanatory.
>>
>> Good luck,
>> Dawid
>>
>> From the readme file:
>>
>> An example of using Carrot2 components to clustering search
>> results from Lucene.
>> ===========================================================
>>
>>
>> Prerequisities
>> --------------
>>
>> You must have an index created with Lucene and containing
>> documents with the following fields: url, title, summary.
>>
>> The Lucene demo works with exactly these fields -- I just indexed
>> all of Lucene's source code and documentation using the following line:
>>
>> mkdir index
>> java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
>> -index index .
>>
>> The index is now in 'index' folder.
>>
>> Remember that the quality of snippets and titles heavily influences the
>> output of the clustering; in fact, the above example index of Lucene's 
>> API is
>> not too good because most queries will return nonsensical cluster labels
>> (see below).
>>
>> Building Carrot2-Lucene demo
>> ----------------------------
>>
>> Basically you should have all of Carrot2 source code checked out and
>> issue the building command:
>>
>> ant -Dcopy.dependencies=true
>>
>> All of the required libraries and Carrot2 components will end up
>> in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
>>
>> You can also spare yourself some time and download precompiled binaries
>> I've put at:
>>
>> http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
>>
>> Now, once you have the compiled binaries, issue the following command
>> (all on one line of course):
>>
>> java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
>>     com.dawidweiss.carrot.lucene.Demo index query
>>
>> The first argument is the location of the Lucene's index created 
>> before. The second argument
>> is a query. In the output you should have clusters and max. three 
>> documents from every cluster:
>>
>> Results for: query
>> Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
>>  :> Search Lucene Rc1 Dev API
>>     - 
>> F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html 
>>
>>       Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev 
>> API)
>>     - 
>> F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html 
>>
>>       org.apache.lucene.search (Lucene 1.5-rc1-dev API)
>>     - 
>> F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html 
>>
>>       Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
>>       (and 19 more)
>>
>>  :> Jakarta Lucene
>>     - 
>> F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
>>       Jakarta Lucene API
>>     - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
>>       Jakarta Lucene - Who We Are - Jakarta Lucene
>>     - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
>>       Jakarta Lucene - Overview - Jakarta Lucene
>>       (and 12 more)
>>
>> If you look at the source code of Demo.java, there are plenty of things
>> apt for customization -- number of results from each cluster, number 
>> of displayed
>> clusters (I would cut it to some reasonable number, say 10 or 15 -- 
>> the further a
>> cluster is from the "top", the less it is likely to be important). 
>> Also keep
>> in mind that some of Carrot2 components produce hierarchical clusters. 
>> This demonstration
>> works with "flat" version of Lingo algorithm, so you don't need to 
>> worry about it.
>>
>> Hope this gets you started with using Carrot2 and Lucene.
>> Please let me know about any successes or failures.
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Clustering lucene's results

Posted by Albert Vila <av...@imente.com>.
That's great, thanks dawid.

Just a question, how can I modify your code in order to use the 
carrot2-output-xsltrenderer to output the clustering results in a html page?

Can you provide an example?

Thanks

Dawid Weiss wrote:

>
> Hi William,
>
> Ok, here is some demo code I've put together that shows how you can 
> achieve clustering of Lucene's results. I hope this will get you 
> started on your projects. If you have questions, please don't hesitate 
> to ask -- cross posts to carrot2-developers would be a good idea too.
>
> The code (plus the binaries so that you don't have to check out all of 
> Carrot2 ;) are at:
> http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
>
> Take a look at Demo.java -- it is the main link between Lucene and 
> Carrot. Play with the parameters, I used 100 as the number of search 
> results to be clustered. Adjust it to your needs.
>
>         int start = 0;
>         int requiredHits = 100;
>
> I hope the code will be self-explanatory.
>
> Good luck,
> Dawid
>
> From the readme file:
>
> An example of using Carrot2 components to clustering search
> results from Lucene.
> ===========================================================
>
>
> Prerequisities
> --------------
>
> You must have an index created with Lucene and containing
> documents with the following fields: url, title, summary.
>
> The Lucene demo works with exactly these fields -- I just indexed
> all of Lucene's source code and documentation using the following line:
>
> mkdir index
> java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create 
> -index index .
>
> The index is now in 'index' folder.
>
> Remember that the quality of snippets and titles heavily influences the
> output of the clustering; in fact, the above example index of Lucene's 
> API is
> not too good because most queries will return nonsensical cluster labels
> (see below).
>
> Building Carrot2-Lucene demo
> ----------------------------
>
> Basically you should have all of Carrot2 source code checked out and
> issue the building command:
>
> ant -Dcopy.dependencies=true
>
> All of the required libraries and Carrot2 components will end up
> in 'tmp/dist/deps-carrot2-lucene-example-jar' folder.
>
> You can also spare yourself some time and download precompiled binaries
> I've put at:
>
> http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
>
> Now, once you have the compiled binaries, issue the following command
> (all on one line of course):
>
> java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \
>     com.dawidweiss.carrot.lucene.Demo index query
>
> The first argument is the location of the Lucene's index created 
> before. The second argument
> is a query. In the output you should have clusters and max. three 
> documents from every cluster:
>
> Results for: query
> Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s
>  :> Search Lucene Rc1 Dev API
>     - 
> F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html 
>
>       Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev 
> API)
>     - 
> F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html 
>
>       org.apache.lucene.search (Lucene 1.5-rc1-dev API)
>     - 
> F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html 
>
>       Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API)
>       (and 19 more)
>
>  :> Jakarta Lucene
>     - 
> F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html
>       Jakarta Lucene API
>     - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html
>       Jakarta Lucene - Who We Are - Jakarta Lucene
>     - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html
>       Jakarta Lucene - Overview - Jakarta Lucene
>       (and 12 more)
>
> If you look at the source code of Demo.java, there are plenty of things
> apt for customization -- number of results from each cluster, number 
> of displayed
> clusters (I would cut it to some reasonable number, say 10 or 15 -- 
> the further a
> cluster is from the "top", the less it is likely to be important). 
> Also keep
> in mind that some of Carrot2 components produce hierarchical clusters. 
> This demonstration
> works with "flat" version of Lingo algorithm, so you don't need to 
> worry about it.
>
> Hope this gets you started with using Carrot2 and Lucene.
> Please let me know about any successes or failures.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>

-- 
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente �La información con más beneficios�]


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org