You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "robert benea (JIRA)" <ji...@apache.org> on 2005/10/04 17:48:47 UTC
[jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Vivisimo like treeview and url redirect
---------------------------------------
Key: NUTCH-103
URL: http://issues.apache.org/jira/browse/NUTCH-103
Project: Nutch
Type: New Feature
Components: web gui
Versions: 0.8-dev
Environment: linux
Reporter: robert benea
First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview. Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-).
To install it just copy the files where you deployed the nutch.war and will work auto-magically.
Regards,
R.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> 1. AHC doesn't have any local filter that implements LocalFilterComponent,
> RawClusterProducer and so on, how can I achieve that, form a very
> superficial point of view it seem that nobody uses AHC class?,
No. We added local interfaces after AHC was implemented. It wasn't
maintained afterwards. So, you'd have to look in the code of a servlet
that implements AHC and write a corresponding local component to it.
> 2. How do the stopwords and stemmers work for STC ?
If I recall correctly, STC in its remote version relies on the input XML
stream having all the metadata with linguistic information. With the
local component... I honestly don't remember off the top of my head even
though I wrote it myself (a few years ago).
I'll work on a local demo application later this week -- I'll publish
the sources, so you'll be able to see how the local controller works and
how various components are configured. We may even create a separate
repository to collaborate and then prepare a patch to Nutch trunk.
I will start working on this later this week.
D.
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Robert Benea <ro...@gmail.com>.
On 10/6/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> > That would be great, I looked already to the code base in the plug-in
> > directory and it seems you use this call to get the clustering results:
> >
> > controller.query("lingo-nmf-km-3", "pseudo-query", requestParams);
> > am I right ?
> >
> > anyway, I want to have the type of algorithm used for clustering, picked
> up
> > from the xml file, it should be easy to do so.
>
> Yes, it is quite easy -- the controller above can be instantiated from
> an XML file or from a Beanshell script using a local controller
> component (not in the Nutch codebase yet). There are unit tests of that
> controller in Carrot2 CVS, but it has been added recently so I didn't
> have the time to integrate it in a solid working example.
Hi Dawid,
I was able to hack the Clusterer class and made it work for STH, here is my
hack ;-)
// Clustering component here.
LocalComponentFactory stcFactory = new LocalComponentFactoryBase() {
public LocalComponent getInstance() {
HashMap defaults = new HashMap();
// These are adjustments settings for the clustering algorithm...
// You can play with them, but the values below are our 'best guess'
// settings that we acquired experimentally.
defaults.put("lsi.threshold.clusterAssignment", "0.150");
defaults.put("lsi.threshold.candidateCluster", "0.775");
// TODO: this should be eventually replaced with documents from Nutch
// tagged with a language tag. There is no need to again determine
// the language of a document.
return new STCLocalFilterComponent();
}
};
controller.addLocalComponentFactory("filter.lingo-old", stcFactory);
}
But I have two questions:
1. AHC doesn't have any local filter that implements LocalFilterComponent,
RawClusterProducer and so on, how can I achieve that, form a very
superficial point of view it seem that nobody uses AHC class?,
2. How do the stopwords and stemmers work for STC ?
There is one potential problem that I see -- Nutch plugins require
> explicit JAR references. If you want to switch between algorithms you'll
> need to either put all Carrot2 JARs in the descriptor, put them in
> CLASSPATH before Nutch starts or do some other trickery with class
> loading.
I just put the stc.jar in the lib directory, I will optimize it later ;-).
Cheers,
R.
I won't be able to help you until next week, but after then I'll try to
> find some time to prepare you an example of how the scriptable
> controller is used (or look at the unit tests, the component is called
> carrot2-local-controller.
>
> Dawid
>
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Jérôme Charron <je...@gmail.com>.
> There is one potential problem that I see -- Nutch plugins require
> explicit JAR references. If you want to switch between algorithms you'll
> need to either put all Carrot2 JARs in the descriptor, put them in
> CLASSPATH before Nutch starts or do some other trickery with class
> loading.
Only available in the trunk, you can also now define some inter-plugins
dependencies
using plugins identifiers instead of explicit jar references. These
dependencies are then
checked for availability and added to the classloader at runtime.
Take a look at analyze-fr and analyze-de plugins that depends on
lib-lucene-analyzers.
You can also notice, that now, for instance all plugins depends on the
nutch-extensionpoints plugin.
For instance, I recently notice that many plugins import a log4j.jar.
It would be a good idea to define a lib-log4j plugin, and add a dependency
on this plugin for
each plugins that import log4j.jar in their lib (of course, we must take
care of the log4j version used)
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> That would be great, I looked already to the code base in the plug-in
> directory and it seems you use this call to get the clustering results:
>
> controller.query("lingo-nmf-km-3", "pseudo-query", requestParams);
> am I right ?
>
> anyway, I want to have the type of algorithm used for clustering, picked up
> from the xml file, it should be easy to do so.
Yes, it is quite easy -- the controller above can be instantiated from
an XML file or from a Beanshell script using a local controller
component (not in the Nutch codebase yet). There are unit tests of that
controller in Carrot2 CVS, but it has been added recently so I didn't
have the time to integrate it in a solid working example.
There is one potential problem that I see -- Nutch plugins require
explicit JAR references. If you want to switch between algorithms you'll
need to either put all Carrot2 JARs in the descriptor, put them in
CLASSPATH before Nutch starts or do some other trickery with class loading.
I won't be able to help you until next week, but after then I'll try to
find some time to prepare you an example of how the scriptable
controller is used (or look at the unit tests, the component is called
carrot2-local-controller.
Dawid
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Robert Benea <ro...@gmail.com>.
On 10/5/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> > I am planning to take a closer look to the carrot2 implementation and
> expose
> > the other algorithms to the user,
>
> That's actually quite simple -- I was planning to do it, but have no
> time at the moment. The current Carrot2 code in Nutch is a preconfigured
> process which uses the open source Lingo clustering algorithm to cluster
> documents. But the the codebase of Carrot2 there is now a scriptable
> controller, so you could basically have external scripts configuring
> several different algorithms. It really isn't that difficult. If you
> need any help, let me know -- private e-mail or the newsgroup, whatever.
That would be great, I looked already to the code base in the plug-in
directory and it seems you use this call to get the clustering results:
controller.query("lingo-nmf-km-3", "pseudo-query", requestParams);
am I right ?
anyway, I want to have the type of algorithm used for clustering, picked up
from the xml file, it should be easy to do so.
Any guidelines, ideas are welcomed.
> changes to the algorithm(s) so that speed wise be as good as vivisimo (not
> > only interface wise ;-)).
>
> We don't know what Vivisimo algorithm is really like in terms of speed.
> Its authors and co-funders are excellent researchers, so I guess it
> will be a tough beast to beat :) But of course we don't have any reasons
> to be ashamed -- the open source version is quite decent.
That's the spirit, and is going to get better ;-).
In the
> commercial version we refactored the codebase and added an optional
> native matrix computation library. The speedup is significant (which
> matters only if your servers are really under a lot of load).
>
> Dawid
Cheers,
R.
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> I am planning to take a closer look to the carrot2 implementation and expose
> the other algorithms to the user,
That's actually quite simple -- I was planning to do it, but have no
time at the moment. The current Carrot2 code in Nutch is a preconfigured
process which uses the open source Lingo clustering algorithm to cluster
documents. But the the codebase of Carrot2 there is now a scriptable
controller, so you could basically have external scripts configuring
several different algorithms. It really isn't that difficult. If you
need any help, let me know -- private e-mail or the newsgroup, whatever.
> changes to the algorithm(s) so that speed wise be as good as vivisimo (not
> only interface wise ;-)).
We don't know what Vivisimo algorithm is really like in terms of speed.
Its authors and co-funders are excellent researchers, so I guess it
will be a tough beast to beat :) But of course we don't have any reasons
to be ashamed -- the open source version is quite decent. In the
commercial version we refactored the codebase and added an optional
native matrix computation library. The speedup is significant (which
matters only if your servers are really under a lot of load).
Dawid
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Robert Benea <ro...@gmail.com>.
Hi Dawid,
I am planning to take a closer look to the carrot2 implementation and expose
the other algorithms to the user, and also if possible make the necessary
changes to the algorithm(s) so that speed wise be as good as vivisimo (not
only interface wise ;-)).
Cheers,
R.
On 10/4/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> Hi Robert,
>
> > First, I modified cluster.jsp and now the cluster has a vivisimo
> > look. I used javascript to show the treeview. Another small change
> > is that I call the cluster recursively twice, so that two levels of
> > clustering are shown.
>
> Yep, this is the simplest way of inducing a hierarchy. There are
> algorithms that do it faster -- a whole spectrum of agglomerative or
> divisive approaches (AHC), a simple overlap-merging method used in STC
> or the commercial version of Lingo clustering component. Vivisimo most
> likely uses a single-pass variant as well.
>
> D.
>
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Robert,
> First, I modified cluster.jsp and now the cluster has a vivisimo
> look. I used javascript to show the treeview. Another small change
> is that I call the cluster recursively twice, so that two levels of
> clustering are shown.
Yep, this is the simplest way of inducing a hierarchy. There are
algorithms that do it faster -- a whole spectrum of agglomerative or
divisive approaches (AHC), a simple overlap-merging method used in STC
or the commercial version of Lingo clustering component. Vivisimo most
likely uses a single-pass variant as well.
D.
Re: [jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Robert Benea <ro...@gmail.com>.
yes,
did u enable clustering ?
<property>
<name>plugin.includes</name>
<value>geoPosition|protocol-httpclient|urlfilter-regex|protocol-file|parse-(text|html|js|word|rss)|index-basic|query-(phrase|site|url)|
clustering-carrot2</value>
<description>Regular expression naming plugin directory names to
Cheers,
R.
P.S. let me know if u made it work.
On 10/17/05, Chih How Bong <ch...@gmail.com> wrote:
>
> Thanks Robert. I will try it again.
> Do I have to configure any settings to point to any nutch plugin?
> Have good day.
>
> Bong
>
> On 10/18/05, Robert Benea <ro...@gmail.com> wrote:
> >
> > u need to untar clusty.tar inside your ROOT directory,
> > jakarta/webapps/ROOT,
> > and then u need to open the files search.jsp/cluster.jsp and resave them
> > so
> > that the webserver will pick the new ones, that's about it.
> >
> > U should spend more like 5 mins on making this one run, probably u r not
> > too
> > familiar with this project.
> >
> > Cheers,
> > R.
> >
> > On 10/17/05, Bong Chih How (JIRA) <ji...@apache.org> wrote:
> > >
> > > [
> > >
> >
> http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316]
> > >
> > > Bong Chih How commented on NUTCH-103:
> > > -------------------------------------
> > >
> > > I adventitously discover this wonderful application some time ago
> > through
> > > JIRA. It enticed me to download the achieve and install in on my linux
> > box
> > > which was running nutch 0.7. After spending ~10 hours, i have no luck
> to
> > > get it work . I would appreciate if detail installation documentation
> is
> > > available here. Thanks.
> > >
> > > > Vivisimo like treeview and url redirect
> > > > ---------------------------------------
> > > >
> > > > Key: NUTCH-103
> > > > URL: http://issues.apache.org/jira/browse/NUTCH-103
> > > > Project: Nutch
> > > > Type: Improvement
> > > > Components: web gui
> > > > Versions: 0.8-dev
> > > > Environment: linux
> > > > Reporter: robert benea
> > > > Priority: Trivial
> > > > Attachments: clusty.tar
> > > >
> > > > First, I modified cluster.jsp and now the cluster has a vivisimo
> look.
> > I
> > > used javascript to show the treeview. Another small change is that I
> > call
> > > the cluster recursively twice, so that two levels of clustering are
> > shown.
> > > > Second, I added redirect.jsp in order to log the links that were
> > clicked
> > > during search and because of that search.jsp is changed as well.
> > > > The code is not clean as all started as an experiment, I hope
> someone
> > > else finds it useful and clean it up ;-).
> > > > To install it just copy the files where you deployed the nutch.warand
> > > will work auto-magically.
> > > > Regards,
> > > > R.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > If you think it was sent incorrectly contact one of the
> administrators:
> > > http://issues.apache.org/jira/secure/Administrators.jspa
> > > -
> > > For more information on JIRA, see:
> > > http://www.atlassian.com/software/jira
> > >
> > >
> >
> >
>
>
Re: [jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Chih How Bong <ch...@gmail.com>.
Thanks Robert. I will try it again.
Do I have to configure any settings to point to any nutch plugin?
Have good day.
Bong
On 10/18/05, Robert Benea <ro...@gmail.com> wrote:
>
> u need to untar clusty.tar inside your ROOT directory,
> jakarta/webapps/ROOT,
> and then u need to open the files search.jsp/cluster.jsp and resave them
> so
> that the webserver will pick the new ones, that's about it.
>
> U should spend more like 5 mins on making this one run, probably u r not
> too
> familiar with this project.
>
> Cheers,
> R.
>
> On 10/17/05, Bong Chih How (JIRA) <ji...@apache.org> wrote:
> >
> > [
> >
> http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316]
> >
> > Bong Chih How commented on NUTCH-103:
> > -------------------------------------
> >
> > I adventitously discover this wonderful application some time ago
> through
> > JIRA. It enticed me to download the achieve and install in on my linux
> box
> > which was running nutch 0.7. After spending ~10 hours, i have no luck to
> > get it work . I would appreciate if detail installation documentation is
> > available here. Thanks.
> >
> > > Vivisimo like treeview and url redirect
> > > ---------------------------------------
> > >
> > > Key: NUTCH-103
> > > URL: http://issues.apache.org/jira/browse/NUTCH-103
> > > Project: Nutch
> > > Type: Improvement
> > > Components: web gui
> > > Versions: 0.8-dev
> > > Environment: linux
> > > Reporter: robert benea
> > > Priority: Trivial
> > > Attachments: clusty.tar
> > >
> > > First, I modified cluster.jsp and now the cluster has a vivisimo look.
> I
> > used javascript to show the treeview. Another small change is that I
> call
> > the cluster recursively twice, so that two levels of clustering are
> shown.
> > > Second, I added redirect.jsp in order to log the links that were
> clicked
> > during search and because of that search.jsp is changed as well.
> > > The code is not clean as all started as an experiment, I hope someone
> > else finds it useful and clean it up ;-).
> > > To install it just copy the files where you deployed the nutch.war and
> > will work auto-magically.
> > > Regards,
> > > R.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > If you think it was sent incorrectly contact one of the administrators:
> > http://issues.apache.org/jira/secure/Administrators.jspa
> > -
> > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
> >
> >
>
>
Re: [jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by Robert Benea <ro...@gmail.com>.
u need to untar clusty.tar inside your ROOT directory, jakarta/webapps/ROOT,
and then u need to open the files search.jsp/cluster.jsp and resave them so
that the webserver will pick the new ones, that's about it.
U should spend more like 5 mins on making this one run, probably u r not too
familiar with this project.
Cheers,
R.
On 10/17/05, Bong Chih How (JIRA) <ji...@apache.org> wrote:
>
> [
> http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316]
>
> Bong Chih How commented on NUTCH-103:
> -------------------------------------
>
> I adventitously discover this wonderful application some time ago through
> JIRA. It enticed me to download the achieve and install in on my linux box
> which was running nutch 0.7. After spending ~10 hours, i have no luck to
> get it work . I would appreciate if detail installation documentation is
> available here. Thanks.
>
> > Vivisimo like treeview and url redirect
> > ---------------------------------------
> >
> > Key: NUTCH-103
> > URL: http://issues.apache.org/jira/browse/NUTCH-103
> > Project: Nutch
> > Type: Improvement
> > Components: web gui
> > Versions: 0.8-dev
> > Environment: linux
> > Reporter: robert benea
> > Priority: Trivial
> > Attachments: clusty.tar
> >
> > First, I modified cluster.jsp and now the cluster has a vivisimo look. I
> used javascript to show the treeview. Another small change is that I call
> the cluster recursively twice, so that two levels of clustering are shown.
> > Second, I added redirect.jsp in order to log the links that were clicked
> during search and because of that search.jsp is changed as well.
> > The code is not clean as all started as an experiment, I hope someone
> else finds it useful and clean it up ;-).
> > To install it just copy the files where you deployed the nutch.war and
> will work auto-magically.
> > Regards,
> > R.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>
[jira] Updated: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by "robert benea (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-103?page=all ]
robert benea updated NUTCH-103:
-------------------------------
Attachment: clusty.tar
> Vivisimo like treeview and url redirect
> ---------------------------------------
>
> Key: NUTCH-103
> URL: http://issues.apache.org/jira/browse/NUTCH-103
> Project: Nutch
> Type: New Feature
> Components: web gui
> Versions: 0.8-dev
> Environment: linux
> Reporter: robert benea
> Attachments: clusty.tar
>
> First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview. Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
> Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
> The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-).
> To install it just copy the files where you deployed the nutch.war and will work auto-magically.
> Regards,
> R.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by "robert benea (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-103?page=all ]
robert benea updated NUTCH-103:
-------------------------------
type: Improvement (was: New Feature)
Priority: Trivial (was: Major)
> Vivisimo like treeview and url redirect
> ---------------------------------------
>
> Key: NUTCH-103
> URL: http://issues.apache.org/jira/browse/NUTCH-103
> Project: Nutch
> Type: Improvement
> Components: web gui
> Versions: 0.8-dev
> Environment: linux
> Reporter: robert benea
> Priority: Trivial
> Attachments: clusty.tar
>
> First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview. Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
> Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
> The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-).
> To install it just copy the files where you deployed the nutch.war and will work auto-magically.
> Regards,
> R.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect
Posted by "Bong Chih How (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316 ]
Bong Chih How commented on NUTCH-103:
-------------------------------------
I adventitously discover this wonderful application some time ago through JIRA. It enticed me to download the achieve and install in on my linux box which was running nutch 0.7. After spending ~10 hours, i have no luck to get it work . I would appreciate if detail installation documentation is available here. Thanks.
> Vivisimo like treeview and url redirect
> ---------------------------------------
>
> Key: NUTCH-103
> URL: http://issues.apache.org/jira/browse/NUTCH-103
> Project: Nutch
> Type: Improvement
> Components: web gui
> Versions: 0.8-dev
> Environment: linux
> Reporter: robert benea
> Priority: Trivial
> Attachments: clusty.tar
>
> First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview. Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
> Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
> The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-).
> To install it just copy the files where you deployed the nutch.war and will work auto-magically.
> Regards,
> R.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira