You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "robert benea (JIRA)" <ji...@apache.org> on 2005/10/04 17:48:47 UTC

[jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Vivisimo like treeview and url redirect
---------------------------------------

         Key: NUTCH-103
         URL: http://issues.apache.org/jira/browse/NUTCH-103
     Project: Nutch
        Type: New Feature
  Components: web gui  
    Versions: 0.8-dev    
 Environment: linux
    Reporter: robert benea


First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview.  Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.

Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.

The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-). 

To install it just copy the files where you deployed the nutch.war and will work auto-magically.

Regards,
R.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> 1. AHC doesn't have any local filter that implements LocalFilterComponent,
> RawClusterProducer and so on, how can I achieve that, form a very
> superficial point of view it seem that nobody uses AHC class?,

No. We added local interfaces after AHC was implemented. It wasn't 
maintained afterwards. So, you'd have to look in the code of a servlet 
that implements AHC and write a corresponding local component to it.

> 2. How do the stopwords and stemmers work for STC ?

If I recall correctly, STC in its remote version relies on the input XML 
stream having all the metadata with linguistic information. With the 
local component... I honestly don't remember off the top of my head even 
though I wrote it myself (a few years ago).

I'll work on a local demo application later this week -- I'll publish 
the sources, so you'll be able to see how the local controller works and 
how various components are configured. We may even create a separate 
repository to collaborate and then prepare a patch to Nutch trunk.

I will start working on this later this week.

D.


Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Robert Benea <ro...@gmail.com>.
On 10/6/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> > That would be great, I looked already to the code base in the plug-in
> > directory and it seems you use this call to get the clustering results:
> >
> > controller.query("lingo-nmf-km-3", "pseudo-query", requestParams);
> > am I right ?
> >
> > anyway, I want to have the type of algorithm used for clustering, picked
> up
> > from the xml file, it should be easy to do so.
>
> Yes, it is quite easy -- the controller above can be instantiated from
> an XML file or from a Beanshell script using a local controller
> component (not in the Nutch codebase yet). There are unit tests of that
> controller in Carrot2 CVS, but it has been added recently so I didn't
> have the time to integrate it in a solid working example.


Hi Dawid,

I was able to hack the Clusterer class and made it work for STH, here is my
hack ;-)

// Clustering component here.
LocalComponentFactory stcFactory = new LocalComponentFactoryBase() {
public LocalComponent getInstance() {
HashMap defaults = new HashMap();

// These are adjustments settings for the clustering algorithm...
// You can play with them, but the values below are our 'best guess'
// settings that we acquired experimentally.
defaults.put("lsi.threshold.clusterAssignment", "0.150");
defaults.put("lsi.threshold.candidateCluster", "0.775");

// TODO: this should be eventually replaced with documents from Nutch
// tagged with a language tag. There is no need to again determine
// the language of a document.
 return new STCLocalFilterComponent();

}
};
controller.addLocalComponentFactory("filter.lingo-old", stcFactory);
}

But I have two questions:

1. AHC doesn't have any local filter that implements LocalFilterComponent,
RawClusterProducer and so on, how can I achieve that, form a very
superficial point of view it seem that nobody uses AHC class?,
2. How do the stopwords and stemmers work for STC ?


There is one potential problem that I see -- Nutch plugins require
> explicit JAR references. If you want to switch between algorithms you'll
> need to either put all Carrot2 JARs in the descriptor, put them in
> CLASSPATH before Nutch starts or do some other trickery with class
> loading.


I just put the stc.jar in the lib directory, I will optimize it later ;-).

Cheers,
R.

I won't be able to help you until next week, but after then I'll try to
> find some time to prepare you an example of how the scriptable
> controller is used (or look at the unit tests, the component is called
> carrot2-local-controller.
>
> Dawid
>

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Jérôme Charron <je...@gmail.com>.
> There is one potential problem that I see -- Nutch plugins require
> explicit JAR references. If you want to switch between algorithms you'll
> need to either put all Carrot2 JARs in the descriptor, put them in
> CLASSPATH before Nutch starts or do some other trickery with class
> loading.

Only available in the trunk, you can also now define some inter-plugins
dependencies
using plugins identifiers instead of explicit jar references. These
dependencies are then
checked for availability and added to the classloader at runtime.
Take a look at analyze-fr and analyze-de plugins that depends on
lib-lucene-analyzers.
You can also notice, that now, for instance all plugins depends on the
nutch-extensionpoints plugin.

For instance, I recently notice that many plugins import a log4j.jar.
It would be a good idea to define a lib-log4j plugin, and add a dependency
on this plugin for
each plugins that import log4j.jar in their lib (of course, we must take
care of the log4j version used)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> That would be great, I looked already to the code base in the plug-in
> directory and it seems you use this call to get the clustering results:
> 
> controller.query("lingo-nmf-km-3", "pseudo-query", requestParams);
> am I right ?
> 
> anyway, I want to have the type of algorithm used for clustering, picked up
> from the xml file, it should be easy to do so.

Yes, it is quite easy -- the controller above can be instantiated from 
an XML file or from a Beanshell script using a local controller 
component (not in the Nutch codebase yet). There are unit tests of that 
controller in Carrot2 CVS, but it has been added recently so I didn't 
have the time to integrate it in a solid working example.

There is one potential problem that I see -- Nutch plugins require 
explicit JAR references. If you want to switch between algorithms you'll 
need to either put all Carrot2 JARs in the descriptor, put them in 
CLASSPATH before Nutch starts or do some other trickery with class loading.

I won't be able to help you until next week, but after then I'll try to 
find some time to prepare you an example of how the scriptable 
controller is used (or look at the unit tests, the component is called 
carrot2-local-controller.

Dawid

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Robert Benea <ro...@gmail.com>.
On 10/5/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> > I am planning to take a closer look to the carrot2 implementation and
> expose
> > the other algorithms to the user,
>
> That's actually quite simple -- I was planning to do it, but have no
> time at the moment. The current Carrot2 code in Nutch is a preconfigured
> process which uses the open source Lingo clustering algorithm to cluster
> documents. But the the codebase of Carrot2 there is now a scriptable
> controller, so you could basically have external scripts configuring
> several different algorithms. It really isn't that difficult. If you
> need any help, let me know -- private e-mail or the newsgroup, whatever.


That would be great, I looked already to the code base in the plug-in
directory and it seems you use this call to get the clustering results:

controller.query("lingo-nmf-km-3", "pseudo-query", requestParams);
am I right ?

anyway, I want to have the type of algorithm used for clustering, picked up
from the xml file, it should be easy to do so.

Any guidelines, ideas are welcomed.


> changes to the algorithm(s) so that speed wise be as good as vivisimo (not
> > only interface wise ;-)).
>
> We don't know what Vivisimo algorithm is really like in terms of speed.
> Its authors and co-funders are excellent researchers, so I guess it
> will be a tough beast to beat :) But of course we don't have any reasons
> to be ashamed -- the open source version is quite decent.


That's the spirit, and is going to get better ;-).

In the
> commercial version we refactored the codebase and added an optional
> native matrix computation library. The speedup is significant (which
> matters only if your servers are really under a lot of load).
>
> Dawid


Cheers,
R.

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> I am planning to take a closer look to the carrot2 implementation and expose
> the other algorithms to the user, 

That's actually quite simple -- I was planning to do it, but have no 
time at the moment. The current Carrot2 code in Nutch is a preconfigured 
process which uses the open source Lingo clustering algorithm to cluster 
documents. But the the codebase of Carrot2 there is now a scriptable 
controller, so you could basically have external scripts configuring 
several different algorithms. It really isn't that difficult. If you 
need any help, let me know -- private e-mail or the newsgroup, whatever.

> changes to the algorithm(s) so that speed wise be as good as vivisimo (not
> only interface wise ;-)).

We don't know what Vivisimo algorithm is really like in terms of speed. 
  Its authors and co-funders are excellent researchers, so I guess it 
will be a tough beast to beat :) But of course we don't have any reasons 
to be ashamed -- the open source version is quite decent. In the 
commercial version we refactored the codebase and added an optional 
native matrix computation library. The speedup is significant (which 
matters only if your servers are really under a lot of load).

Dawid

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Robert Benea <ro...@gmail.com>.
Hi Dawid,

I am planning to take a closer look to the carrot2 implementation and expose
the other algorithms to the user, and also if possible make the necessary
changes to the algorithm(s) so that speed wise be as good as vivisimo (not
only interface wise ;-)).

Cheers,
R.

On 10/4/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> Hi Robert,
>
> > First, I modified cluster.jsp and now the cluster has a vivisimo
> > look. I used javascript to show the treeview. Another small change
> > is that I call the cluster recursively twice, so that two levels of
> > clustering are shown.
>
> Yep, this is the simplest way of inducing a hierarchy. There are
> algorithms that do it faster -- a whole spectrum of agglomerative or
> divisive approaches (AHC), a simple overlap-merging method used in STC
> or the commercial version of Lingo clustering component. Vivisimo most
> likely uses a single-pass variant as well.
>
> D.
>

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Robert,

> First, I modified cluster.jsp and now the cluster has a vivisimo
> look. I used javascript to show the treeview.  Another small change
> is that I call the cluster recursively twice, so that two levels of
 > clustering are shown.

Yep, this is the simplest way of inducing a hierarchy. There are 
algorithms that do it faster -- a whole spectrum of agglomerative or 
divisive approaches (AHC), a simple overlap-merging method used in STC 
or the commercial version of Lingo clustering component. Vivisimo most 
likely uses a single-pass variant as well.

D.

Re: [jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Robert Benea <ro...@gmail.com>.
yes,

did u enable clustering ?

<property>
<name>plugin.includes</name>
<value>geoPosition|protocol-httpclient|urlfilter-regex|protocol-file|parse-(text|html|js|word|rss)|index-basic|query-(phrase|site|url)|
clustering-carrot2</value>
<description>Regular expression naming plugin directory names to

Cheers,
R.

P.S. let me know if u made it work.

On 10/17/05, Chih How Bong <ch...@gmail.com> wrote:
>
> Thanks Robert. I will try it again.
> Do I have to configure any settings to point to any nutch plugin?
> Have good day.
>
> Bong
>
> On 10/18/05, Robert Benea <ro...@gmail.com> wrote:
> >
> > u need to untar clusty.tar inside your ROOT directory,
> > jakarta/webapps/ROOT,
> > and then u need to open the files search.jsp/cluster.jsp and resave them
> > so
> > that the webserver will pick the new ones, that's about it.
> >
> > U should spend more like 5 mins on making this one run, probably u r not
> > too
> > familiar with this project.
> >
> > Cheers,
> > R.
> >
> > On 10/17/05, Bong Chih How (JIRA) <ji...@apache.org> wrote:
> > >
> > > [
> > >
> >
> http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316]
> > >
> > > Bong Chih How commented on NUTCH-103:
> > > -------------------------------------
> > >
> > > I adventitously discover this wonderful application some time ago
> > through
> > > JIRA. It enticed me to download the achieve and install in on my linux
> > box
> > > which was running nutch 0.7. After spending ~10 hours, i have no luck
> to
> > > get it work . I would appreciate if detail installation documentation
> is
> > > available here. Thanks.
> > >
> > > > Vivisimo like treeview and url redirect
> > > > ---------------------------------------
> > > >
> > > > Key: NUTCH-103
> > > > URL: http://issues.apache.org/jira/browse/NUTCH-103
> > > > Project: Nutch
> > > > Type: Improvement
> > > > Components: web gui
> > > > Versions: 0.8-dev
> > > > Environment: linux
> > > > Reporter: robert benea
> > > > Priority: Trivial
> > > > Attachments: clusty.tar
> > > >
> > > > First, I modified cluster.jsp and now the cluster has a vivisimo
> look.
> > I
> > > used javascript to show the treeview. Another small change is that I
> > call
> > > the cluster recursively twice, so that two levels of clustering are
> > shown.
> > > > Second, I added redirect.jsp in order to log the links that were
> > clicked
> > > during search and because of that search.jsp is changed as well.
> > > > The code is not clean as all started as an experiment, I hope
> someone
> > > else finds it useful and clean it up ;-).
> > > > To install it just copy the files where you deployed the nutch.warand
> > > will work auto-magically.
> > > > Regards,
> > > > R.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > If you think it was sent incorrectly contact one of the
> administrators:
> > > http://issues.apache.org/jira/secure/Administrators.jspa
> > > -
> > > For more information on JIRA, see:
> > > http://www.atlassian.com/software/jira
> > >
> > >
> >
> >
>
>

Re: [jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Chih How Bong <ch...@gmail.com>.
Thanks Robert. I will try it again.
Do I have to configure any settings to point to any nutch plugin?
Have good day.

Bong

On 10/18/05, Robert Benea <ro...@gmail.com> wrote:
>
> u need to untar clusty.tar inside your ROOT directory,
> jakarta/webapps/ROOT,
> and then u need to open the files search.jsp/cluster.jsp and resave them
> so
> that the webserver will pick the new ones, that's about it.
>
> U should spend more like 5 mins on making this one run, probably u r not
> too
> familiar with this project.
>
> Cheers,
> R.
>
> On 10/17/05, Bong Chih How (JIRA) <ji...@apache.org> wrote:
> >
> > [
> >
> http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316]
> >
> > Bong Chih How commented on NUTCH-103:
> > -------------------------------------
> >
> > I adventitously discover this wonderful application some time ago
> through
> > JIRA. It enticed me to download the achieve and install in on my linux
> box
> > which was running nutch 0.7. After spending ~10 hours, i have no luck to
> > get it work . I would appreciate if detail installation documentation is
> > available here. Thanks.
> >
> > > Vivisimo like treeview and url redirect
> > > ---------------------------------------
> > >
> > > Key: NUTCH-103
> > > URL: http://issues.apache.org/jira/browse/NUTCH-103
> > > Project: Nutch
> > > Type: Improvement
> > > Components: web gui
> > > Versions: 0.8-dev
> > > Environment: linux
> > > Reporter: robert benea
> > > Priority: Trivial
> > > Attachments: clusty.tar
> > >
> > > First, I modified cluster.jsp and now the cluster has a vivisimo look.
> I
> > used javascript to show the treeview. Another small change is that I
> call
> > the cluster recursively twice, so that two levels of clustering are
> shown.
> > > Second, I added redirect.jsp in order to log the links that were
> clicked
> > during search and because of that search.jsp is changed as well.
> > > The code is not clean as all started as an experiment, I hope someone
> > else finds it useful and clean it up ;-).
> > > To install it just copy the files where you deployed the nutch.war and
> > will work auto-magically.
> > > Regards,
> > > R.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > If you think it was sent incorrectly contact one of the administrators:
> > http://issues.apache.org/jira/secure/Administrators.jspa
> > -
> > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
> >
> >
>
>

Re: [jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by Robert Benea <ro...@gmail.com>.
u need to untar clusty.tar inside your ROOT directory, jakarta/webapps/ROOT,
and then u need to open the files search.jsp/cluster.jsp and resave them so
that the webserver will pick the new ones, that's about it.

U should spend more like 5 mins on making this one run, probably u r not too
familiar with this project.

Cheers,
R.

On 10/17/05, Bong Chih How (JIRA) <ji...@apache.org> wrote:
>
> [
> http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316]
>
> Bong Chih How commented on NUTCH-103:
> -------------------------------------
>
> I adventitously discover this wonderful application some time ago through
> JIRA. It enticed me to download the achieve and install in on my linux box
> which was running nutch 0.7. After spending ~10 hours, i have no luck to
> get it work . I would appreciate if detail installation documentation is
> available here. Thanks.
>
> > Vivisimo like treeview and url redirect
> > ---------------------------------------
> >
> > Key: NUTCH-103
> > URL: http://issues.apache.org/jira/browse/NUTCH-103
> > Project: Nutch
> > Type: Improvement
> > Components: web gui
> > Versions: 0.8-dev
> > Environment: linux
> > Reporter: robert benea
> > Priority: Trivial
> > Attachments: clusty.tar
> >
> > First, I modified cluster.jsp and now the cluster has a vivisimo look. I
> used javascript to show the treeview. Another small change is that I call
> the cluster recursively twice, so that two levels of clustering are shown.
> > Second, I added redirect.jsp in order to log the links that were clicked
> during search and because of that search.jsp is changed as well.
> > The code is not clean as all started as an experiment, I hope someone
> else finds it useful and clean it up ;-).
> > To install it just copy the files where you deployed the nutch.war and
> will work auto-magically.
> > Regards,
> > R.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>

[jira] Updated: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by "robert benea (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-103?page=all ]

robert benea updated NUTCH-103:
-------------------------------

    Attachment: clusty.tar

> Vivisimo like treeview and url redirect
> ---------------------------------------
>
>          Key: NUTCH-103
>          URL: http://issues.apache.org/jira/browse/NUTCH-103
>      Project: Nutch
>         Type: New Feature
>   Components: web gui
>     Versions: 0.8-dev
>  Environment: linux
>     Reporter: robert benea
>  Attachments: clusty.tar
>
> First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview.  Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
> Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
> The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-). 
> To install it just copy the files where you deployed the nutch.war and will work auto-magically.
> Regards,
> R.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by "robert benea (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-103?page=all ]

robert benea updated NUTCH-103:
-------------------------------

        type: Improvement  (was: New Feature)
    Priority: Trivial  (was: Major)

> Vivisimo like treeview and url redirect
> ---------------------------------------
>
>          Key: NUTCH-103
>          URL: http://issues.apache.org/jira/browse/NUTCH-103
>      Project: Nutch
>         Type: Improvement
>   Components: web gui
>     Versions: 0.8-dev
>  Environment: linux
>     Reporter: robert benea
>     Priority: Trivial
>  Attachments: clusty.tar
>
> First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview.  Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
> Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
> The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-). 
> To install it just copy the files where you deployed the nutch.war and will work auto-magically.
> Regards,
> R.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-103) Vivisimo like treeview and url redirect

Posted by "Bong Chih How (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-103?page=comments#action_12332316 ] 

Bong Chih How commented on NUTCH-103:
-------------------------------------

I adventitously discover this wonderful application some time ago through JIRA. It enticed me to download the achieve and install in on my linux box which was running nutch 0.7. After spending ~10 hours, i have no luck to get it work . I would appreciate if detail installation documentation is available here. Thanks.

> Vivisimo like treeview and url redirect
> ---------------------------------------
>
>          Key: NUTCH-103
>          URL: http://issues.apache.org/jira/browse/NUTCH-103
>      Project: Nutch
>         Type: Improvement
>   Components: web gui
>     Versions: 0.8-dev
>  Environment: linux
>     Reporter: robert benea
>     Priority: Trivial
>  Attachments: clusty.tar
>
> First, I modified cluster.jsp and now the cluster has a vivisimo look. I used javascript to show the treeview.  Another small change is that I call the cluster recursively twice, so that two levels of clustering are shown.
> Second, I added redirect.jsp in order to log the links that were clicked during search and because of that search.jsp is changed as well.
> The code is not clean as all started as an experiment, I hope someone else finds it useful and clean it up ;-). 
> To install it just copy the files where you deployed the nutch.war and will work auto-magically.
> Regards,
> R.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira