You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/04/22 00:01:07 UTC

[jira] Updated: (NUTCH-251) Administration GUI

     [ http://issues.apache.org/jira/browse/NUTCH-251?page=all ]

Stefan Groschupf updated NUTCH-251:
-----------------------------------

    Attachment: hadoop_nutch_gui_v1.patch
                nutch_gui_v1.patch
                nutch_gui_plugins_v1.zip

This is a early preview patch of the nutch gui. 
There are known issues, however it is a starting point from where we can continue building a solid administration user interface.


This patch introduce following functionalities:

+ web based administration gui via embed web container
+ gui is fully based  on the plugin system, so it is customizable and  extendable using plugins
+ all plugins can be internationalized  
+ introduce the concept of nutch instances, a mechanism to have separated configurable nutch deployments using the same code base. (e.g intranet search, webpage search)
+ plug able authentication, currently it comes with a default user  - password tuple based on the configuration but for example LDAP integration can be easily realized.  

The patch it comes with following plugins:
+ admin-listing
++ required by the web ui to show all deployed plugins as tabs on a webpage

+ admin-instance
++ lists all instances and allows to create a new instance

+ admin-configuration
++ configure a nutch instance (configuration will be written as nutch-site.xml to hdd)

+ admin-inject
++ inject urls in a crawlDb

+admin-system
++ shows status of system

+admin-job
++ shows  status of jobs

+ admin-crawldb-status
++ shows crawldb entries filtered by status or  shows the status of a given url  (usefully to check if a page was already fetched)

+admin-management
++ generate segment
++ fetch segment
++ parse segment (if required)
++ update crawldb
++ invert links
++ index segment
++ delete segment, parse, index etc.

+admin-scheduling
++ quartz based cron job management to run a time driven "generate - fetch - updatedb - invertlins - index" job


Known issues
+ require hadoop changes
+ local running jobs can not be stopped but distributed running jobs can be stopped
+ index searcher does not use index folders inside of segment folders as in nutch 0.7 but the gui place the index folder in the segment folder
++ searcher is unable to find indices
+ put to search does not work since searcher does not support dynamically adding of index folders
+ linkdb inverter does not update but overwrite a linkdb - this is a general nutch bug but affect the gui as well.
+ the nutch gui introduce locking by storing lock files in folders, this mechanism is ignored by the nutch command line tools.



It would be great if users can test the gui and reports bugs and help to improve the patch.
This is a very complex patch and it is difficult to stay in sync with the latest changes so in case we miss something 
until generation this patch and the patch does not work as expected please don't blame us but give us some time and hints to fix the problems.


 help is welcome by following tasks:
+ fixing languages issues in java doc, api and bundle files
+ translate bundles in more languages (currently it comes with english and german bundles)
+ heavily test and find bugs and provide fixes :)
+ write help texts and documentation

How to:

+ checkout latest nutch sources

+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib


+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn does not support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui <folderWhereYourInstanceDataWillBeStored)
+ point your browser to: http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch-default.xml)
+ select the "default" instance or create a new instance.



Thanks to everybody that helped to get this implement and do the first beta tests, but specially to Marko hacking all jsp's!
I suggest to add this patch to a nutch 0.9 branch and add a gui component in the jira to go from there.
I really hope I didn't miss anything or upload the wrong files now. :-O

> Administration GUI
> ------------------
>
>          Key: NUTCH-251
>          URL: http://issues.apache.org/jira/browse/NUTCH-251
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: hadoop_nutch_gui_v1.patch, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch
>
> Having a web based administration interface would help to make nutch administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Updated: (NUTCH-251) Administration GUI

Posted by TDLN <di...@gmail.com>.

I have my local changes, so I can't use the binary distribution.
Anyway, I will have a go at it and let you know.

Rgrds, Thomas



On 5/11/06, Stefan Groschupf <sg...@media-style.com> wrote:
> Hi,
>
> the easiest way is to download one of the binary distributions.
> However as far I know the patches still work and need to be applied
> to both projects.
> Stefan
>
> Am 11.05.2006 um 08:38 schrieb TDLN:
>
> > Hi Stephan.
> >
> > I am about to get started with the Admin GUI and was wondering if
> > these instructions are . still valid.
> >
> > More in particular, is it still necessary to patch Hadoop, or has this
> > patch already been integrated?
> >
> > Also do you know how the Nutch patches go with the latest revision?
> >
> > Rgrds, Thomas Delnoij
> >
> > How to:
> >
> > + checkout latest nutch sources
> >
> > + checkout hadoop sources
> > + patch hadoop with the hadoop patch
> > + build hadoop jar
> > + remove old hadoop jar from nutch/lib
> > + place new hadoop jar in nutch/lib
> >
> >
> > + uncompress plugin zip file
> > + place plugins in nutch/src/plugins (patch not possible since svn
> > does not support binary patches)
> > + patch nutch with nutch patch
> > + start gui with bin/nutch gui
> > <folderWhereYourInstanceDataWillBeStored)
> > + point your browser to: http://localhost:50060/general/
> > + username and password are "admin". ( can be changed in nutch-
> > default.xml)
> > + select the "default" instance or create a new instance.
> >
>
>

Re: [jira] Updated: (NUTCH-251) Administration GUI

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,

the easiest way is to download one of the binary distributions.
However as far I know the patches still work and need to be applied  
to both projects.
Stefan

Am 11.05.2006 um 08:38 schrieb TDLN:

> Hi Stephan.
>
> I am about to get started with the Admin GUI and was wondering if
> these instructions are . still valid.
>
> More in particular, is it still necessary to patch Hadoop, or has this
> patch already been integrated?
>
> Also do you know how the Nutch patches go with the latest revision?
>
> Rgrds, Thomas Delnoij
>
> How to:
>
> + checkout latest nutch sources
>
> + checkout hadoop sources
> + patch hadoop with the hadoop patch
> + build hadoop jar
> + remove old hadoop jar from nutch/lib
> + place new hadoop jar in nutch/lib
>
>
> + uncompress plugin zip file
> + place plugins in nutch/src/plugins (patch not possible since svn
> does not support binary patches)
> + patch nutch with nutch patch
> + start gui with bin/nutch gui  
> <folderWhereYourInstanceDataWillBeStored)
> + point your browser to: http://localhost:50060/general/
> + username and password are "admin". ( can be changed in nutch- 
> default.xml)
> + select the "default" instance or create a new instance.
>

Re: [jira] Updated: (NUTCH-251) Administration GUI

Posted by TDLN <di...@gmail.com>.

Hi Stephan.

I am about to get started with the Admin GUI and was wondering if
these instructions are . still valid.

More in particular, is it still necessary to patch Hadoop, or has this
patch already been integrated?

Also do you know how the Nutch patches go with the latest revision?

Rgrds, Thomas Delnoij

How to:

+ checkout latest nutch sources

+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib


+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn
does not support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui <folderWhereYourInstanceDataWillBeStored)
+ point your browser to: http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch-default.xml)
+ select the "default" instance or create a new instance.

Re: [jira] Updated: (NUTCH-251) Administration GUI

Posted by TDLN <di...@gmail.com>.

One more question: and where shall we post questions about the admin gui, on
the dev or the users list?

Rgrds, Thomas Delnoij

On 4/26/06, Stefan Groschupf <sg...@media-style.com> wrote:
>
> Hi Thomas,
> > 1) I am using local filesystem - as you say that local running jobs
> > can not be stopped, does that imply that the scheduling is
> > dysfunctional as well on local filesystem
>
> The gui fully works in a local file system and a local jobrunner,
> except of stopping a running job.
> http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/
> 200603.mbox/%3C44170394.8070103@apache.org%3E
>
>
> >
> > 2) Do you think it makes sense to have a language bundle in Dutch -
> > personally I don't, because I never meet a Dutch developer who doesn't
> > speak English, but I could do it anyway.
>
> Getting as much as possible would increase the use friendliness.
> However we may be need to test it and fix bugs before we start to
> translate more the ui in more languages.
>
>
> >
> > 3) What you like to operate as a filter for the bugs before we add
> > them to Jira, so that we don't post any known issues / duplicates.
>
> I created a component called administration gui, just attach bug
> reports under this component.
> If we all use the jira than we will know what are known problems.
>
> >
> > Thanks for all the work.
>
> Thanks for looking into it.
>
> Stefan
>
>

Re: [jira] Updated: (NUTCH-251) Administration GUI

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Thomas,
> 1) I am using local filesystem - as you say that local running jobs
> can not be stopped, does that imply that the scheduling is
> dysfunctional as well on local filesystem

The gui fully works in a local file system and a local jobrunner,  
except of stopping a running job.
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/ 
200603.mbox/%3C44170394.8070103@apache.org%3E


>
> 2) Do you think it makes sense to have a language bundle in Dutch -
> personally I don't, because I never meet a Dutch developer who doesn't
> speak English, but I could do it anyway.

Getting as much as possible would increase the use friendliness.
However we may be need to test it and fix bugs before we start to  
translate more the ui in more languages.


>
> 3) What you like to operate as a filter for the bugs before we add
> them to Jira, so that we don't post any known issues / duplicates.

I created a component called administration gui, just attach bug  
reports under this component.
If we all use the jira than we will know what are known problems.

>
> Thanks for all the work.

Thanks for looking into it.

Stefan

Re: [jira] Updated: (NUTCH-251) Administration GUI

Posted by TDLN <di...@gmail.com>.

Stefan,

this patch looks very interesting. I would like to test it, but before
I have several questions.

1) I am using local filesystem - as you say that local running jobs
can not be stopped, does that imply that the scheduling is
dysfunctional as well on local filesystem

2) Do you think it makes sense to have a language bundle in Dutch -
personally I don't, because I never meet a Dutch developer who doesn't
speak English, but I could do it anyway.

3) What you like to operate as a filter for the bugs before we add
them to Jira, so that we don't post any known issues / duplicates.

Thanks for all the work.

Rgrds, Thomas Delnoij

On 4/22/06, Stefan Groschupf (JIRA) <ji...@apache.org> wrote:
>      [ http://issues.apache.org/jira/browse/NUTCH-251?page=all ]
>
> Stefan Groschupf updated NUTCH-251:
> -----------------------------------
>
>     Attachment: hadoop_nutch_gui_v1.patch
>                 nutch_gui_v1.patch
>                 nutch_gui_plugins_v1.zip
>
> This is a early preview patch of the nutch gui.
> There are known issues, however it is a starting point from where we can continue building a solid administration user interface.
>
>
> This patch introduce following functionalities:
>
> + web based administration gui via embed web container
> + gui is fully based  on the plugin system, so it is customizable and  extendable using plugins
> + all plugins can be internationalized
> + introduce the concept of nutch instances, a mechanism to have separated configurable nutch deployments using the same code base. (e.g intranet search, webpage search)
> + plug able authentication, currently it comes with a default user  - password tuple based on the configuration but for example LDAP integration can be easily realized.
>
> The patch it comes with following plugins:
> + admin-listing
> ++ required by the web ui to show all deployed plugins as tabs on a webpage
>
> + admin-instance
> ++ lists all instances and allows to create a new instance
>
> + admin-configuration
> ++ configure a nutch instance (configuration will be written as nutch-site.xml to hdd)
>
> + admin-inject
> ++ inject urls in a crawlDb
>
> +admin-system
> ++ shows status of system
>
> +admin-job
> ++ shows  status of jobs
>
> + admin-crawldb-status
> ++ shows crawldb entries filtered by status or  shows the status of a given url  (usefully to check if a page was already fetched)
>
> +admin-management
> ++ generate segment
> ++ fetch segment
> ++ parse segment (if required)
> ++ update crawldb
> ++ invert links
> ++ index segment
> ++ delete segment, parse, index etc.
>
> +admin-scheduling
> ++ quartz based cron job management to run a time driven "generate - fetch - updatedb - invertlins - index" job
>
>
> Known issues
> + require hadoop changes
> + local running jobs can not be stopped but distributed running jobs can be stopped
> + index searcher does not use index folders inside of segment folders as in nutch 0.7 but the gui place the index folder in the segment folder
> ++ searcher is unable to find indices
> + put to search does not work since searcher does not support dynamically adding of index folders
> + linkdb inverter does not update but overwrite a linkdb - this is a general nutch bug but affect the gui as well.
> + the nutch gui introduce locking by storing lock files in folders, this mechanism is ignored by the nutch command line tools.
>
>
>
> It would be great if users can test the gui and reports bugs and help to improve the patch.
> This is a very complex patch and it is difficult to stay in sync with the latest changes so in case we miss something
> until generation this patch and the patch does not work as expected please don't blame us but give us some time and hints to fix the problems.
>
>
>  help is welcome by following tasks:
> + fixing languages issues in java doc, api and bundle files
> + translate bundles in more languages (currently it comes with english and german bundles)
> + heavily test and find bugs and provide fixes :)
> + write help texts and documentation
>
> How to:
>
> + checkout latest nutch sources
>
> + checkout hadoop sources
> + patch hadoop with the hadoop patch
> + build hadoop jar
> + remove old hadoop jar from nutch/lib
> + place new hadoop jar in nutch/lib
>
>
> + uncompress plugin zip file
> + place plugins in nutch/src/plugins (patch not possible since svn does not support binary patches)
> + patch nutch with nutch patch
> + start gui with bin/nutch gui <folderWhereYourInstanceDataWillBeStored)
> + point your browser to: http://localhost:50060/general/
> + username and password are "admin". ( can be changed in nutch-default.xml)
> + select the "default" instance or create a new instance.
>
>
>
> Thanks to everybody that helped to get this implement and do the first beta tests, but specially to Marko hacking all jsp's!
> I suggest to add this patch to a nutch 0.9 branch and add a gui component in the jira to go from there.
> I really hope I didn't miss anything or upload the wrong files now. :-O
>
> > Administration GUI
> > ------------------
> >
> >          Key: NUTCH-251
> >          URL: http://issues.apache.org/jira/browse/NUTCH-251
> >      Project: Nutch
> >         Type: Improvement
>
> >     Versions: 0.8-dev
> >     Reporter: Stefan Groschupf
> >     Priority: Minor
> >      Fix For: 0.8-dev
> >  Attachments: hadoop_nutch_gui_v1.patch, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch
> >
> > Having a web based administration interface would help to make nutch administration and management much more user friendly.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>
>