You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by srinivasarao v <sr...@gmail.com> on 2009/08/17 17:58:46 UTC

Indexing Images

Hello all,

I am using nutch-0.9 to index the images from web sites. I crawled some
websites. While indexing, I want to index only the images, not any other
webpages like html pages etc. Can anyone help me with this?

Thank You,
Srinivas

-- 
http://cheyuta.wordpress.com

Re: scheduling

Posted by fa...@butterflycluster.net.
uh wow!~

never heard of this one before. thanks for this information!


Quoting rzo <rz...@gmx.de>:

> hi,
>
> you may take a look at
> http://yajsw.sourceforge.net/
>
> this is a framework for wrapping applications and for running them as
> services or daemons.
> it includes a sample script for running/scheduling nutch with tomcat
> and with solr.
>
> it comes with a system tray icon which can be used to start/stop the
> application.
>
> - ron
>
> fadzi@butterflycluster.net wrote:
>> hi,
>>
>> I have a requirement to build a simple UI for starting stopping the  
>>  crawler, and also a scheduling mechanism  (Quartz).
>>
>> Has anyone attempted this before? Any lessons learned or any   
>> suggestions how to best go about this?
>>
>> i am just worried about issues of running the crawler inside a web   
>> container etc..
>>
>> there is the option of unix cron jobs but thats the last option   
>> unfortunately.
>>
>> thanks.
>>
>>




Re: scheduling

Posted by rzo <rz...@gmx.de>.
hi,

you may take a look at
http://yajsw.sourceforge.net/

this is a framework for wrapping applications and for running them as 
services or daemons.
it includes a sample script for running/scheduling nutch with tomcat and 
with solr.

it comes with a system tray icon which can be used to start/stop the 
application.

- ron

fadzi@butterflycluster.net wrote:
> hi,
>
> I have a requirement to build a simple UI for starting stopping the 
> crawler, and also a scheduling mechanism  (Quartz).
>
> Has anyone attempted this before? Any lessons learned or any 
> suggestions how to best go about this?
>
> i am just worried about issues of running the crawler inside a web 
> container etc..
>
> there is the option of unix cron jobs but thats the last option 
> unfortunately.
>
> thanks.
>
>


Re: scheduling

Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 18, 2009, at 10:04 AM, fadzi@butterflycluster.net wrote:

>>>
>>> tried that; no joy still.

>>>
>>>
>>> are there any specifics i need to put in nutch-site.xml?

In the conf folder you have a nutch-site.xml.tempate.

>>>
>>>
>>> because mine is blank at the moment.


Ok. create a nutch-site.xml file in your conf folder with the  
following content


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

	<!-- Put site-specific property overrides in this file. -->

<configuration>

	<property>
		<name>plugin.folders</name>
		<value>src/plugin</value>
		<description>Directories where nutch plugins are located. Each
			element may be a relative or absolute path. If absolute, it is used
			as is. If relative, it is searched for on the classpath.</ 
description>
	</property>

	<property>
		<name>http.agent.name</name>
		<value>Nutch-Gui-1.0</value>
		<description>HTTP 'User-Agent' request header. MUST NOT be empty -
			please set this to a single word uniquely related to your
			organization.

			NOTE: You should also check other related properties:

			http.robots.agents
			http.agent.description
			http.agent.url
			http.agent.email
			http.agent.version

			and set their values appropriately.

  </description>
	</property>

</configuration>

after that start the class AdministartionApplication with parameter "/ 
tmp/nutchGui" and "50060"
please delete the folder "/tmp/nutchGui" if the folder already exists.

after that the gui should starting and you can open the browser on  
"127.0.0.1:50060/general".
but the gui is currently only in german language :(.
in the next days we translate it via i18n.


marko



>>>
>>>
>>>
>>>
>>> Quoting Marko Bauhardt <mb...@101tec.com>:
>>>
>>>>
>>>> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>>>>
>>>>> Hi Marko,
>>>>
>>>> Hi
>>>>
>>>>>
>>>>>
>>>>> I am trying to run the AdminApp; whats value for working   
>>>>> directory  and port number?
>>>>
>>>> for the testing i use as working directory the "/tmp" directory.
>>>> for the http-port i use the 50060 or 8080. start the app on the  
>>>> port
>>>> you want to start.
>>>>
>>>>>
>>>>>
>>>>> i used working directory as my nutch root directory.
>>>>>
>>>>> i am getting:
>>>>>
>>>>> java.lang.RuntimeException: x-point    
>>>>> org.apache.nutch.admin.IGuiComponent not found, check your  
>>>>> plugin   folder
>>>>> 	at    
>>>>> org 
>>>>> .apache 
>>>>> .nutch 
>>>>> .admin 
>>>>> .GuiComponentExtensionContainer 
>>>>> .getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>>>>> 	at    
>>>>> org 
>>>>> .apache 
>>>>> .nutch 
>>>>> .admin 
>>>>> .GuiComponentExtensionContainer 
>>>>> .getGeneralGuiComponentExtensions 
>>>>> (GuiComponentExtensionContainer.java:36)
>>>>> 	at    
>>>>> org 
>>>>> .apache 
>>>>> .nutch 
>>>>> .admin 
>>>>> .GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>>>>> 	at    
>>>>> org 
>>>>> .apache 
>>>>> .nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java: 
>>>>> 102)
>>>>
>>>>
>>>> i think your plugin folder is not correct.
>>>> if you start the ui from eclipse then create a nutch-site.xml in  
>>>> your
>>>> conf folder.
>>>> for example.
>>>>
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>>
>>>> 	<!-- Put site-specific property overrides in this file. -->
>>>>
>>>> <configuration>
>>>>
>>>> 	<property>
>>>> 		<name>plugin.folders</name>
>>>> 		<value>src/plugin</value>
>>>> 		<description>Directories where nutch plugins are located. Each
>>>> 			element may be a relative or absolute path. If absolute, it is  
>>>> used
>>>> 			as is. If relative, it is searched for on the classpath.</ 
>>>> description>
>>>> 	</property>
>>>>
>>>> 	<property>
>>>> 		<name>http.agent.name</name>
>>>> 		<value>Nutch-Gui-0.1</value>
>>>> 		<description>HTTP 'User-Agent' request header. MUST NOT be  
>>>> empty -
>>>> 			please set this to a single word uniquely related to your
>>>> 			organization.
>>>>
>>>> 			NOTE: You should also check other related properties:
>>>>
>>>> 			http.robots.agents
>>>> 			http.agent.description
>>>> 			http.agent.url
>>>> 			http.agent.email
>>>> 			http.agent.version
>>>>
>>>> 			and set their values appropriately.
>>>>
>>>> </description>
>>>> 	</property>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>> you should also create a conf/regex-urlfilter.txt file.
>>>>
>>>>
>>>> hth
>>>> marko
>>>
>>>
>>>
>>>
>
>
>
>


Re: scheduling

Posted by fa...@butterflycluster.net.
from eclipse.

Quoting Marko Bauhardt <mb...@101tec.com>:

> Do you start the gui from eclipse or from binary package?
>
> marko
>
>
>
> On Aug 18, 2009, at 9:57 AM, fadzi@butterflycluster.net wrote:
>
>>
>> tried that; no joy still.
>>
>> are there any specifics i need to put in nutch-site.xml?
>>
>> because mine is blank at the moment.
>>
>>
>>
>> Quoting Marko Bauhardt <mb...@101tec.com>:
>>
>>>
>>> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>>>
>>>> Hi Marko,
>>>
>>> Hi
>>>
>>>>
>>>>
>>>> I am trying to run the AdminApp; whats value for working   
>>>> directory  and port number?
>>>
>>> for the testing i use as working directory the "/tmp" directory.
>>> for the http-port i use the 50060 or 8080. start the app on the port
>>> you want to start.
>>>
>>>>
>>>>
>>>> i used working directory as my nutch root directory.
>>>>
>>>> i am getting:
>>>>
>>>> java.lang.RuntimeException: x-point    
>>>> org.apache.nutch.admin.IGuiComponent not found, check your plugin  
>>>>   folder
>>>> 	at    
>>>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>>>> 	at    
>>>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGeneralGuiComponentExtensions(GuiComponentExtensionContainer.java:36)
>>>> 	at    
>>>> org.apache.nutch.admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>>>> 	at    
>>>> org.apache.nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
>>>
>>>
>>> i think your plugin folder is not correct.
>>> if you start the ui from eclipse then create a nutch-site.xml in your
>>> conf folder.
>>> for example.
>>>
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>
>>> 	<!-- Put site-specific property overrides in this file. -->
>>>
>>> <configuration>
>>>
>>> 	<property>
>>> 		<name>plugin.folders</name>
>>> 		<value>src/plugin</value>
>>> 		<description>Directories where nutch plugins are located. Each
>>> 			element may be a relative or absolute path. If absolute, it is used
>>> 			as is. If relative, it is searched for on the classpath.</description>
>>> 	</property>
>>>
>>> 	<property>
>>> 		<name>http.agent.name</name>
>>> 		<value>Nutch-Gui-0.1</value>
>>> 		<description>HTTP 'User-Agent' request header. MUST NOT be empty -
>>> 			please set this to a single word uniquely related to your
>>> 			organization.
>>>
>>> 			NOTE: You should also check other related properties:
>>>
>>> 			http.robots.agents
>>> 			http.agent.description
>>> 			http.agent.url
>>> 			http.agent.email
>>> 			http.agent.version
>>>
>>> 			and set their values appropriately.
>>>
>>> </description>
>>> 	</property>
>>>
>>> </configuration>
>>>
>>>
>>> you should also create a conf/regex-urlfilter.txt file.
>>>
>>>
>>> hth
>>> marko
>>
>>
>>
>>




Re: scheduling

Posted by Marko Bauhardt <mb...@101tec.com>.
Do you start the gui from eclipse or from binary package?

marko



On Aug 18, 2009, at 9:57 AM, fadzi@butterflycluster.net wrote:

>
> tried that; no joy still.
>
> are there any specifics i need to put in nutch-site.xml?
>
> because mine is blank at the moment.
>
>
>
> Quoting Marko Bauhardt <mb...@101tec.com>:
>
>>
>> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>>
>>> Hi Marko,
>>
>> Hi
>>
>>>
>>>
>>> I am trying to run the AdminApp; whats value for working  
>>> directory  and port number?
>>
>> for the testing i use as working directory the "/tmp" directory.
>> for the http-port i use the 50060 or 8080. start the app on the port
>> you want to start.
>>
>>>
>>>
>>> i used working directory as my nutch root directory.
>>>
>>> i am getting:
>>>
>>> java.lang.RuntimeException: x-point   
>>> org.apache.nutch.admin.IGuiComponent not found, check your plugin   
>>> folder
>>> 	at   
>>> org 
>>> .apache 
>>> .nutch 
>>> .admin 
>>> .GuiComponentExtensionContainer 
>>> .getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>>> 	at   
>>> org 
>>> .apache 
>>> .nutch 
>>> .admin 
>>> .GuiComponentExtensionContainer 
>>> .getGeneralGuiComponentExtensions 
>>> (GuiComponentExtensionContainer.java:36)
>>> 	at   
>>> org 
>>> .apache 
>>> .nutch 
>>> .admin 
>>> .GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>>> 	at   
>>> org 
>>> .apache 
>>> .nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
>>
>>
>> i think your plugin folder is not correct.
>> if you start the ui from eclipse then create a nutch-site.xml in your
>> conf folder.
>> for example.
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> 	<!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>>
>> 	<property>
>> 		<name>plugin.folders</name>
>> 		<value>src/plugin</value>
>> 		<description>Directories where nutch plugins are located. Each
>> 			element may be a relative or absolute path. If absolute, it is  
>> used
>> 			as is. If relative, it is searched for on the classpath.</ 
>> description>
>> 	</property>
>>
>> 	<property>
>> 		<name>http.agent.name</name>
>> 		<value>Nutch-Gui-0.1</value>
>> 		<description>HTTP 'User-Agent' request header. MUST NOT be empty -
>> 			please set this to a single word uniquely related to your
>> 			organization.
>>
>> 			NOTE: You should also check other related properties:
>>
>> 			http.robots.agents
>> 			http.agent.description
>> 			http.agent.url
>> 			http.agent.email
>> 			http.agent.version
>>
>> 			and set their values appropriately.
>>
>> </description>
>> 	</property>
>>
>> </configuration>
>>
>>
>> you should also create a conf/regex-urlfilter.txt file.
>>
>>
>> hth
>> marko
>
>
>
>


Re: scheduling

Posted by fa...@butterflycluster.net.
tried that; no joy still.

are there any specifics i need to put in nutch-site.xml?

because mine is blank at the moment.


Quoting Marko Bauhardt <mb...@101tec.com>:

>
> On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:
>
>> Hi Marko,
>
> Hi
>
>>
>>
>> I am trying to run the AdminApp; whats value for working directory   
>> and port number?
>
> for the testing i use as working directory the "/tmp" directory.
> for the http-port i use the 50060 or 8080. start the app on the port
> you want to start.
>
>>
>>
>> i used working directory as my nutch root directory.
>>
>> i am getting:
>>
>> java.lang.RuntimeException: x-point   
>> org.apache.nutch.admin.IGuiComponent not found, check your plugin   
>> folder
>> 	at   
>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
>> 	at   
>> org.apache.nutch.admin.GuiComponentExtensionContainer.getGeneralGuiComponentExtensions(GuiComponentExtensionContainer.java:36)
>> 	at   
>> org.apache.nutch.admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
>> 	at   
>> org.apache.nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)
>
>
> i think your plugin folder is not correct.
> if you start the ui from eclipse then create a nutch-site.xml in your
> conf folder.
> for example.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> 	<!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> 	<property>
> 		<name>plugin.folders</name>
> 		<value>src/plugin</value>
> 		<description>Directories where nutch plugins are located. Each
> 			element may be a relative or absolute path. If absolute, it is used
> 			as is. If relative, it is searched for on the classpath.</description>
> 	</property>
>
> 	<property>
> 		<name>http.agent.name</name>
> 		<value>Nutch-Gui-0.1</value>
> 		<description>HTTP 'User-Agent' request header. MUST NOT be empty -
> 			please set this to a single word uniquely related to your
> 			organization.
>
> 			NOTE: You should also check other related properties:
>
> 			http.robots.agents
> 			http.agent.description
> 			http.agent.url
> 			http.agent.email
> 			http.agent.version
>
> 			and set their values appropriately.
>
>  </description>
> 	</property>
>
> </configuration>
>
>
> you should also create a conf/regex-urlfilter.txt file.
>
>
> hth
> marko




Re: scheduling

Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 18, 2009, at 9:36 AM, fadzi@butterflycluster.net wrote:

> Hi Marko,

Hi

>
>
> I am trying to run the AdminApp; whats value for working directory  
> and port number?

for the testing i use as working directory the "/tmp" directory.
for the http-port i use the 50060 or 8080. start the app on the port  
you want to start.

>
>
> i used working directory as my nutch root directory.
>
> i am getting:
>
> java.lang.RuntimeException: x-point  
> org.apache.nutch.admin.IGuiComponent not found, check your plugin  
> folder
> 	at  
> org 
> .apache 
> .nutch 
> .admin 
> .GuiComponentExtensionContainer 
> .getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
> 	at  
> org 
> .apache 
> .nutch 
> .admin 
> .GuiComponentExtensionContainer 
> .getGeneralGuiComponentExtensions 
> (GuiComponentExtensionContainer.java:36)
> 	at  
> org 
> .apache 
> .nutch 
> .admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java: 
> 126)
> 	at  
> org 
> .apache 
> .nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)


i think your plugin folder is not correct.
if you start the ui from eclipse then create a nutch-site.xml in your  
conf folder.
for example.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

	<!-- Put site-specific property overrides in this file. -->

<configuration>

	<property>
		<name>plugin.folders</name>
		<value>src/plugin</value>
		<description>Directories where nutch plugins are located. Each
			element may be a relative or absolute path. If absolute, it is used
			as is. If relative, it is searched for on the classpath.</ 
description>
	</property>

	<property>
		<name>http.agent.name</name>
		<value>Nutch-Gui-0.1</value>
		<description>HTTP 'User-Agent' request header. MUST NOT be empty -
			please set this to a single word uniquely related to your
			organization.

			NOTE: You should also check other related properties:

			http.robots.agents
			http.agent.description
			http.agent.url
			http.agent.email
			http.agent.version

			and set their values appropriately.

  </description>
	</property>

</configuration>


you should also create a conf/regex-urlfilter.txt file.


hth
marko




Re: scheduling

Posted by fa...@butterflycluster.net.
Hi Marko,

I am trying to run the AdminApp; whats value for working directory and  
port number?

i used working directory as my nutch root directory.

i am getting:

java.lang.RuntimeException: x-point  
org.apache.nutch.admin.IGuiComponent not found, check your plugin folder
	at  
org.apache.nutch.admin.GuiComponentExtensionContainer.getGuiComponentExtensions(GuiComponentExtensionContainer.java:49)
	at  
org.apache.nutch.admin.GuiComponentExtensionContainer.getGeneralGuiComponentExtensions(GuiComponentExtensionContainer.java:36)
	at  
org.apache.nutch.admin.GuiComponentDeployer.getExtensions(GuiComponentDeployer.java:126)
	at  
org.apache.nutch.admin.GuiComponentDeployer.run(GuiComponentDeployer.java:102)

Quoting Marko Bauhardt <mb...@101tec.com>:

>
> On Aug 18, 2009, at 7:04 AM, fadzi@butterflycluster.net wrote:
>
>> hi,
>
> Hi Fadzi,
>
>>
>>
>> I have a requirement to build a simple UI for starting stopping the  
>>  crawler, and also a scheduling mechanism  (Quartz).
>>
>> Has anyone attempted this before?
>
> We have started to implement the upgrade of the nutch gui as well.
> the gui supports
>
> + creation of separate nutch instances
> + configuration of these instances
> + manual crawl's
> + scheduled crawl's
> + small statistics
> + system overview
> + url uploading
> + small simple search ui
> ...
>
> you can find the project on github.com. it is a fork of the apache
> nutch project from branch "tags/release-1.0"
> you can checkout the sources here
> http://github.com/101tec/nutch/tree/nutch-gui
> we plan to push a release at the end of this months.
>
> marko
>
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec GmbH
>
> Halle (Saale), Saxony-Anhalt, Germany
> http://www.101tec.com




Re: scheduling

Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 18, 2009, at 7:04 AM, fadzi@butterflycluster.net wrote:

> hi,

Hi Fadzi,

>
>
> I have a requirement to build a simple UI for starting stopping the  
> crawler, and also a scheduling mechanism  (Quartz).
>
> Has anyone attempted this before?

We have started to implement the upgrade of the nutch gui as well.
the gui supports

+ creation of separate nutch instances
+ configuration of these instances
+ manual crawl's
+ scheduled crawl's
+ small statistics
+ system overview
+ url uploading
+ small simple search ui
...

you can find the project on github.com. it is a fork of the apache  
nutch project from branch "tags/release-1.0"
you can checkout the sources here http://github.com/101tec/nutch/tree/nutch-gui
we plan to push a release at the end of this months.

marko



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com




scheduling

Posted by fa...@butterflycluster.net.
hi,

I have a requirement to build a simple UI for starting stopping the  
crawler, and also a scheduling mechanism  (Quartz).

Has anyone attempted this before? Any lessons learned or any  
suggestions how to best go about this?

i am just worried about issues of running the crawler inside a web  
container etc..

there is the option of unix cron jobs but thats the last option unfortunately.

thanks.