You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Hugo Alves <hu...@gmail.com> on 2012/08/16 12:55:48 UTC

Crawl command help

Hi.

I am using nutch 2.0 with hsql.

I've created some plugins for parsing special content inside company
website, the plugins parse the content and next send some data to a
sql server database,this is working fine. But the problem is the crawl
command. I am starting nutch with:
./nutch crawl -depth 300 -topN 30000.

In nutch-site.xml i configured the refetch interval to 30 days(the
default value) but after each cycle nutch fetches the new pages found
and the old pages.

What i am doing wrong?

RE: recrawling

Posted by Markus Jelsma <ma...@openindex.io>.

hi,

Pages will be recrawled when their eligible (last fetch time + interval). To force it you can use the -adddays switch on the generator tool. 


 
 
-----Original message-----
> From:Stefan Scheffler <ss...@avantgarde-labs.de>
> Sent: Fri 17-Aug-2012 11:54
> To: user@nutch.apache.org
> Subject: recrawling
> 
> Hello,
> How can i do a recrawling to an existing crawldb?
> 
> With friendly regards
> Stefan Scheffler
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheffler@avantgarde-labs.de
> 
>

recrawling

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.

Hello,
How can i do a recrawling to an existing crawldb?

With friendly regards
Stefan Scheffler

-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de

Re: Crawl command help

Posted by "hugo.ma" <hu...@gmail.com>.

Again the problem persists.

Looking for example for the attached example.

We can see that Previous fetch time is Sun, 16 Sep 2012 01:13:56 GMT and the
fetch time is Sun, 16 Sep 2012 06:27:14 GMT.

It didn't respected the fetch interval.


http://lucene.472066.n3.nabble.com/file/n4001796/nutche.jpg nutche.jpg 



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4001796.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl command help

Posted by "hugo.ma" <hu...@gmail.com>.

The problem continues, i have the following script.

@echo off

rem *** Do not allow this script to permanently modify environment variables
and enable
rem *** delayed expansion so for loops can accumulate into a variable using
! instead of %
setlocal ENABLEDELAYEDEXPANSION
SETLOCAL ENABLEEXTENSIONS
set Fdepth=300
set threads=10
rem set topN=-topN 15
set topN=""
rem = -1 if you don't want to set topN value

set urlDir=.\urls

rem *** Require JAVA_HOME
if "X%JAVA_HOME%" == "X" goto error


rem *** Setup the basic parameters
if "X%NUTCH_HOME%"     == "X" set NUTCH_HOME=%CD%\..\..
if "X%JAVA%"           == "X" set JAVA=%JAVA_HOME%\bin\java.exe
if "X%JAVA_HEAP_MAX%"  == "X" set JAVA_HEAP_MAX=-Xmx1000m
if "X%NUTCH_LOG_DIR%"  == "X" set NUTCH_LOG_DIR=%NUTCH_HOME%\logs
if "X%NUTCH_LOG_FILE%" == "X" set NUTCH_LOG_FILE=hadoop.log
set NUTCH_LOG_OPTS="-Dhadoop.log.dir=%NUTCH_LOG_DIR%"
"-Dhadoop.log.file=%NUTCH_LOG_FILE%"
set CLASSPATH=%NUTCH_HOME%;%NUTCH_HOME%\conf;%JAVA_HOME%\lib\tools.jar

rem *** Add Nutch job file(s) to the class path
for /f %%G IN ('dir /b ^"%NUTCH_HOME%\nutch-*.job^"') do set
CLASSPATH=%CLASSPATH%;%NUTCH_HOME%\%%G

rem *** Add Nutch .jar file(s) to the class path
for /f %%G IN ('dir /b ^"%NUTCH_HOME%\lib\*.jar^"') do set
CLASSPATH=!CLASSPATH!;%NUTCH_HOME%\lib\%%G

rem *** Add Nutch .jar file(s) from jetty to the class path
for /f %%G IN ('dir /b ^"%NUTCH_HOME%\lib\jetty-ext\*.jar^"') do set
CLASSPATH=!CLASSPATH!;%NUTCH_HOME%\lib\jetty-ext\%%G



rem *** Revamp the path
set PATH=/bin;%CD%\..\..\bin

			echo "**************************************************************"
			echo "--------------------- NUTCH vODAFONE --------------------------"
			echo "**************************************************************"

		set steps=2

		echo "**************************************************************"
		echo "--- Inject first urls---"
		echo "**************************************************************"
		echo "----- Inject (Step 1 of %steps%) -----"
		"%JAVA%" %JAVA_HEAP_MAX% %NUTCH_LOG_OPTS% %NUTCH_OPTS% -classpath
"%CLASSPATH%" org.apache.nutch.crawl.InjectorJob %urlDir%

		echo "**************************************************************"
		echo "----- Generate, Fetch, Parse, Update (Step 2 of %steps%) -----"
		echo "**************************************************************"

		for /l %%d in (1, 1, %Fdepth%) do (

		echo "**************************************************************"
		echo "--- Beginning GENERATE at depth %%d ---"
		echo "**************************************************************"

		"%JAVA%" %JAVA_HEAP_MAX% %NUTCH_LOG_OPTS% %NUTCH_OPTS% -classpath
"%CLASSPATH%" org.apache.nutch.crawl.GeneratorJob %topN% 
		echo "batch-id"
		set /p batchid="Enter ID: " %=%
		echo !batchid!
			echo   !batchid! batch id***********
		if  NOT "%errorlevel%"=="0" (
			echo "runbot: Stopping at depth %%d. No more URLs to fetch."
				EXIT
				)
			
			echo "**************************************************************"
			echo "--- Beginning FETCH at depth %%d ---"
			echo "**************************************************************"		
			"%JAVA%" %JAVA_HEAP_MAX% %NUTCH_LOG_OPTS% %NUTCH_OPTS% -classpath
"%CLASSPATH%" org.apache.nutch.fetcher.FetcherJob -batchId %batchid%
			
		if NOT "%errorlevel%"=="0" ( 
		echo "runbot: fetch  at depth %%d failed."
			rem echo "runbot: Deleting segment $segment."
			)
		
			echo "**************************************************************"
			echo "--- Beginning PARSE at depth %%d ---"
			echo "**************************************************************"	
		
			"%JAVA%" %JAVA_HEAP_MAX% %NUTCH_LOG_OPTS% %NUTCH_OPTS% -classpath
"%CLASSPATH%" org.apache.nutch.parse.ParserJob  -batchId %batchid%
	
			if  NOT "%errorlevel%"=="0" (
			echo "runbot: Stopping at depth %%d. error in parsejob."
			EXIT
				)
					
					
			echo "**************************************************************"
			echo "--- Beginning UPDATEDB at depth %%d ---"
			echo "**************************************************************"	
			"%JAVA%" %JAVA_HEAP_MAX% %NUTCH_LOG_OPTS% %NUTCH_OPTS% -classpath
"%CLASSPATH%" org.apache.nutch.crawl.DbUpdaterJob 
			if  NOT "%errorlevel%"=="0" (
			echo "runbot: Stopping at depth %%d. error in updater."
				EXIT
				)
		)


			echo "**************************************************************"
			echo "**************************************************************"
			echo "**************************************************************"

			echo "FIM"	
			echo ""


:error
echo "ERROR: You must specify the path to your Java installation in the
JAVA_HOME environment variable
color 00

:done
rem *** Restore environment variables
echo "FIM"
endlocal

hugo.ma wrote
> 
> For example i put in a seed file the url  nabble.com
> Then nutch fetch and parse the url, from the parse i get nabble.com/user
> and nabble.com/admin
> Then in the next fetch job the three urls are fetched and parsed:
> nabble.com
> nabble.com/user
> nabble.com/admin
> 
> And this process repeats until the end of the depth.(The urls are
> fictitious)
> 
> I left nutch running  on Tuesday around 18.00h and today i checked my
> sqlserver database and the last record was from Wednesday 10:40h. He is
> still running on all urls fetched, around 3400 pages.
> 
> I didnt' checked nutch yesterday because was holiday.
> 




--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4002123.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl command help

Posted by "hugo.ma" <hu...@gmail.com>.

I've lost some time looking for hadoop.log and i dont see any errors, only
the fetch and parse list becoming bigger every iteration.

Meanwhile i've converted the nutch script to windows, the one that does the
steps individually, it looks ok, but i only got around 100 urls. I am going
to leave nutch running until tomorrow morning and then i will check if the
problem persists.

Thanks for the help.



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4001644.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl command help

Posted by Ferdy Galema <fe...@kalooga.com>.

I am not aware of a bug like this. You have to debug or trace a crawl for
more details. Some hints: Normally an url is only fetched when the
fetchtime is eligable. So try to track an url from start (generate) to end
(after update) and see what happens with the fetchtime.

Or maybe else has some advice.

On Thu, Aug 16, 2012 at 2:09 PM, hugo.ma <hu...@gmail.com> wrote:

> I am creating the script for doing the steps individually, is enough to do.
> inject -> Loop(Generate, Fetch, Parse, Update) , i dont need the index
> feature.
>
> I am creating the script because it has to run on windows without cygwin,
> the crawler class already Works.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4001610.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Crawl command help

Posted by "hugo.ma" <hu...@gmail.com>.

I am creating the script for doing the steps individually, is enough to do.
inject -> Loop(Generate, Fetch, Parse, Update) , i dont need the index
feature.

I am creating the script because it has to run on windows without cygwin,
the crawler class already Works.



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4001610.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl command help

Posted by "hugo.ma" <hu...@gmail.com>.

For example i put in a seed file the url  nabble.com
Then nutch fetch and parse the url, from the parse i get nabble.com/user and
nabble.com/admin
The in the next fetch job the three urls are fetched and parsed:
nabble.com
nabble.com/user
nabble.com/admin

And this process repeats until the end of the depth.(The urls are
fictitious)

I left nutch running  on Tuesday around 18.00h and today i checked my
sqlserver database and the last record was from Wednesday 10:40h. He is
still running on all urls fetched, around 3400 pages.

I didnt' checked nutch yesterday because was holiday.





--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4001604.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl command help

Posted by Ferdy Galema <fe...@kalooga.com>.

Hi,

Could you be a bit more specific what type of urls are refetched? In
general it is advised to run the different jobs explicitly, to have more
control over the crawling. (Inject,generate,fetch,parse,update etc.)

Ferdy.

On Thu, Aug 16, 2012 at 12:55 PM, Hugo Alves <hu...@gmail.com>wrote:

> Hi.
>
> I am using nutch 2.0 with hsql.
>
> I've created some plugins for parsing special content inside company
> website, the plugins parse the content and next send some data to a
> sql server database,this is working fine. But the problem is the crawl
> command. I am starting nutch with:
> ./nutch crawl -depth 300 -topN 30000.
>
> In nutch-site.xml i configured the refetch interval to 30 days(the
> default value) but after each cycle nutch fetches the new pages found
> and the old pages.
>
> What i am doing wrong?
>

Re: Crawl command help

Posted by weishenyun <wl...@yahoo.com.cn>.

Are you sure in each cycle, nutch re-crawl the old pages?Every old pages?



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-command-help-tp4001595p4001598.html
Sent from the Nutch - User mailing list archive at Nabble.com.