You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by eakarsu <ea...@gmail.com> on 2013/07/08 21:24:57 UTC

crawldb contents

I have question on the contents of crawldb folder with Nutch 1.6

After I do updatedb step, crawldb folder includes the following. Is this
correct result I should get?
If not, how I can fix it?

If I execute "generate" on this crawldb below, will it generate full url
lists? My concern is that updatedb process is not completed fully because we
"624730206" and "current" folder at the same time.
Does Nutch take care of this?

I appreciate your help


hduser@hadoopdev1:~$ hadoop dfs -ls 160milyonurls/crawldb
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - hduser supergroup          0 2013-07-05 23:55
/user/hduser/160milyonurls/crawldb/624730206
drwxr-xr-x   - hduser supergroup          0 2013-07-08 18:59
/user/hduser/160milyonurls/crawldb/current
drwxr-xr-x   - hduser supergroup          0 2013-07-03 14:39
/user/hduser/160milyonurls/crawldb/old




--
View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawldb contents

Posted by eakarsu <ea...@gmail.com>.

Sebastian,Actually, updatedb hadoop job finished successfully.User:
hduserJobName: crawldb 160milyonurls/crawldbJobConf:
hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xmlJob-ACLs:
All users are allowedSubmitted At: 5-Jul-2013 22:14:37Launched At:
5-Jul-2013 22:14:38 (0sec)Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins,
33sec)Status: SUCCESSFailure Info:Analyse This JobKind	Total
Tasks(successful+failed+killed)	Successful tasks	Failed tasks	Killed tasks
Start Time	Finish TimeSetup 	1 	1 	0 	0 	5-Jul-2013 22:15:31 	5-Jul-2013
22:15:32 (1sec)Map 	3043 	3043 	0 	0 	5-Jul-2013 22:14:41 	5-Jul-2013
23:16:56 (1hrs, 2mins, 14sec)Reduce 	40 	40 	0 	0 	5-Jul-2013 22:18:25 
5-Jul-2013 23:55:35 (1hrs, 37mins, 10sec)Cleanup 	1 	1 	0 	0 	5-Jul-2013
23:55:10 	5-Jul-2013 23:55:11 (1sec)Hadoop Job job_201307050940_0002 on
History Viewer
Hadoop Job job_201307050940_0002 on  History Viewer <jobhistoryhome.jsp>  
*User: * hduser
 *JobName: * crawldb 160milyonurls/crawldb
  *JobConf: * 
hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
<jobconf_history.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb>  
 *Job-ACLs: All users are allowed*
 *Submitted At: * 5-Jul-2013 22:14:37
 *Launched At: * 5-Jul-2013 22:14:38 (0sec)
*Finished At: *  5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
*Status: * SUCCESS
 *Failure Info: * 
* Analyse This Job
<analysejobhistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb> 
* 
Kind	Total Tasks(successful+failed+killed)	Successful tasks	Failed tasks
Killed tasks	Start Time	Finish Time
Setup    	 1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=all>      	
1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=SUCCESS>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=FAILED>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=KILLED>        
5-Jul-2013 22:15:31    	5-Jul-2013 22:15:32 (1sec)
Map    	 3043
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=all>      	
3043
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=SUCCESS>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=FAILED>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=KILLED>      
5-Jul-2013 22:14:41    	5-Jul-2013 23:16:56 (1hrs, 2mins, 14sec)
Reduce    	 40
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=all>      	
40
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=SUCCESS>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=FAILED>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=KILLED>        
5-Jul-2013 22:18:25    	5-Jul-2013 23:55:35 (1hrs, 37mins, 10sec)
Cleanup    	 1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=all>      	
1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=SUCCESS>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=FAILED>      	
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=KILLED>        
5-Jul-2013 23:55:10    	5-Jul-2013 23:55:11 (1sec)


  
  	
  	Counter  	Map  	Reduce  	Total       
         	         Job Counters        	Launched reduce tasks       	0       
0       	40            
       	SLOTS_MILLIS_MAPS       	0       	0       	1,078,990,953            
       	Total time spent by all reduces waiting after reserving slots (ms)       
0       	0       	0            
       	Total time spent by all maps waiting after reserving slots (ms)       
0       	0       	0            
       	Rack-local map tasks       	0       	0       	5            
       	Launched map tasks       	0       	0       	3,043            
       	Data-local map tasks       	0       	0       	3,038            
       	SLOTS_MILLIS_REDUCES       	0       	0       	230,033,243            
         	         File Input Format Counters        	Bytes Read       
795,488,504,160       	0       	795,488,504,160            
         	         File Output Format Counters        	Bytes Written       
0       	232,897,849,385       	232,897,849,385            
         	         FileSystemCounters       	FILE_BYTES_READ       
222,822,617,457       	277,017,265,266       	499,839,882,723            
       	HDFS_BYTES_READ       	795,489,371,753       	0       
795,489,371,753            
       	FILE_BYTES_WRITTEN       	340,286,252,780       	277,020,558,696       
617,306,811,476            
       	HDFS_BYTES_WRITTEN       	0       	232,897,849,385       
232,897,849,385            
         	         CrawlDB status       	db_redir_temp       	0       
22,973,233       	22,973,233            
       	db_redir_perm       	0       	24,711,774       	24,711,774            
       	db_notmodified       	0       	6,186       	6,186            
       	db_unfetched       	0       	744,252,948       	744,252,948            
       	db_gone       	0       	13,677,812       	13,677,812            
       	db_fetched       	0       	472,704,451       	472,704,451            
         	         Map-Reduce Framework       	Map output materialized bytes       
164,837,502,212       	0       	164,837,502,212            
       	Map input records       	6,805,902,736       	0       	6,805,902,736            
       	Reduce shuffle bytes       	0       	164,837,502,212       
164,837,502,212            
       	Spilled Records       	14,135,524,721       	12,724,232,390       
26,859,757,111            
       	Map output bytes       	731,983,634,000       	0       
731,983,634,000            
       	Total committed heap usage (bytes)       	1,311,373,459,456       
21,273,116,672       	1,332,646,576,128            
       	CPU time spent (ms)       	75,047,000       	32,011,850       
107,058,850            
       	Map input bytes       	795,485,093,219       	0       
795,485,093,219            
       	SPLIT_RAW_BYTES       	443,407       	0       	443,407            
       	Combine input records       	0       	0       	0            
       	Reduce input records       	0       	6,795,657,408       
6,795,657,408            
       	Reduce input groups       	0       	1,278,326,704       
1,278,326,704            
       	Combine output records       	0       	0       	0            
       	Physical memory (bytes) snapshot       	810,779,176,960       
27,269,132,288       	838,048,309,248            
       	Reduce output records       	0       	1,278,326,404       
1,278,326,404            
       	Virtual memory (bytes) snapshot       	4,017,384,558,592       
53,819,904,000       	4,071,204,462,592            
       	Map output records       	6,795,657,408       	0       
6,795,657,408     


 

 




--
View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076362.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawldb contents

Posted by Sebastian Nagel <wa...@googlemail.com>.

It should be possible to merge the CrawlDbs
but not that way. "current" is a hard-wired
subdir. A correct call would not contain "current":
 nutch mergedb <output> crawldb1/ crawldb2/

I understand you may have lost lot of data but again:
> Assumed crawling is continued the missed data will be crawled again
> (or already has been crawled again because it happened 3 days ago).
But that's also a question how you run the crawl.

First, you should check whether entries are really lost.
If yes, you better run the update job again.
The segment to update the CrawlDb with should be still there.

The update job took 1.5h, that's a lot. What is your -topN?
If it's large reduce it, so that one cycle finishes within a few hours.
If a job fails the loss is tolerable, just run it again.

On 07/08/2013 10:49 PM, eakarsu wrote:
> Sebastian,
> 
> The hadoop job result page does not render properly. There was nothing wrong
> for updatedb job.
> 
> Can we merge current and 624730206 folders with command?
> 
> nutch mergedb <output_crawldb> 160milyonurls/crawldb/current
> 160milyonurls/crawldb/624730206
> 
> 
> User: hduser
> JobName: crawldb 160milyonurls/crawldb
> JobConf:
> hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
> Job-ACLs: All users are allowed
> Submitted At: 5-Jul-2013 22:14:37
> Launched At: 5-Jul-2013 22:14:38 (0sec)
> Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
> Status: SUCCESS
> Failure Info:
> Analyse This Job
> Kind	Total Tasks(successful+failed+killed)	Successful tasks	Failed tasks
> Killed tasks	Start Time	Finish Time
> Setup 	1 	1 	0 	0 	5-Jul-2013 22:15:31 	5-Jul-2013 22:15:32 (1sec)
> Map 	3043 	3043 	0 	0 	5-Jul-2013 22:14:41 	5-Jul-2013 23:16:56 (1hrs,
> 2mins, 14sec)
> Reduce 	40 	40 	0 	0 	5-Jul-2013 22:18:25 	5-Jul-2013 23:55:35 (1hrs,
> 37mins, 10sec)
> Cleanup 	1 	1 	0 	0 	5-Jul-2013 23:55:10 	5-Jul-2013 23:55:11 (1sec)
> 
> 
> 
> <http://lucene.472066.n3.nabble.com/file/n4076369/Capture.jpg> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076369.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: crawldb contents

Posted by eakarsu <ea...@gmail.com>.

Sebastian,

The hadoop job result page does not render properly. There was nothing wrong
for updatedb job.

Can we merge current and 624730206 folders with command?

nutch mergedb <output_crawldb> 160milyonurls/crawldb/current
160milyonurls/crawldb/624730206


User: hduser
JobName: crawldb 160milyonurls/crawldb
JobConf:
hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
Job-ACLs: All users are allowed
Submitted At: 5-Jul-2013 22:14:37
Launched At: 5-Jul-2013 22:14:38 (0sec)
Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
Status: SUCCESS
Failure Info:
Analyse This Job
Kind	Total Tasks(successful+failed+killed)	Successful tasks	Failed tasks
Killed tasks	Start Time	Finish Time
Setup 	1 	1 	0 	0 	5-Jul-2013 22:15:31 	5-Jul-2013 22:15:32 (1sec)
Map 	3043 	3043 	0 	0 	5-Jul-2013 22:14:41 	5-Jul-2013 23:16:56 (1hrs,
2mins, 14sec)
Reduce 	40 	40 	0 	0 	5-Jul-2013 22:18:25 	5-Jul-2013 23:55:35 (1hrs,
37mins, 10sec)
Cleanup 	1 	1 	0 	0 	5-Jul-2013 23:55:10 	5-Jul-2013 23:55:11 (1sec)



<http://lucene.472066.n3.nabble.com/file/n4076369/Capture.jpg> 



--
View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076369.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawldb contents

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

the folder "624730206" points indeed to a failed (or canceled?) updatedb job.
If the job is successful the intermediate output path (a random number)
is installed (moved) to "current".

You should have a look at the logs around 2013-07-05 23:55.

Assumed crawling is continued the missed data will be crawled again
(or already is because it happened 3 days ago).

Sebastian

On 07/08/2013 09:24 PM, eakarsu wrote:
> 
> I have question on the contents of crawldb folder with Nutch 1.6
> 
> After I do updatedb step, crawldb folder includes the following. Is this
> correct result I should get?
> If not, how I can fix it?
> 
> If I execute "generate" on this crawldb below, will it generate full url
> lists? My concern is that updatedb process is not completed fully because we
> "624730206" and "current" folder at the same time.
> Does Nutch take care of this?
> 
> I appreciate your help
> 
> 
> hduser@hadoopdev1:~$ hadoop dfs -ls 160milyonurls/crawldb
> Warning: $HADOOP_HOME is deprecated.
> 
> Found 3 items
> drwxr-xr-x   - hduser supergroup          0 2013-07-05 23:55
> /user/hduser/160milyonurls/crawldb/624730206
> drwxr-xr-x   - hduser supergroup          0 2013-07-08 18:59
> /user/hduser/160milyonurls/crawldb/current
> drwxr-xr-x   - hduser supergroup          0 2013-07-03 14:39
> /user/hduser/160milyonurls/crawldb/old
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>