You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by eakarsu <ea...@gmail.com> on 2013/07/08 21:24:57 UTC
crawldb contents
I have question on the contents of crawldb folder with Nutch 1.6
After I do updatedb step, crawldb folder includes the following. Is this
correct result I should get?
If not, how I can fix it?
If I execute "generate" on this crawldb below, will it generate full url
lists? My concern is that updatedb process is not completed fully because we
"624730206" and "current" folder at the same time.
Does Nutch take care of this?
I appreciate your help
hduser@hadoopdev1:~$ hadoop dfs -ls 160milyonurls/crawldb
Warning: $HADOOP_HOME is deprecated.
Found 3 items
drwxr-xr-x - hduser supergroup 0 2013-07-05 23:55
/user/hduser/160milyonurls/crawldb/624730206
drwxr-xr-x - hduser supergroup 0 2013-07-08 18:59
/user/hduser/160milyonurls/crawldb/current
drwxr-xr-x - hduser supergroup 0 2013-07-03 14:39
/user/hduser/160milyonurls/crawldb/old
--
View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawldb contents
Posted by eakarsu <ea...@gmail.com>.
Sebastian,Actually, updatedb hadoop job finished successfully.User:
hduserJobName: crawldb 160milyonurls/crawldbJobConf:
hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xmlJob-ACLs:
All users are allowedSubmitted At: 5-Jul-2013 22:14:37Launched At:
5-Jul-2013 22:14:38 (0sec)Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins,
33sec)Status: SUCCESSFailure Info:Analyse This JobKind Total
Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks
Start Time Finish TimeSetup 1 1 0 0 5-Jul-2013 22:15:31 5-Jul-2013
22:15:32 (1sec)Map 3043 3043 0 0 5-Jul-2013 22:14:41 5-Jul-2013
23:16:56 (1hrs, 2mins, 14sec)Reduce 40 40 0 0 5-Jul-2013 22:18:25
5-Jul-2013 23:55:35 (1hrs, 37mins, 10sec)Cleanup 1 1 0 0 5-Jul-2013
23:55:10 5-Jul-2013 23:55:11 (1sec)Hadoop Job job_201307050940_0002 on
History Viewer
Hadoop Job job_201307050940_0002 on History Viewer <jobhistoryhome.jsp>
*User: * hduser
*JobName: * crawldb 160milyonurls/crawldb
*JobConf: *
hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
<jobconf_history.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb>
*Job-ACLs: All users are allowed*
*Submitted At: * 5-Jul-2013 22:14:37
*Launched At: * 5-Jul-2013 22:14:38 (0sec)
*Finished At: * 5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
*Status: * SUCCESS
*Failure Info: *
* Analyse This Job
<analysejobhistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb>
*
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks
Killed tasks Start Time Finish Time
Setup 1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=all>
1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=SUCCESS>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=FAILED>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=SETUP&status=KILLED>
5-Jul-2013 22:15:31 5-Jul-2013 22:15:32 (1sec)
Map 3043
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=all>
3043
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=SUCCESS>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=FAILED>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=MAP&status=KILLED>
5-Jul-2013 22:14:41 5-Jul-2013 23:16:56 (1hrs, 2mins, 14sec)
Reduce 40
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=all>
40
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=SUCCESS>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=FAILED>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=REDUCE&status=KILLED>
5-Jul-2013 22:18:25 5-Jul-2013 23:55:35 (1hrs, 37mins, 10sec)
Cleanup 1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=all>
1
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=SUCCESS>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=FAILED>
0
<jobtaskshistory.jsp?logFile=file:/home/hduser/hadoop-1.1.2/logs/history/done/version-1/summitdev1_1373010012230_/2013/07/05/000000/job_201307050940_0002_1373055277591_hduser_crawldb%2B160milyonurls%252Fcrawldb&taskType=CLEANUP&status=KILLED>
5-Jul-2013 23:55:10 5-Jul-2013 23:55:11 (1sec)
Counter Map Reduce Total
Job Counters Launched reduce tasks 0
0 40
SLOTS_MILLIS_MAPS 0 0 1,078,990,953
Total time spent by all reduces waiting after reserving slots (ms)
0 0 0
Total time spent by all maps waiting after reserving slots (ms)
0 0 0
Rack-local map tasks 0 0 5
Launched map tasks 0 0 3,043
Data-local map tasks 0 0 3,038
SLOTS_MILLIS_REDUCES 0 0 230,033,243
File Input Format Counters Bytes Read
795,488,504,160 0 795,488,504,160
File Output Format Counters Bytes Written
0 232,897,849,385 232,897,849,385
FileSystemCounters FILE_BYTES_READ
222,822,617,457 277,017,265,266 499,839,882,723
HDFS_BYTES_READ 795,489,371,753 0
795,489,371,753
FILE_BYTES_WRITTEN 340,286,252,780 277,020,558,696
617,306,811,476
HDFS_BYTES_WRITTEN 0 232,897,849,385
232,897,849,385
CrawlDB status db_redir_temp 0
22,973,233 22,973,233
db_redir_perm 0 24,711,774 24,711,774
db_notmodified 0 6,186 6,186
db_unfetched 0 744,252,948 744,252,948
db_gone 0 13,677,812 13,677,812
db_fetched 0 472,704,451 472,704,451
Map-Reduce Framework Map output materialized bytes
164,837,502,212 0 164,837,502,212
Map input records 6,805,902,736 0 6,805,902,736
Reduce shuffle bytes 0 164,837,502,212
164,837,502,212
Spilled Records 14,135,524,721 12,724,232,390
26,859,757,111
Map output bytes 731,983,634,000 0
731,983,634,000
Total committed heap usage (bytes) 1,311,373,459,456
21,273,116,672 1,332,646,576,128
CPU time spent (ms) 75,047,000 32,011,850
107,058,850
Map input bytes 795,485,093,219 0
795,485,093,219
SPLIT_RAW_BYTES 443,407 0 443,407
Combine input records 0 0 0
Reduce input records 0 6,795,657,408
6,795,657,408
Reduce input groups 0 1,278,326,704
1,278,326,704
Combine output records 0 0 0
Physical memory (bytes) snapshot 810,779,176,960
27,269,132,288 838,048,309,248
Reduce output records 0 1,278,326,404
1,278,326,404
Virtual memory (bytes) snapshot 4,017,384,558,592
53,819,904,000 4,071,204,462,592
Map output records 6,795,657,408 0
6,795,657,408
--
View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076362.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawldb contents
Posted by Sebastian Nagel <wa...@googlemail.com>.
It should be possible to merge the CrawlDbs
but not that way. "current" is a hard-wired
subdir. A correct call would not contain "current":
nutch mergedb <output> crawldb1/ crawldb2/
I understand you may have lost lot of data but again:
> Assumed crawling is continued the missed data will be crawled again
> (or already has been crawled again because it happened 3 days ago).
But that's also a question how you run the crawl.
First, you should check whether entries are really lost.
If yes, you better run the update job again.
The segment to update the CrawlDb with should be still there.
The update job took 1.5h, that's a lot. What is your -topN?
If it's large reduce it, so that one cycle finishes within a few hours.
If a job fails the loss is tolerable, just run it again.
On 07/08/2013 10:49 PM, eakarsu wrote:
> Sebastian,
>
> The hadoop job result page does not render properly. There was nothing wrong
> for updatedb job.
>
> Can we merge current and 624730206 folders with command?
>
> nutch mergedb <output_crawldb> 160milyonurls/crawldb/current
> 160milyonurls/crawldb/624730206
>
>
> User: hduser
> JobName: crawldb 160milyonurls/crawldb
> JobConf:
> hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
> Job-ACLs: All users are allowed
> Submitted At: 5-Jul-2013 22:14:37
> Launched At: 5-Jul-2013 22:14:38 (0sec)
> Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
> Status: SUCCESS
> Failure Info:
> Analyse This Job
> Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks
> Killed tasks Start Time Finish Time
> Setup 1 1 0 0 5-Jul-2013 22:15:31 5-Jul-2013 22:15:32 (1sec)
> Map 3043 3043 0 0 5-Jul-2013 22:14:41 5-Jul-2013 23:16:56 (1hrs,
> 2mins, 14sec)
> Reduce 40 40 0 0 5-Jul-2013 22:18:25 5-Jul-2013 23:55:35 (1hrs,
> 37mins, 10sec)
> Cleanup 1 1 0 0 5-Jul-2013 23:55:10 5-Jul-2013 23:55:11 (1sec)
>
>
>
> <http://lucene.472066.n3.nabble.com/file/n4076369/Capture.jpg>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076369.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
Re: crawldb contents
Posted by eakarsu <ea...@gmail.com>.
Sebastian,
The hadoop job result page does not render properly. There was nothing wrong
for updatedb job.
Can we merge current and 624730206 folders with command?
nutch mergedb <output_crawldb> 160milyonurls/crawldb/current
160milyonurls/crawldb/624730206
User: hduser
JobName: crawldb 160milyonurls/crawldb
JobConf:
hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
Job-ACLs: All users are allowed
Submitted At: 5-Jul-2013 22:14:37
Launched At: 5-Jul-2013 22:14:38 (0sec)
Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
Status: SUCCESS
Failure Info:
Analyse This Job
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks
Killed tasks Start Time Finish Time
Setup 1 1 0 0 5-Jul-2013 22:15:31 5-Jul-2013 22:15:32 (1sec)
Map 3043 3043 0 0 5-Jul-2013 22:14:41 5-Jul-2013 23:16:56 (1hrs,
2mins, 14sec)
Reduce 40 40 0 0 5-Jul-2013 22:18:25 5-Jul-2013 23:55:35 (1hrs,
37mins, 10sec)
Cleanup 1 1 0 0 5-Jul-2013 23:55:10 5-Jul-2013 23:55:11 (1sec)
<http://lucene.472066.n3.nabble.com/file/n4076369/Capture.jpg>
--
View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076369.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawldb contents
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
the folder "624730206" points indeed to a failed (or canceled?) updatedb job.
If the job is successful the intermediate output path (a random number)
is installed (moved) to "current".
You should have a look at the logs around 2013-07-05 23:55.
Assumed crawling is continued the missed data will be crawled again
(or already is because it happened 3 days ago).
Sebastian
On 07/08/2013 09:24 PM, eakarsu wrote:
>
> I have question on the contents of crawldb folder with Nutch 1.6
>
> After I do updatedb step, crawldb folder includes the following. Is this
> correct result I should get?
> If not, how I can fix it?
>
> If I execute "generate" on this crawldb below, will it generate full url
> lists? My concern is that updatedb process is not completed fully because we
> "624730206" and "current" folder at the same time.
> Does Nutch take care of this?
>
> I appreciate your help
>
>
> hduser@hadoopdev1:~$ hadoop dfs -ls 160milyonurls/crawldb
> Warning: $HADOOP_HOME is deprecated.
>
> Found 3 items
> drwxr-xr-x - hduser supergroup 0 2013-07-05 23:55
> /user/hduser/160milyonurls/crawldb/624730206
> drwxr-xr-x - hduser supergroup 0 2013-07-08 18:59
> /user/hduser/160milyonurls/crawldb/current
> drwxr-xr-x - hduser supergroup 0 2013-07-03 14:39
> /user/hduser/160milyonurls/crawldb/old
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>