You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Robert Irribarren <ro...@algorithms.io> on 2012/08/26 02:23:40 UTC

nutch 2.0 updatedb Killed and more concerns

I am using nutch 2.0 and solr 4.0 and am having minimal success I have 3
urls and my regex-urlfilter.xml is set to allow everything.
I ran this script

#!/bin/bash

# Nutch crawl

export NUTCH_HOME=~/java/workspace/Nutch2.0/runtime/local

# depth in the web exploration
n=1
# number of selected urls for fetching
maxUrls=50000
# solr server
solrUrl=http://localhost:8983

for (( i = 1 ; i <= $n ; i++ ))
do

log=$NUTCH_HOME/logs/log

# Generate
$NUTCH_HOME/bin/nutch generate -topN $maxUrls > $log

batchId=`sed -n 's|.*batch id: \(.*\)|\1|p' < $log`

# rename log file by appending the batch id
log2=$log$batchId
mv $log $log2
log=$log2

# Fetch
$NUTCH_HOME/bin/nutch fetch $batchId >> $log

# Parse
$NUTCH_HOME/bin/nutch parse $batchId >> $log

# Update
$NUTCH_HOME/bin/nutch updatedb >> $log

# Index
$NUTCH_HOME/bin/nutch solrindex $solrUrl $batchId >> $log

done
----------------------------
Of course I bin/nutch inject urls before i run the script, but when I
look at the logs, I see Skipping : different batch id and some of the
urls that I see are ones that arent in the seed.txt and I want to
include them
into solr, but they aren't added.
I have 3 urls in my seed.txt

After I ran this script I had tried
bin/nutch parse -force -all
bin/nutch updatedb
bin/nutch solrindex http://127.0.0.1:8983/solr/sites -reindex

*My questions are as follows.
1. The last three commands why were they necessary?
2. How do I get all of the urls during the parse job, even with the
-force -all i still get different batch id skipping
3. The script above, if i set generate -topN to 5. Does this mean if a
site has a link to another site to another site to another site to
another site to another site. That it will be included in the
fetch/parse cycle?
4. What about this command, why is this even mentioned bin/nutch crawl
urls -solr http://127.0.0.1:8983/solr/sites -depth 3 -topN 10000
-threads 3.
5. When i run bin/nutch updateb it takes 1-2 mineuts then it echos
Killed. This concerns me. Please Help.*