You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Lacoursiere, Francois" <fl...@Kronos.com> on 2005/10/19 15:49:31 UTC

Missing files in fetchlist

Hello,
 
I have a small problem. I'm indexing the files of a web server on my
intranet (apache). In one directory of the intranet there is 50 files. I
run the
generate,fetch commands and I see that the last 3 files are never
fetched.
 
The following 2 workarounds work:
-If I create an index.html file that refers all the 50 files.  Then all
the 50
files are in the fetch list and they are indexed.
-If I do a subdirectory. 47 files in parent dir and I move 3 files in
the
subdirectory. Then all the 50 files are in the fetchlist and they are
indexed.
 
Do you have an idea what's going wrong?
 
thanks
Francois.
 
Here is the script I use to build the fetch list and index:
:
echo "** Nutch Index 1 iteration"
bin/nutch generate db segments 
s1=`ls -d segments/2* | tail -1`
echo $s1
echo "** fetch list"
bin/nutch fetchlist db -local -dumpurls $s1
bin/nutch fetch -local -threads 1 $s1
bin/nutch updatedb db $s1
 
echo "** Nutch Index 2 iteration"
bin/nutch generate db segments 
s2=`ls -d segments/2* | tail -1`
echo "** fetch list"
bin/nutch fetchlist db -local -dumpurls $s2
echo $s2
bin/nutch fetch -local -threads 1 $s2
bin/nutch updatedb db $s2
 
echo "** Nutch Index 3 iteration"
bin/nutch generate db segments 
s3=`ls -d segments/2* | tail -1`
echo "** fetch list"
bin/nutch fetchlist db -local -dumpurls $s3
echo $s3
bin/nutch fetch -local -threads 1 $s3
bin/nutch updatedb db $s3
 
bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3