You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Marshall, Al" <al...@sensis.com> on 2005/10/15 22:43:48 UTC

Intranet Search

I've been using Nutch on an intranet for about a month.
db.default.fetch.interval is set to 1 day. Fetching and indexing are done
each night. As of this morning, the main index and segment index are empty
and the fetchListTool produces an empty fetchlist. I've tried generating the
fetchlist with -adddays 2, 8 and 30 but the fetchlist is still empty. Next
fetch dates in the final merged segment (obtained from segread) range from
about a month ago up to 7 days from today. A dump of the WebDB shows that it
has plenty of pages (~4000) and links. The main part of the bash script that
runs each night follows:

echo "Nutch program directory: " ${nutchdir}
echo "Data directory: " ${datadir}

# iteratively build a set of data segments
for pass in 1 2 3
do
   echo "***** Fetch pass " ${pass} "*****"
   # generate a segment subdirectory and fetchlist from the database
   ${nutchdir}/bin/nutch generate ${datadir}/db ${datadir}/segments

   # get the segment subdirectory name that was just created
   seg[pass]=`ls -d ${datadir}/segments/2* | tail -1`
   echo "Created segment dir ${seg[pass]}"

   # fetch the content of the segment from its fetchlist
   ${nutchdir}/bin/nutch fetch ${seg[pass]}

   # update the nutch database with the fetched results
   ${nutchdir}/bin/nutch updatedb ${datadir}/db  ${seg[pass]}

   # run several iterations of link analysis to prioritize popular pages
   ${nutchdir}/bin/nutch analyze ${datadir}/db 2
done

# merge all existing segments into 1 new one, index it and delete the old
segs
${nutchdir}/bin/nutch mergesegs -dir ${datadir}/segments -i -ds

# copy new segment index into main index directory
newseg=`ls -d ${datadir}/segments/2* | tail -1`
${nutchdir}/bin/nutch merge ${datadir}/index ${newseg}

Why are the indexes empty?
Why are many of the "Next fetch" dates in the segment dump in the past?
What is the best way to get daily updates to work?
Is there a concise description somewhere of the outputs from the WebDB dump
and segment dump?

Thanks!

Al