You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manoharam Reddy <ma...@gmail.com> on 2007/05/26 12:23:26 UTC

Deleting crawl still gives proper results

After I create the crawldb after running bin/nutch crawl, I start my
Tomcat server. It gives proper search results.

What I am wondering is that even after I delete, the 'crawl' folder,
the search page still gives proper search results. How is this
possible? Only after I restart the Tomcat server, it stops giving
results.

Re: Deleting crawl still gives proper results

Posted by Manoharam Reddy <ma...@gmail.com>.
There is no need to restart the server. You can make Tomcat reload the
new index by simply touching the web.xml file present in
webapps/ROOT/WEB-INF

like touch /opt/tomcat/webapps/ROOT/WEB-INF/web.xml

On 5/27/07, Enzo Michelangeli <en...@gmail.com> wrote:
> ----- Original Message -----
> From: "Manoharam Reddy" <ma...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Saturday, May 26, 2007 6:23 PM
>
> > After I create the crawldb after running bin/nutch crawl, I start my
> > Tomcat server. It gives proper search results.
> >
> > What I am wondering is that even after I delete, the 'crawl' folder,
> > the search page still gives proper search results. How is this
> > possible? Only after I restart the Tomcat server, it stops giving
> > results.
>
> The webapp seems to cache data. I have a related problem: updates to the
> indexes are only noticed after restarting Tomcat (so I have scheduled a
> nightly cron job to do that).
>
> Question for the Ones Who Know: in "bin/nutch mergesegs", can I use the same
> directory for input and output?
>
> For example:
>
>  bin/nutch mergesegs crawl/segments -dir crawl/segments
>
> Same for mergedb: can I issue:
>
>   bin/nutch mergedb crawl/crawldb crawl/crawldb
>
> At present I pass through temporary directories, and then I switch them in
> place of the old ones with a couple of "mv", but I don't know if that's
> necessary, or may even be harmful (for example, leaving the webapp, unaware
> of the "mv", pointing to the inode of the old directory). And I noticed that
> "bin/nutch mergedb" does not create the output directory until it's done, so
> I wonder if the explicit use of a temporary directory in my scripts is
> redundant.
>
> Enzo
>
>
>

Re: Deleting crawl still gives proper results

Posted by Enzo Michelangeli <en...@gmail.com>.
Not crawldb, and surely not entire files, but information about the indexes. 
If you modify directory information while files are still open by a process 
(e.g. by renaming a directory that contains them, and create a new directory 
with the old name) the process keeps accessing the original files on disk 
until it closes and reopens them (hence my question about mergesegs and 
mergedb).

----- Original Message ----- 
From: "Manoharam Reddy" <ma...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Monday, May 28, 2007 1:53 PM
Subject: Re: Deleting crawl still gives proper results


> The webapp caches the whole crawldb? Can anyone please tell me where
> does it cache the whole crawldb? I don't think it is possible to cache
> it on RAM. Is it cached in some location on the hard disk.
>
> Please clarify this point.
>
> On 5/27/07, Enzo Michelangeli <en...@gmail.com> wrote:
>> ----- Original Message -----
>> From: "Manoharam Reddy" <ma...@gmail.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Saturday, May 26, 2007 6:23 PM
>>
>> > After I create the crawldb after running bin/nutch crawl, I start my
>> > Tomcat server. It gives proper search results.
>> >
>> > What I am wondering is that even after I delete, the 'crawl' folder,
>> > the search page still gives proper search results. How is this
>> > possible? Only after I restart the Tomcat server, it stops giving
>> > results.
>>
>> The webapp seems to cache data. I have a related problem: updates to the
>> indexes are only noticed after restarting Tomcat (so I have scheduled a
>> nightly cron job to do that).
>>
>> Question for the Ones Who Know: in "bin/nutch mergesegs", can I use the 
>> same
>> directory for input and output?
>>
>> For example:
>>
>>  bin/nutch mergesegs crawl/segments -dir crawl/segments
>>
>> Same for mergedb: can I issue:
>>
>>   bin/nutch mergedb crawl/crawldb crawl/crawldb
>>
>> At present I pass through temporary directories, and then I switch them 
>> in
>> place of the old ones with a couple of "mv", but I don't know if that's
>> necessary, or may even be harmful (for example, leaving the webapp, 
>> unaware
>> of the "mv", pointing to the inode of the old directory). And I noticed 
>> that
>> "bin/nutch mergedb" does not create the output directory until it's done, 
>> so
>> I wonder if the explicit use of a temporary directory in my scripts is
>> redundant.
>>
>> Enzo
>>
>>
>>
> 


Re: Deleting crawl still gives proper results

Posted by Manoharam Reddy <ma...@gmail.com>.
The webapp caches the whole crawldb? Can anyone please tell me where
does it cache the whole crawldb? I don't think it is possible to cache
it on RAM. Is it cached in some location on the hard disk.

Please clarify this point.

On 5/27/07, Enzo Michelangeli <en...@gmail.com> wrote:
> ----- Original Message -----
> From: "Manoharam Reddy" <ma...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Saturday, May 26, 2007 6:23 PM
>
> > After I create the crawldb after running bin/nutch crawl, I start my
> > Tomcat server. It gives proper search results.
> >
> > What I am wondering is that even after I delete, the 'crawl' folder,
> > the search page still gives proper search results. How is this
> > possible? Only after I restart the Tomcat server, it stops giving
> > results.
>
> The webapp seems to cache data. I have a related problem: updates to the
> indexes are only noticed after restarting Tomcat (so I have scheduled a
> nightly cron job to do that).
>
> Question for the Ones Who Know: in "bin/nutch mergesegs", can I use the same
> directory for input and output?
>
> For example:
>
>  bin/nutch mergesegs crawl/segments -dir crawl/segments
>
> Same for mergedb: can I issue:
>
>   bin/nutch mergedb crawl/crawldb crawl/crawldb
>
> At present I pass through temporary directories, and then I switch them in
> place of the old ones with a couple of "mv", but I don't know if that's
> necessary, or may even be harmful (for example, leaving the webapp, unaware
> of the "mv", pointing to the inode of the old directory). And I noticed that
> "bin/nutch mergedb" does not create the output directory until it's done, so
> I wonder if the explicit use of a temporary directory in my scripts is
> redundant.
>
> Enzo
>
>
>

Re: Deleting crawl still gives proper results

Posted by Enzo Michelangeli <en...@gmail.com>.
----- Original Message ----- 
From: "Manoharam Reddy" <ma...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Saturday, May 26, 2007 6:23 PM

> After I create the crawldb after running bin/nutch crawl, I start my
> Tomcat server. It gives proper search results.
>
> What I am wondering is that even after I delete, the 'crawl' folder,
> the search page still gives proper search results. How is this
> possible? Only after I restart the Tomcat server, it stops giving
> results.

The webapp seems to cache data. I have a related problem: updates to the
indexes are only noticed after restarting Tomcat (so I have scheduled a
nightly cron job to do that).

Question for the Ones Who Know: in "bin/nutch mergesegs", can I use the same
directory for input and output?

For example:

 bin/nutch mergesegs crawl/segments -dir crawl/segments

Same for mergedb: can I issue:

  bin/nutch mergedb crawl/crawldb crawl/crawldb

At present I pass through temporary directories, and then I switch them in
place of the old ones with a couple of "mv", but I don't know if that's
necessary, or may even be harmful (for example, leaving the webapp, unaware
of the "mv", pointing to the inode of the old directory). And I noticed that
"bin/nutch mergedb" does not create the output directory until it's done, so
I wonder if the explicit use of a temporary directory in my scripts is
redundant.

Enzo