You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Armel T. Nene" <ar...@idna-solutions.com> on 2006/11/28 21:20:28 UTC

Indexing and Re-crawling site

Hi guys,

 

I have a few questions regarding the way nutch indexes and the best way a
recrawl can be implemented. 

 

1.	Why does nutch has to create a new index every time when indexing,
while it can just merge it with the old existing index? I try to change the
value in the IndexMerger class to 'false' while creating an index therefore
Lucene doesn't recreate a new index each time it is indexing. The problem
with this is, I keep on having some exception when it tries to merge the
indexes. There is a lock time out exception that is thrown by the
IndexMerger. And consequently the index that get created. Is it possible to
let nutch index by merging it with an existing index? I have to crawl about
100Gb of data and if there are only a few documents that have been changed,
I don't nutch to recreate a new index because of that but update the
existing index by merging it with the new one. I need some light on this.

 

2.	What is the best way to make nutch re-crawl? I have implemented a
class that loops the crawl process; it has a crawl interval which is set in
a property file and a running status. The running status is a Boolean
variable which is set to true if the re-crawl process is ongoing or false if
it should stop. But with this approach, it seems that the index is not being
fully generated. The values in the index cannot be queried. The re-crawl is
in java which calls an underlying ant script to run nutch. I know most
re-crawl are written as batch script but can you tell me which one do you
recommended? A batch script or a loop-based java program?  

 

3.	What is the best way of implementing nutch has a window service or
unix daemon?

 

Thanks,

 

Armel

Re: Indexing and Re-crawling site

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

One thing that I realized yesterday while looking into Hadoop code is that
as of now it is not possible to update existing file in HDFS. If I
understood it correctly then HDFS is write-once file system and you can not
update (add/remove content) to an existing file. Since Nutch can run on top
of HDFS it makes very sense to me why it works the way it does today.

Regards,
Lukas

On 12/5/06, Armel T. Nene <ar...@idna-solutions.com> wrote:
>
>
>
> Lukas,
>
> > This is more for Nutch experts but to me it seems that new index is
> > reasonable. Besides others it means that original index is still
> searchable
> > while the new index is being created (creating a new index can take long
> > time based on your settings). Updating one document at a time in large
> index
> > is not very optimal approach I think.
>
> I see your point but indexing could be a long process only if the all the
> files need to be re-fetched entirely. I don't know how this work, if the
> fetcher re-fetches everything and checks for the STATUS_UNMODIFIED before
> creating a segments out of it. The way I set it up at the moment is; the
> crawler will crawl a set of files at a time and so on but it also checks
> to
> see if the files held in the segments are up to date. My question is, when
> creating a new index, you said the old index will still be available until
> the new one is created, but what happens if someone is running a search
> while the IndexMerger is executing? I guess the old index gets deleted and
> replaced, does that mean an exception will be thrown while the index is
> being replaced?
>
> Regards,
>
> Armel
>
>
> -----Original Message-----
> From: Lukas Vlcek [mailto:lukas.vlcek@gmail.com]
> Sent: 04 December 2006 22:12
> To: nutch-dev@lucene.apache.org
> Subject: Re: Indexing and Re-crawling site
>
> Hi,
>
> I will try to use my out-dated knowledge to answer (& confuse you on) your
> items:
>
> 1.      Why does nutch has to create a new index every time when indexing,
> > while it can just merge it with the old existing index? I try to change
> > the
> > value in the IndexMerger class to 'false' while creating an index
> > therefore
> > Lucene doesn't recreate a new index each time it is indexing. The
> problem
> > with this is, I keep on having some exception when it tries to merge the
> > indexes. There is a lock time out exception that is thrown by the
> > IndexMerger. And consequently the index that get created. Is it possible
> > to
> > let nutch index by merging it with an existing index? I have to crawl
> > about
> > 100Gb of data and if there are only a few documents that have been
> > changed,
> > I don't nutch to recreate a new index because of that but update the
> > existing index by merging it with the new one. I need some light on
> this.
>
>
> This is more for Nutch experts but to me it seems that new index is
> reasonable. Besides others it means that original index is still
> searchable
> while the new index is being created (creating a new index can take long
> time based on your settings). Updating one document at a time in large
> index
> is not very optimal approach I think.
>
> 2.      What is the best way to make nutch re-crawl? I have implemented a
> > class that loops the crawl process; it has a crawl interval which is set
> > in
> > a property file and a running status. The running status is a Boolean
> > variable which is set to true if the re-crawl process is ongoing or
> false
> > if
> > it should stop. But with this approach, it seems that the index is not
> > being
> > fully generated. The values in the index cannot be queried. The re-crawl
> > is
> > in java which calls an underlying ant script to run nutch. I know most
> > re-crawl are written as batch script but can you tell me which one do
> you
> > recommended? A batch script or a loop-based java program?
>
>
> I used to use batch and was happy with it.
>
> 3.      What is the best way of implementing nutch has a window service or
> > unix daemon?
> >
>
> Sorry - what do you mean byt this?
>
> Regards,
> Lukas
>
>

RE: Indexing and Re-crawling site

Posted by "Armel T. Nene" <ar...@idna-solutions.com>.

Lukas,

 

I was wondering about running Nutch as Windows Services. I was able to
implement it as follow:

 

1.    Creating a java program that will act as a Nutch and Launcher and
re-crawler.

2.    Download JavaService from http://javaservice.objectweb.org/

3.    Follow the tutorial to turn your java program in a Window service

 

I then was able to test it on Windows Server 2003 and XP. It works fine. If
you want to me to post the code let me know, maybe others can use it too.

 

Regards,

 

Armel

 

-----Original Message-----
From: Lukas Vlcek [mailto:lukas.vlcek@gmail.com] 
Sent: 04 December 2006 22:12
To: nutch-dev@lucene.apache.org
Subject: Re: Indexing and Re-crawling site

 

Hi,

 

I will try to use my out-dated knowledge to answer (& confuse you on) your

items:

 

1.      Why does nutch has to create a new index every time when indexing,

> while it can just merge it with the old existing index? I try to change

> the

> value in the IndexMerger class to 'false' while creating an index

> therefore

> Lucene doesn't recreate a new index each time it is indexing. The problem

> with this is, I keep on having some exception when it tries to merge the

> indexes. There is a lock time out exception that is thrown by the

> IndexMerger. And consequently the index that get created. Is it possible

> to

> let nutch index by merging it with an existing index? I have to crawl

> about

> 100Gb of data and if there are only a few documents that have been

> changed,

> I don't nutch to recreate a new index because of that but update the

> existing index by merging it with the new one. I need some light on this.

 

 

This is more for Nutch experts but to me it seems that new index is

reasonable. Besides others it means that original index is still searchable

while the new index is being created (creating a new index can take long

time based on your settings). Updating one document at a time in large index

is not very optimal approach I think.

 

2.      What is the best way to make nutch re-crawl? I have implemented a

> class that loops the crawl process; it has a crawl interval which is set

> in

> a property file and a running status. The running status is a Boolean

> variable which is set to true if the re-crawl process is ongoing or false

> if

> it should stop. But with this approach, it seems that the index is not

> being

> fully generated. The values in the index cannot be queried. The re-crawl

> is

> in java which calls an underlying ant script to run nutch. I know most

> re-crawl are written as batch script but can you tell me which one do you

> recommended? A batch script or a loop-based java program?

 

 

I used to use batch and was happy with it.

 

3.      What is the best way of implementing nutch has a window service or

> unix daemon?

> 

 

Sorry - what do you mean byt this?

 

Regards,

Lukas

RE: Indexing and Re-crawling site

Posted by "Armel T. Nene" <ar...@idna-solutions.com>.


Lukas,

> This is more for Nutch experts but to me it seems that new index is
> reasonable. Besides others it means that original index is still
searchable
> while the new index is being created (creating a new index can take long
> time based on your settings). Updating one document at a time in large
index
> is not very optimal approach I think.

I see your point but indexing could be a long process only if the all the
files need to be re-fetched entirely. I don't know how this work, if the
fetcher re-fetches everything and checks for the STATUS_UNMODIFIED before
creating a segments out of it. The way I set it up at the moment is; the
crawler will crawl a set of files at a time and so on but it also checks to
see if the files held in the segments are up to date. My question is, when
creating a new index, you said the old index will still be available until
the new one is created, but what happens if someone is running a search
while the IndexMerger is executing? I guess the old index gets deleted and
replaced, does that mean an exception will be thrown while the index is
being replaced?

Regards,

Armel 


-----Original Message-----
From: Lukas Vlcek [mailto:lukas.vlcek@gmail.com] 
Sent: 04 December 2006 22:12
To: nutch-dev@lucene.apache.org
Subject: Re: Indexing and Re-crawling site

Hi,

I will try to use my out-dated knowledge to answer (& confuse you on) your
items:

1.      Why does nutch has to create a new index every time when indexing,
> while it can just merge it with the old existing index? I try to change
> the
> value in the IndexMerger class to 'false' while creating an index
> therefore
> Lucene doesn't recreate a new index each time it is indexing. The problem
> with this is, I keep on having some exception when it tries to merge the
> indexes. There is a lock time out exception that is thrown by the
> IndexMerger. And consequently the index that get created. Is it possible
> to
> let nutch index by merging it with an existing index? I have to crawl
> about
> 100Gb of data and if there are only a few documents that have been
> changed,
> I don't nutch to recreate a new index because of that but update the
> existing index by merging it with the new one. I need some light on this.


This is more for Nutch experts but to me it seems that new index is
reasonable. Besides others it means that original index is still searchable
while the new index is being created (creating a new index can take long
time based on your settings). Updating one document at a time in large index
is not very optimal approach I think.

2.      What is the best way to make nutch re-crawl? I have implemented a
> class that loops the crawl process; it has a crawl interval which is set
> in
> a property file and a running status. The running status is a Boolean
> variable which is set to true if the re-crawl process is ongoing or false
> if
> it should stop. But with this approach, it seems that the index is not
> being
> fully generated. The values in the index cannot be queried. The re-crawl
> is
> in java which calls an underlying ant script to run nutch. I know most
> re-crawl are written as batch script but can you tell me which one do you
> recommended? A batch script or a loop-based java program?


I used to use batch and was happy with it.

3.      What is the best way of implementing nutch has a window service or
> unix daemon?
>

Sorry - what do you mean byt this?

Regards,
Lukas

Re: Indexing and Re-crawling site

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

I will try to use my out-dated knowledge to answer (& confuse you on) your
items:

1.      Why does nutch has to create a new index every time when indexing,
> while it can just merge it with the old existing index? I try to change
> the
> value in the IndexMerger class to 'false' while creating an index
> therefore
> Lucene doesn't recreate a new index each time it is indexing. The problem
> with this is, I keep on having some exception when it tries to merge the
> indexes. There is a lock time out exception that is thrown by the
> IndexMerger. And consequently the index that get created. Is it possible
> to
> let nutch index by merging it with an existing index? I have to crawl
> about
> 100Gb of data and if there are only a few documents that have been
> changed,
> I don't nutch to recreate a new index because of that but update the
> existing index by merging it with the new one. I need some light on this.


This is more for Nutch experts but to me it seems that new index is
reasonable. Besides others it means that original index is still searchable
while the new index is being created (creating a new index can take long
time based on your settings). Updating one document at a time in large index
is not very optimal approach I think.

2.      What is the best way to make nutch re-crawl? I have implemented a
> class that loops the crawl process; it has a crawl interval which is set
> in
> a property file and a running status. The running status is a Boolean
> variable which is set to true if the re-crawl process is ongoing or false
> if
> it should stop. But with this approach, it seems that the index is not
> being
> fully generated. The values in the index cannot be queried. The re-crawl
> is
> in java which calls an underlying ant script to run nutch. I know most
> re-crawl are written as batch script but can you tell me which one do you
> recommended? A batch script or a loop-based java program?


I used to use batch and was happy with it.

3.      What is the best way of implementing nutch has a window service or
> unix daemon?
>

Sorry - what do you mean byt this?

Regards,
Lukas