You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Armel T. Nene" <ar...@idna-solutions.com> on 2007/01/23 15:50:29 UTC

How to modify crawldb values

Hi guys,

 

I want to extend Nutch to use real-time indexing on local file system. I
have been through the source code to find out ways to modify values stored
in CrawlDB. The idea is simple:

 

I have an external program (or a script) which checks for changes in a
directory (url injected in the crawldb). When there are new changes
recorded, the program will update the status in the crawldb and generate a
new fetch list for the fetcher to fetch. I do not want to make great changes
to the nutch source code as I want the program to be compatible with future
releases. Now, I know the crawldatum is saved in the crawldb with the url. I
am not too sure but I think the url is the key to retrieve the crawldatum.
For my program to work successfully, I need to know the following:

 

*         How to read data from the crawldb; what data structure does it use
and how to referenced to it?

*         How to write back to the crawldb; updating information back to the
crawldb or probably creating a new with changed and unchanged values.

 

This is an extract from the crawldb:

 

http://some-url.com/    Version: 4

Status: 2 (DB_fetched)

Fetch time: Thu Feb 22 12:44:05 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 1.0323955

Signature: f4c14c46074b66aad8829b8aa84cd636

Metadata: null

 

How can get this information with an external program and modify/ update it.
Once I know how to implement that part, I can call nutch in the usual way of
generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
have the new value that I want re-indexed. This will stop the fetcher from
fetching a long list of urls (changed or unchanged but need fetching because
of their next_fetch_time is due). The program gets its update from the
underlying OS to know notify about any changes to files and folders being
monitored. Once the program is working with sufficient tests, I will be
willing to share the source code; it's written in java and doesn't need any
script to launch nutch.

 

I will be looking forward to your kind support.

 

Armel

 

-------------------------------------------------

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 <http://blog.idna-solutions.com/> http://blog.idna-solutions.com

 


RE: How to modify crawldb values

Posted by "Armel T. Nene" <ar...@idna-solutions.com>.
Thanks for the reply, I 'll try this and if I encounter any problem I'll
send another email. This will be a good feature to have and probably will
avoid the project into branching in different subprojects.

Regards,

Armel

-------------------------------------------------
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com
-----Original Message-----
From: Doğacan Güney [mailto:dogacan.guney@agmlab.com] 
Sent: 23 January 2007 15:06
To: nutch-dev@lucene.apache.org
Subject: Re: How to modify crawldb values

Hi,

Armel T. Nene wrote:
> Hi guys,
>
>  
>
> I want to extend Nutch to use real-time indexing on local file system. I
> have been through the source code to find out ways to modify values stored
> in CrawlDB. The idea is simple:
>
>  
>
> I have an external program (or a script) which checks for changes in a
> directory (url injected in the crawldb). When there are new changes
> recorded, the program will update the status in the crawldb and generate a
> new fetch list for the fetcher to fetch. I do not want to make great
changes
> to the nutch source code as I want the program to be compatible with
future
> releases. Now, I know the crawldatum is saved in the crawldb with the url.
I
> am not too sure but I think the url is the key to retrieve the crawldatum.
> For my program to work successfully, I need to know the following:
>
>  
>
> *         How to read data from the crawldb; what data structure does it
use
> and how to referenced to it?
>   

Crawldb is essentially a list of <url, CrawlDatum> pairs and is stores 
as a MapFile. So you can read it with MapFile.Reader.get.
> *         How to write back to the crawldb; updating information back to
the
> crawldb or probably creating a new with changed and unchanged values.
>   
Current FS implementation is write-once, so you can't modify it. But you 
can read it one-by-one(possibly with MapFile.Reader.next) and then write 
a new one with MapFile.Writer.

>  
>
> This is an extract from the crawldb:
>
>  
>
> http://some-url.com/    Version: 4
>
> Status: 2 (DB_fetched)
>
> Fetch time: Thu Feb 22 12:44:05 GMT 2007
>
> Modified time: Thu Jan 01 01:00:00 GMT 1970
>
> Retries since fetch: 0
>
> Retry interval: 30.0 days
>
> Score: 1.0323955
>
> Signature: f4c14c46074b66aad8829b8aa84cd636
>
> Metadata: null
>
>  
>
> How can get this information with an external program and modify/ update
it.
> Once I know how to implement that part, I can call nutch in the usual way
of
> generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
> have the new value that I want re-indexed. This will stop the fetcher from
> fetching a long list of urls (changed or unchanged but need fetching
because
> of their next_fetch_time is due). The program gets its update from the
> underlying OS to know notify about any changes to files and folders being
> monitored. Once the program is working with sufficient tests, I will be
> willing to share the source code; it's written in java and doesn't need
any
> script to launch nutch.
>
>  
>
> I will be looking forward to your kind support.
>
>  
>
> Armel
>
>  
>
> -------------------------------------------------
>
> Armel T. Nene
>
> iDNA Solutions
>
> Tel: +44 (207) 257 6124
>
> Mobile: +44 (788) 695 0483 
>
>  <http://blog.idna-solutions.com/> http://blog.idna-solutions.com
>
>  
>
>
>   




Re: How to modify crawldb values

Posted by Doğacan Güney <do...@agmlab.com>.
Hi,

Armel T. Nene wrote:
> Hi guys,
>
>  
>
> I want to extend Nutch to use real-time indexing on local file system. I
> have been through the source code to find out ways to modify values stored
> in CrawlDB. The idea is simple:
>
>  
>
> I have an external program (or a script) which checks for changes in a
> directory (url injected in the crawldb). When there are new changes
> recorded, the program will update the status in the crawldb and generate a
> new fetch list for the fetcher to fetch. I do not want to make great changes
> to the nutch source code as I want the program to be compatible with future
> releases. Now, I know the crawldatum is saved in the crawldb with the url. I
> am not too sure but I think the url is the key to retrieve the crawldatum.
> For my program to work successfully, I need to know the following:
>
>  
>
> *         How to read data from the crawldb; what data structure does it use
> and how to referenced to it?
>   

Crawldb is essentially a list of <url, CrawlDatum> pairs and is stores 
as a MapFile. So you can read it with MapFile.Reader.get.
> *         How to write back to the crawldb; updating information back to the
> crawldb or probably creating a new with changed and unchanged values.
>   
Current FS implementation is write-once, so you can't modify it. But you 
can read it one-by-one(possibly with MapFile.Reader.next) and then write 
a new one with MapFile.Writer.

>  
>
> This is an extract from the crawldb:
>
>  
>
> http://some-url.com/    Version: 4
>
> Status: 2 (DB_fetched)
>
> Fetch time: Thu Feb 22 12:44:05 GMT 2007
>
> Modified time: Thu Jan 01 01:00:00 GMT 1970
>
> Retries since fetch: 0
>
> Retry interval: 30.0 days
>
> Score: 1.0323955
>
> Signature: f4c14c46074b66aad8829b8aa84cd636
>
> Metadata: null
>
>  
>
> How can get this information with an external program and modify/ update it.
> Once I know how to implement that part, I can call nutch in the usual way of
> generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
> have the new value that I want re-indexed. This will stop the fetcher from
> fetching a long list of urls (changed or unchanged but need fetching because
> of their next_fetch_time is due). The program gets its update from the
> underlying OS to know notify about any changes to files and folders being
> monitored. Once the program is working with sufficient tests, I will be
> willing to share the source code; it's written in java and doesn't need any
> script to launch nutch.
>
>  
>
> I will be looking forward to your kind support.
>
>  
>
> Armel
>
>  
>
> -------------------------------------------------
>
> Armel T. Nene
>
> iDNA Solutions
>
> Tel: +44 (207) 257 6124
>
> Mobile: +44 (788) 695 0483 
>
>  <http://blog.idna-solutions.com/> http://blog.idna-solutions.com
>
>  
>
>
>