You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Peters, Vijaya" <Vi...@sra.com> on 2009/12/09 18:44:35 UTC

how to force nutch to do a recrawl

I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
recrawl?

 

thanks,

- Vijaya

 

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com <http://www.sra.com/> 
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years

P Please consider the environment before printing this e-mail

This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

 


RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
I tried that and it worked a few times, but now I get 0 records selected for fetching.

$ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50
crawl started in: crawl9a
rootUrlDir = urls
threads = 10
depth = 15
topN = 50
Injector: starting
Injector: crawlDb: crawl9a/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl9a/segments/20091209124308
Generator: filtering: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl9a

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
-----Original Message-----
From: xiao yang [mailto:yangxiao9901@gmail.com] 
Sent: Wednesday, December 09, 2009 1:19 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

What do you mean by "recrawl"?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.

On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com> wrote:
> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
> recrawl?
>
>
>
> thanks,
>
> - Vijaya
>
>
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com <http://www.sra.com/>
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
>
> P Please consider the environment before printing this e-mail
>
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
>
>
>
>

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Okay.  Our fetch finishes in less than 10 minutes (just intranet).  But,
I'll set it to 2 hours. 

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Monday, December 14, 2009 11:50 AM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


but just think about one thing...if you are recrawling to much urls and
the crawl time will be more than 1 hours, so your crawl will not
finish...becoz every time it find and url so it will find that the
fetchtime is ready and it fetch it again....
to well sett your fetchtime you have to crawl a first time and see how
much time your crawl wil take to finish.....
let us say it will take 3 hours...so you have to set the fetchtime to
like 5 hours, give it 2 hours in the case of some tiemout pages that
nutch will retry....


i hv met this probleme and my crawl took like 24 hours...becoz of the
small fetchtime (fecthtime smaller than the crawl time)
thx



> Subject: RE: how to force nutch to do a recrawl
> Date: Mon, 14 Dec 2009 11:42:40 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Thanks.
> I'm on a development system, so every hour is okay.  
> I guess that's why the last time I changed the properties file it
didn't
> take any effect (because crawldb won't change the fetch time
> automatically).
> 
> I'll give this a try - thanks much.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Monday, December 14, 2009 11:38 AM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> yes just add those config in the nutch-site.xml and it should work.
> but are you going to recrawl every hour ??? i see 3600 secondes !!
> 
> another thing is  you have to make an initial clean crawl with the new
> fetchtime , because in the crawldb it will not change the fetch time
> automaticly . (in my case it didnt change, i just deleted the crawldb
> and made a clean crawl and it works)
> mabe someone can tell you how to change the fecthtime in the crawldb
> without deleting it for an intial clean crawl.
> 
> thx
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Mon, 14 Dec 2009 11:26:31 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I finally go the command to work on another server (see below).  to
> > change the retry interval, should I just add the two properties into
> > nutch-site.xml (though I tried this before and it didn't work):
> > 
> > http://mysite/	Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Fri Jan 08 15:42:33 EST 2010  
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 2592000 seconds (30 days)  
> > Score: 1.0
> > Signature: e04ab1ac06075fc273dbe1334a6c6dc5
> > Metadata: _pst_: success(1), lastModified=0
> > 
> > 
> > <property>
> > <name>db.fetch.interval.default</name>
> > <value>3600</value>
> > <description>The default number of seconds between re-fetches of 
> > a page 30 days). 
> > </description>
> > </property>
> > 
> > <property>
> > <name>db.fetch.interval.max</name>
> > <value>3600</value>
> > <description>The maximum number of seconds between re-fetches of 
> > a page(90 days). After this period every page in the db will be 
> > re-tried, no matter what is its status.  </description> 
> > </property>
> > 
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents
of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Friday, December 11, 2009 3:11 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > hi,
> > 
> > you shouldnt open the crc file you have to open the other one, which
> is
> > part-00000.
> > use vi top edit part-0000.
> > if you will not find this file so your dump failed...just check the
> > logs/hadoop.log file
> > 
> > 
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Fri, 11 Dec 2009 09:14:26 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > I'm using cygwin to run the scripts.  I use EditPlus to edit the
> > files.  But EditPlus won't allow me to edit the crc file.  I'll see
if
> I
> > can ftp the file to a unix machine.
> > > 
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 12500 Fair Lakes Circle
> > > Room 3507
> > > Fairfax, VA 22033
> > > Tel:  703-222-9207
> > > 
> > > www.sra.com
> > > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents
of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com]
> > > Sent: Thu 12/10/2009 6:43 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > >  
> > > 
> > > 
> > > bu8t how you are running sh scripts...
> > > you have to use cygwin to be able to edit linux files
> > > 
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Adam,
> > > > I'm on windows unfortunately!!  I'm using cygdrive, but it
doesn't
> > > > recognize vi.  Any idea for opening it in windows?  Notepad
didn't
> > work
> > > > either.
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from
SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > individual
> > > > or entity named above.  If you are not the intended recipient,
be
> > aware
> > > > that any disclosure, copying, distribution, or use of the
contents
> > of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > Sent: Thursday, December 10, 2009 4:01 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > jus use vi or vim
> > > > 
> > > > 
> > > > i use vi to edit the file
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > 
> > > > > Adam,
> > > > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > > > advance!
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the
> contents
> > of
> > > > > this information is strictly prohibited.  If you have received
> > this
> > > > > electronic information in error, please notify us immediately
by
> > > > > telephone at 866-584-2143.
> > > > > 
> > > > > -----Original Message-----
> > > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > > Sent: Thursday, December 10, 2009 3:48 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > 
> > > > > 
> > > > > it will not dump to the console !
> > > > > whole_db is a folder and you have to edit the file you will
find
> > in
> > > > this
> > > > > folder
> > > > > 
> > > > > 
> > > > > 
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > > > From: Vijaya_Peters@sra.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > 
> > > > > > Adam,
> > > > > > I tried running that command and get the following (it
created
> a
> > > > > > whole_db directory, but it's not dumping out the contents to
> the
> > > > > > console):
> > > > > > 
> > > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > > > CrawlDb dump: starting
> > > > > > CrawlDb db: crawl/crawldb/
> > > > > > CrawlDb dump: done
> > > > > > 
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > > 
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive years
> > > > > > P Please consider the environment before printing this
e-mail
> > > > > > This electronic message transmission contains information
from
> > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > or entity named above.  If you are not the intended
recipient,
> > be
> > > > > aware
> > > > > > that any disclosure, copying, distribution, or use of the
> > contents
> > > > of
> > > > > > this information is strictly prohibited.  If you have
received
> > this
> > > > > > electronic information in error, please notify us
immediately
> by
> > > > > > telephone at 866-584-2143.
> > > > > > -----Original Message-----
> > > > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > 
> > > > > > 
> > > > > > hi,
> > > > > > check the fetch time in your crawldb...you can dump all the
> > crawldb
> > > > > like
> > > > > > this:
> > > > > > 
> > > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > > > 
> > > > > > entries will look like this:
> > > > > > 
> > > > > > http://www.YOUR_URL_TO_FETCH
> > > > > > Status: 2 (db_fetched)
> > > > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > > Retries since fetch: 0
> > > > > > Retry interval: 18000 seconds (0 days)
> > > > > > Score: 0.0014977538
> > > > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > > > Metadata: _pst_: success(1), lastModified=0
> > > > > > 
> > > > > > 
> > > > > > as you see the next time the page will be fetched is in
fetch
> > time
> > > > :
> > > > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > > > and check the rety interval : it should be your 3600. 
> > > > > > 
> > > > > > hope it will help
> > > > > > 
> > > > > > 
> > > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > > > From: Vijaya_Peters@sra.com
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > 
> > > > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that
> > people
> > > > > had
> > > > > > > created, but I couldn't get them to work.
> > > > > > > 
> > > > > > > Thanks much.
> > > > > > > 
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > > 
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
for
> > 10
> > > > > > > consecutive years
> > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > This electronic message transmission contains information
> from
> > SRA
> > > > > > > International, Inc. which may be confidential, privileged
or
> > > > > > > proprietary.  The information is intended for the use of
the
> > > > > > individual
> > > > > > > or entity named above.  If you are not the intended
> recipient,
> > be
> > > > > > aware
> > > > > > > that any disclosure, copying, distribution, or use of the
> > contents
> > > > > of
> > > > > > > this information is strictly prohibited.  If you have
> received
> > > > this
> > > > > > > electronic information in error, please notify us
> immediately
> > by
> > > > > > > telephone at 866-584-2143.
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > 
> > > > > > > I don't that you can use nutch crawl command to do that,
> this
> > is a
> > > > > one
> > > > > > > stop
> > > > > > > shop command.
> > > > > > > You probably want to use individual commands.
> > > > > > > Type nutch generate to get the help and you will see the
> > option
> > > > > > > -adddays,
> > > > > > > read that page on the wiki to get a feel how you should
do:
> > > > > > > http://wiki.apache.org/nutch/Crawl
> > > > > > > 
> > > > > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > > > > 
> > > > > > > > I didn't see a setting to override in crawl-urlfilter.
> How
> > do I
> > > > > set
> > > > > > > > numberDays? I have regular expressions to
include/exclude
> > > > certain
> > > > > > > extensions
> > > > > > > > and certain urls, but that's all I have in there.
> > > > > > > >
> > > > > > > > Please send me an example and I'll give it a try.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Vijaya Peters
> > > > > > > > SRA International, Inc.
> > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > Room 4004
> > > > > > > > Fairfax, VA  22033
> > > > > > > > Tel:  703-502-1184
> > > > > > > >
> > > > > > > > www.sra.com
> > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > 10
> > > > > > > consecutive
> > > > > > > > years
> > > > > > > > P Please consider the environment before printing this
> > e-mail
> > > > > > > > This electronic message transmission contains
information
> > from
> > > > SRA
> > > > > > > > International, Inc. which may be confidential,
privileged
> or
> > > > > > > proprietary.
> > > > > > > >  The information is intended for the use of the
individual
> > or
> > > > > entity
> > > > > > > named
> > > > > > > > above.  If you are not the intended recipient, be aware
> that
> > any
> > > > > > > disclosure,
> > > > > > > > copying, distribution, or use of the contents of this
> > > > information
> > > > > is
> > > > > > > > strictly prohibited.  If you have received this
electronic
> > > > > > information
> > > > > > > in
> > > > > > > > error, please notify us immediately by telephone at
> > > > 866-584-2143.
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > >
> > > > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > > > >
> > > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > > > <Vi...@sra.com>
> > > > > > > > wrote:
> > > > > > > > > I tried that too.
> > > > > > > > > in Nutch-site.xml, I added in the below, but this had
no
> > > > effect.
> > > > > > > > >
> > > > > > > > > <property>
> > > > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > > > >  <value>0</value>
> > > > > > > > >  <description>(DEPRECATED) The default number of days
> > between
> > > > > > > re-fetches
> > > > > > > > of a page.  value was 30
> > > > > > > > >  </description>
> > > > > > > > > </property>
> > > > > > > > >
> > > > > > > > > <property>
> > > > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > > > >  <value>3600</value>
> > > > > > > > >  <description>The default number of seconds between
> > re-fetches
> > > > > of
> > > > > > a
> > > > > > > page
> > > > > > > > (30 days). value was 2592000 (30 days)
> > > > > > > > >  </description>
> > > > > > > > > </property>
> > > > > > > > >
> > > > > > > > > <property>
> > > > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > > > >  <value>3600</value>
> > > > > > > > >  <description>The maximum number of seconds between
> > re-fetches
> > > > > of
> > > > > > a
> > > > > > > page
> > > > > > > > >  (90 days). After this period every page in the db
will
> be
> > > > > > re-tried,
> > > > > > > no
> > > > > > > > >  matter what is its status.  value was 7776000
> > > > > > > > >  </description>
> > > > > > > > > </property>
> > > > > > > > >
> > > > > > > > > Vijaya Peters
> > > > > > > > > SRA International, Inc.
> > > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > > Room 4004
> > > > > > > > > Fairfax, VA  22033
> > > > > > > > > Tel:  703-502-1184
> > > > > > > > >
> > > > > > > > > www.sra.com
> > > > > > > > > Named to FORTUNE's "100 Best Companies to Work For"
list
> > for
> > > > 10
> > > > > > > > consecutive years
> > > > > > > > > P Please consider the environment before printing this
> > e-mail
> > > > > > > > > This electronic message transmission contains
> information
> > from
> > > > > SRA
> > > > > > > > International, Inc. which may be confidential,
privileged
> or
> > > > > > > proprietary.
> > > > > > > >  The information is intended for the use of the
individual
> > or
> > > > > entity
> > > > > > > named
> > > > > > > > above.  If you are not the intended recipient, be aware
> that
> > any
> > > > > > > disclosure,
> > > > > > > > copying, distribution, or use of the contents of this
> > > > information
> > > > > is
> > > > > > > > strictly prohibited.  If you have received this
electronic
> > > > > > information
> > > > > > > in
> > > > > > > > error, please notify us immediately by telephone at
> > > > 866-584-2143.
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > > >
> > > > > > > > > Nutch only recrawl every 30 days by default. So you
set
> > the
> > > > > > > numberDays
> > > > > > > > > adequately and it wil recrawl read nutch-default.xml
to
> > get
> > > > the
> > > > > > > > > details
> > > > > > > > >
> > > > > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > > > > >> What do you mean by "recrawl"?
> > > > > > > > >> Does the following command meets what you need?
> > > > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > > > >> Change the destination directory to a different one
> with
> > the
> > > > > last
> > > > > > > crawl.
> > > > > > > > >>
> > > > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > > > <Vi...@sra.com>
> > > > > > > > >> wrote:
> > > > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force
> Nutch
> > to
> > > > do
> > > > > a
> > > > > > > > complete
> > > > > > > > >>> recrawl?
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> thanks,
> > > > > > > > >>>
> > > > > > > > >>> - Vijaya
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> Vijaya Peters
> > > > > > > > >>> SRA International, Inc.
> > > > > > > > >>> 4350 Fair Lakes Court North
> > > > > > > > >>> Room 4004
> > > > > > > > >>> Fairfax, VA  22033
> > > > > > > > >>> Tel:  703-502-1184
> > > > > > > > >>>
> > > > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For"
> list
> > for
> > > > > 10
> > > > > > > > >>> consecutive years
> > > > > > > > >>>
> > > > > > > > >>> P Please consider the environment before printing
this
> > > > e-mail
> > > > > > > > >>>
> > > > > > > > >>> This electronic message transmission contains
> > information
> > > > from
> > > > > > SRA
> > > > > > > > >>> International, Inc. which may be confidential,
> > privileged or
> > > > > > > > >>> proprietary.  The information is intended for the
use
> of
> > the
> > > > > > > individual
> > > > > > > > >>> or entity named above.  If you are not the intended
> > > > recipient,
> > > > > > be
> > > > > > > aware
> > > > > > > > >>> that any disclosure, copying, distribution, or use
of
> > the
> > > > > > contents
> > > > > > > of
> > > > > > > > >>> this information is strictly prohibited.  If you
have
> > > > received
> > > > > > > this
> > > > > > > > >>> electronic information in error, please notify us
> > > > immediately
> > > > > by
> > > > > > > > >>> telephone at 866-584-2143.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -MilleBii-
> > > > > > > > >
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > -- 
> > > > > > > -MilleBii-
> > > > > >  		 	   		  
> > > > > >
> > _________________________________________________________________
> > > > > > Windows Live: Friends get your Flickr, Yelp, and Digg
updates
> > when
> > > > > they
> > > > > > e-mail you.
> > > > > > http://go.microsoft.com/?linkid=9691817
> > > > >  		 	   		  
> > > > >
> _________________________________________________________________
> > > > > Windows Live: Make it easier for your friends to see what
you're
> > up to
> > > > > on Facebook.
> > > > > http://go.microsoft.com/?linkid=9691816
> > > >  		 	   		  
> > > >
_________________________________________________________________
> > > > Windows Live: Make it easier for your friends to see what you're
> up
> > to
> > > > on Facebook.
> > > > http://go.microsoft.com/?linkid=9691816
> > >  		 	   		  
> > > _________________________________________________________________
> > > Eligible CDN College & University students can upgrade to Windows
7
> > before Jan 3 for only $39.99. Upgrade now!
> > > http://go.microsoft.com/?linkid=9691819
> > > 
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up
to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they
e-mail you.
http://go.microsoft.com/?linkid=9691817

RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.
but just think about one thing...if you are recrawling to much urls and the crawl time will be more than 1 hours, so your crawl will not finish...becoz every time it find and url so it will find that the fetchtime is ready and it fetch it again....
to well sett your fetchtime you have to crawl a first time and see how much time your crawl wil take to finish.....
let us say it will take 3 hours...so you have to set the fetchtime to like 5 hours, give it 2 hours in the case of some tiemout pages that nutch will retry....


i hv met this probleme and my crawl took like 24 hours...becoz of the small fetchtime (fecthtime smaller than the crawl time)
thx



> Subject: RE: how to force nutch to do a recrawl
> Date: Mon, 14 Dec 2009 11:42:40 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Thanks.
> I'm on a development system, so every hour is okay.  
> I guess that's why the last time I changed the properties file it didn't
> take any effect (because crawldb won't change the fetch time
> automatically).
> 
> I'll give this a try - thanks much.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Monday, December 14, 2009 11:38 AM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> yes just add those config in the nutch-site.xml and it should work.
> but are you going to recrawl every hour ??? i see 3600 secondes !!
> 
> another thing is  you have to make an initial clean crawl with the new
> fetchtime , because in the crawldb it will not change the fetch time
> automaticly . (in my case it didnt change, i just deleted the crawldb
> and made a clean crawl and it works)
> mabe someone can tell you how to change the fecthtime in the crawldb
> without deleting it for an intial clean crawl.
> 
> thx
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Mon, 14 Dec 2009 11:26:31 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I finally go the command to work on another server (see below).  to
> > change the retry interval, should I just add the two properties into
> > nutch-site.xml (though I tried this before and it didn't work):
> > 
> > http://mysite/	Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Fri Jan 08 15:42:33 EST 2010  
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 2592000 seconds (30 days)  
> > Score: 1.0
> > Signature: e04ab1ac06075fc273dbe1334a6c6dc5
> > Metadata: _pst_: success(1), lastModified=0
> > 
> > 
> > <property>
> > <name>db.fetch.interval.default</name>
> > <value>3600</value>
> > <description>The default number of seconds between re-fetches of 
> > a page 30 days). 
> > </description>
> > </property>
> > 
> > <property>
> > <name>db.fetch.interval.max</name>
> > <value>3600</value>
> > <description>The maximum number of seconds between re-fetches of 
> > a page(90 days). After this period every page in the db will be 
> > re-tried, no matter what is its status.  </description> 
> > </property>
> > 
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Friday, December 11, 2009 3:11 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > hi,
> > 
> > you shouldnt open the crc file you have to open the other one, which
> is
> > part-00000.
> > use vi top edit part-0000.
> > if you will not find this file so your dump failed...just check the
> > logs/hadoop.log file
> > 
> > 
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Fri, 11 Dec 2009 09:14:26 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > I'm using cygwin to run the scripts.  I use EditPlus to edit the
> > files.  But EditPlus won't allow me to edit the crc file.  I'll see if
> I
> > can ftp the file to a unix machine.
> > > 
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 12500 Fair Lakes Circle
> > > Room 3507
> > > Fairfax, VA 22033
> > > Tel:  703-222-9207
> > > 
> > > www.sra.com
> > > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com]
> > > Sent: Thu 12/10/2009 6:43 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > >  
> > > 
> > > 
> > > bu8t how you are running sh scripts...
> > > you have to use cygwin to be able to edit linux files
> > > 
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Adam,
> > > > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > > > recognize vi.  Any idea for opening it in windows?  Notepad didn't
> > work
> > > > either.
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> > of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > Sent: Thursday, December 10, 2009 4:01 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > jus use vi or vim
> > > > 
> > > > 
> > > > i use vi to edit the file
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > 
> > > > > Adam,
> > > > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > > > advance!
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the
> contents
> > of
> > > > > this information is strictly prohibited.  If you have received
> > this
> > > > > electronic information in error, please notify us immediately by
> > > > > telephone at 866-584-2143.
> > > > > 
> > > > > -----Original Message-----
> > > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > > Sent: Thursday, December 10, 2009 3:48 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > 
> > > > > 
> > > > > it will not dump to the console !
> > > > > whole_db is a folder and you have to edit the file you will find
> > in
> > > > this
> > > > > folder
> > > > > 
> > > > > 
> > > > > 
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > > > From: Vijaya_Peters@sra.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > 
> > > > > > Adam,
> > > > > > I tried running that command and get the following (it created
> a
> > > > > > whole_db directory, but it's not dumping out the contents to
> the
> > > > > > console):
> > > > > > 
> > > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > > > CrawlDb dump: starting
> > > > > > CrawlDb db: crawl/crawldb/
> > > > > > CrawlDb dump: done
> > > > > > 
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > > 
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > or entity named above.  If you are not the intended recipient,
> > be
> > > > > aware
> > > > > > that any disclosure, copying, distribution, or use of the
> > contents
> > > > of
> > > > > > this information is strictly prohibited.  If you have received
> > this
> > > > > > electronic information in error, please notify us immediately
> by
> > > > > > telephone at 866-584-2143.
> > > > > > -----Original Message-----
> > > > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > 
> > > > > > 
> > > > > > hi,
> > > > > > check the fetch time in your crawldb...you can dump all the
> > crawldb
> > > > > like
> > > > > > this:
> > > > > > 
> > > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > > > 
> > > > > > entries will look like this:
> > > > > > 
> > > > > > http://www.YOUR_URL_TO_FETCH
> > > > > > Status: 2 (db_fetched)
> > > > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > > Retries since fetch: 0
> > > > > > Retry interval: 18000 seconds (0 days)
> > > > > > Score: 0.0014977538
> > > > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > > > Metadata: _pst_: success(1), lastModified=0
> > > > > > 
> > > > > > 
> > > > > > as you see the next time the page will be fetched is in fetch
> > time
> > > > :
> > > > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > > > and check the rety interval : it should be your 3600. 
> > > > > > 
> > > > > > hope it will help
> > > > > > 
> > > > > > 
> > > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > > > From: Vijaya_Peters@sra.com
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > 
> > > > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that
> > people
> > > > > had
> > > > > > > created, but I couldn't get them to work.
> > > > > > > 
> > > > > > > Thanks much.
> > > > > > > 
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > > 
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> > 10
> > > > > > > consecutive years
> > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > This electronic message transmission contains information
> from
> > SRA
> > > > > > > International, Inc. which may be confidential, privileged or
> > > > > > > proprietary.  The information is intended for the use of the
> > > > > > individual
> > > > > > > or entity named above.  If you are not the intended
> recipient,
> > be
> > > > > > aware
> > > > > > > that any disclosure, copying, distribution, or use of the
> > contents
> > > > > of
> > > > > > > this information is strictly prohibited.  If you have
> received
> > > > this
> > > > > > > electronic information in error, please notify us
> immediately
> > by
> > > > > > > telephone at 866-584-2143.
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > 
> > > > > > > I don't that you can use nutch crawl command to do that,
> this
> > is a
> > > > > one
> > > > > > > stop
> > > > > > > shop command.
> > > > > > > You probably want to use individual commands.
> > > > > > > Type nutch generate to get the help and you will see the
> > option
> > > > > > > -adddays,
> > > > > > > read that page on the wiki to get a feel how you should do:
> > > > > > > http://wiki.apache.org/nutch/Crawl
> > > > > > > 
> > > > > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > > > > 
> > > > > > > > I didn't see a setting to override in crawl-urlfilter.
> How
> > do I
> > > > > set
> > > > > > > > numberDays? I have regular expressions to include/exclude
> > > > certain
> > > > > > > extensions
> > > > > > > > and certain urls, but that's all I have in there.
> > > > > > > >
> > > > > > > > Please send me an example and I'll give it a try.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Vijaya Peters
> > > > > > > > SRA International, Inc.
> > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > Room 4004
> > > > > > > > Fairfax, VA  22033
> > > > > > > > Tel:  703-502-1184
> > > > > > > >
> > > > > > > > www.sra.com
> > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > 10
> > > > > > > consecutive
> > > > > > > > years
> > > > > > > > P Please consider the environment before printing this
> > e-mail
> > > > > > > > This electronic message transmission contains information
> > from
> > > > SRA
> > > > > > > > International, Inc. which may be confidential, privileged
> or
> > > > > > > proprietary.
> > > > > > > >  The information is intended for the use of the individual
> > or
> > > > > entity
> > > > > > > named
> > > > > > > > above.  If you are not the intended recipient, be aware
> that
> > any
> > > > > > > disclosure,
> > > > > > > > copying, distribution, or use of the contents of this
> > > > information
> > > > > is
> > > > > > > > strictly prohibited.  If you have received this electronic
> > > > > > information
> > > > > > > in
> > > > > > > > error, please notify us immediately by telephone at
> > > > 866-584-2143.
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > >
> > > > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > > > >
> > > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > > > <Vi...@sra.com>
> > > > > > > > wrote:
> > > > > > > > > I tried that too.
> > > > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > > > effect.
> > > > > > > > >
> > > > > > > > > <property>
> > > > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > > > >  <value>0</value>
> > > > > > > > >  <description>(DEPRECATED) The default number of days
> > between
> > > > > > > re-fetches
> > > > > > > > of a page.  value was 30
> > > > > > > > >  </description>
> > > > > > > > > </property>
> > > > > > > > >
> > > > > > > > > <property>
> > > > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > > > >  <value>3600</value>
> > > > > > > > >  <description>The default number of seconds between
> > re-fetches
> > > > > of
> > > > > > a
> > > > > > > page
> > > > > > > > (30 days). value was 2592000 (30 days)
> > > > > > > > >  </description>
> > > > > > > > > </property>
> > > > > > > > >
> > > > > > > > > <property>
> > > > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > > > >  <value>3600</value>
> > > > > > > > >  <description>The maximum number of seconds between
> > re-fetches
> > > > > of
> > > > > > a
> > > > > > > page
> > > > > > > > >  (90 days). After this period every page in the db will
> be
> > > > > > re-tried,
> > > > > > > no
> > > > > > > > >  matter what is its status.  value was 7776000
> > > > > > > > >  </description>
> > > > > > > > > </property>
> > > > > > > > >
> > > > > > > > > Vijaya Peters
> > > > > > > > > SRA International, Inc.
> > > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > > Room 4004
> > > > > > > > > Fairfax, VA  22033
> > > > > > > > > Tel:  703-502-1184
> > > > > > > > >
> > > > > > > > > www.sra.com
> > > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
> > for
> > > > 10
> > > > > > > > consecutive years
> > > > > > > > > P Please consider the environment before printing this
> > e-mail
> > > > > > > > > This electronic message transmission contains
> information
> > from
> > > > > SRA
> > > > > > > > International, Inc. which may be confidential, privileged
> or
> > > > > > > proprietary.
> > > > > > > >  The information is intended for the use of the individual
> > or
> > > > > entity
> > > > > > > named
> > > > > > > > above.  If you are not the intended recipient, be aware
> that
> > any
> > > > > > > disclosure,
> > > > > > > > copying, distribution, or use of the contents of this
> > > > information
> > > > > is
> > > > > > > > strictly prohibited.  If you have received this electronic
> > > > > > information
> > > > > > > in
> > > > > > > > error, please notify us immediately by telephone at
> > > > 866-584-2143.
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > > >
> > > > > > > > > Nutch only recrawl every 30 days by default. So you set
> > the
> > > > > > > numberDays
> > > > > > > > > adequately and it wil recrawl read nutch-default.xml to
> > get
> > > > the
> > > > > > > > > details
> > > > > > > > >
> > > > > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > > > > >> What do you mean by "recrawl"?
> > > > > > > > >> Does the following command meets what you need?
> > > > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > > > >> Change the destination directory to a different one
> with
> > the
> > > > > last
> > > > > > > crawl.
> > > > > > > > >>
> > > > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > > > <Vi...@sra.com>
> > > > > > > > >> wrote:
> > > > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force
> Nutch
> > to
> > > > do
> > > > > a
> > > > > > > > complete
> > > > > > > > >>> recrawl?
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> thanks,
> > > > > > > > >>>
> > > > > > > > >>> - Vijaya
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> Vijaya Peters
> > > > > > > > >>> SRA International, Inc.
> > > > > > > > >>> 4350 Fair Lakes Court North
> > > > > > > > >>> Room 4004
> > > > > > > > >>> Fairfax, VA  22033
> > > > > > > > >>> Tel:  703-502-1184
> > > > > > > > >>>
> > > > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For"
> list
> > for
> > > > > 10
> > > > > > > > >>> consecutive years
> > > > > > > > >>>
> > > > > > > > >>> P Please consider the environment before printing this
> > > > e-mail
> > > > > > > > >>>
> > > > > > > > >>> This electronic message transmission contains
> > information
> > > > from
> > > > > > SRA
> > > > > > > > >>> International, Inc. which may be confidential,
> > privileged or
> > > > > > > > >>> proprietary.  The information is intended for the use
> of
> > the
> > > > > > > individual
> > > > > > > > >>> or entity named above.  If you are not the intended
> > > > recipient,
> > > > > > be
> > > > > > > aware
> > > > > > > > >>> that any disclosure, copying, distribution, or use of
> > the
> > > > > > contents
> > > > > > > of
> > > > > > > > >>> this information is strictly prohibited.  If you have
> > > > received
> > > > > > > this
> > > > > > > > >>> electronic information in error, please notify us
> > > > immediately
> > > > > by
> > > > > > > > >>> telephone at 866-584-2143.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -MilleBii-
> > > > > > > > >
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > -- 
> > > > > > > -MilleBii-
> > > > > >  		 	   		  
> > > > > >
> > _________________________________________________________________
> > > > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates
> > when
> > > > > they
> > > > > > e-mail you.
> > > > > > http://go.microsoft.com/?linkid=9691817
> > > > >  		 	   		  
> > > > >
> _________________________________________________________________
> > > > > Windows Live: Make it easier for your friends to see what you're
> > up to
> > > > > on Facebook.
> > > > > http://go.microsoft.com/?linkid=9691816
> > > >  		 	   		  
> > > > _________________________________________________________________
> > > > Windows Live: Make it easier for your friends to see what you're
> up
> > to
> > > > on Facebook.
> > > > http://go.microsoft.com/?linkid=9691816
> > >  		 	   		  
> > > _________________________________________________________________
> > > Eligible CDN College & University students can upgrade to Windows 7
> > before Jan 3 for only $39.99. Upgrade now!
> > > http://go.microsoft.com/?linkid=9691819
> > > 
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://go.microsoft.com/?linkid=9691817

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Thanks.
I'm on a development system, so every hour is okay.  
I guess that's why the last time I changed the properties file it didn't
take any effect (because crawldb won't change the fetch time
automatically).

I'll give this a try - thanks much.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Monday, December 14, 2009 11:38 AM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


yes just add those config in the nutch-site.xml and it should work.
but are you going to recrawl every hour ??? i see 3600 secondes !!

another thing is  you have to make an initial clean crawl with the new
fetchtime , because in the crawldb it will not change the fetch time
automaticly . (in my case it didnt change, i just deleted the crawldb
and made a clean crawl and it works)
mabe someone can tell you how to change the fecthtime in the crawldb
without deleting it for an intial clean crawl.

thx


> Subject: RE: how to force nutch to do a recrawl
> Date: Mon, 14 Dec 2009 11:26:31 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I finally go the command to work on another server (see below).  to
> change the retry interval, should I just add the two properties into
> nutch-site.xml (though I tried this before and it didn't work):
> 
> http://mysite/	Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Jan 08 15:42:33 EST 2010  
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)  
> Score: 1.0
> Signature: e04ab1ac06075fc273dbe1334a6c6dc5
> Metadata: _pst_: success(1), lastModified=0
> 
> 
> <property>
> <name>db.fetch.interval.default</name>
> <value>3600</value>
> <description>The default number of seconds between re-fetches of 
> a page 30 days). 
> </description>
> </property>
> 
> <property>
> <name>db.fetch.interval.max</name>
> <value>3600</value>
> <description>The maximum number of seconds between re-fetches of 
> a page(90 days). After this period every page in the db will be 
> re-tried, no matter what is its status.  </description> 
> </property>
> 
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Friday, December 11, 2009 3:11 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> hi,
> 
> you shouldnt open the crc file you have to open the other one, which
is
> part-00000.
> use vi top edit part-0000.
> if you will not find this file so your dump failed...just check the
> logs/hadoop.log file
> 
> 
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Fri, 11 Dec 2009 09:14:26 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I'm using cygwin to run the scripts.  I use EditPlus to edit the
> files.  But EditPlus won't allow me to edit the crc file.  I'll see if
I
> can ftp the file to a unix machine.
> > 
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 12500 Fair Lakes Circle
> > Room 3507
> > Fairfax, VA 22033
> > Tel:  703-222-9207
> > 
> > www.sra.com
> > This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> > 
> > 
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com]
> > Sent: Thu 12/10/2009 6:43 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> >  
> > 
> > 
> > bu8t how you are running sh scripts...
> > you have to use cygwin to be able to edit linux files
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > > recognize vi.  Any idea for opening it in windows?  Notepad didn't
> work
> > > either.
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> individual
> > > or entity named above.  If you are not the intended recipient, be
> aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received
this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > Sent: Thursday, December 10, 2009 4:01 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > jus use vi or vim
> > > 
> > > 
> > > i use vi to edit the file
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Adam,
> > > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > > advance!
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from
SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient,
be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the
contents
> of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > Sent: Thursday, December 10, 2009 3:48 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > it will not dump to the console !
> > > > whole_db is a folder and you have to edit the file you will find
> in
> > > this
> > > > folder
> > > > 
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > 
> > > > > Adam,
> > > > > I tried running that command and get the following (it created
a
> > > > > whole_db directory, but it's not dumping out the contents to
the
> > > > > console):
> > > > > 
> > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > > CrawlDb dump: starting
> > > > > CrawlDb db: crawl/crawldb/
> > > > > CrawlDb dump: done
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the
> contents
> > > of
> > > > > this information is strictly prohibited.  If you have received
> this
> > > > > electronic information in error, please notify us immediately
by
> > > > > telephone at 866-584-2143.
> > > > > -----Original Message-----
> > > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > 
> > > > > 
> > > > > hi,
> > > > > check the fetch time in your crawldb...you can dump all the
> crawldb
> > > > like
> > > > > this:
> > > > > 
> > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > > 
> > > > > entries will look like this:
> > > > > 
> > > > > http://www.YOUR_URL_TO_FETCH
> > > > > Status: 2 (db_fetched)
> > > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > Retries since fetch: 0
> > > > > Retry interval: 18000 seconds (0 days)
> > > > > Score: 0.0014977538
> > > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > > Metadata: _pst_: success(1), lastModified=0
> > > > > 
> > > > > 
> > > > > as you see the next time the page will be fetched is in fetch
> time
> > > :
> > > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > > and check the rety interval : it should be your 3600. 
> > > > > 
> > > > > hope it will help
> > > > > 
> > > > > 
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > > From: Vijaya_Peters@sra.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > 
> > > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that
> people
> > > > had
> > > > > > created, but I couldn't get them to work.
> > > > > > 
> > > > > > Thanks much.
> > > > > > 
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > > 
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive years
> > > > > > P Please consider the environment before printing this
e-mail
> > > > > > This electronic message transmission contains information
from
> SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > or entity named above.  If you are not the intended
recipient,
> be
> > > > > aware
> > > > > > that any disclosure, copying, distribution, or use of the
> contents
> > > > of
> > > > > > this information is strictly prohibited.  If you have
received
> > > this
> > > > > > electronic information in error, please notify us
immediately
> by
> > > > > > telephone at 866-584-2143.
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > 
> > > > > > I don't that you can use nutch crawl command to do that,
this
> is a
> > > > one
> > > > > > stop
> > > > > > shop command.
> > > > > > You probably want to use individual commands.
> > > > > > Type nutch generate to get the help and you will see the
> option
> > > > > > -adddays,
> > > > > > read that page on the wiki to get a feel how you should do:
> > > > > > http://wiki.apache.org/nutch/Crawl
> > > > > > 
> > > > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > > > 
> > > > > > > I didn't see a setting to override in crawl-urlfilter.
How
> do I
> > > > set
> > > > > > > numberDays? I have regular expressions to include/exclude
> > > certain
> > > > > > extensions
> > > > > > > and certain urls, but that's all I have in there.
> > > > > > >
> > > > > > > Please send me an example and I'll give it a try.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > >
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
for
> 10
> > > > > > consecutive
> > > > > > > years
> > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > This electronic message transmission contains information
> from
> > > SRA
> > > > > > > International, Inc. which may be confidential, privileged
or
> > > > > > proprietary.
> > > > > > >  The information is intended for the use of the individual
> or
> > > > entity
> > > > > > named
> > > > > > > above.  If you are not the intended recipient, be aware
that
> any
> > > > > > disclosure,
> > > > > > > copying, distribution, or use of the contents of this
> > > information
> > > > is
> > > > > > > strictly prohibited.  If you have received this electronic
> > > > > information
> > > > > > in
> > > > > > > error, please notify us immediately by telephone at
> > > 866-584-2143.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > >
> > > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > > >
> > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > > <Vi...@sra.com>
> > > > > > > wrote:
> > > > > > > > I tried that too.
> > > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > > effect.
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > > >  <value>0</value>
> > > > > > > >  <description>(DEPRECATED) The default number of days
> between
> > > > > > re-fetches
> > > > > > > of a page.  value was 30
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > > >  <value>3600</value>
> > > > > > > >  <description>The default number of seconds between
> re-fetches
> > > > of
> > > > > a
> > > > > > page
> > > > > > > (30 days). value was 2592000 (30 days)
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > > >  <value>3600</value>
> > > > > > > >  <description>The maximum number of seconds between
> re-fetches
> > > > of
> > > > > a
> > > > > > page
> > > > > > > >  (90 days). After this period every page in the db will
be
> > > > > re-tried,
> > > > > > no
> > > > > > > >  matter what is its status.  value was 7776000
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > Vijaya Peters
> > > > > > > > SRA International, Inc.
> > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > Room 4004
> > > > > > > > Fairfax, VA  22033
> > > > > > > > Tel:  703-502-1184
> > > > > > > >
> > > > > > > > www.sra.com
> > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > > 10
> > > > > > > consecutive years
> > > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > > This electronic message transmission contains
information
> from
> > > > SRA
> > > > > > > International, Inc. which may be confidential, privileged
or
> > > > > > proprietary.
> > > > > > >  The information is intended for the use of the individual
> or
> > > > entity
> > > > > > named
> > > > > > > above.  If you are not the intended recipient, be aware
that
> any
> > > > > > disclosure,
> > > > > > > copying, distribution, or use of the contents of this
> > > information
> > > > is
> > > > > > > strictly prohibited.  If you have received this electronic
> > > > > information
> > > > > > in
> > > > > > > error, please notify us immediately by telephone at
> > > 866-584-2143.
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > >
> > > > > > > > Nutch only recrawl every 30 days by default. So you set
> the
> > > > > > numberDays
> > > > > > > > adequately and it wil recrawl read nutch-default.xml to
> get
> > > the
> > > > > > > > details
> > > > > > > >
> > > > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > > > >> What do you mean by "recrawl"?
> > > > > > > >> Does the following command meets what you need?
> > > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > > >> Change the destination directory to a different one
with
> the
> > > > last
> > > > > > crawl.
> > > > > > > >>
> > > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > > <Vi...@sra.com>
> > > > > > > >> wrote:
> > > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force
Nutch
> to
> > > do
> > > > a
> > > > > > > complete
> > > > > > > >>> recrawl?
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> thanks,
> > > > > > > >>>
> > > > > > > >>> - Vijaya
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Vijaya Peters
> > > > > > > >>> SRA International, Inc.
> > > > > > > >>> 4350 Fair Lakes Court North
> > > > > > > >>> Room 4004
> > > > > > > >>> Fairfax, VA  22033
> > > > > > > >>> Tel:  703-502-1184
> > > > > > > >>>
> > > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For"
list
> for
> > > > 10
> > > > > > > >>> consecutive years
> > > > > > > >>>
> > > > > > > >>> P Please consider the environment before printing this
> > > e-mail
> > > > > > > >>>
> > > > > > > >>> This electronic message transmission contains
> information
> > > from
> > > > > SRA
> > > > > > > >>> International, Inc. which may be confidential,
> privileged or
> > > > > > > >>> proprietary.  The information is intended for the use
of
> the
> > > > > > individual
> > > > > > > >>> or entity named above.  If you are not the intended
> > > recipient,
> > > > > be
> > > > > > aware
> > > > > > > >>> that any disclosure, copying, distribution, or use of
> the
> > > > > contents
> > > > > > of
> > > > > > > >>> this information is strictly prohibited.  If you have
> > > received
> > > > > > this
> > > > > > > >>> electronic information in error, please notify us
> > > immediately
> > > > by
> > > > > > > >>> telephone at 866-584-2143.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -MilleBii-
> > > > > > > >
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > -MilleBii-
> > > > >  		 	   		  
> > > > >
> _________________________________________________________________
> > > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates
> when
> > > > they
> > > > > e-mail you.
> > > > > http://go.microsoft.com/?linkid=9691817
> > > >  		 	   		  
> > > >
_________________________________________________________________
> > > > Windows Live: Make it easier for your friends to see what you're
> up to
> > > > on Facebook.
> > > > http://go.microsoft.com/?linkid=9691816
> > >  		 	   		  
> > > _________________________________________________________________
> > > Windows Live: Make it easier for your friends to see what you're
up
> to
> > > on Facebook.
> > > http://go.microsoft.com/?linkid=9691816
> >  		 	   		  
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> > 
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7
before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.
yes just add those config in the nutch-site.xml and it should work.   but are you going to recrawl every hour ??? i see 3600 secondes !!

another thing is  you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly . (in my case it didnt change, i just deleted the crawldb and made a clean crawl and it works)
mabe someone can tell you how to change the fecthtime in the crawldb without deleting it for an intial clean crawl.

thx


> Subject: RE: how to force nutch to do a recrawl
> Date: Mon, 14 Dec 2009 11:26:31 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I finally go the command to work on another server (see below).  to
> change the retry interval, should I just add the two properties into
> nutch-site.xml (though I tried this before and it didn't work):
> 
> http://mysite/	Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Jan 08 15:42:33 EST 2010  
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)  
> Score: 1.0
> Signature: e04ab1ac06075fc273dbe1334a6c6dc5
> Metadata: _pst_: success(1), lastModified=0
> 
> 
> <property>
> <name>db.fetch.interval.default</name>
> <value>3600</value>
> <description>The default number of seconds between re-fetches of 
> a page 30 days). 
> </description>
> </property>
> 
> <property>
> <name>db.fetch.interval.max</name>
> <value>3600</value>
> <description>The maximum number of seconds between re-fetches of 
> a page(90 days). After this period every page in the db will be 
> re-tried, no matter what is its status.  </description> 
> </property>
> 
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Friday, December 11, 2009 3:11 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> hi,
> 
> you shouldnt open the crc file you have to open the other one, which is
> part-00000.
> use vi top edit part-0000.
> if you will not find this file so your dump failed...just check the
> logs/hadoop.log file
> 
> 
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Fri, 11 Dec 2009 09:14:26 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I'm using cygwin to run the scripts.  I use EditPlus to edit the
> files.  But EditPlus won't allow me to edit the crc file.  I'll see if I
> can ftp the file to a unix machine.
> > 
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 12500 Fair Lakes Circle
> > Room 3507
> > Fairfax, VA 22033
> > Tel:  703-222-9207
> > 
> > www.sra.com
> > This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> > 
> > 
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com]
> > Sent: Thu 12/10/2009 6:43 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> >  
> > 
> > 
> > bu8t how you are running sh scripts...
> > you have to use cygwin to be able to edit linux files
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > > recognize vi.  Any idea for opening it in windows?  Notepad didn't
> work
> > > either.
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> individual
> > > or entity named above.  If you are not the intended recipient, be
> aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > Sent: Thursday, December 10, 2009 4:01 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > jus use vi or vim
> > > 
> > > 
> > > i use vi to edit the file
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Adam,
> > > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > > advance!
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > Sent: Thursday, December 10, 2009 3:48 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > it will not dump to the console !
> > > > whole_db is a folder and you have to edit the file you will find
> in
> > > this
> > > > folder
> > > > 
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > 
> > > > > Adam,
> > > > > I tried running that command and get the following (it created a
> > > > > whole_db directory, but it's not dumping out the contents to the
> > > > > console):
> > > > > 
> > > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > > CrawlDb dump: starting
> > > > > CrawlDb db: crawl/crawldb/
> > > > > CrawlDb dump: done
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the
> contents
> > > of
> > > > > this information is strictly prohibited.  If you have received
> this
> > > > > electronic information in error, please notify us immediately by
> > > > > telephone at 866-584-2143.
> > > > > -----Original Message-----
> > > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > 
> > > > > 
> > > > > hi,
> > > > > check the fetch time in your crawldb...you can dump all the
> crawldb
> > > > like
> > > > > this:
> > > > > 
> > > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > > 
> > > > > entries will look like this:
> > > > > 
> > > > > http://www.YOUR_URL_TO_FETCH
> > > > > Status: 2 (db_fetched)
> > > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > > Retries since fetch: 0
> > > > > Retry interval: 18000 seconds (0 days)
> > > > > Score: 0.0014977538
> > > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > > Metadata: _pst_: success(1), lastModified=0
> > > > > 
> > > > > 
> > > > > as you see the next time the page will be fetched is in fetch
> time
> > > :
> > > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > > and check the rety interval : it should be your 3600. 
> > > > > 
> > > > > hope it will help
> > > > > 
> > > > > 
> > > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > > From: Vijaya_Peters@sra.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > 
> > > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that
> people
> > > > had
> > > > > > created, but I couldn't get them to work.
> > > > > > 
> > > > > > Thanks much.
> > > > > > 
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > > 
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > or entity named above.  If you are not the intended recipient,
> be
> > > > > aware
> > > > > > that any disclosure, copying, distribution, or use of the
> contents
> > > > of
> > > > > > this information is strictly prohibited.  If you have received
> > > this
> > > > > > electronic information in error, please notify us immediately
> by
> > > > > > telephone at 866-584-2143.
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > 
> > > > > > I don't that you can use nutch crawl command to do that, this
> is a
> > > > one
> > > > > > stop
> > > > > > shop command.
> > > > > > You probably want to use individual commands.
> > > > > > Type nutch generate to get the help and you will see the
> option
> > > > > > -adddays,
> > > > > > read that page on the wiki to get a feel how you should do:
> > > > > > http://wiki.apache.org/nutch/Crawl
> > > > > > 
> > > > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > > > 
> > > > > > > I didn't see a setting to override in crawl-urlfilter.  How
> do I
> > > > set
> > > > > > > numberDays? I have regular expressions to include/exclude
> > > certain
> > > > > > extensions
> > > > > > > and certain urls, but that's all I have in there.
> > > > > > >
> > > > > > > Please send me an example and I'll give it a try.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > >
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > > consecutive
> > > > > > > years
> > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > This electronic message transmission contains information
> from
> > > SRA
> > > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.
> > > > > > >  The information is intended for the use of the individual
> or
> > > > entity
> > > > > > named
> > > > > > > above.  If you are not the intended recipient, be aware that
> any
> > > > > > disclosure,
> > > > > > > copying, distribution, or use of the contents of this
> > > information
> > > > is
> > > > > > > strictly prohibited.  If you have received this electronic
> > > > > information
> > > > > > in
> > > > > > > error, please notify us immediately by telephone at
> > > 866-584-2143.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > >
> > > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > > >
> > > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > > <Vi...@sra.com>
> > > > > > > wrote:
> > > > > > > > I tried that too.
> > > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > > effect.
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > > >  <value>0</value>
> > > > > > > >  <description>(DEPRECATED) The default number of days
> between
> > > > > > re-fetches
> > > > > > > of a page.  value was 30
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > > >  <value>3600</value>
> > > > > > > >  <description>The default number of seconds between
> re-fetches
> > > > of
> > > > > a
> > > > > > page
> > > > > > > (30 days). value was 2592000 (30 days)
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > <property>
> > > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > > >  <value>3600</value>
> > > > > > > >  <description>The maximum number of seconds between
> re-fetches
> > > > of
> > > > > a
> > > > > > page
> > > > > > > >  (90 days). After this period every page in the db will be
> > > > > re-tried,
> > > > > > no
> > > > > > > >  matter what is its status.  value was 7776000
> > > > > > > >  </description>
> > > > > > > > </property>
> > > > > > > >
> > > > > > > > Vijaya Peters
> > > > > > > > SRA International, Inc.
> > > > > > > > 4350 Fair Lakes Court North
> > > > > > > > Room 4004
> > > > > > > > Fairfax, VA  22033
> > > > > > > > Tel:  703-502-1184
> > > > > > > >
> > > > > > > > www.sra.com
> > > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > > 10
> > > > > > > consecutive years
> > > > > > > > P Please consider the environment before printing this
> e-mail
> > > > > > > > This electronic message transmission contains information
> from
> > > > SRA
> > > > > > > International, Inc. which may be confidential, privileged or
> > > > > > proprietary.
> > > > > > >  The information is intended for the use of the individual
> or
> > > > entity
> > > > > > named
> > > > > > > above.  If you are not the intended recipient, be aware that
> any
> > > > > > disclosure,
> > > > > > > copying, distribution, or use of the contents of this
> > > information
> > > > is
> > > > > > > strictly prohibited.  If you have received this electronic
> > > > > information
> > > > > > in
> > > > > > > error, please notify us immediately by telephone at
> > > 866-584-2143.
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > > >
> > > > > > > > Nutch only recrawl every 30 days by default. So you set
> the
> > > > > > numberDays
> > > > > > > > adequately and it wil recrawl read nutch-default.xml to
> get
> > > the
> > > > > > > > details
> > > > > > > >
> > > > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > > > >> What do you mean by "recrawl"?
> > > > > > > >> Does the following command meets what you need?
> > > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > > >> Change the destination directory to a different one with
> the
> > > > last
> > > > > > crawl.
> > > > > > > >>
> > > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > > <Vi...@sra.com>
> > > > > > > >> wrote:
> > > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch
> to
> > > do
> > > > a
> > > > > > > complete
> > > > > > > >>> recrawl?
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> thanks,
> > > > > > > >>>
> > > > > > > >>> - Vijaya
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Vijaya Peters
> > > > > > > >>> SRA International, Inc.
> > > > > > > >>> 4350 Fair Lakes Court North
> > > > > > > >>> Room 4004
> > > > > > > >>> Fairfax, VA  22033
> > > > > > > >>> Tel:  703-502-1184
> > > > > > > >>>
> > > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list
> for
> > > > 10
> > > > > > > >>> consecutive years
> > > > > > > >>>
> > > > > > > >>> P Please consider the environment before printing this
> > > e-mail
> > > > > > > >>>
> > > > > > > >>> This electronic message transmission contains
> information
> > > from
> > > > > SRA
> > > > > > > >>> International, Inc. which may be confidential,
> privileged or
> > > > > > > >>> proprietary.  The information is intended for the use of
> the
> > > > > > individual
> > > > > > > >>> or entity named above.  If you are not the intended
> > > recipient,
> > > > > be
> > > > > > aware
> > > > > > > >>> that any disclosure, copying, distribution, or use of
> the
> > > > > contents
> > > > > > of
> > > > > > > >>> this information is strictly prohibited.  If you have
> > > received
> > > > > > this
> > > > > > > >>> electronic information in error, please notify us
> > > immediately
> > > > by
> > > > > > > >>> telephone at 866-584-2143.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -MilleBii-
> > > > > > > >
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > -MilleBii-
> > > > >  		 	   		  
> > > > >
> _________________________________________________________________
> > > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates
> when
> > > > they
> > > > > e-mail you.
> > > > > http://go.microsoft.com/?linkid=9691817
> > > >  		 	   		  
> > > > _________________________________________________________________
> > > > Windows Live: Make it easier for your friends to see what you're
> up to
> > > > on Facebook.
> > > > http://go.microsoft.com/?linkid=9691816
> > >  		 	   		  
> > > _________________________________________________________________
> > > Windows Live: Make it easier for your friends to see what you're up
> to
> > > on Facebook.
> > > http://go.microsoft.com/?linkid=9691816
> >  		 	   		  
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> > 
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Adam,
I finally go the command to work on another server (see below).  to
change the retry interval, should I just add the two properties into
nutch-site.xml (though I tried this before and it didn't work):

http://mysite/	Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Jan 08 15:42:33 EST 2010  
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)  
Score: 1.0
Signature: e04ab1ac06075fc273dbe1334a6c6dc5
Metadata: _pst_: success(1), lastModified=0


<property>
<name>db.fetch.interval.default</name>
<value>3600</value>
<description>The default number of seconds between re-fetches of 
a page 30 days). 
</description>
</property>

<property>
<name>db.fetch.interval.max</name>
<value>3600</value>
<description>The maximum number of seconds between re-fetches of 
a page(90 days). After this period every page in the db will be 
re-tried, no matter what is its status.  </description> 
</property>


Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Friday, December 11, 2009 3:11 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


hi,

you shouldnt open the crc file you have to open the other one, which is
part-00000.
use vi top edit part-0000.
if you will not find this file so your dump failed...just check the
logs/hadoop.log file






> Subject: RE: how to force nutch to do a recrawl
> Date: Fri, 11 Dec 2009 09:14:26 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I'm using cygwin to run the scripts.  I use EditPlus to edit the
files.  But EditPlus won't allow me to edit the crc file.  I'll see if I
can ftp the file to a unix machine.
> 
> 
> Vijaya Peters
> SRA International, Inc.
> 12500 Fair Lakes Circle
> Room 3507
> Fairfax, VA 22033
> Tel:  703-222-9207
> 
> www.sra.com
> This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.
> 
> 
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com]
> Sent: Thu 12/10/2009 6:43 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
>  
> 
> 
> bu8t how you are running sh scripts...
> you have to use cygwin to be able to edit linux files
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > recognize vi.  Any idea for opening it in windows?  Notepad didn't
work
> > either.
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
individual
> > or entity named above.  If you are not the intended recipient, be
aware
> > that any disclosure, copying, distribution, or use of the contents
of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Thursday, December 10, 2009 4:01 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > jus use vi or vim
> > 
> > 
> > i use vi to edit the file
> > 
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > advance!
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents
of
> > > this information is strictly prohibited.  If you have received
this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > Sent: Thursday, December 10, 2009 3:48 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > it will not dump to the console !
> > > whole_db is a folder and you have to edit the file you will find
in
> > this
> > > folder
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Adam,
> > > > I tried running that command and get the following (it created a
> > > > whole_db directory, but it's not dumping out the contents to the
> > > > console):
> > > > 
> > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > CrawlDb dump: starting
> > > > CrawlDb db: crawl/crawldb/
> > > > CrawlDb dump: done
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from
SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient,
be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the
contents
> > of
> > > > this information is strictly prohibited.  If you have received
this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > hi,
> > > > check the fetch time in your crawldb...you can dump all the
crawldb
> > > like
> > > > this:
> > > > 
> > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > 
> > > > entries will look like this:
> > > > 
> > > > http://www.YOUR_URL_TO_FETCH
> > > > Status: 2 (db_fetched)
> > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > Retries since fetch: 0
> > > > Retry interval: 18000 seconds (0 days)
> > > > Score: 0.0014977538
> > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > Metadata: _pst_: success(1), lastModified=0
> > > > 
> > > > 
> > > > as you see the next time the page will be fetched is in fetch
time
> > :
> > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > and check the rety interval : it should be your 3600. 
> > > > 
> > > > hope it will help
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > 
> > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that
people
> > > had
> > > > > created, but I couldn't get them to work.
> > > > > 
> > > > > Thanks much.
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient,
be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the
contents
> > > of
> > > > > this information is strictly prohibited.  If you have received
> > this
> > > > > electronic information in error, please notify us immediately
by
> > > > > telephone at 866-584-2143.
> > > > > 
> > > > > -----Original Message-----
> > > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > 
> > > > > I don't that you can use nutch crawl command to do that, this
is a
> > > one
> > > > > stop
> > > > > shop command.
> > > > > You probably want to use individual commands.
> > > > > Type nutch generate to get the help and you will see the
option
> > > > > -adddays,
> > > > > read that page on the wiki to get a feel how you should do:
> > > > > http://wiki.apache.org/nutch/Crawl
> > > > > 
> > > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > > 
> > > > > > I didn't see a setting to override in crawl-urlfilter.  How
do I
> > > set
> > > > > > numberDays? I have regular expressions to include/exclude
> > certain
> > > > > extensions
> > > > > > and certain urls, but that's all I have in there.
> > > > > >
> > > > > > Please send me an example and I'll give it a try.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > >
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
10
> > > > > consecutive
> > > > > > years
> > > > > > P Please consider the environment before printing this
e-mail
> > > > > > This electronic message transmission contains information
from
> > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.
> > > > > >  The information is intended for the use of the individual
or
> > > entity
> > > > > named
> > > > > > above.  If you are not the intended recipient, be aware that
any
> > > > > disclosure,
> > > > > > copying, distribution, or use of the contents of this
> > information
> > > is
> > > > > > strictly prohibited.  If you have received this electronic
> > > > information
> > > > > in
> > > > > > error, please notify us immediately by telephone at
> > 866-584-2143.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > >
> > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > >
> > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > <Vi...@sra.com>
> > > > > > wrote:
> > > > > > > I tried that too.
> > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > effect.
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > >  <value>0</value>
> > > > > > >  <description>(DEPRECATED) The default number of days
between
> > > > > re-fetches
> > > > > > of a page.  value was 30
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > >  <value>3600</value>
> > > > > > >  <description>The default number of seconds between
re-fetches
> > > of
> > > > a
> > > > > page
> > > > > > (30 days). value was 2592000 (30 days)
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > >  <value>3600</value>
> > > > > > >  <description>The maximum number of seconds between
re-fetches
> > > of
> > > > a
> > > > > page
> > > > > > >  (90 days). After this period every page in the db will be
> > > > re-tried,
> > > > > no
> > > > > > >  matter what is its status.  value was 7776000
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > >
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list
for
> > 10
> > > > > > consecutive years
> > > > > > > P Please consider the environment before printing this
e-mail
> > > > > > > This electronic message transmission contains information
from
> > > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.
> > > > > >  The information is intended for the use of the individual
or
> > > entity
> > > > > named
> > > > > > above.  If you are not the intended recipient, be aware that
any
> > > > > disclosure,
> > > > > > copying, distribution, or use of the contents of this
> > information
> > > is
> > > > > > strictly prohibited.  If you have received this electronic
> > > > information
> > > > > in
> > > > > > error, please notify us immediately by telephone at
> > 866-584-2143.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > >
> > > > > > > Nutch only recrawl every 30 days by default. So you set
the
> > > > > numberDays
> > > > > > > adequately and it wil recrawl read nutch-default.xml to
get
> > the
> > > > > > > details
> > > > > > >
> > > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > > >> What do you mean by "recrawl"?
> > > > > > >> Does the following command meets what you need?
> > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > >> Change the destination directory to a different one with
the
> > > last
> > > > > crawl.
> > > > > > >>
> > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > <Vi...@sra.com>
> > > > > > >> wrote:
> > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch
to
> > do
> > > a
> > > > > > complete
> > > > > > >>> recrawl?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> thanks,
> > > > > > >>>
> > > > > > >>> - Vijaya
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Vijaya Peters
> > > > > > >>> SRA International, Inc.
> > > > > > >>> 4350 Fair Lakes Court North
> > > > > > >>> Room 4004
> > > > > > >>> Fairfax, VA  22033
> > > > > > >>> Tel:  703-502-1184
> > > > > > >>>
> > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list
for
> > > 10
> > > > > > >>> consecutive years
> > > > > > >>>
> > > > > > >>> P Please consider the environment before printing this
> > e-mail
> > > > > > >>>
> > > > > > >>> This electronic message transmission contains
information
> > from
> > > > SRA
> > > > > > >>> International, Inc. which may be confidential,
privileged or
> > > > > > >>> proprietary.  The information is intended for the use of
the
> > > > > individual
> > > > > > >>> or entity named above.  If you are not the intended
> > recipient,
> > > > be
> > > > > aware
> > > > > > >>> that any disclosure, copying, distribution, or use of
the
> > > > contents
> > > > > of
> > > > > > >>> this information is strictly prohibited.  If you have
> > received
> > > > > this
> > > > > > >>> electronic information in error, please notify us
> > immediately
> > > by
> > > > > > >>> telephone at 866-584-2143.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -MilleBii-
> > > > > > >
> > > > > >
> > > > > 
> > > > > 
> > > > > 
> > > > > -- 
> > > > > -MilleBii-
> > > >  		 	   		  
> > > >
_________________________________________________________________
> > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates
when
> > > they
> > > > e-mail you.
> > > > http://go.microsoft.com/?linkid=9691817
> > >  		 	   		  
> > > _________________________________________________________________
> > > Windows Live: Make it easier for your friends to see what you're
up to
> > > on Facebook.
> > > http://go.microsoft.com/?linkid=9691816
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up
to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7
before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
> 
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you're up to
on Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.
hi,

you shouldnt open the crc file you have to open the other one, which is part-00000.
use vi top edit part-0000.
if you will not find this file so your dump failed...just check the logs/hadoop.log file






> Subject: RE: how to force nutch to do a recrawl
> Date: Fri, 11 Dec 2009 09:14:26 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I'm using cygwin to run the scripts.  I use EditPlus to edit the files.  But EditPlus won't allow me to edit the crc file.  I'll see if I can ftp the file to a unix machine.
> 
> 
> Vijaya Peters
> SRA International, Inc.
> 12500 Fair Lakes Circle
> Room 3507
> Fairfax, VA 22033
> Tel:  703-222-9207
> 
> www.sra.com
> This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
> 
> 
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com]
> Sent: Thu 12/10/2009 6:43 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
>  
> 
> 
> bu8t how you are running sh scripts...
> you have to use cygwin to be able to edit linux files
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 16:09:13 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> > recognize vi.  Any idea for opening it in windows?  Notepad didn't work
> > either.
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the individual
> > or entity named above.  If you are not the intended recipient, be aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Thursday, December 10, 2009 4:01 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > jus use vi or vim
> > 
> > 
> > i use vi to edit the file
> > 
> > 
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> > advance!
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > Sent: Thursday, December 10, 2009 3:48 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > it will not dump to the console !
> > > whole_db is a folder and you have to edit the file you will find in
> > this
> > > folder
> > > 
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Adam,
> > > > I tried running that command and get the following (it created a
> > > > whole_db directory, but it's not dumping out the contents to the
> > > > console):
> > > > 
> > > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > > CrawlDb dump: starting
> > > > CrawlDb db: crawl/crawldb/
> > > > CrawlDb dump: done
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> > of
> > > > this information is strictly prohibited.  If you have received this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > -----Original Message-----
> > > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > > Sent: Thursday, December 10, 2009 1:40 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > 
> > > > 
> > > > hi,
> > > > check the fetch time in your crawldb...you can dump all the crawldb
> > > like
> > > > this:
> > > > 
> > > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > > 
> > > > entries will look like this:
> > > > 
> > > > http://www.YOUR_URL_TO_FETCH
> > > > Status: 2 (db_fetched)
> > > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > > Retries since fetch: 0
> > > > Retry interval: 18000 seconds (0 days)
> > > > Score: 0.0014977538
> > > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > > Metadata: _pst_: success(1), lastModified=0
> > > > 
> > > > 
> > > > as you see the next time the page will be fetched is in fetch time
> > :
> > > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > > and check the rety interval : it should be your 3600. 
> > > > 
> > > > hope it will help
> > > > 
> > > > 
> > > > > Subject: RE: how to force nutch to do a recrawl
> > > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > 
> > > > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> > > had
> > > > > created, but I couldn't get them to work.
> > > > > 
> > > > > Thanks much.
> > > > > 
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > > 
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > > > individual
> > > > > or entity named above.  If you are not the intended recipient, be
> > > > aware
> > > > > that any disclosure, copying, distribution, or use of the contents
> > > of
> > > > > this information is strictly prohibited.  If you have received
> > this
> > > > > electronic information in error, please notify us immediately by
> > > > > telephone at 866-584-2143.
> > > > > 
> > > > > -----Original Message-----
> > > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > 
> > > > > I don't that you can use nutch crawl command to do that, this is a
> > > one
> > > > > stop
> > > > > shop command.
> > > > > You probably want to use individual commands.
> > > > > Type nutch generate to get the help and you will see the option
> > > > > -adddays,
> > > > > read that page on the wiki to get a feel how you should do:
> > > > > http://wiki.apache.org/nutch/Crawl
> > > > > 
> > > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > > 
> > > > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> > > set
> > > > > > numberDays? I have regular expressions to include/exclude
> > certain
> > > > > extensions
> > > > > > and certain urls, but that's all I have in there.
> > > > > >
> > > > > > Please send me an example and I'll give it a try.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > >
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive
> > > > > > years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.
> > > > > >  The information is intended for the use of the individual or
> > > entity
> > > > > named
> > > > > > above.  If you are not the intended recipient, be aware that any
> > > > > disclosure,
> > > > > > copying, distribution, or use of the contents of this
> > information
> > > is
> > > > > > strictly prohibited.  If you have received this electronic
> > > > information
> > > > > in
> > > > > > error, please notify us immediately by telephone at
> > 866-584-2143.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > >
> > > > > > What about the configuration in crawl-urlfilter.txt?
> > > > > >
> > > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > > <Vi...@sra.com>
> > > > > > wrote:
> > > > > > > I tried that too.
> > > > > > > in Nutch-site.xml, I added in the below, but this had no
> > effect.
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.default.fetch.interval</name>
> > > > > > >  <value>0</value>
> > > > > > >  <description>(DEPRECATED) The default number of days between
> > > > > re-fetches
> > > > > > of a page.  value was 30
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.fetch.interval.default</name>
> > > > > > >  <value>3600</value>
> > > > > > >  <description>The default number of seconds between re-fetches
> > > of
> > > > a
> > > > > page
> > > > > > (30 days). value was 2592000 (30 days)
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > <property>
> > > > > > >  <name>db.fetch.interval.max</name>
> > > > > > >  <value>3600</value>
> > > > > > >  <description>The maximum number of seconds between re-fetches
> > > of
> > > > a
> > > > > page
> > > > > > >  (90 days). After this period every page in the db will be
> > > > re-tried,
> > > > > no
> > > > > > >  matter what is its status.  value was 7776000
> > > > > > >  </description>
> > > > > > > </property>
> > > > > > >
> > > > > > > Vijaya Peters
> > > > > > > SRA International, Inc.
> > > > > > > 4350 Fair Lakes Court North
> > > > > > > Room 4004
> > > > > > > Fairfax, VA  22033
> > > > > > > Tel:  703-502-1184
> > > > > > >
> > > > > > > www.sra.com
> > > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> > 10
> > > > > > consecutive years
> > > > > > > P Please consider the environment before printing this e-mail
> > > > > > > This electronic message transmission contains information from
> > > SRA
> > > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.
> > > > > >  The information is intended for the use of the individual or
> > > entity
> > > > > named
> > > > > > above.  If you are not the intended recipient, be aware that any
> > > > > disclosure,
> > > > > > copying, distribution, or use of the contents of this
> > information
> > > is
> > > > > > strictly prohibited.  If you have received this electronic
> > > > information
> > > > > in
> > > > > > error, please notify us immediately by telephone at
> > 866-584-2143.
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > > >
> > > > > > > Nutch only recrawl every 30 days by default. So you set the
> > > > > numberDays
> > > > > > > adequately and it wil recrawl read nutch-default.xml to get
> > the
> > > > > > > details
> > > > > > >
> > > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > > >> What do you mean by "recrawl"?
> > > > > > >> Does the following command meets what you need?
> > > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > > >> Change the destination directory to a different one with the
> > > last
> > > > > crawl.
> > > > > > >>
> > > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > > <Vi...@sra.com>
> > > > > > >> wrote:
> > > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to
> > do
> > > a
> > > > > > complete
> > > > > > >>> recrawl?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> thanks,
> > > > > > >>>
> > > > > > >>> - Vijaya
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Vijaya Peters
> > > > > > >>> SRA International, Inc.
> > > > > > >>> 4350 Fair Lakes Court North
> > > > > > >>> Room 4004
> > > > > > >>> Fairfax, VA  22033
> > > > > > >>> Tel:  703-502-1184
> > > > > > >>>
> > > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> > > 10
> > > > > > >>> consecutive years
> > > > > > >>>
> > > > > > >>> P Please consider the environment before printing this
> > e-mail
> > > > > > >>>
> > > > > > >>> This electronic message transmission contains information
> > from
> > > > SRA
> > > > > > >>> International, Inc. which may be confidential, privileged or
> > > > > > >>> proprietary.  The information is intended for the use of the
> > > > > individual
> > > > > > >>> or entity named above.  If you are not the intended
> > recipient,
> > > > be
> > > > > aware
> > > > > > >>> that any disclosure, copying, distribution, or use of the
> > > > contents
> > > > > of
> > > > > > >>> this information is strictly prohibited.  If you have
> > received
> > > > > this
> > > > > > >>> electronic information in error, please notify us
> > immediately
> > > by
> > > > > > >>> telephone at 866-584-2143.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -MilleBii-
> > > > > > >
> > > > > >
> > > > > 
> > > > > 
> > > > > 
> > > > > -- 
> > > > > -MilleBii-
> > > >  		 	   		  
> > > > _________________________________________________________________
> > > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> > > they
> > > > e-mail you.
> > > > http://go.microsoft.com/?linkid=9691817
> > >  		 	   		  
> > > _________________________________________________________________
> > > Windows Live: Make it easier for your friends to see what you're up to
> > > on Facebook.
> > > http://go.microsoft.com/?linkid=9691816
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
> 
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Adam,
I'm using cygwin to run the scripts.  I use EditPlus to edit the files.  But EditPlus won't allow me to edit the crc file.  I'll see if I can ftp the file to a unix machine.


Vijaya Peters
SRA International, Inc.
12500 Fair Lakes Circle
Room 3507
Fairfax, VA 22033
Tel:  703-222-9207

www.sra.com
This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.



-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com]
Sent: Thu 12/10/2009 6:43 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl
 


bu8t how you are running sh scripts...
you have to use cygwin to be able to edit linux files




> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 16:09:13 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> recognize vi.  Any idea for opening it in windows?  Notepad didn't work
> either.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Thursday, December 10, 2009 4:01 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> jus use vi or vim
> 
> 
> i use vi to edit the file
> 
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> advance!
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Thursday, December 10, 2009 3:48 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > it will not dump to the console !
> > whole_db is a folder and you have to edit the file you will find in
> this
> > folder
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > I tried running that command and get the following (it created a
> > > whole_db directory, but it's not dumping out the contents to the
> > > console):
> > > 
> > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > CrawlDb dump: starting
> > > CrawlDb db: crawl/crawldb/
> > > CrawlDb dump: done
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > Sent: Thursday, December 10, 2009 1:40 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > hi,
> > > check the fetch time in your crawldb...you can dump all the crawldb
> > like
> > > this:
> > > 
> > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > 
> > > entries will look like this:
> > > 
> > > http://www.YOUR_URL_TO_FETCH
> > > Status: 2 (db_fetched)
> > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > Retries since fetch: 0
> > > Retry interval: 18000 seconds (0 days)
> > > Score: 0.0014977538
> > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > Metadata: _pst_: success(1), lastModified=0
> > > 
> > > 
> > > as you see the next time the page will be fetched is in fetch time
> :
> > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > and check the rety interval : it should be your 3600. 
> > > 
> > > hope it will help
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> > had
> > > > created, but I couldn't get them to work.
> > > > 
> > > > Thanks much.
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> > of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: how to force nutch to do a recrawl
> > > > 
> > > > I don't that you can use nutch crawl command to do that, this is a
> > one
> > > > stop
> > > > shop command.
> > > > You probably want to use individual commands.
> > > > Type nutch generate to get the help and you will see the option
> > > > -adddays,
> > > > read that page on the wiki to get a feel how you should do:
> > > > http://wiki.apache.org/nutch/Crawl
> > > > 
> > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > 
> > > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> > set
> > > > > numberDays? I have regular expressions to include/exclude
> certain
> > > > extensions
> > > > > and certain urls, but that's all I have in there.
> > > > >
> > > > > Please send me an example and I'll give it a try.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > >
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive
> > > > > years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.
> > > > >  The information is intended for the use of the individual or
> > entity
> > > > named
> > > > > above.  If you are not the intended recipient, be aware that any
> > > > disclosure,
> > > > > copying, distribution, or use of the contents of this
> information
> > is
> > > > > strictly prohibited.  If you have received this electronic
> > > information
> > > > in
> > > > > error, please notify us immediately by telephone at
> 866-584-2143.
> > > > >
> > > > > -----Original Message-----
> > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > >
> > > > > What about the configuration in crawl-urlfilter.txt?
> > > > >
> > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > <Vi...@sra.com>
> > > > > wrote:
> > > > > > I tried that too.
> > > > > > in Nutch-site.xml, I added in the below, but this had no
> effect.
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.default.fetch.interval</name>
> > > > > >  <value>0</value>
> > > > > >  <description>(DEPRECATED) The default number of days between
> > > > re-fetches
> > > > > of a page.  value was 30
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.fetch.interval.default</name>
> > > > > >  <value>3600</value>
> > > > > >  <description>The default number of seconds between re-fetches
> > of
> > > a
> > > > page
> > > > > (30 days). value was 2592000 (30 days)
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.fetch.interval.max</name>
> > > > > >  <value>3600</value>
> > > > > >  <description>The maximum number of seconds between re-fetches
> > of
> > > a
> > > > page
> > > > > >  (90 days). After this period every page in the db will be
> > > re-tried,
> > > > no
> > > > > >  matter what is its status.  value was 7776000
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > >
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > consecutive years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> > SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.
> > > > >  The information is intended for the use of the individual or
> > entity
> > > > named
> > > > > above.  If you are not the intended recipient, be aware that any
> > > > disclosure,
> > > > > copying, distribution, or use of the contents of this
> information
> > is
> > > > > strictly prohibited.  If you have received this electronic
> > > information
> > > > in
> > > > > error, please notify us immediately by telephone at
> 866-584-2143.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > >
> > > > > > Nutch only recrawl every 30 days by default. So you set the
> > > > numberDays
> > > > > > adequately and it wil recrawl read nutch-default.xml to get
> the
> > > > > > details
> > > > > >
> > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > >> What do you mean by "recrawl"?
> > > > > >> Does the following command meets what you need?
> > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > >> Change the destination directory to a different one with the
> > last
> > > > crawl.
> > > > > >>
> > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > <Vi...@sra.com>
> > > > > >> wrote:
> > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to
> do
> > a
> > > > > complete
> > > > > >>> recrawl?
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> thanks,
> > > > > >>>
> > > > > >>> - Vijaya
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> Vijaya Peters
> > > > > >>> SRA International, Inc.
> > > > > >>> 4350 Fair Lakes Court North
> > > > > >>> Room 4004
> > > > > >>> Fairfax, VA  22033
> > > > > >>> Tel:  703-502-1184
> > > > > >>>
> > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> > 10
> > > > > >>> consecutive years
> > > > > >>>
> > > > > >>> P Please consider the environment before printing this
> e-mail
> > > > > >>>
> > > > > >>> This electronic message transmission contains information
> from
> > > SRA
> > > > > >>> International, Inc. which may be confidential, privileged or
> > > > > >>> proprietary.  The information is intended for the use of the
> > > > individual
> > > > > >>> or entity named above.  If you are not the intended
> recipient,
> > > be
> > > > aware
> > > > > >>> that any disclosure, copying, distribution, or use of the
> > > contents
> > > > of
> > > > > >>> this information is strictly prohibited.  If you have
> received
> > > > this
> > > > > >>> electronic information in error, please notify us
> immediately
> > by
> > > > > >>> telephone at 866-584-2143.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -MilleBii-
> > > > > >
> > > > >
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > -MilleBii-
> > >  		 	   		  
> > > _________________________________________________________________
> > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> > they
> > > e-mail you.
> > > http://go.microsoft.com/?linkid=9691817
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819


RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.

bu8t how you are running sh scripts...
you have to use cygwin to be able to edit linux files




> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 16:09:13 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
> recognize vi.  Any idea for opening it in windows?  Notepad didn't work
> either.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Thursday, December 10, 2009 4:01 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> jus use vi or vim
> 
> 
> i use vi to edit the file
> 
> 
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 15:58:24 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > What do I use to open a CRC file? I tried QuickSFV.  Thanks in
> advance!
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Thursday, December 10, 2009 3:48 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > it will not dump to the console !
> > whole_db is a folder and you have to edit the file you will find in
> this
> > folder
> > 
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Adam,
> > > I tried running that command and get the following (it created a
> > > whole_db directory, but it's not dumping out the contents to the
> > > console):
> > > 
> > > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > > CrawlDb dump: starting
> > > CrawlDb db: crawl/crawldb/
> > > CrawlDb dump: done
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > -----Original Message-----
> > > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > > Sent: Thursday, December 10, 2009 1:40 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: how to force nutch to do a recrawl
> > > 
> > > 
> > > hi,
> > > check the fetch time in your crawldb...you can dump all the crawldb
> > like
> > > this:
> > > 
> > > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > > 
> > > entries will look like this:
> > > 
> > > http://www.YOUR_URL_TO_FETCH
> > > Status: 2 (db_fetched)
> > > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > > Modified time: Wed Dec 31 19:00:00 EST 1969
> > > Retries since fetch: 0
> > > Retry interval: 18000 seconds (0 days)
> > > Score: 0.0014977538
> > > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > > Metadata: _pst_: success(1), lastModified=0
> > > 
> > > 
> > > as you see the next time the page will be fetched is in fetch time
> :
> > > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > > and check the rety interval : it should be your 3600. 
> > > 
> > > hope it will help
> > > 
> > > 
> > > > Subject: RE: how to force nutch to do a recrawl
> > > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > > 
> > > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> > had
> > > > created, but I couldn't get them to work.
> > > > 
> > > > Thanks much.
> > > > 
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > > 
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> > > individual
> > > > or entity named above.  If you are not the intended recipient, be
> > > aware
> > > > that any disclosure, copying, distribution, or use of the contents
> > of
> > > > this information is strictly prohibited.  If you have received
> this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > > 
> > > > -----Original Message-----
> > > > From: MilleBii [mailto:millebii@gmail.com] 
> > > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: how to force nutch to do a recrawl
> > > > 
> > > > I don't that you can use nutch crawl command to do that, this is a
> > one
> > > > stop
> > > > shop command.
> > > > You probably want to use individual commands.
> > > > Type nutch generate to get the help and you will see the option
> > > > -adddays,
> > > > read that page on the wiki to get a feel how you should do:
> > > > http://wiki.apache.org/nutch/Crawl
> > > > 
> > > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > > 
> > > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> > set
> > > > > numberDays? I have regular expressions to include/exclude
> certain
> > > > extensions
> > > > > and certain urls, but that's all I have in there.
> > > > >
> > > > > Please send me an example and I'll give it a try.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > >
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive
> > > > > years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.
> > > > >  The information is intended for the use of the individual or
> > entity
> > > > named
> > > > > above.  If you are not the intended recipient, be aware that any
> > > > disclosure,
> > > > > copying, distribution, or use of the contents of this
> information
> > is
> > > > > strictly prohibited.  If you have received this electronic
> > > information
> > > > in
> > > > > error, please notify us immediately by telephone at
> 866-584-2143.
> > > > >
> > > > > -----Original Message-----
> > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > >
> > > > > What about the configuration in crawl-urlfilter.txt?
> > > > >
> > > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > > <Vi...@sra.com>
> > > > > wrote:
> > > > > > I tried that too.
> > > > > > in Nutch-site.xml, I added in the below, but this had no
> effect.
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.default.fetch.interval</name>
> > > > > >  <value>0</value>
> > > > > >  <description>(DEPRECATED) The default number of days between
> > > > re-fetches
> > > > > of a page.  value was 30
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.fetch.interval.default</name>
> > > > > >  <value>3600</value>
> > > > > >  <description>The default number of seconds between re-fetches
> > of
> > > a
> > > > page
> > > > > (30 days). value was 2592000 (30 days)
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > <property>
> > > > > >  <name>db.fetch.interval.max</name>
> > > > > >  <value>3600</value>
> > > > > >  <description>The maximum number of seconds between re-fetches
> > of
> > > a
> > > > page
> > > > > >  (90 days). After this period every page in the db will be
> > > re-tried,
> > > > no
> > > > > >  matter what is its status.  value was 7776000
> > > > > >  </description>
> > > > > > </property>
> > > > > >
> > > > > > Vijaya Peters
> > > > > > SRA International, Inc.
> > > > > > 4350 Fair Lakes Court North
> > > > > > Room 4004
> > > > > > Fairfax, VA  22033
> > > > > > Tel:  703-502-1184
> > > > > >
> > > > > > www.sra.com
> > > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > > consecutive years
> > > > > > P Please consider the environment before printing this e-mail
> > > > > > This electronic message transmission contains information from
> > SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.
> > > > >  The information is intended for the use of the individual or
> > entity
> > > > named
> > > > > above.  If you are not the intended recipient, be aware that any
> > > > disclosure,
> > > > > copying, distribution, or use of the contents of this
> information
> > is
> > > > > strictly prohibited.  If you have received this electronic
> > > information
> > > > in
> > > > > error, please notify us immediately by telephone at
> 866-584-2143.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: Re: how to force nutch to do a recrawl
> > > > > >
> > > > > > Nutch only recrawl every 30 days by default. So you set the
> > > > numberDays
> > > > > > adequately and it wil recrawl read nutch-default.xml to get
> the
> > > > > > details
> > > > > >
> > > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > > >> What do you mean by "recrawl"?
> > > > > >> Does the following command meets what you need?
> > > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > > >> Change the destination directory to a different one with the
> > last
> > > > crawl.
> > > > > >>
> > > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > > <Vi...@sra.com>
> > > > > >> wrote:
> > > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to
> do
> > a
> > > > > complete
> > > > > >>> recrawl?
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> thanks,
> > > > > >>>
> > > > > >>> - Vijaya
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> Vijaya Peters
> > > > > >>> SRA International, Inc.
> > > > > >>> 4350 Fair Lakes Court North
> > > > > >>> Room 4004
> > > > > >>> Fairfax, VA  22033
> > > > > >>> Tel:  703-502-1184
> > > > > >>>
> > > > > >>> www.sra.com <http://www.sra.com/>
> > > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> > 10
> > > > > >>> consecutive years
> > > > > >>>
> > > > > >>> P Please consider the environment before printing this
> e-mail
> > > > > >>>
> > > > > >>> This electronic message transmission contains information
> from
> > > SRA
> > > > > >>> International, Inc. which may be confidential, privileged or
> > > > > >>> proprietary.  The information is intended for the use of the
> > > > individual
> > > > > >>> or entity named above.  If you are not the intended
> recipient,
> > > be
> > > > aware
> > > > > >>> that any disclosure, copying, distribution, or use of the
> > > contents
> > > > of
> > > > > >>> this information is strictly prohibited.  If you have
> received
> > > > this
> > > > > >>> electronic information in error, please notify us
> immediately
> > by
> > > > > >>> telephone at 866-584-2143.
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -MilleBii-
> > > > > >
> > > > >
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > -MilleBii-
> > >  		 	   		  
> > > _________________________________________________________________
> > > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> > they
> > > e-mail you.
> > > http://go.microsoft.com/?linkid=9691817
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you're up to
> > on Facebook.
> > http://go.microsoft.com/?linkid=9691816
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Adam,
I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
recognize vi.  Any idea for opening it in windows?  Notepad didn't work
either.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Thursday, December 10, 2009 4:01 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


jus use vi or vim


i use vi to edit the file





> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 15:58:24 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> What do I use to open a CRC file? I tried QuickSFV.  Thanks in
advance!
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Thursday, December 10, 2009 3:48 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> it will not dump to the console !
> whole_db is a folder and you have to edit the file you will find in
this
> folder
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I tried running that command and get the following (it created a
> > whole_db directory, but it's not dumping out the contents to the
> > console):
> > 
> > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > CrawlDb dump: starting
> > CrawlDb db: crawl/crawldb/
> > CrawlDb dump: done
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents
of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Thursday, December 10, 2009 1:40 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > hi,
> > check the fetch time in your crawldb...you can dump all the crawldb
> like
> > this:
> > 
> > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > 
> > entries will look like this:
> > 
> > http://www.YOUR_URL_TO_FETCH
> > Status: 2 (db_fetched)
> > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 18000 seconds (0 days)
> > Score: 0.0014977538
> > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > Metadata: _pst_: success(1), lastModified=0
> > 
> > 
> > as you see the next time the page will be fetched is in fetch time
:
> > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > and check the rety interval : it should be your 3600. 
> > 
> > hope it will help
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> had
> > > created, but I couldn't get them to work.
> > > 
> > > Thanks much.
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received
this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: MilleBii [mailto:millebii@gmail.com] 
> > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: how to force nutch to do a recrawl
> > > 
> > > I don't that you can use nutch crawl command to do that, this is a
> one
> > > stop
> > > shop command.
> > > You probably want to use individual commands.
> > > Type nutch generate to get the help and you will see the option
> > > -adddays,
> > > read that page on the wiki to get a feel how you should do:
> > > http://wiki.apache.org/nutch/Crawl
> > > 
> > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > 
> > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> set
> > > > numberDays? I have regular expressions to include/exclude
certain
> > > extensions
> > > > and certain urls, but that's all I have in there.
> > > >
> > > > Please send me an example and I'll give it a try.
> > > >
> > > > Thanks!
> > > >
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > >
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive
> > > > years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from
SRA
> > > > International, Inc. which may be confidential, privileged or
> > > proprietary.
> > > >  The information is intended for the use of the individual or
> entity
> > > named
> > > > above.  If you are not the intended recipient, be aware that any
> > > disclosure,
> > > > copying, distribution, or use of the contents of this
information
> is
> > > > strictly prohibited.  If you have received this electronic
> > information
> > > in
> > > > error, please notify us immediately by telephone at
866-584-2143.
> > > >
> > > > -----Original Message-----
> > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: how to force nutch to do a recrawl
> > > >
> > > > What about the configuration in crawl-urlfilter.txt?
> > > >
> > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > <Vi...@sra.com>
> > > > wrote:
> > > > > I tried that too.
> > > > > in Nutch-site.xml, I added in the below, but this had no
effect.
> > > > >
> > > > > <property>
> > > > >  <name>db.default.fetch.interval</name>
> > > > >  <value>0</value>
> > > > >  <description>(DEPRECATED) The default number of days between
> > > re-fetches
> > > > of a page.  value was 30
> > > > >  </description>
> > > > > </property>
> > > > >
> > > > > <property>
> > > > >  <name>db.fetch.interval.default</name>
> > > > >  <value>3600</value>
> > > > >  <description>The default number of seconds between re-fetches
> of
> > a
> > > page
> > > > (30 days). value was 2592000 (30 days)
> > > > >  </description>
> > > > > </property>
> > > > >
> > > > > <property>
> > > > >  <name>db.fetch.interval.max</name>
> > > > >  <value>3600</value>
> > > > >  <description>The maximum number of seconds between re-fetches
> of
> > a
> > > page
> > > > >  (90 days). After this period every page in the db will be
> > re-tried,
> > > no
> > > > >  matter what is its status.  value was 7776000
> > > > >  </description>
> > > > > </property>
> > > > >
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > >
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for
10
> > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > International, Inc. which may be confidential, privileged or
> > > proprietary.
> > > >  The information is intended for the use of the individual or
> entity
> > > named
> > > > above.  If you are not the intended recipient, be aware that any
> > > disclosure,
> > > > copying, distribution, or use of the contents of this
information
> is
> > > > strictly prohibited.  If you have received this electronic
> > information
> > > in
> > > > error, please notify us immediately by telephone at
866-584-2143.
> > > > >
> > > > > -----Original Message-----
> > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > >
> > > > > Nutch only recrawl every 30 days by default. So you set the
> > > numberDays
> > > > > adequately and it wil recrawl read nutch-default.xml to get
the
> > > > > details
> > > > >
> > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > >> What do you mean by "recrawl"?
> > > > >> Does the following command meets what you need?
> > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > >> Change the destination directory to a different one with the
> last
> > > crawl.
> > > > >>
> > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > <Vi...@sra.com>
> > > > >> wrote:
> > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to
do
> a
> > > > complete
> > > > >>> recrawl?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> thanks,
> > > > >>>
> > > > >>> - Vijaya
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Vijaya Peters
> > > > >>> SRA International, Inc.
> > > > >>> 4350 Fair Lakes Court North
> > > > >>> Room 4004
> > > > >>> Fairfax, VA  22033
> > > > >>> Tel:  703-502-1184
> > > > >>>
> > > > >>> www.sra.com <http://www.sra.com/>
> > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > >>> consecutive years
> > > > >>>
> > > > >>> P Please consider the environment before printing this
e-mail
> > > > >>>
> > > > >>> This electronic message transmission contains information
from
> > SRA
> > > > >>> International, Inc. which may be confidential, privileged or
> > > > >>> proprietary.  The information is intended for the use of the
> > > individual
> > > > >>> or entity named above.  If you are not the intended
recipient,
> > be
> > > aware
> > > > >>> that any disclosure, copying, distribution, or use of the
> > contents
> > > of
> > > > >>> this information is strictly prohibited.  If you have
received
> > > this
> > > > >>> electronic information in error, please notify us
immediately
> by
> > > > >>> telephone at 866-584-2143.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > -MilleBii-
> > > > >
> > > >
> > > 
> > > 
> > > 
> > > -- 
> > > -MilleBii-
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> they
> > e-mail you.
> > http://go.microsoft.com/?linkid=9691817
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you're up to
on Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.
jus use vi or vim


i use vi to edit the file





> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 15:58:24 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> What do I use to open a CRC file? I tried QuickSFV.  Thanks in advance!
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Thursday, December 10, 2009 3:48 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> it will not dump to the console !
> whole_db is a folder and you have to edit the file you will find in this
> folder
> 
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Thu, 10 Dec 2009 14:26:30 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Adam,
> > I tried running that command and get the following (it created a
> > whole_db directory, but it's not dumping out the contents to the
> > console):
> > 
> > $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> > CrawlDb dump: starting
> > CrawlDb db: crawl/crawldb/
> > CrawlDb dump: done
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Sent: Thursday, December 10, 2009 1:40 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: how to force nutch to do a recrawl
> > 
> > 
> > hi,
> > check the fetch time in your crawldb...you can dump all the crawldb
> like
> > this:
> > 
> > ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> > 
> > entries will look like this:
> > 
> > http://www.YOUR_URL_TO_FETCH
> > Status: 2 (db_fetched)
> > Fetch time: Thu Dec 10 09:19:18 EST 2009
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 18000 seconds (0 days)
> > Score: 0.0014977538
> > Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> > Metadata: _pst_: success(1), lastModified=0
> > 
> > 
> > as you see the next time the page will be fetched is in fetch time  :
> > 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> > and check the rety interval : it should be your 3600. 
> > 
> > hope it will help
> > 
> > 
> > > Subject: RE: how to force nutch to do a recrawl
> > > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > > 
> > > Okay.  I'll dig a little deeper.  I saw a few scripts that people
> had
> > > created, but I couldn't get them to work.
> > > 
> > > Thanks much.
> > > 
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > > 
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the
> > individual
> > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > that any disclosure, copying, distribution, or use of the contents
> of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > > 
> > > -----Original Message-----
> > > From: MilleBii [mailto:millebii@gmail.com] 
> > > Sent: Wednesday, December 09, 2009 4:05 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: how to force nutch to do a recrawl
> > > 
> > > I don't that you can use nutch crawl command to do that, this is a
> one
> > > stop
> > > shop command.
> > > You probably want to use individual commands.
> > > Type nutch generate to get the help and you will see the option
> > > -adddays,
> > > read that page on the wiki to get a feel how you should do:
> > > http://wiki.apache.org/nutch/Crawl
> > > 
> > > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > > 
> > > > I didn't see a setting to override in crawl-urlfilter.  How do I
> set
> > > > numberDays? I have regular expressions to include/exclude certain
> > > extensions
> > > > and certain urls, but that's all I have in there.
> > > >
> > > > Please send me an example and I'll give it a try.
> > > >
> > > > Thanks!
> > > >
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > >
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive
> > > > years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > proprietary.
> > > >  The information is intended for the use of the individual or
> entity
> > > named
> > > > above.  If you are not the intended recipient, be aware that any
> > > disclosure,
> > > > copying, distribution, or use of the contents of this information
> is
> > > > strictly prohibited.  If you have received this electronic
> > information
> > > in
> > > > error, please notify us immediately by telephone at 866-584-2143.
> > > >
> > > > -----Original Message-----
> > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: how to force nutch to do a recrawl
> > > >
> > > > What about the configuration in crawl-urlfilter.txt?
> > > >
> > > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > > <Vi...@sra.com>
> > > > wrote:
> > > > > I tried that too.
> > > > > in Nutch-site.xml, I added in the below, but this had no effect.
> > > > >
> > > > > <property>
> > > > >  <name>db.default.fetch.interval</name>
> > > > >  <value>0</value>
> > > > >  <description>(DEPRECATED) The default number of days between
> > > re-fetches
> > > > of a page.  value was 30
> > > > >  </description>
> > > > > </property>
> > > > >
> > > > > <property>
> > > > >  <name>db.fetch.interval.default</name>
> > > > >  <value>3600</value>
> > > > >  <description>The default number of seconds between re-fetches
> of
> > a
> > > page
> > > > (30 days). value was 2592000 (30 days)
> > > > >  </description>
> > > > > </property>
> > > > >
> > > > > <property>
> > > > >  <name>db.fetch.interval.max</name>
> > > > >  <value>3600</value>
> > > > >  <description>The maximum number of seconds between re-fetches
> of
> > a
> > > page
> > > > >  (90 days). After this period every page in the db will be
> > re-tried,
> > > no
> > > > >  matter what is its status.  value was 7776000
> > > > >  </description>
> > > > > </property>
> > > > >
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > >
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from
> SRA
> > > > International, Inc. which may be confidential, privileged or
> > > proprietary.
> > > >  The information is intended for the use of the individual or
> entity
> > > named
> > > > above.  If you are not the intended recipient, be aware that any
> > > disclosure,
> > > > copying, distribution, or use of the contents of this information
> is
> > > > strictly prohibited.  If you have received this electronic
> > information
> > > in
> > > > error, please notify us immediately by telephone at 866-584-2143.
> > > > >
> > > > > -----Original Message-----
> > > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: how to force nutch to do a recrawl
> > > > >
> > > > > Nutch only recrawl every 30 days by default. So you set the
> > > numberDays
> > > > > adequately and it wil recrawl read nutch-default.xml to get the
> > > > > details
> > > > >
> > > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > > >> What do you mean by "recrawl"?
> > > > >> Does the following command meets what you need?
> > > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > > >> Change the destination directory to a different one with the
> last
> > > crawl.
> > > > >>
> > > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > > <Vi...@sra.com>
> > > > >> wrote:
> > > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do
> a
> > > > complete
> > > > >>> recrawl?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> thanks,
> > > > >>>
> > > > >>> - Vijaya
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Vijaya Peters
> > > > >>> SRA International, Inc.
> > > > >>> 4350 Fair Lakes Court North
> > > > >>> Room 4004
> > > > >>> Fairfax, VA  22033
> > > > >>> Tel:  703-502-1184
> > > > >>>
> > > > >>> www.sra.com <http://www.sra.com/>
> > > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
> 10
> > > > >>> consecutive years
> > > > >>>
> > > > >>> P Please consider the environment before printing this e-mail
> > > > >>>
> > > > >>> This electronic message transmission contains information from
> > SRA
> > > > >>> International, Inc. which may be confidential, privileged or
> > > > >>> proprietary.  The information is intended for the use of the
> > > individual
> > > > >>> or entity named above.  If you are not the intended recipient,
> > be
> > > aware
> > > > >>> that any disclosure, copying, distribution, or use of the
> > contents
> > > of
> > > > >>> this information is strictly prohibited.  If you have received
> > > this
> > > > >>> electronic information in error, please notify us immediately
> by
> > > > >>> telephone at 866-584-2143.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > -MilleBii-
> > > > >
> > > >
> > > 
> > > 
> > > 
> > > -- 
> > > -MilleBii-
> >  		 	   		  
> > _________________________________________________________________
> > Windows Live: Friends get your Flickr, Yelp, and Digg updates when
> they
> > e-mail you.
> > http://go.microsoft.com/?linkid=9691817
>  		 	   		  
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you're up to
> on Facebook.
> http://go.microsoft.com/?linkid=9691816
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Adam,
What do I use to open a CRC file? I tried QuickSFV.  Thanks in advance!

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Thursday, December 10, 2009 3:48 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this
folder



> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 14:26:30 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I tried running that command and get the following (it created a
> whole_db directory, but it's not dumping out the contents to the
> console):
> 
> $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> CrawlDb dump: starting
> CrawlDb db: crawl/crawldb/
> CrawlDb dump: done
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Thursday, December 10, 2009 1:40 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> hi,
> check the fetch time in your crawldb...you can dump all the crawldb
like
> this:
> 
> ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> 
> entries will look like this:
> 
> http://www.YOUR_URL_TO_FETCH
> Status: 2 (db_fetched)
> Fetch time: Thu Dec 10 09:19:18 EST 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 18000 seconds (0 days)
> Score: 0.0014977538
> Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> Metadata: _pst_: success(1), lastModified=0
> 
> 
> as you see the next time the page will be fetched is in fetch time  :
> 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> and check the rety interval : it should be your 3600. 
> 
> hope it will help
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Okay.  I'll dig a little deeper.  I saw a few scripts that people
had
> > created, but I couldn't get them to work.
> > 
> > Thanks much.
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents
of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: MilleBii [mailto:millebii@gmail.com] 
> > Sent: Wednesday, December 09, 2009 4:05 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> > 
> > I don't that you can use nutch crawl command to do that, this is a
one
> > stop
> > shop command.
> > You probably want to use individual commands.
> > Type nutch generate to get the help and you will see the option
> > -adddays,
> > read that page on the wiki to get a feel how you should do:
> > http://wiki.apache.org/nutch/Crawl
> > 
> > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > 
> > > I didn't see a setting to override in crawl-urlfilter.  How do I
set
> > > numberDays? I have regular expressions to include/exclude certain
> > extensions
> > > and certain urls, but that's all I have in there.
> > >
> > > Please send me an example and I'll give it a try.
> > >
> > > Thanks!
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive
> > > years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > proprietary.
> > >  The information is intended for the use of the individual or
entity
> > named
> > > above.  If you are not the intended recipient, be aware that any
> > disclosure,
> > > copying, distribution, or use of the contents of this information
is
> > > strictly prohibited.  If you have received this electronic
> information
> > in
> > > error, please notify us immediately by telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: how to force nutch to do a recrawl
> > >
> > > What about the configuration in crawl-urlfilter.txt?
> > >
> > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > <Vi...@sra.com>
> > > wrote:
> > > > I tried that too.
> > > > in Nutch-site.xml, I added in the below, but this had no effect.
> > > >
> > > > <property>
> > > >  <name>db.default.fetch.interval</name>
> > > >  <value>0</value>
> > > >  <description>(DEPRECATED) The default number of days between
> > re-fetches
> > > of a page.  value was 30
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>db.fetch.interval.default</name>
> > > >  <value>3600</value>
> > > >  <description>The default number of seconds between re-fetches
of
> a
> > page
> > > (30 days). value was 2592000 (30 days)
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>db.fetch.interval.max</name>
> > > >  <value>3600</value>
> > > >  <description>The maximum number of seconds between re-fetches
of
> a
> > page
> > > >  (90 days). After this period every page in the db will be
> re-tried,
> > no
> > > >  matter what is its status.  value was 7776000
> > > >  </description>
> > > > </property>
> > > >
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > >
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from
SRA
> > > International, Inc. which may be confidential, privileged or
> > proprietary.
> > >  The information is intended for the use of the individual or
entity
> > named
> > > above.  If you are not the intended recipient, be aware that any
> > disclosure,
> > > copying, distribution, or use of the contents of this information
is
> > > strictly prohibited.  If you have received this electronic
> information
> > in
> > > error, please notify us immediately by telephone at 866-584-2143.
> > > >
> > > > -----Original Message-----
> > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: how to force nutch to do a recrawl
> > > >
> > > > Nutch only recrawl every 30 days by default. So you set the
> > numberDays
> > > > adequately and it wil recrawl read nutch-default.xml to get the
> > > > details
> > > >
> > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > >> What do you mean by "recrawl"?
> > > >> Does the following command meets what you need?
> > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > >> Change the destination directory to a different one with the
last
> > crawl.
> > > >>
> > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > <Vi...@sra.com>
> > > >> wrote:
> > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do
a
> > > complete
> > > >>> recrawl?
> > > >>>
> > > >>>
> > > >>>
> > > >>> thanks,
> > > >>>
> > > >>> - Vijaya
> > > >>>
> > > >>>
> > > >>>
> > > >>> Vijaya Peters
> > > >>> SRA International, Inc.
> > > >>> 4350 Fair Lakes Court North
> > > >>> Room 4004
> > > >>> Fairfax, VA  22033
> > > >>> Tel:  703-502-1184
> > > >>>
> > > >>> www.sra.com <http://www.sra.com/>
> > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for
10
> > > >>> consecutive years
> > > >>>
> > > >>> P Please consider the environment before printing this e-mail
> > > >>>
> > > >>> This electronic message transmission contains information from
> SRA
> > > >>> International, Inc. which may be confidential, privileged or
> > > >>> proprietary.  The information is intended for the use of the
> > individual
> > > >>> or entity named above.  If you are not the intended recipient,
> be
> > aware
> > > >>> that any disclosure, copying, distribution, or use of the
> contents
> > of
> > > >>> this information is strictly prohibited.  If you have received
> > this
> > > >>> electronic information in error, please notify us immediately
by
> > > >>> telephone at 866-584-2143.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > -MilleBii-
> > > >
> > >
> > 
> > 
> > 
> > -- 
> > -MilleBii-
>  		 	   		  
> _________________________________________________________________
> Windows Live: Friends get your Flickr, Yelp, and Digg updates when
they
> e-mail you.
> http://go.microsoft.com/?linkid=9691817
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you're up to
on Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.
it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this folder



> Subject: RE: how to force nutch to do a recrawl
> Date: Thu, 10 Dec 2009 14:26:30 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Adam,
> I tried running that command and get the following (it created a
> whole_db directory, but it's not dumping out the contents to the
> console):
> 
> $ bin/nutch readdb crawl/crawldb/ -dump whole_db
> CrawlDb dump: starting
> CrawlDb db: crawl/crawldb/
> CrawlDb dump: done
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com] 
> Sent: Thursday, December 10, 2009 1:40 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: how to force nutch to do a recrawl
> 
> 
> hi,
> check the fetch time in your crawldb...you can dump all the crawldb like
> this:
> 
> ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
> 
> entries will look like this:
> 
> http://www.YOUR_URL_TO_FETCH
> Status: 2 (db_fetched)
> Fetch time: Thu Dec 10 09:19:18 EST 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 18000 seconds (0 days)
> Score: 0.0014977538
> Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
> Metadata: _pst_: success(1), lastModified=0
> 
> 
> as you see the next time the page will be fetched is in fetch time  :
> 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
> and check the rety interval : it should be your 3600. 
> 
> hope it will help
> 
> 
> > Subject: RE: how to force nutch to do a recrawl
> > Date: Wed, 9 Dec 2009 16:06:58 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> > 
> > Okay.  I'll dig a little deeper.  I saw a few scripts that people had
> > created, but I couldn't get them to work.
> > 
> > Thanks much.
> > 
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> > 
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the
> individual
> > or entity named above.  If you are not the intended recipient, be
> aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> > 
> > -----Original Message-----
> > From: MilleBii [mailto:millebii@gmail.com] 
> > Sent: Wednesday, December 09, 2009 4:05 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> > 
> > I don't that you can use nutch crawl command to do that, this is a one
> > stop
> > shop command.
> > You probably want to use individual commands.
> > Type nutch generate to get the help and you will see the option
> > -adddays,
> > read that page on the wiki to get a feel how you should do:
> > http://wiki.apache.org/nutch/Crawl
> > 
> > 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> > 
> > > I didn't see a setting to override in crawl-urlfilter.  How do I set
> > > numberDays? I have regular expressions to include/exclude certain
> > extensions
> > > and certain urls, but that's all I have in there.
> > >
> > > Please send me an example and I'll give it a try.
> > >
> > > Thanks!
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive
> > > years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > proprietary.
> > >  The information is intended for the use of the individual or entity
> > named
> > > above.  If you are not the intended recipient, be aware that any
> > disclosure,
> > > copying, distribution, or use of the contents of this information is
> > > strictly prohibited.  If you have received this electronic
> information
> > in
> > > error, please notify us immediately by telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > Sent: Wednesday, December 09, 2009 1:41 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: how to force nutch to do a recrawl
> > >
> > > What about the configuration in crawl-urlfilter.txt?
> > >
> > > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> > <Vi...@sra.com>
> > > wrote:
> > > > I tried that too.
> > > > in Nutch-site.xml, I added in the below, but this had no effect.
> > > >
> > > > <property>
> > > >  <name>db.default.fetch.interval</name>
> > > >  <value>0</value>
> > > >  <description>(DEPRECATED) The default number of days between
> > re-fetches
> > > of a page.  value was 30
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>db.fetch.interval.default</name>
> > > >  <value>3600</value>
> > > >  <description>The default number of seconds between re-fetches of
> a
> > page
> > > (30 days). value was 2592000 (30 days)
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>db.fetch.interval.max</name>
> > > >  <value>3600</value>
> > > >  <description>The maximum number of seconds between re-fetches of
> a
> > page
> > > >  (90 days). After this period every page in the db will be
> re-tried,
> > no
> > > >  matter what is its status.  value was 7776000
> > > >  </description>
> > > > </property>
> > > >
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > >
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > proprietary.
> > >  The information is intended for the use of the individual or entity
> > named
> > > above.  If you are not the intended recipient, be aware that any
> > disclosure,
> > > copying, distribution, or use of the contents of this information is
> > > strictly prohibited.  If you have received this electronic
> information
> > in
> > > error, please notify us immediately by telephone at 866-584-2143.
> > > >
> > > > -----Original Message-----
> > > > From: MilleBii [mailto:millebii@gmail.com]
> > > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: how to force nutch to do a recrawl
> > > >
> > > > Nutch only recrawl every 30 days by default. So you set the
> > numberDays
> > > > adequately and it wil recrawl read nutch-default.xml to get the
> > > > details
> > > >
> > > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > > >> What do you mean by "recrawl"?
> > > >> Does the following command meets what you need?
> > > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > > >> Change the destination directory to a different one with the last
> > crawl.
> > > >>
> > > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> > <Vi...@sra.com>
> > > >> wrote:
> > > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> > > complete
> > > >>> recrawl?
> > > >>>
> > > >>>
> > > >>>
> > > >>> thanks,
> > > >>>
> > > >>> - Vijaya
> > > >>>
> > > >>>
> > > >>>
> > > >>> Vijaya Peters
> > > >>> SRA International, Inc.
> > > >>> 4350 Fair Lakes Court North
> > > >>> Room 4004
> > > >>> Fairfax, VA  22033
> > > >>> Tel:  703-502-1184
> > > >>>
> > > >>> www.sra.com <http://www.sra.com/>
> > > >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > >>> consecutive years
> > > >>>
> > > >>> P Please consider the environment before printing this e-mail
> > > >>>
> > > >>> This electronic message transmission contains information from
> SRA
> > > >>> International, Inc. which may be confidential, privileged or
> > > >>> proprietary.  The information is intended for the use of the
> > individual
> > > >>> or entity named above.  If you are not the intended recipient,
> be
> > aware
> > > >>> that any disclosure, copying, distribution, or use of the
> contents
> > of
> > > >>> this information is strictly prohibited.  If you have received
> > this
> > > >>> electronic information in error, please notify us immediately by
> > > >>> telephone at 866-584-2143.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > -MilleBii-
> > > >
> > >
> > 
> > 
> > 
> > -- 
> > -MilleBii-
>  		 	   		  
> _________________________________________________________________
> Windows Live: Friends get your Flickr, Yelp, and Digg updates when they
> e-mail you.
> http://go.microsoft.com/?linkid=9691817
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Adam,
I tried running that command and get the following (it created a
whole_db directory, but it's not dumping out the contents to the
console):

$ bin/nutch readdb crawl/crawldb/ -dump whole_db
CrawlDb dump: starting
CrawlDb db: crawl/crawldb/
CrawlDb dump: done

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.
-----Original Message-----
From: BELLINI ADAM [mailto:mbellil@msn.com] 
Sent: Thursday, December 10, 2009 1:40 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


hi,
check the fetch time in your crawldb...you can dump all the crawldb like
this:

./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db

entries will look like this:

http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 0.0014977538
Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
Metadata: _pst_: success(1), lastModified=0


as you see the next time the page will be fetched is in fetch time  :
'Fetch time: Thu Dec 10 09:19:18 EST 2009'
and check the rety interval : it should be your 3600. 

hope it will help


> Subject: RE: how to force nutch to do a recrawl
> Date: Wed, 9 Dec 2009 16:06:58 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Okay.  I'll dig a little deeper.  I saw a few scripts that people had
> created, but I couldn't get them to work.
> 
> Thanks much.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the
individual
> or entity named above.  If you are not the intended recipient, be
aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com] 
> Sent: Wednesday, December 09, 2009 4:05 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
> 
> I don't that you can use nutch crawl command to do that, this is a one
> stop
> shop command.
> You probably want to use individual commands.
> Type nutch generate to get the help and you will see the option
> -adddays,
> read that page on the wiki to get a feel how you should do:
> http://wiki.apache.org/nutch/Crawl
> 
> 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> 
> > I didn't see a setting to override in crawl-urlfilter.  How do I set
> > numberDays? I have regular expressions to include/exclude certain
> extensions
> > and certain urls, but that's all I have in there.
> >
> > Please send me an example and I'll give it a try.
> >
> > Thanks!
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive
> > years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> proprietary.
> >  The information is intended for the use of the individual or entity
> named
> > above.  If you are not the intended recipient, be aware that any
> disclosure,
> > copying, distribution, or use of the contents of this information is
> > strictly prohibited.  If you have received this electronic
information
> in
> > error, please notify us immediately by telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > Sent: Wednesday, December 09, 2009 1:41 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> >
> > What about the configuration in crawl-urlfilter.txt?
> >
> > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> <Vi...@sra.com>
> > wrote:
> > > I tried that too.
> > > in Nutch-site.xml, I added in the below, but this had no effect.
> > >
> > > <property>
> > >  <name>db.default.fetch.interval</name>
> > >  <value>0</value>
> > >  <description>(DEPRECATED) The default number of days between
> re-fetches
> > of a page.  value was 30
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>db.fetch.interval.default</name>
> > >  <value>3600</value>
> > >  <description>The default number of seconds between re-fetches of
a
> page
> > (30 days). value was 2592000 (30 days)
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>db.fetch.interval.max</name>
> > >  <value>3600</value>
> > >  <description>The maximum number of seconds between re-fetches of
a
> page
> > >  (90 days). After this period every page in the db will be
re-tried,
> no
> > >  matter what is its status.  value was 7776000
> > >  </description>
> > > </property>
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> proprietary.
> >  The information is intended for the use of the individual or entity
> named
> > above.  If you are not the intended recipient, be aware that any
> disclosure,
> > copying, distribution, or use of the contents of this information is
> > strictly prohibited.  If you have received this electronic
information
> in
> > error, please notify us immediately by telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: MilleBii [mailto:millebii@gmail.com]
> > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: how to force nutch to do a recrawl
> > >
> > > Nutch only recrawl every 30 days by default. So you set the
> numberDays
> > > adequately and it wil recrawl read nutch-default.xml to get the
> > > details
> > >
> > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > >> What do you mean by "recrawl"?
> > >> Does the following command meets what you need?
> > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > >> Change the destination directory to a different one with the last
> crawl.
> > >>
> > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> <Vi...@sra.com>
> > >> wrote:
> > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> > complete
> > >>> recrawl?
> > >>>
> > >>>
> > >>>
> > >>> thanks,
> > >>>
> > >>> - Vijaya
> > >>>
> > >>>
> > >>>
> > >>> Vijaya Peters
> > >>> SRA International, Inc.
> > >>> 4350 Fair Lakes Court North
> > >>> Room 4004
> > >>> Fairfax, VA  22033
> > >>> Tel:  703-502-1184
> > >>>
> > >>> www.sra.com <http://www.sra.com/>
> > >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > >>> consecutive years
> > >>>
> > >>> P Please consider the environment before printing this e-mail
> > >>>
> > >>> This electronic message transmission contains information from
SRA
> > >>> International, Inc. which may be confidential, privileged or
> > >>> proprietary.  The information is intended for the use of the
> individual
> > >>> or entity named above.  If you are not the intended recipient,
be
> aware
> > >>> that any disclosure, copying, distribution, or use of the
contents
> of
> > >>> this information is strictly prohibited.  If you have received
> this
> > >>> electronic information in error, please notify us immediately by
> > >>> telephone at 866-584-2143.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > -MilleBii-
> > >
> >
> 
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they
e-mail you.
http://go.microsoft.com/?linkid=9691817

RE: how to force nutch to do a recrawl

Posted by BELLINI ADAM <mb...@msn.com>.
hi,
check the fetch time in your crawldb...you can dump all the crawldb like this:

./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db

entries will look like this:

http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 0.0014977538
Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
Metadata: _pst_: success(1), lastModified=0


as you see the next time the page will be fetched is in fetch time  : 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
and check the rety interval : it should be your 3600. 

hope it will help


> Subject: RE: how to force nutch to do a recrawl
> Date: Wed, 9 Dec 2009 16:06:58 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> Okay.  I'll dig a little deeper.  I saw a few scripts that people had
> created, but I couldn't get them to work.
> 
> Thanks much.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com] 
> Sent: Wednesday, December 09, 2009 4:05 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
> 
> I don't that you can use nutch crawl command to do that, this is a one
> stop
> shop command.
> You probably want to use individual commands.
> Type nutch generate to get the help and you will see the option
> -adddays,
> read that page on the wiki to get a feel how you should do:
> http://wiki.apache.org/nutch/Crawl
> 
> 2009/12/9 Peters, Vijaya <Vi...@sra.com>
> 
> > I didn't see a setting to override in crawl-urlfilter.  How do I set
> > numberDays? I have regular expressions to include/exclude certain
> extensions
> > and certain urls, but that's all I have in there.
> >
> > Please send me an example and I'll give it a try.
> >
> > Thanks!
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive
> > years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> proprietary.
> >  The information is intended for the use of the individual or entity
> named
> > above.  If you are not the intended recipient, be aware that any
> disclosure,
> > copying, distribution, or use of the contents of this information is
> > strictly prohibited.  If you have received this electronic information
> in
> > error, please notify us immediately by telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > Sent: Wednesday, December 09, 2009 1:41 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> >
> > What about the configuration in crawl-urlfilter.txt?
> >
> > On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
> <Vi...@sra.com>
> > wrote:
> > > I tried that too.
> > > in Nutch-site.xml, I added in the below, but this had no effect.
> > >
> > > <property>
> > >  <name>db.default.fetch.interval</name>
> > >  <value>0</value>
> > >  <description>(DEPRECATED) The default number of days between
> re-fetches
> > of a page.  value was 30
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>db.fetch.interval.default</name>
> > >  <value>3600</value>
> > >  <description>The default number of seconds between re-fetches of a
> page
> > (30 days). value was 2592000 (30 days)
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>db.fetch.interval.max</name>
> > >  <value>3600</value>
> > >  <description>The maximum number of seconds between re-fetches of a
> page
> > >  (90 days). After this period every page in the db will be re-tried,
> no
> > >  matter what is its status.  value was 7776000
> > >  </description>
> > > </property>
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> proprietary.
> >  The information is intended for the use of the individual or entity
> named
> > above.  If you are not the intended recipient, be aware that any
> disclosure,
> > copying, distribution, or use of the contents of this information is
> > strictly prohibited.  If you have received this electronic information
> in
> > error, please notify us immediately by telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: MilleBii [mailto:millebii@gmail.com]
> > > Sent: Wednesday, December 09, 2009 1:27 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: how to force nutch to do a recrawl
> > >
> > > Nutch only recrawl every 30 days by default. So you set the
> numberDays
> > > adequately and it wil recrawl read nutch-default.xml to get the
> > > details
> > >
> > > 2009/12/9, xiao yang <ya...@gmail.com>:
> > >> What do you mean by "recrawl"?
> > >> Does the following command meets what you need?
> > >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > >> Change the destination directory to a different one with the last
> crawl.
> > >>
> > >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
> <Vi...@sra.com>
> > >> wrote:
> > >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> > complete
> > >>> recrawl?
> > >>>
> > >>>
> > >>>
> > >>> thanks,
> > >>>
> > >>> - Vijaya
> > >>>
> > >>>
> > >>>
> > >>> Vijaya Peters
> > >>> SRA International, Inc.
> > >>> 4350 Fair Lakes Court North
> > >>> Room 4004
> > >>> Fairfax, VA  22033
> > >>> Tel:  703-502-1184
> > >>>
> > >>> www.sra.com <http://www.sra.com/>
> > >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > >>> consecutive years
> > >>>
> > >>> P Please consider the environment before printing this e-mail
> > >>>
> > >>> This electronic message transmission contains information from SRA
> > >>> International, Inc. which may be confidential, privileged or
> > >>> proprietary.  The information is intended for the use of the
> individual
> > >>> or entity named above.  If you are not the intended recipient, be
> aware
> > >>> that any disclosure, copying, distribution, or use of the contents
> of
> > >>> this information is strictly prohibited.  If you have received
> this
> > >>> electronic information in error, please notify us immediately by
> > >>> telephone at 866-584-2143.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > -MilleBii-
> > >
> >
> 
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://go.microsoft.com/?linkid=9691817

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
Okay.  I'll dig a little deeper.  I saw a few scripts that people had
created, but I couldn't get them to work.

Thanks much.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: MilleBii [mailto:millebii@gmail.com] 
Sent: Wednesday, December 09, 2009 4:05 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

I don't that you can use nutch crawl command to do that, this is a one
stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option
-adddays,
read that page on the wiki to get a feel how you should do:
http://wiki.apache.org/nutch/Crawl

2009/12/9 Peters, Vijaya <Vi...@sra.com>

> I didn't see a setting to override in crawl-urlfilter.  How do I set
> numberDays? I have regular expressions to include/exclude certain
extensions
> and certain urls, but that's all I have in there.
>
> Please send me an example and I'll give it a try.
>
> Thanks!
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive
> years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
proprietary.
>  The information is intended for the use of the individual or entity
named
> above.  If you are not the intended recipient, be aware that any
disclosure,
> copying, distribution, or use of the contents of this information is
> strictly prohibited.  If you have received this electronic information
in
> error, please notify us immediately by telephone at 866-584-2143.
>
> -----Original Message-----
> From: xiao yang [mailto:yangxiao9901@gmail.com]
> Sent: Wednesday, December 09, 2009 1:41 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
>
> What about the configuration in crawl-urlfilter.txt?
>
> On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
<Vi...@sra.com>
> wrote:
> > I tried that too.
> > in Nutch-site.xml, I added in the below, but this had no effect.
> >
> > <property>
> >  <name>db.default.fetch.interval</name>
> >  <value>0</value>
> >  <description>(DEPRECATED) The default number of days between
re-fetches
> of a page.  value was 30
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.fetch.interval.default</name>
> >  <value>3600</value>
> >  <description>The default number of seconds between re-fetches of a
page
> (30 days). value was 2592000 (30 days)
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.fetch.interval.max</name>
> >  <value>3600</value>
> >  <description>The maximum number of seconds between re-fetches of a
page
> >  (90 days). After this period every page in the db will be re-tried,
no
> >  matter what is its status.  value was 7776000
> >  </description>
> > </property>
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
proprietary.
>  The information is intended for the use of the individual or entity
named
> above.  If you are not the intended recipient, be aware that any
disclosure,
> copying, distribution, or use of the contents of this information is
> strictly prohibited.  If you have received this electronic information
in
> error, please notify us immediately by telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: MilleBii [mailto:millebii@gmail.com]
> > Sent: Wednesday, December 09, 2009 1:27 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> >
> > Nutch only recrawl every 30 days by default. So you set the
numberDays
> > adequately and it wil recrawl read nutch-default.xml to get the
> > details
> >
> > 2009/12/9, xiao yang <ya...@gmail.com>:
> >> What do you mean by "recrawl"?
> >> Does the following command meets what you need?
> >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >> Change the destination directory to a different one with the last
crawl.
> >>
> >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya
<Vi...@sra.com>
> >> wrote:
> >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> complete
> >>> recrawl?
> >>>
> >>>
> >>>
> >>> thanks,
> >>>
> >>> - Vijaya
> >>>
> >>>
> >>>
> >>> Vijaya Peters
> >>> SRA International, Inc.
> >>> 4350 Fair Lakes Court North
> >>> Room 4004
> >>> Fairfax, VA  22033
> >>> Tel:  703-502-1184
> >>>
> >>> www.sra.com <http://www.sra.com/>
> >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> >>> consecutive years
> >>>
> >>> P Please consider the environment before printing this e-mail
> >>>
> >>> This electronic message transmission contains information from SRA
> >>> International, Inc. which may be confidential, privileged or
> >>> proprietary.  The information is intended for the use of the
individual
> >>> or entity named above.  If you are not the intended recipient, be
aware
> >>> that any disclosure, copying, distribution, or use of the contents
of
> >>> this information is strictly prohibited.  If you have received
this
> >>> electronic information in error, please notify us immediately by
> >>> telephone at 866-584-2143.
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Re: how to force nutch to do a recrawl

Posted by MilleBii <mi...@gmail.com>.
I don't that you can use nutch crawl command to do that, this is a one stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option -adddays,
read that page on the wiki to get a feel how you should do:
http://wiki.apache.org/nutch/Crawl

2009/12/9 Peters, Vijaya <Vi...@sra.com>

> I didn't see a setting to override in crawl-urlfilter.  How do I set
> numberDays? I have regular expressions to include/exclude certain extensions
> and certain urls, but that's all I have in there.
>
> Please send me an example and I'll give it a try.
>
> Thanks!
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive
> years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or proprietary.
>  The information is intended for the use of the individual or entity named
> above.  If you are not the intended recipient, be aware that any disclosure,
> copying, distribution, or use of the contents of this information is
> strictly prohibited.  If you have received this electronic information in
> error, please notify us immediately by telephone at 866-584-2143.
>
> -----Original Message-----
> From: xiao yang [mailto:yangxiao9901@gmail.com]
> Sent: Wednesday, December 09, 2009 1:41 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
>
> What about the configuration in crawl-urlfilter.txt?
>
> On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya <Vi...@sra.com>
> wrote:
> > I tried that too.
> > in Nutch-site.xml, I added in the below, but this had no effect.
> >
> > <property>
> >  <name>db.default.fetch.interval</name>
> >  <value>0</value>
> >  <description>(DEPRECATED) The default number of days between re-fetches
> of a page.  value was 30
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.fetch.interval.default</name>
> >  <value>3600</value>
> >  <description>The default number of seconds between re-fetches of a page
> (30 days). value was 2592000 (30 days)
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.fetch.interval.max</name>
> >  <value>3600</value>
> >  <description>The maximum number of seconds between re-fetches of a page
> >  (90 days). After this period every page in the db will be re-tried, no
> >  matter what is its status.  value was 7776000
> >  </description>
> > </property>
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or proprietary.
>  The information is intended for the use of the individual or entity named
> above.  If you are not the intended recipient, be aware that any disclosure,
> copying, distribution, or use of the contents of this information is
> strictly prohibited.  If you have received this electronic information in
> error, please notify us immediately by telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: MilleBii [mailto:millebii@gmail.com]
> > Sent: Wednesday, December 09, 2009 1:27 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: how to force nutch to do a recrawl
> >
> > Nutch only recrawl every 30 days by default. So you set the numberDays
> > adequately and it wil recrawl read nutch-default.xml to get the
> > details
> >
> > 2009/12/9, xiao yang <ya...@gmail.com>:
> >> What do you mean by "recrawl"?
> >> Does the following command meets what you need?
> >> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >> Change the destination directory to a different one with the last crawl.
> >>
> >> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com>
> >> wrote:
> >>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
> complete
> >>> recrawl?
> >>>
> >>>
> >>>
> >>> thanks,
> >>>
> >>> - Vijaya
> >>>
> >>>
> >>>
> >>> Vijaya Peters
> >>> SRA International, Inc.
> >>> 4350 Fair Lakes Court North
> >>> Room 4004
> >>> Fairfax, VA  22033
> >>> Tel:  703-502-1184
> >>>
> >>> www.sra.com <http://www.sra.com/>
> >>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> >>> consecutive years
> >>>
> >>> P Please consider the environment before printing this e-mail
> >>>
> >>> This electronic message transmission contains information from SRA
> >>> International, Inc. which may be confidential, privileged or
> >>> proprietary.  The information is intended for the use of the individual
> >>> or entity named above.  If you are not the intended recipient, be aware
> >>> that any disclosure, copying, distribution, or use of the contents of
> >>> this information is strictly prohibited.  If you have received this
> >>> electronic information in error, please notify us immediately by
> >>> telephone at 866-584-2143.
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
I didn't see a setting to override in crawl-urlfilter.  How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there.

Please send me an example and I'll give it a try.

Thanks!

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.

-----Original Message-----
From: xiao yang [mailto:yangxiao9901@gmail.com] 
Sent: Wednesday, December 09, 2009 1:41 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

What about the configuration in crawl-urlfilter.txt?

On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya <Vi...@sra.com> wrote:
> I tried that too.
> in Nutch-site.xml, I added in the below, but this had no effect.
>
> <property>
>  <name>db.default.fetch.interval</name>
>  <value>0</value>
>  <description>(DEPRECATED) The default number of days between re-fetches of a page.  value was 30
>  </description>
> </property>
>
> <property>
>  <name>db.fetch.interval.default</name>
>  <value>3600</value>
>  <description>The default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days)
>  </description>
> </property>
>
> <property>
>  <name>db.fetch.interval.max</name>
>  <value>3600</value>
>  <description>The maximum number of seconds between re-fetches of a page
>  (90 days). After this period every page in the db will be re-tried, no
>  matter what is its status.  value was 7776000
>  </description>
> </property>
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
>
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com]
> Sent: Wednesday, December 09, 2009 1:27 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
>
> Nutch only recrawl every 30 days by default. So you set the numberDays
> adequately and it wil recrawl read nutch-default.xml to get the
> details
>
> 2009/12/9, xiao yang <ya...@gmail.com>:
>> What do you mean by "recrawl"?
>> Does the following command meets what you need?
>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>> Change the destination directory to a different one with the last crawl.
>>
>> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com>
>> wrote:
>>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
>>> recrawl?
>>>
>>>
>>>
>>> thanks,
>>>
>>> - Vijaya
>>>
>>>
>>>
>>> Vijaya Peters
>>> SRA International, Inc.
>>> 4350 Fair Lakes Court North
>>> Room 4004
>>> Fairfax, VA  22033
>>> Tel:  703-502-1184
>>>
>>> www.sra.com <http://www.sra.com/>
>>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
>>> consecutive years
>>>
>>> P Please consider the environment before printing this e-mail
>>>
>>> This electronic message transmission contains information from SRA
>>> International, Inc. which may be confidential, privileged or
>>> proprietary.  The information is intended for the use of the individual
>>> or entity named above.  If you are not the intended recipient, be aware
>>> that any disclosure, copying, distribution, or use of the contents of
>>> this information is strictly prohibited.  If you have received this
>>> electronic information in error, please notify us immediately by
>>> telephone at 866-584-2143.
>>>
>>>
>>>
>>>
>>
>
>
> --
> -MilleBii-
>

Re: how to force nutch to do a recrawl

Posted by xiao yang <ya...@gmail.com>.
What about the configuration in crawl-urlfilter.txt?

On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya <Vi...@sra.com> wrote:
> I tried that too.
> in Nutch-site.xml, I added in the below, but this had no effect.
>
> <property>
>  <name>db.default.fetch.interval</name>
>  <value>0</value>
>  <description>(DEPRECATED) The default number of days between re-fetches of a page.  value was 30
>  </description>
> </property>
>
> <property>
>  <name>db.fetch.interval.default</name>
>  <value>3600</value>
>  <description>The default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days)
>  </description>
> </property>
>
> <property>
>  <name>db.fetch.interval.max</name>
>  <value>3600</value>
>  <description>The maximum number of seconds between re-fetches of a page
>  (90 days). After this period every page in the db will be re-tried, no
>  matter what is its status.  value was 7776000
>  </description>
> </property>
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
>
> -----Original Message-----
> From: MilleBii [mailto:millebii@gmail.com]
> Sent: Wednesday, December 09, 2009 1:27 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: how to force nutch to do a recrawl
>
> Nutch only recrawl every 30 days by default. So you set the numberDays
> adequately and it wil recrawl read nutch-default.xml to get the
> details
>
> 2009/12/9, xiao yang <ya...@gmail.com>:
>> What do you mean by "recrawl"?
>> Does the following command meets what you need?
>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>> Change the destination directory to a different one with the last crawl.
>>
>> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com>
>> wrote:
>>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
>>> recrawl?
>>>
>>>
>>>
>>> thanks,
>>>
>>> - Vijaya
>>>
>>>
>>>
>>> Vijaya Peters
>>> SRA International, Inc.
>>> 4350 Fair Lakes Court North
>>> Room 4004
>>> Fairfax, VA  22033
>>> Tel:  703-502-1184
>>>
>>> www.sra.com <http://www.sra.com/>
>>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
>>> consecutive years
>>>
>>> P Please consider the environment before printing this e-mail
>>>
>>> This electronic message transmission contains information from SRA
>>> International, Inc. which may be confidential, privileged or
>>> proprietary.  The information is intended for the use of the individual
>>> or entity named above.  If you are not the intended recipient, be aware
>>> that any disclosure, copying, distribution, or use of the contents of
>>> this information is strictly prohibited.  If you have received this
>>> electronic information in error, please notify us immediately by
>>> telephone at 866-584-2143.
>>>
>>>
>>>
>>>
>>
>
>
> --
> -MilleBii-
>

RE: how to force nutch to do a recrawl

Posted by "Peters, Vijaya" <Vi...@sra.com>.
I tried that too.  
in Nutch-site.xml, I added in the below, but this had no effect.

<property>
  <name>db.default.fetch.interval</name>
  <value>0</value>
  <description>(DEPRECATED) The default number of days between re-fetches of a page.  value was 30
  </description>
</property>

<property>
  <name>db.fetch.interval.default</name>
  <value>3600</value>
  <description>The default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days)
  </description>
</property>

<property>
  <name>db.fetch.interval.max</name>
  <value>3600</value>
  <description>The maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.  value was 7776000
  </description>
</property>

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10 consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary.  The information is intended for the use of the individual or entity named above.  If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited.  If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.

-----Original Message-----
From: MilleBii [mailto:millebii@gmail.com] 
Sent: Wednesday, December 09, 2009 1:27 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details

2009/12/9, xiao yang <ya...@gmail.com>:
> What do you mean by "recrawl"?
> Does the following command meets what you need?
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> Change the destination directory to a different one with the last crawl.
>
> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com>
> wrote:
>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
>> recrawl?
>>
>>
>>
>> thanks,
>>
>> - Vijaya
>>
>>
>>
>> Vijaya Peters
>> SRA International, Inc.
>> 4350 Fair Lakes Court North
>> Room 4004
>> Fairfax, VA  22033
>> Tel:  703-502-1184
>>
>> www.sra.com <http://www.sra.com/>
>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
>> consecutive years
>>
>> P Please consider the environment before printing this e-mail
>>
>> This electronic message transmission contains information from SRA
>> International, Inc. which may be confidential, privileged or
>> proprietary.  The information is intended for the use of the individual
>> or entity named above.  If you are not the intended recipient, be aware
>> that any disclosure, copying, distribution, or use of the contents of
>> this information is strictly prohibited.  If you have received this
>> electronic information in error, please notify us immediately by
>> telephone at 866-584-2143.
>>
>>
>>
>>
>


-- 
-MilleBii-

Re: how to force nutch to do a recrawl

Posted by MilleBii <mi...@gmail.com>.
Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details

2009/12/9, xiao yang <ya...@gmail.com>:
> What do you mean by "recrawl"?
> Does the following command meets what you need?
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> Change the destination directory to a different one with the last crawl.
>
> On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com>
> wrote:
>> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
>> recrawl?
>>
>>
>>
>> thanks,
>>
>> - Vijaya
>>
>>
>>
>> Vijaya Peters
>> SRA International, Inc.
>> 4350 Fair Lakes Court North
>> Room 4004
>> Fairfax, VA  22033
>> Tel:  703-502-1184
>>
>> www.sra.com <http://www.sra.com/>
>> Named to FORTUNE's "100 Best Companies to Work For" list for 10
>> consecutive years
>>
>> P Please consider the environment before printing this e-mail
>>
>> This electronic message transmission contains information from SRA
>> International, Inc. which may be confidential, privileged or
>> proprietary.  The information is intended for the use of the individual
>> or entity named above.  If you are not the intended recipient, be aware
>> that any disclosure, copying, distribution, or use of the contents of
>> this information is strictly prohibited.  If you have received this
>> electronic information in error, please notify us immediately by
>> telephone at 866-584-2143.
>>
>>
>>
>>
>


-- 
-MilleBii-

Re: how to force nutch to do a recrawl

Posted by xiao yang <ya...@gmail.com>.
What do you mean by "recrawl"?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.

On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya <Vi...@sra.com> wrote:
> I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
> recrawl?
>
>
>
> thanks,
>
> - Vijaya
>
>
>
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
>
> www.sra.com <http://www.sra.com/>
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
>
> P Please consider the environment before printing this e-mail
>
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
>
>
>
>