You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by is...@thomson.com on 2005/06/02 03:06:41 UTC

Intranet crawl and re-fetch - newbie question

Hello,

I have a newbie question:

I have launched and completed an intranet crawling (bin/nutch crawl mySite myDB).
Since I would like to recrawl in a few days, I changed the nutch default parameter to 3 days (instead of 30).
How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters? 
If I do, will the fetching only download new or modified pages, or will it download everything again?

Thanks for any help

Isabelle

Isabelle.Moulinier@thomson.com
Ph: 651 687 3424




Re: Intranet crawl and re-fetch - newbie question

Posted by "Daniel D." <nu...@gmail.com>.
Hi Piotr,

 Thanks for the information. 

You are right, those URLs (generated with -refetchonly) are not being 
fetched. In my bullet # 4 I have said that they are fetched as I was mislead 
by presents of data files (even so they were very small and I didn't check 
the content).

 I'm trying to understand how to start with initial set of URLs and continue 
fetching new URLS and re-fetching existing URLS (when they due to re-fetch).

I will post the questions below in nutch-dev list also.

 
   1. I have set db.default.fetch.interval to 1 (in nutch-default.xml) 
   but I have noticed that fetchInterval field in Page object is being set to 
   current time + 7 days while URL link data is being read from the fetchlist. 
   Can somebody explain why or am I not reading the code correctly? 
   2. I have modified code to ignore fetchInterval value coming from the 
   fetchlist, meaning that fetchInterval stays equal to the initial value - 
   current time. After I do the following commands: fetch, db update
and generate
   db segments, I'm getting new fetchlist but this list doesn't include my 
   original sites. Even so their next fetch time should be in past already. Can 
   somebody help me to understand when those URLS will be fetch? 
   3. Looks like fetcher fail to extract links from http://www.eltweb.com. 
   I know that there are some formats (looks like some HTML variations also) 
   that are not supported. Where can I find information what is currently 
   supported? 
   4. Some of the out-links discovered during the fetch (for instance: 
   http://www.webct.com/software/viewpage?name=software_campus_edition or 
   http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not 
   included in the next fetchlist after executing [generate db segments] 
   command). Is there known reason for this? Is there some documentation 
   describing supported URL types. 

Thanks,

Daniel


On 6/8/05, Piotr Kosiorowski <pk...@gmail.com> wrote: 
> 
> Hello Daniel,
> I raised -refetchonly question on nutch-dev list two days ago (subject:
> -refetchonly investigation). I have described my tests and code findings
> there. If you are interested you can check it there but for me the most
> important is Doug answer so I will cite it here:
> <cite>
> The original rationale for the "-refetchonly" option was to permit
> indexing of all of the urls known the the database, with anchor text,
> but without fetching them. Thus one can, e.g., provide an index of 10M
> urls while only actually fetching 1M urls. I have never actually used
> this feature myseufl. I don't know whether other folks have ever used
> it sucessfully, nor whether such a feature is in fact desired.
> </cite>
> 
> I do not personally find such feature useful but maybe it is for
> somebody. I would like to add a feature that allows one to generate
> fetchlist that would contain only urls that were already fetched (and
> for symmetry the opposite - urls that were never fetched) - but at the
> moment I am a bit busy with my personal life and work - but I have it on
> my TODO list (I will get back to your questions than too).
> Regards
> Piotr
> 
> 
> 
>

Re: Intranet crawl and re-fetch - newbie question

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello Daniel,
I raised -refetchonly question on nutch-dev list two days ago (subject: 
-refetchonly investigation). I have described my tests and code findings 
there. If you are interested you can check it there but for me the most 
important is Doug answer so I will cite it here:
<cite>
The original rationale for the "-refetchonly" option was to permit 
indexing of all of the urls known the the database, with anchor text, 
but without fetching them.  Thus one can, e.g., provide an index of 10M 
urls while only actually fetching 1M urls.  I have never actually used 
this feature myseufl.  I don't know whether other folks have ever used 
it sucessfully, nor whether such a feature is in fact desired.
</cite>

I do not personally find such feature useful but maybe it is for 
somebody. I would like to add a feature that allows one to generate
fetchlist that would contain only urls that were already fetched (and 
for symmetry the opposite - urls that were never fetched) - but at the 
moment I am a bit busy with my personal life and work - but I have it on 
my TODO list (I will get back to your questions than too).
Regards
Piotr


Daniel D. wrote:
> Hi,
> 
> I have run some tests to verify (as nobody confirmed this yet) how 
> –refetchonly is behaving and would like to share with you the results. I 
> also will add some questions in the end.
> 
> I'm using Nutch v6.
> For test purposes I have modified code to create log file with some URL 
> information. I have also changed code in test 2 to modify the 
> fetchinterval (see below).
> 
> Test 1:
> I have created DB and have injected 3 URLS. Re-fetch interval was set to 
> 1 ( db.default.fetch.interval).
> 1. I have run fetch. I'm attaching the log_10_7_days.txt to see the 
> results of the fetch. Please pay attention to the nextFetch date. Even 
> so that fetchinterval is 1 nextFetch date was in 7 days. I think this 
> nextFetch is being read from the fetchlist. (Question #1)
> 2. I have updated DB.
> 3. I have created the segments with –refetchonly option. Results of the 
> nutch fetchlist –dumpurls … attached as test1_dumpurls.txt
> You can see that only new URLS were included. But URLS having the 
> following form: 
> http://www.webct.com/software/viewpage?name=software_campus_edition or 
> http://v.extreme-dm.com/?login=cguilfor 
> <http://v.extreme-dm.com/?login=cguilfor> were not included (Question #2)
> 4. I have run fetch on new segment (create in # 3) Results are in the 
> log_10_7_refetch.txt. You will see that all URLS from the 
> test1_dumpurls.txt were fetch but no outlinks were recorded. (Question #3)
> 
> 
> Test 2: After realizing that nextFetch is in 7 days I have modified code 
> to ignore value being loaded from the fetchlist and kept it equal to the 
> current time (assigned in time of initialization)
> 
> I have created DB and have injected 3 URLS. Re-fetch interval was set to 
> 1 ( db.default.fetch.interval).
> 1. I have run fetch. I'm attaching the log_10_0_days.txt to see the 
> results of the fetch. Please pay attention to the nextFetch date.
> 2. I have updated DB.
> 3. I have created the segments with –refetchonly option. Results of the 
> nutch fetchlist –dumpurls … attached as test2_dumpurls.txt. Note that 
> even so that current time has passed the nextFetch date I have found 
> exact the same list of URLS as in test1!!!!
>      You can see that only new URLS were included. But URLS having the 
> following form: 
> http://www.webct.com/software/viewpage?name=software_campus_edition 
> <http://www.webct.com/software/viewpage?name=software_campus_edition> or 
> http://v.extreme-dm.com/?login=cguilfor were not included (Question #2)
> 4. I have run fetch on new segment (create in # 3) Results are in the 
> log_10_0_refetch.txt. You will see that all URLS from the 
> test2_dumpurls.txt were fetch but no outlinks were recorded. (Question #3)
> 
> Questions:
> 1. Why when db.default.fetch.interval is 1 Page object nextFetch 
> variable is in 7 days?
> 2. Why created the segments with –refetchonly excluded the URLS with the 
> following form (I think having question mark):   
> http://www.webct.com/software/viewpage?name=software_campus_edition or 
> http://v.extreme-dm.com/?login=cguilfor 
> <http://v.extreme-dm.com/?login=cguilfor>
> 3. Why fetch of the fetchlist created with –refetchonly is not storing 
> outlinks in the results?
> 
> Hope my results will help to understand how it works.
> 
> Guys, please find time and ask those questions as this greatly help in 
> my work.
> 
> Thanks,
> Daniel.
> 
> 
> On 6/6/05, Piotr Kosiorowski <pkosiorowski@gmail.com 
> <ma...@gmail.com>> wrote:
>  > As far as I know crawl - (named Intranet crawling in tutorial) - assumes
>  > you refetch everything from scratch every time you run it. Whole Web
>  > crawling allows you to control what you want to crawl and recrawl with
>  > more details but some parameters might not work as I would expect (eg.
>  > -refetchonly). Support for checking if page was modified from last fetch
>  > time is  currently missing (although as I understand there is some work
>  > going on in this direction: 
> http://issues.apache.org/jira/browse/NUTCH-61 )
>  > Regards
>  > Piotr
>  >
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ==================================================
> URL: http://www.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 22:49:28 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 18
> Outlink: toUrl: http://www.hypermail.org/docs.html anchor: Documentation
> Outlink: toUrl: http://dev.hypermail.org/openfaq/ anchor: OpenFAQ
> Outlink: toUrl: http://www.hypermail.org/lists.html anchor: Mailing Lists
> Outlink: toUrl: http://www.hypermail.org/mail-archive/archives.html anchor: Mailing List Archives
> Outlink: toUrl: http://www.hypermail.org/dist anchor: Download Hypermail Software
> Outlink: toUrl: http://www.hypermail.org/cvs.html anchor: CVS Server Access
> Outlink: toUrl: http://cvsweb.hypermail.org/ anchor: Browsing the CVS Baseline
> Outlink: toUrl: http://www.hypermail.org/submit-patches.html anchor: Submitting Patches
> Outlink: toUrl: mailto:hypermail@hypermail.org anchor: Suggestions
> Outlink: toUrl: http://www.hypermail.org/using.html anchor: Lists Using Hypermail
> Outlink: toUrl: http://www.hypermail.org/net-resources.html anchor: Net.Resources
> Outlink: toUrl: http://www.hypermail.org/others.html anchor: The Others
> Outlink: toUrl: http://www.hypermail.org/credits.html anchor: Credits
> Outlink: toUrl: http://www.hypermail.org/copyright.html anchor: Copyright
> Outlink: toUrl: http://home.netscape.com/comprod/mirror/index.html anchor: Download
> Outlink: toUrl: http://www.hypermail.org/navbar.html anchor: 
> Outlink: toUrl: http://www.hypermail.org/firstpage.html anchor: 
> Outlink: toUrl: http://www.hypermail.org/search.html anchor: 
> 
> ==================================================
> URL: http://www.powa.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 22:49:29 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 16
> Outlink: toUrl: http://my.powa.org/modules.php?name=Your_Account anchor: Login
> Outlink: toUrl: http://www.webenglishteacher.com/ anchor: 
> Outlink: toUrl: http://www.eltweb.com/ anchor: 
> Outlink: toUrl: http://www.rockhillpress.com/ anchor: 
> Outlink: toUrl: http://members.tripod.com/~DoctorAhClem/ahclem.html anchor: 
> Outlink: toUrl: http://webcrawler.com/select/ anchor: 
> Outlink: toUrl: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html anchor: 
> Outlink: toUrl: http://www.studyweb.com/ anchor: 
> Outlink: toUrl: http://www.schoolzone.co.uk/ anchor: 
> Outlink: toUrl: http://www.awesomelibrary.org/ratings.html anchor: 
> Outlink: toUrl: http://www.kn.pacbell.com/wired/bluewebn/ anchor: 
> Outlink: toUrl: http://www.homeworkspot.com/high/english/essaywriting.htm anchor: 
> Outlink: toUrl: http://www.links2go.com/topic/Writing anchor: 
> Outlink: toUrl: http://www.cs.wisc.edu/scout/report anchor: 
> Outlink: toUrl: http://v.extreme-dm.com/?login=cguilfor anchor: 
> Outlink: toUrl: mailto:chuck@powa.org anchor: Chuck Guilford
> 
> ==================================================
> URL: http://www.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 22:49:30 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 62
> Outlink: toUrl: http://www.webct.com/entrypage anchor: 
> Outlink: toUrl: http://www.webct.com/entrypage anchor: Home
> Outlink: toUrl: http://www.webct.com/software anchor: Software
> Outlink: toUrl: http://www.webct.com/services anchor: Services
> Outlink: toUrl: http://www.webct.com/techsupport anchor: Support
> Outlink: toUrl: http://www.webct.com/success anchor: Customer Success
> Outlink: toUrl: http://www.webct.com/content anchor: Digital Content
> Outlink: toUrl: http://www.webct.com/powerlinks anchor: WebCT PowerLinks
> Outlink: toUrl: http://www.webct.com/vision anchor: Vision
> Outlink: toUrl: http://www.webct.com/company anchor: About Us
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_technical_solutions anchor: Technical Solutions
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_professional_development anchor: Professional Development
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_hosting anchor: Hosting Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_support_options anchor: Support Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_expanding_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_getting_started_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/support anchor: WebCT Support
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/support/viewpage?name=company_documentation_index anchor: Documentation
> Outlink: toUrl: http://download.webct.com/ anchor: Software Downloads
> Outlink: toUrl: http://www.webct.com/techsupport/viewpage?name=techsupport_license_faq anchor: License Keys
> Outlink: toUrl: http://www.webct.com/success/viewpage?name=success_case_studies anchor: Case Studies
> Outlink: toUrl: http://www.webct.com/exemplary anchor: Exemplary Courses
> Outlink: toUrl: http://www.webct.com/institutes anchor: WebCT Institutes
> Outlink: toUrl: http://www.webct.com/worldwide anchor: WebCT Worldwide
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_instructors anchor: Instructors
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_admin anchor: Administrators
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_access anchor: Student Access Codes
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_customer_care anchor: Help
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_network anchor: PowerLinks Network
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_showcase anchor: PowerLinks Showcase
> Outlink: toUrl: http://www.webct.com/developers anchor: Vista Developers Network
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_webct_customers anchor: Customers
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_management_team anchor: Leadership
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_manage_investors anchor: Investors
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_partners anchor: Partners
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_press_kit anchor: Press
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_events anchor: Events
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_jobs anchor: Jobs
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/ce6 anchor: 
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26162711 anchor: Innovative e-learning project
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26052806 anchor: WebCT to give sneak preview
> Outlink: toUrl: http://www.webct.com/2005 anchor: WebCT Impact 2005
> Outlink: toUrl: http://www.webct.com/company/service/selectnewsletters anchor: Subscribe to WebCT Newsletter
> Outlink: toUrl: http://www.webct.com/vision anchor: Learn how WebCT can help your institution achieve learning without limits
> Outlink: toUrl: http://www.webct.com/events anchor: Events
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/students anchor: Students
> Outlink: toUrl: http://www.webct.com/faculty anchor: Faculty
> Outlink: toUrl: http://www.webct.com/workshops anchor: Online Workshops
> Outlink: toUrl: http://www.webct.com/seminars anchor: Online Seminars
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/ce6 anchor: CE 6 Upgrade
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/communities/servicepolicy anchor: Terms of Service
> Outlink: toUrl: http://www.webct.com/communities/privacypolicy anchor: Privacy Policy
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_how_to_apply anchor: Employment
> Outlink: toUrl: http://www.webct.com/communities/viewpage?name=communities_site_map anchor: Site Map
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ==================================================
> URL: http://www.eltweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/students
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Students
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/navbar.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.cs.wisc.edu/scout/report
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/worldwide
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Worldwide
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/techsupport
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Support
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/support
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Support
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.kn.pacbell.com/wired/bluewebn/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.rockhillpress.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/mail-archive/archives.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Mailing List Archives
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/powerlinks
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT PowerLinks
> Outlinks Count: 0
> 
> ==================================================
> URL: http://dev.hypermail.org/openfaq/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: OpenFAQ
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/dist
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Download Hypermail Software
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/credits.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Credits
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/copyright.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Copyright
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/search.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/company/service/selectnewsletters
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Subscribe to WebCT Newsletter
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/company
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: About Us
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/developers
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Vista Developers Network
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/services
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Services
> Outlinks Count: 0
> 
> ==================================================
> URL: http://members.tripod.com/~DoctorAhClem/ahclem.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/firstpage.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.homeworkspot.com/high/english/essaywriting.htm
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/using.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Lists Using Hypermail
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/communities/privacypolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Privacy Policy
> Outlinks Count: 0
> 
> ==================================================
> URL: http://home.netscape.com/comprod/mirror/index.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Download
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.links2go.com/topic/Writing
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/seminars
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Online Seminars
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/institutes
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Institutes
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/events
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Events
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/ce6
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/lists.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Mailing Lists
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/success
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Customer Success
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.studyweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/2005
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Impact 2005
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/entrypage
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://download.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Software Downloads
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/cvs.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: CVS Server Access
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/net-resources.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Net.Resources
> Outlinks Count: 0
> 
> ==================================================
> URL: http://cvsweb.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Browsing the CVS Baseline
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/exemplary
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Exemplary Courses
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/docs.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Documentation
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.awesomelibrary.org/ratings.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/ask_drc
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Ask Dr. C
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.schoolzone.co.uk/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/vision
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Vision
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/content
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Digital Content
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/software
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Software
> Outlinks Count: 0
> 
> ==================================================
> URL: http://webcrawler.com/select/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/submit-patches.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Submitting Patches
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/communities/servicepolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Terms of Service
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/others.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: The Others
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/workshops
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Online Workshops
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webenglishteacher.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/faculty
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Faculty
> Outlinks Count: 0
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ==================================================
> URL: http://www.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 20:59:25 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 18
> Outlink: toUrl: http://www.hypermail.org/docs.html anchor: Documentation
> Outlink: toUrl: http://dev.hypermail.org/openfaq/ anchor: OpenFAQ
> Outlink: toUrl: http://www.hypermail.org/lists.html anchor: Mailing Lists
> Outlink: toUrl: http://www.hypermail.org/mail-archive/archives.html anchor: Mailing List Archives
> Outlink: toUrl: http://www.hypermail.org/dist anchor: Download Hypermail Software
> Outlink: toUrl: http://www.hypermail.org/cvs.html anchor: CVS Server Access
> Outlink: toUrl: http://cvsweb.hypermail.org/ anchor: Browsing the CVS Baseline
> Outlink: toUrl: http://www.hypermail.org/submit-patches.html anchor: Submitting Patches
> Outlink: toUrl: mailto:hypermail@hypermail.org anchor: Suggestions
> Outlink: toUrl: http://www.hypermail.org/using.html anchor: Lists Using Hypermail
> Outlink: toUrl: http://www.hypermail.org/net-resources.html anchor: Net.Resources
> Outlink: toUrl: http://www.hypermail.org/others.html anchor: The Others
> Outlink: toUrl: http://www.hypermail.org/credits.html anchor: Credits
> Outlink: toUrl: http://www.hypermail.org/copyright.html anchor: Copyright
> Outlink: toUrl: http://home.netscape.com/comprod/mirror/index.html anchor: Download
> Outlink: toUrl: http://www.hypermail.org/navbar.html anchor: 
> Outlink: toUrl: http://www.hypermail.org/firstpage.html anchor: 
> Outlink: toUrl: http://www.hypermail.org/search.html anchor: 
> 
> ==================================================
> URL: http://www.powa.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 20:59:25 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 16
> Outlink: toUrl: http://my.powa.org/modules.php?name=Your_Account anchor: Login
> Outlink: toUrl: http://www.webenglishteacher.com/ anchor: 
> Outlink: toUrl: http://www.eltweb.com/ anchor: 
> Outlink: toUrl: http://www.rockhillpress.com/ anchor: 
> Outlink: toUrl: http://members.tripod.com/~DoctorAhClem/ahclem.html anchor: 
> Outlink: toUrl: http://webcrawler.com/select/ anchor: 
> Outlink: toUrl: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html anchor: 
> Outlink: toUrl: http://www.studyweb.com/ anchor: 
> Outlink: toUrl: http://www.schoolzone.co.uk/ anchor: 
> Outlink: toUrl: http://www.awesomelibrary.org/ratings.html anchor: 
> Outlink: toUrl: http://www.kn.pacbell.com/wired/bluewebn/ anchor: 
> Outlink: toUrl: http://www.homeworkspot.com/high/english/essaywriting.htm anchor: 
> Outlink: toUrl: http://www.links2go.com/topic/Writing anchor: 
> Outlink: toUrl: http://www.cs.wisc.edu/scout/report anchor: 
> Outlink: toUrl: http://v.extreme-dm.com/?login=cguilfor anchor: 
> Outlink: toUrl: mailto:chuck@powa.org anchor: Chuck Guilford
> 
> ==================================================
> URL: http://www.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 20:59:25 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 62
> Outlink: toUrl: http://www.webct.com/entrypage anchor: 
> Outlink: toUrl: http://www.webct.com/entrypage anchor: Home
> Outlink: toUrl: http://www.webct.com/software anchor: Software
> Outlink: toUrl: http://www.webct.com/services anchor: Services
> Outlink: toUrl: http://www.webct.com/techsupport anchor: Support
> Outlink: toUrl: http://www.webct.com/success anchor: Customer Success
> Outlink: toUrl: http://www.webct.com/content anchor: Digital Content
> Outlink: toUrl: http://www.webct.com/powerlinks anchor: WebCT PowerLinks
> Outlink: toUrl: http://www.webct.com/vision anchor: Vision
> Outlink: toUrl: http://www.webct.com/company anchor: About Us
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_technical_solutions anchor: Technical Solutions
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_professional_development anchor: Professional Development
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_hosting anchor: Hosting Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_support_options anchor: Support Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_expanding_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_getting_started_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/support anchor: WebCT Support
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/support/viewpage?name=company_documentation_index anchor: Documentation
> Outlink: toUrl: http://download.webct.com/ anchor: Software Downloads
> Outlink: toUrl: http://www.webct.com/techsupport/viewpage?name=techsupport_license_faq anchor: License Keys
> Outlink: toUrl: http://www.webct.com/success/viewpage?name=success_case_studies anchor: Case Studies
> Outlink: toUrl: http://www.webct.com/exemplary anchor: Exemplary Courses
> Outlink: toUrl: http://www.webct.com/institutes anchor: WebCT Institutes
> Outlink: toUrl: http://www.webct.com/worldwide anchor: WebCT Worldwide
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_instructors anchor: Instructors
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_admin anchor: Administrators
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_access anchor: Student Access Codes
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_customer_care anchor: Help
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_network anchor: PowerLinks Network
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_showcase anchor: PowerLinks Showcase
> Outlink: toUrl: http://www.webct.com/developers anchor: Vista Developers Network
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_webct_customers anchor: Customers
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_management_team anchor: Leadership
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_manage_investors anchor: Investors
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_partners anchor: Partners
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_press_kit anchor: Press
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_events anchor: Events
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_jobs anchor: Jobs
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/ce6 anchor: 
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26162711 anchor: Innovative e-learning project
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26052806 anchor: WebCT to give sneak preview
> Outlink: toUrl: http://www.webct.com/2005 anchor: WebCT Impact 2005
> Outlink: toUrl: http://www.webct.com/company/service/selectnewsletters anchor: Subscribe to WebCT Newsletter
> Outlink: toUrl: http://www.webct.com/vision anchor: Learn how WebCT can help your institution achieve learning without limits
> Outlink: toUrl: http://www.webct.com/events anchor: Events
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/students anchor: Students
> Outlink: toUrl: http://www.webct.com/faculty anchor: Faculty
> Outlink: toUrl: http://www.webct.com/workshops anchor: Online Workshops
> Outlink: toUrl: http://www.webct.com/seminars anchor: Online Seminars
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/ce6 anchor: CE 6 Upgrade
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/communities/servicepolicy anchor: Terms of Service
> Outlink: toUrl: http://www.webct.com/communities/privacypolicy anchor: Privacy Policy
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_how_to_apply anchor: Employment
> Outlink: toUrl: http://www.webct.com/communities/viewpage?name=communities_site_map anchor: Site Map
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ==================================================
> URL: http://www.eltweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/students
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Students
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/navbar.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.cs.wisc.edu/scout/report
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/worldwide
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Worldwide
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/techsupport
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Support
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/support
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Support
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.kn.pacbell.com/wired/bluewebn/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.rockhillpress.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/mail-archive/archives.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Mailing List Archives
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/powerlinks
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT PowerLinks
> Outlinks Count: 0
> 
> ==================================================
> URL: http://dev.hypermail.org/openfaq/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: OpenFAQ
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/dist
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Download Hypermail Software
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/credits.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Credits
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/copyright.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Copyright
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/search.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/company/service/selectnewsletters
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Subscribe to WebCT Newsletter
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/company
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: About Us
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/developers
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Vista Developers Network
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/services
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Services
> Outlinks Count: 0
> 
> ==================================================
> URL: http://members.tripod.com/~DoctorAhClem/ahclem.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/firstpage.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.homeworkspot.com/high/english/essaywriting.htm
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/using.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Lists Using Hypermail
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/communities/privacypolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Privacy Policy
> Outlinks Count: 0
> 
> ==================================================
> URL: http://home.netscape.com/comprod/mirror/index.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Download
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.links2go.com/topic/Writing
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/seminars
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Online Seminars
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/institutes
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Institutes
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/events
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Events
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/ce6
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/lists.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Mailing Lists
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/success
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Customer Success
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.studyweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/2005
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: WebCT Impact 2005
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/entrypage
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://download.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Software Downloads
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/cvs.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: CVS Server Access
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/net-resources.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Net.Resources
> Outlinks Count: 0
> 
> ==================================================
> URL: http://cvsweb.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Browsing the CVS Baseline
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/exemplary
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Exemplary Courses
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/docs.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Documentation
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.awesomelibrary.org/ratings.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/ask_drc
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Ask Dr. C
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.schoolzone.co.uk/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/vision
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Vision
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/content
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Digital Content
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/software
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Software
> Outlinks Count: 0
> 
> ==================================================
> URL: http://webcrawler.com/select/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/submit-patches.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Submitting Patches
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/communities/servicepolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Terms of Service
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.hypermail.org/others.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: The Others
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/workshops
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Online Workshops
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webenglishteacher.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 0
> Outlinks Count: 0
> 
> ==================================================
> URL: http://www.webct.com/faculty
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
> 
> 
> Number of anchors: 1
> Anchors: Faculty
> Outlinks Count: 0
> 
> 
> ------------------------------------------------------------------------
> 
> admin@dss ~/nutch
> $ nutch fetchlist -dumpurls segments/20050607213908
> run java in C:\j2sdk1.4.2_06
> 050607 214600 No NutchFileSystem indicated, so defaulting to local fs.
> 050607 214600 loading file:/D:/nutch-0.6/conf/nutch-default.xml
> 050607 214601 loading file:/D:/nutch-0.6/conf/nutch-site.xml
> Recno 0: http://www.eltweb.com/
> Recno 1: http://www.webct.com/students
> Recno 2: http://www.hypermail.org/navbar.html
> Recno 3: http://www.cs.wisc.edu/scout/report
> Recno 4: http://www.webct.com/worldwide
> Recno 5: http://www.webct.com/techsupport
> Recno 6: http://www.webct.com/support
> Recno 7: http://www.kn.pacbell.com/wired/bluewebn/
> Recno 8: http://www.rockhillpress.com/
> Recno 9: http://www.hypermail.org/mail-archive/archives.html
> Recno 10: http://www.webct.com/powerlinks
> Recno 11: http://dev.hypermail.org/openfaq/
> Recno 12: http://www.hypermail.org/dist
> Recno 13: http://www.hypermail.org/credits.html
> Recno 14: http://www.hypermail.org/copyright.html
> Recno 15: http://www.hypermail.org/search.html
> Recno 16: http://www.webct.com/company/service/selectnewsletters
> Recno 17: http://www.webct.com/company
> Recno 18: http://www.webct.com/developers
> Recno 19:
> http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/W
> riting/reviews.html
> Recno 20: http://www.webct.com/services
> Recno 21: http://members.tripod.com/~DoctorAhClem/ahclem.html
> Recno 22: http://www.hypermail.org/firstpage.html
> Recno 23: http://www.homeworkspot.com/high/english/essaywriting.htm
> Recno 24: http://www.hypermail.org/using.html
> Recno 25: http://www.webct.com/communities/privacypolicy
> Recno 26: http://home.netscape.com/comprod/mirror/index.html
> Recno 27: http://www.links2go.com/topic/Writing
> Recno 28: http://www.webct.com/seminars
> Recno 29: http://www.webct.com/institutes
> Recno 30: http://www.webct.com/events
> Recno 31: http://www.webct.com/ce6
> Recno 32: http://www.hypermail.org/lists.html
> Recno 33: http://www.webct.com/success
> Recno 34: http://www.studyweb.com/
> Recno 35: http://www.webct.com/2005
> Recno 36: http://www.webct.com/entrypage
> Recno 37: http://download.webct.com/
> Recno 38: http://www.hypermail.org/cvs.html
> Recno 39: http://www.hypermail.org/net-resources.html
> Recno 40: http://cvsweb.hypermail.org/
> Recno 41: http://www.webct.com/exemplary
> Recno 42: http://www.hypermail.org/docs.html
> Recno 43: http://www.awesomelibrary.org/ratings.html
> Recno 44: http://www.webct.com/ask_drc
> Recno 45: http://www.schoolzone.co.uk/
> Recno 46: http://www.webct.com/vision
> Recno 47: http://www.webct.com/content
> Recno 48: http://www.webct.com/software
> Recno 49: http://webcrawler.com/select/
> Recno 50: http://www.hypermail.org/submit-patches.html
> Recno 51: http://www.webct.com/communities/servicepolicy
> Recno 52: http://www.hypermail.org/others.html
> Recno 53: http://www.webct.com/workshops
> Recno 54: http://www.webenglishteacher.com/
> Recno 55: http://www.webct.com/faculty
> 
> 
> ------------------------------------------------------------------------
> 
> $ nutch fetchlist -dumpurls segments/20050607230550
> run java in C:\j2sdk1.4.2_06
> 050607 230638 No NutchFileSystem indicated, so defaulting to local fs.
> 050607 230638 loading file:/D:/nutch-0.6/conf/nutch-default.xml
> 050607 230638 loading file:/D:/nutch-0.6/conf/nutch-site.xml
> Recno 0: http://www.eltweb.com/
> Recno 1: http://www.webct.com/students
> Recno 2: http://www.hypermail.org/navbar.html
> Recno 3: http://www.cs.wisc.edu/scout/report
> Recno 4: http://www.webct.com/worldwide
> Recno 5: http://www.webct.com/techsupport
> Recno 6: http://www.webct.com/support
> Recno 7: http://www.kn.pacbell.com/wired/bluewebn/
> Recno 8: http://www.rockhillpress.com/
> Recno 9: http://www.hypermail.org/mail-archive/archives.html
> Recno 10: http://www.webct.com/powerlinks
> Recno 11: http://dev.hypermail.org/openfaq/
> Recno 12: http://www.hypermail.org/dist
> Recno 13: http://www.hypermail.org/credits.html
> Recno 14: http://www.hypermail.org/copyright.html
> Recno 15: http://www.hypermail.org/search.html
> Recno 16: http://www.webct.com/company/service/selectnewsletters
> Recno 17: http://www.webct.com/company
> Recno 18: http://www.webct.com/developers
> Recno 19:
> http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/W
> riting/reviews.html
> Recno 20: http://www.webct.com/services
> Recno 21: http://members.tripod.com/~DoctorAhClem/ahclem.html
> Recno 22: http://www.hypermail.org/firstpage.html
> Recno 23: http://www.homeworkspot.com/high/english/essaywriting.htm
> Recno 24: http://www.hypermail.org/using.html
> Recno 25: http://www.webct.com/communities/privacypolicy
> Recno 26: http://home.netscape.com/comprod/mirror/index.html
> Recno 27: http://www.links2go.com/topic/Writing
> Recno 28: http://www.webct.com/seminars
> Recno 29: http://www.webct.com/institutes
> Recno 30: http://www.webct.com/events
> Recno 31: http://www.webct.com/ce6
> Recno 32: http://www.hypermail.org/lists.html
> Recno 33: http://www.webct.com/success
> Recno 34: http://www.studyweb.com/
> Recno 35: http://www.webct.com/2005
> Recno 36: http://www.webct.com/entrypage
> Recno 37: http://download.webct.com/
> Recno 38: http://www.hypermail.org/cvs.html
> Recno 39: http://www.hypermail.org/net-resources.html
> Recno 40: http://cvsweb.hypermail.org/
> Recno 41: http://www.webct.com/exemplary
> Recno 42: http://www.hypermail.org/docs.html
> Recno 43: http://www.awesomelibrary.org/ratings.html
> Recno 44: http://www.webct.com/ask_drc
> Recno 45: http://www.schoolzone.co.uk/
> Recno 46: http://www.webct.com/vision
> Recno 47: http://www.webct.com/content
> Recno 48: http://www.webct.com/software
> Recno 49: http://webcrawler.com/select/
> Recno 50: http://www.hypermail.org/submit-patches.html
> Recno 51: http://www.webct.com/communities/servicepolicy
> Recno 52: http://www.hypermail.org/others.html
> Recno 53: http://www.webct.com/workshops
> Recno 54: http://www.webenglishteacher.com/
> Recno 55: http://www.webct.com/faculty



Re: Intranet crawl and re-fetch - newbie question

Posted by "Daniel D." <nu...@gmail.com>.
Hi,

I have run some tests to verify (as nobody confirmed this yet) how 
–refetchonly is behaving and would like to share with you the results. I 
also will add some questions in the end.

I'm using Nutch v6. 
For test purposes I have modified code to create log file with some URL 
information. I have also changed code in test 2 to modify the fetchinterval 
(see below). 

Test 1:
I have created DB and have injected 3 URLS. Re-fetch interval was set to 1 (
db.default.fetch.interval). 
1. I have run fetch. I'm attaching the log_10_7_days.txt to see the results 
of the fetch. Please pay attention to the nextFetch date. Even so that 
fetchinterval is 1 nextFetch date was in 7 days. I think this nextFetch is 
being read from the fetchlist. (Question #1)
2. I have updated DB.
3. I have created the segments with –refetchonly option. Results of the 
nutch fetchlist –dumpurls … attached as test1_dumpurls.txt
You can see that only new URLS were included. But URLS having the following 
form: http://www.webct.com/software/viewpage?name=software_campus_edition or 
http://v.extreme-dm.com/?login=cguilfor were not included (Question #2)
4. I have run fetch on new segment (create in # 3) Results are in the 
log_10_7_refetch.txt. You will see that all URLS from the test1_dumpurls.txt 
were fetch but no outlinks were recorded. (Question #3)


Test 2: After realizing that nextFetch is in 7 days I have modified code to 
ignore value being loaded from the fetchlist and kept it equal to the 
current time (assigned in time of initialization)

I have created DB and have injected 3 URLS. Re-fetch interval was set to 1 (
db.default.fetch.interval). 
1. I have run fetch. I'm attaching the log_10_0_days.txt to see the results 
of the fetch. Please pay attention to the nextFetch date. 
2. I have updated DB.
3. I have created the segments with –refetchonly option. Results of the 
nutch fetchlist –dumpurls … attached as test2_dumpurls.txt. Note that even 
so that current time has passed the nextFetch date I have found exact the 
same list of URLS as in test1!!!!
You can see that only new URLS were included. But URLS having the following 
form: http://www.webct.com/software/viewpage?name=software_campus_edition or 
http://v.extreme-dm.com/?login=cguilfor were not included (Question #2)
4. I have run fetch on new segment (create in # 3) Results are in the 
log_10_0_refetch.txt. You will see that all URLS from the test2_dumpurls.txt 
were fetch but no outlinks were recorded. (Question #3)

Questions:
1. Why when db.default.fetch.interval is 1 Page object nextFetch variable is 
in 7 days?
2. Why created the segments with –refetchonly excluded the URLS with the 
following form (I think having question mark): 
http://www.webct.com/software/viewpage?name=software_campus_edition or 
http://v.extreme-dm.com/?login=cguilfor
3. Why fetch of the fetchlist created with –refetchonly is not storing 
outlinks in the results?

Hope my results will help to understand how it works. 

Guys, please find time and ask those questions as this greatly help in my 
work.

Thanks,
Daniel.


On 6/6/05, Piotr Kosiorowski <pk...@gmail.com> wrote:
> As far as I know crawl - (named Intranet crawling in tutorial) - assumes
> you refetch everything from scratch every time you run it. Whole Web
> crawling allows you to control what you want to crawl and recrawl with
> more details but some parameters might not work as I would expect (eg.
> -refetchonly). Support for checking if page was modified from last fetch
> time is currently missing (although as I understand there is some work
> going on in this direction: http://issues.apache.org/jira/browse/NUTCH-61)
> Regards
> Piotr
>

Re: Intranet crawl and re-fetch - newbie question

Posted by Piotr Kosiorowski <pk...@gmail.com>.
As far as I know crawl - (named Intranet crawling in tutorial) - assumes 
you refetch everything from scratch every time you run it. Whole Web 
crawling allows you to control what you want to crawl and recrawl with 
more details but some parameters might not work as I would expect (eg. 
-refetchonly). Support for checking if page was modified from last fetch 
time is  currently missing (although as I understand there is some work 
going on in this direction: http://issues.apache.org/jira/browse/NUTCH-61 )
Regards
Piotr



isabelle.moulinier@thomson.com wrote:
> Hello,
> 
> I have a newbie question:
> 
> I have launched and completed an intranet crawling (bin/nutch crawl mySite myDB).
> Since I would like to recrawl in a few days, I changed the nutch default parameter to 3 days (instead of 30).
> How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters? 
> If I do, will the fetching only download new or modified pages, or will it download everything again?
> 
> Thanks for any help
> 
> Isabelle
> 
> Isabelle.Moulinier@thomson.com
> Ph: 651 687 3424
> 
> 
> 
> 


Re: Intranet crawl and re-fetch - newbie question

Posted by Jack Tang <hi...@gmail.com>.
Hi 

I focused on Nutch month ago, then was interruptted, and here I am now.
One question should be confirmed. Nutch hosted in svn supports recrawling now?
If yes, could you pls tell me the config params? Thanks

/Jack

On 6/2/05, isabelle.moulinier@thomson.com
<is...@thomson.com> wrote:
> Hello,
> 
> I have a newbie question:
> 
> I have launched and completed an intranet crawling (bin/nutch crawl mySite myDB).
> Since I would like to recrawl in a few days, I changed the nutch default parameter to 3 days (instead of 30).
> How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters?
> If I do, will the fetching only download new or modified pages, or will it download everything again?
> 
> Thanks for any help
> 
> Isabelle
> 
> Isabelle.Moulinier@thomson.com
> Ph: 651 687 3424
> 
> 
> 
>