You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/03/16 08:44:45 UTC

What's wrong crawling a google site? Why is the time limit 0?

$  bin/nutch inject crawl/crawldb dmoz
Injector: starting at 2011-03-15 22:17:40
Injector: crawlDb: crawl/crawldb
Injector: urlDir: dmoz
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-03-15 22:17:53, elapsed: 00:00:13
$  bin/nutch generate crawl/crawldb crawl/segments
Generator: starting at 2011-03-15 22:18:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20110315221842
Generator: finished at 2011-03-15 22:18:47, elapsed: 00:00:13
$ s1=`ls -d crawl/segments/2* | tail -1`
$  bin/nutch fetch $s1
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-03-15 22:18:59
Fetcher: segment: crawl/segments/20110315221842
Fetcher: threads: 10
*QueueFeeder finished: total 1 records + hit by time limit :0*
*fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-15 22:19:10, elapsed: 00:00:10
$  bin/nutch updatedb crawl/crawldb $s1
CrawlDb update: starting at 2011-03-15 22:19:17
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20110315221842]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-03-15 22:19:26, elapsed: 00:00:08
$  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2011-03-15 22:19:37
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/users/simpatico/nutch-1.2/crawl/segments/20110315221842
LinkDb: finished at 2011-03-15 22:19:44, elapsed: 00:00:06
$  bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/*
Indexer: starting at 2011-03-15 22:19:48
Indexer: finished at 2011-03-15 22:20:02, elapsed: 00:00:13
$  bin/nutch org.apache.nutch.searcher.NutchBean dp4j
Total hits: 0


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

QueueFeeder finished: total 1 records + *hit by time limit :0*
fetching http://*localhost*:8080/qui/scoringtest.html
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-24 11:56:37, elapsed: 00:00:13


http://localhost:8080/qui/scoringtest.html      Version: 7
Status: 2 (db_*fetched*)
Fetch time: Sat Apr 23 12:56:34 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: 0028561eab448dbcf15f287fd705a152
Metadata: _pst_: *success*(1), lastModified=0


On Wed, Mar 23, 2011 at 9:27 AM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Now even the info in readdb is not enough to troubleshoot why a page is not
> fetched.
>
> $ bin/nutch readdb crawl/crawldb -url
> http://en.wikipedia.org/wiki/Artificial_Christmas_tree
> URL: http://en.wikipedia.org/wiki/Artificial_Christmas_tree
>
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Mar 23 07:27:08 CET 2011
>
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata:
>
> It's not robot.txt, there's nothing wrong with url, so why unfetched?
>
>
> On Tue, Mar 22, 2011 at 2:51 PM, Markus Jelsma <markus.jelsma@openindex.io
> > wrote:
>
>>
>>
>> On Tuesday 22 March 2011 14:38:13 Gabriele Kahlout wrote:
>> > On Tue, Mar 22, 2011 at 1:33 PM, Markus Jelsma
>> > > You can check the contents of the CrawlDB by using nutch readdb.
>> > >
>> > > $ bin/nutch readdb crawl/crawldb -url
>> >
>> > http://sites.google.com/a/mysimpatico.com/home/dp4j
>> > URL: http://sites.google.com/a/mysimpatico.com/home/dp4j
>> > not found
>> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
>> > http://sites.google.com
>> > URL: http://sites.google.com
>> > not found
>>
>> If they're not in the crawldb then you a) didn't inject them or b) the
>> url's
>> didn't pass the filters. Also, take care that the url parameter of readdb
>> only
>> accepts the exact url, missing a slash can make a difference here. Use
>> -dump
>> to inspect the complete db.
>>
>> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -stats
>> -sort
>> > CrawlDb statistics start: crawl/crawldb
>> > Statistics for CrawlDb: crawl/crawldb
>> > TOTAL urls:    7
>> > retry 0:    7
>> > min score:    1.0
>> > avg score:    1.0
>> > max score:    1.0
>> > status 1 (db_unfetched):    2
>> >    docs.google.com :    1
>> >    sites.google.com :    1
>> > status 2 (db_fetched):    2
>> >    singinst.org :    1
>> >    www.egamaster.com :    1
>> > status 3 (db_gone):    1
>> >    wsdownload.bbc.co.uk :    1
>> > status 4 (db_redir_temp):    2
>> >    docs.google.com :    1
>> >    sites.google.com :    1
>> > CrawlDb statistics: done
>> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
>> > sites.google.com
>> > URL: sites.google.com
>> > not found
>> >
>> > Fetcher: starting at 2011-03-22 14:31:27
>> > Fetcher: segment: crawl/segments/20110322143119
>> > Fetcher: threads: 10
>> > QueueFeeder finished: total 5 records + hit by time limit :0
>> > fetching
>> >
>> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLt
>> > QIE fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
>> > fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
>> > fetching
>> >
>> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303
>> > _6min_heart.pdf fetching http://www.egamaster.com/datos/politica_fr.pdf
>> > -finishing thread FetcherThread, activeThreads=9
>> > -finishing thread FetcherThread, activeThreads=8
>> > -finishing thread FetcherThread, activeThreads=7
>> > -finishing thread FetcherThread, activeThreads=6
>> > -finishing thread FetcherThread, activeThreads=5
>> > -finishing thread FetcherThread, activeThreads=4
>> > -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
>> > -finishing thread FetcherThread, activeThreads=3
>> > -finishing thread FetcherThread, activeThreads=2
>> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>> > Error parsing: http://www.egamaster.com/datos/politica_fr.pdf:
>> failed(2,0):
>> > expected='endstream' actual=''
>> > org.apache.pdfbox.io.PushBackInputStream@43582a7c
>> > -finishing thread FetcherThread, activeThreads=1
>> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> > Error parsing:
>> http://singinst.org/upload/artificial-intelligence-risk.pdf:
>> > failed(2,0): null
>> > -finishing thread FetcherThread, activeThreads=0
>> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=0
>> > Fetcher: finished at 2011-03-22 14:31:46, elapsed: 00:00:18
>> >
>> > > On Wednesday 16 March 2011 17:25:58 Gabriele Kahlout wrote:
>> > > > Thank you, but the page is accessible even without https.
>> > > >
>> > > > *QueueFeeder finished: total 1 records + hit by time limit :0*
>> > > >
>> > > > > *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
>> > > > > -finishing thread FetcherThread, activeThreads=5
>> > > >
>> > > > My understanding is that google prevents the crawler from fetching
>> the
>> > > > page. Correct?
>> > > > Otherwise, why is the index empty?
>> > > >
>> > > > Same here:
>> > > > QueueFeeder finished: total 1 records + hit by time limit :0
>> > > > fetching
>> > >
>> > >
>> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRii
>> > > Lt
>> > >
>> > > > QIE
>> > > >
>> > > > > --
>> > > > > Regards,
>> > > > > K. Gabriele
>> > > > >
>> > > > > --- unchanged since 20/9/10 ---
>> > > > > P.S. If the subject contains "[LON]" or the addressee acknowledges
>> > > > > the receipt within 48 hours then I don't resend the email.
>> > > > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this)
>> ∧
>> > > > > time(x)
>> > > > > < Now + 48h) ⇒ ¬resend(I, this).
>> > > > >
>> > > > > If an email is sent by a sender that is not a trusted contact or
>> the
>> > > > > email does not contain a valid code then the email is not
>> received. A
>> > > > > valid code starts with a hyphen and ends with "X".
>> > > > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x)
>> ∧
>> > > > > y
>> > >
>> > > ∈
>> > >
>> > > > > L(-[a-z]+[0-9]X)).
>> > > > > Email has been scanned for viruses by Altman Technologies' email
>> > > > > management service - www.altman.co.uk/emailsystems
>> > > > >
>> > > > > Glasgow Caledonian University is a registered Scottish charity,
>> > > > > number SC021474
>> > > > >
>> > > > > Winner: Times Higher Education’s Widening Participation Initiative
>> of
>> > >
>> > > the
>> > >
>> > > > > Year 2009 and Herald Society’s Education Initiative of the Year
>> 2009.
>> > >
>> > >
>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219
>> > >
>> > > > > ,en.html
>> > > > >
>> > > > > Winner: Times Higher Education’s Outstanding Support for Early
>> Career
>> > > > > Researchers of the Year 2010, GCU as a lead with Universities
>> > > > > Scotland partners.
>> > >
>> > >
>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,1569
>> > >
>> > > > > 1,en.html
>> > >
>> > > --
>> > > Markus Jelsma - CTO - Openindex
>> > > http://www.linkedin.com/in/markus17
>> > > 050-8536620 / 06-50258350
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

Now even the info in readdb is not enough to troubleshoot why a page is not
fetched.

$ bin/nutch readdb crawl/crawldb -url
http://en.wikipedia.org/wiki/Artificial_Christmas_tree
URL: http://en.wikipedia.org/wiki/Artificial_Christmas_tree
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Mar 23 07:27:08 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:

It's not robot.txt, there's nothing wrong with url, so why unfetched?

On Tue, Mar 22, 2011 at 2:51 PM, Markus Jelsma
<ma...@openindex.io>wrote:

>
>
> On Tuesday 22 March 2011 14:38:13 Gabriele Kahlout wrote:
> > On Tue, Mar 22, 2011 at 1:33 PM, Markus Jelsma
> > > You can check the contents of the CrawlDB by using nutch readdb.
> > >
> > > $ bin/nutch readdb crawl/crawldb -url
> >
> > http://sites.google.com/a/mysimpatico.com/home/dp4j
> > URL: http://sites.google.com/a/mysimpatico.com/home/dp4j
> > not found
> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
> > http://sites.google.com
> > URL: http://sites.google.com
> > not found
>
> If they're not in the crawldb then you a) didn't inject them or b) the
> url's
> didn't pass the filters. Also, take care that the url parameter of readdb
> only
> accepts the exact url, missing a slash can make a difference here. Use
> -dump
> to inspect the complete db.
>
> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -stats -sort
> > CrawlDb statistics start: crawl/crawldb
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls:    7
> > retry 0:    7
> > min score:    1.0
> > avg score:    1.0
> > max score:    1.0
> > status 1 (db_unfetched):    2
> >    docs.google.com :    1
> >    sites.google.com :    1
> > status 2 (db_fetched):    2
> >    singinst.org :    1
> >    www.egamaster.com :    1
> > status 3 (db_gone):    1
> >    wsdownload.bbc.co.uk :    1
> > status 4 (db_redir_temp):    2
> >    docs.google.com :    1
> >    sites.google.com :    1
> > CrawlDb statistics: done
> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
> > sites.google.com
> > URL: sites.google.com
> > not found
> >
> > Fetcher: starting at 2011-03-22 14:31:27
> > Fetcher: segment: crawl/segments/20110322143119
> > Fetcher: threads: 10
> > QueueFeeder finished: total 5 records + hit by time limit :0
> > fetching
> >
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLt
> > QIE fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
> > fetching
> >
> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303
> > _6min_heart.pdf fetching http://www.egamaster.com/datos/politica_fr.pdf
> > -finishing thread FetcherThread, activeThreads=9
> > -finishing thread FetcherThread, activeThreads=8
> > -finishing thread FetcherThread, activeThreads=7
> > -finishing thread FetcherThread, activeThreads=6
> > -finishing thread FetcherThread, activeThreads=5
> > -finishing thread FetcherThread, activeThreads=4
> > -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=3
> > -finishing thread FetcherThread, activeThreads=2
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > Error parsing: http://www.egamaster.com/datos/politica_fr.pdf:
> failed(2,0):
> > expected='endstream' actual=''
> > org.apache.pdfbox.io.PushBackInputStream@43582a7c
> > -finishing thread FetcherThread, activeThreads=1
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > Error parsing:
> http://singinst.org/upload/artificial-intelligence-risk.pdf:
> > failed(2,0): null
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2011-03-22 14:31:46, elapsed: 00:00:18
> >
> > > On Wednesday 16 March 2011 17:25:58 Gabriele Kahlout wrote:
> > > > Thank you, but the page is accessible even without https.
> > > >
> > > > *QueueFeeder finished: total 1 records + hit by time limit :0*
> > > >
> > > > > *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > > > > -finishing thread FetcherThread, activeThreads=5
> > > >
> > > > My understanding is that google prevents the crawler from fetching
> the
> > > > page. Correct?
> > > > Otherwise, why is the index empty?
> > > >
> > > > Same here:
> > > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > > fetching
> > >
> > >
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRii
> > > Lt
> > >
> > > > QIE
> > > >
> > > > > --
> > > > > Regards,
> > > > > K. Gabriele
> > > > >
> > > > > --- unchanged since 20/9/10 ---
> > > > > P.S. If the subject contains "[LON]" or the addressee acknowledges
> > > > > the receipt within 48 hours then I don't resend the email.
> > > > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this)
> ∧
> > > > > time(x)
> > > > > < Now + 48h) ⇒ ¬resend(I, this).
> > > > >
> > > > > If an email is sent by a sender that is not a trusted contact or
> the
> > > > > email does not contain a valid code then the email is not received.
> A
> > > > > valid code starts with a hyphen and ends with "X".
> > > > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x)
> ∧
> > > > > y
> > >
> > > ∈
> > >
> > > > > L(-[a-z]+[0-9]X)).
> > > > > Email has been scanned for viruses by Altman Technologies' email
> > > > > management service - www.altman.co.uk/emailsystems
> > > > >
> > > > > Glasgow Caledonian University is a registered Scottish charity,
> > > > > number SC021474
> > > > >
> > > > > Winner: Times Higher Education’s Widening Participation Initiative
> of
> > >
> > > the
> > >
> > > > > Year 2009 and Herald Society’s Education Initiative of the Year
> 2009.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219
> > >
> > > > > ,en.html
> > > > >
> > > > > Winner: Times Higher Education’s Outstanding Support for Early
> Career
> > > > > Researchers of the Year 2010, GCU as a lead with Universities
> > > > > Scotland partners.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,1569
> > >
> > > > > 1,en.html
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

On Tue, Mar 22, 2011 at 3:43 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> //ok, but:
> //$ bin/nutch org.apache.nutch.searcher.NutchBean artificial
> // Total hits: 0
> Am I right in saying that parse is an extra step not mentioned in the
> tutorial <http://wiki.apache.org/nutch/NutchTutorial>.
>

Looks like it since:
bin/nutch parse crawl/segments/20110322155911
ParseSegment: starting at 2011-03-22 15:59:45
ParseSegment: segment: crawl/segments/20110322155911
Exception in thread "main" java.io.IOException: *Segment already parsed!*
    at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:80)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
    at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

On Tue, Mar 22, 2011 at 2:51 PM, Markus Jelsma
<ma...@openindex.io>wrote:

>
>
> On Tuesday 22 March 2011 14:38:13 Gabriele Kahlout wrote:
> > On Tue, Mar 22, 2011 at 1:33 PM, Markus Jelsma
> > > You can check the contents of the CrawlDB by using nutch readdb.
> > >
> > > $ bin/nutch readdb crawl/crawldb -url
> >
> > http://sites.google.com/a/mysimpatico.com/home/dp4j
> > URL: http://sites.google.com/a/mysimpatico.com/home/dp4j
> > not found
> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
> > http://sites.google.com
> > URL: http://sites.google.com
> > not found
>
> If they're not in the crawldb then you a) didn't inject them or b) the
> url's
> didn't pass the filters. Also, take care that the url parameter of readdb
> only
> accepts the exact url, missing a slash can make a difference here.





> Use -dump
> to inspect the complete db.
>
>
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Thu Apr 21 16:26:40 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE/

http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE/
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Mar 22 15:27:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE/_repr_:
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE

http://singinst.org/upload/artificial-intelligence-risk.pdf    Version: 7
Status: 2 (*db_fetched)*
Fetch time: Thu Apr 21 16:26:39 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: f15bcfec04913f6772a4ce7a22a0caf3
Metadata: _pst_: *success*(1), lastModified=0
//ok, but:
//$ bin/nutch org.apache.nutch.searcher.NutchBean artificial
// Total hits: 0
Am I right in saying that parse is an extra step not mentioned in the
tutorial <http://wiki.apache.org/nutch/NutchTutorial>.


http://sites.google.com/a/mysimpatico.com/home/dp4j    Version: 7
Status: 4 (db_redir_temp)
Fetch time: Thu Apr 21 16:26:40 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
https://sites.google.com/a/mysimpatico.com/home/dp4j

http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
Version: 7
Status: 3 (db_gone)
Fetch time: Fri May 06 16:26:37 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 1.0
Signature: null
Metadata: _pst_: *robots_denied*(18), lastModified=0 //fair enough

http://www.egamaster.com/datos/politica_fr.pdf    Version: 7
Status: 2 (db_fetched)
Fetch time: Thu Apr 21 16:26:40 CEST 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: efc86a0a75896a35d4d255e30429db12
Metadata: _pst_: success(1), lastModified=0

https://sites.google.com/a/mysimpatico.com/home/dp4j    Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Mar 22 15:27:04 CET 2011
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: *temp_moved*(13), lastModified=0:
https://sites.google.com/a/mysimpatico.com/home/dp4j_repr_:
http://sites.google.com/a/mysimpatico.com/home/dp4j

It seems like redirect issue.

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, *instead it will record them for later fetching.*
  </description>
</property>

<property>
  <name>db.fetch.retry.max</name>
  <value>3</value>
  <description>The maximum number of times a url that has encountered
  recoverable errors is generated for fetch.</description>
</property>



$ bin/nutch readdb crawl/crawldb -stats -sort -dump
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:    7
retry 0:    7
min score:    1.0
avg score:    1.0
max score:    1.0
status 1 (db_unfetched):    2
   docs.google.com :    1
   sites.google.com :    1
status 2 (db_fetched):    2
   singinst.org :    1
   www.egamaster.com :    1
status 3 (db_gone):    1
   wsdownload.bbc.co.uk :    1
status 4 (db_redir_temp):    2
   docs.google.com :    1
   sites.google.com :    1
CrawlDb statistics: done
*Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 4
    at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:504)*





>  > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -stats
> -sort
> > CrawlDb statistics start: crawl/crawldb
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls:    7
> > retry 0:    7
> > min score:    1.0
> > avg score:    1.0
> > max score:    1.0
> > status 1 (db_unfetched):    2
> >    docs.google.com :    1
> >    sites.google.com :    1
> > status 2 (db_fetched):    2
> >    singinst.org :    1
> >    www.egamaster.com :    1
> > status 3 (db_gone):    1
> >    wsdownload.bbc.co.uk :    1
> > status 4 (db_redir_temp):    2
> >    docs.google.com :    1
> >    sites.google.com :    1
> > CrawlDb statistics: done
> > michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
> > sites.google.com
> > URL: sites.google.com
> > not found
> >
> > Fetcher: starting at 2011-03-22 14:31:27
> > Fetcher: segment: crawl/segments/20110322143119
> > Fetcher: threads: 10
> > QueueFeeder finished: total 5 records + hit by time limit :0
> > fetching
> >
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLt
> > QIE fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
> > fetching
> >
> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303
> > _6min_heart.pdf fetching http://www.egamaster.com/datos/politica_fr.pdf
> > -finishing thread FetcherThread, activeThreads=9
> > -finishing thread FetcherThread, activeThreads=8
> > -finishing thread FetcherThread, activeThreads=7
> > -finishing thread FetcherThread, activeThreads=6
> > -finishing thread FetcherThread, activeThreads=5
> > -finishing thread FetcherThread, activeThreads=4
> > -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=3
> > -finishing thread FetcherThread, activeThreads=2
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> > Error parsing: http://www.egamaster.com/datos/politica_fr.pdf:
> failed(2,0):
> > expected='endstream' actual=''
> > org.apache.pdfbox.io.PushBackInputStream@43582a7c
> > -finishing thread FetcherThread, activeThreads=1
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > Error parsing:
> http://singinst.org/upload/artificial-intelligence-risk.pdf:
> > failed(2,0): null
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2011-03-22 14:31:46, elapsed: 00:00:18
> >
> > > On Wednesday 16 March 2011 17:25:58 Gabriele Kahlout wrote:
> > > > Thank you, but the page is accessible even without https.
> > > >
> > > > *QueueFeeder finished: total 1 records + hit by time limit :0*
> > > >
> > > > > *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > > > > -finishing thread FetcherThread, activeThreads=5
> > > >
> > > > My understanding is that google prevents the crawler from fetching
> the
> > > > page. Correct?
> > > > Otherwise, why is the index empty?
> > > >
> > > > Same here:
> > > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > > fetching
> > >
> > >
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRii
> > > Lt
> > >
> > > > QIE
> > > >
> > > > > --
> > > > > Regards,
> > > > > K. Gabriele
> > > > >
> > > > > --- unchanged since 20/9/10 ---
> > > > > P.S. If the subject contains "[LON]" or the addressee acknowledges
> > > > > the receipt within 48 hours then I don't resend the email.
> > > > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this)
> ∧
> > > > > time(x)
> > > > > < Now + 48h) ⇒ ¬resend(I, this).
> > > > >
> > > > > If an email is sent by a sender that is not a trusted contact or
> the
> > > > > email does not contain a valid code then the email is not received.
> A
> > > > > valid code starts with a hyphen and ends with "X".
> > > > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x)
> ∧
> > > > > y
> > >
> > > ∈
> > >
> > > > > L(-[a-z]+[0-9]X)).
> > > > > Email has been scanned for viruses by Altman Technologies' email
> > > > > management service - www.altman.co.uk/emailsystems
> > > > >
> > > > > Glasgow Caledonian University is a registered Scottish charity,
> > > > > number SC021474
> > > > >
> > > > > Winner: Times Higher Education’s Widening Participation Initiative
> of
> > >
> > > the
> > >
> > > > > Year 2009 and Herald Society’s Education Initiative of the Year
> 2009.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219
> > >
> > > > > ,en.html
> > > > >
> > > > > Winner: Times Higher Education’s Outstanding Support for Early
> Career
> > > > > Researchers of the Year 2010, GCU as a lead with Universities
> > > > > Scotland partners.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,1569
> > >
> > > > > 1,en.html
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Markus Jelsma <ma...@openindex.io>.


On Tuesday 22 March 2011 14:38:13 Gabriele Kahlout wrote:
> On Tue, Mar 22, 2011 at 1:33 PM, Markus Jelsma
> > You can check the contents of the CrawlDB by using nutch readdb.
> > 
> > $ bin/nutch readdb crawl/crawldb -url
> 
> http://sites.google.com/a/mysimpatico.com/home/dp4j
> URL: http://sites.google.com/a/mysimpatico.com/home/dp4j
> not found
> michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
> http://sites.google.com
> URL: http://sites.google.com
> not found

If they're not in the crawldb then you a) didn't inject them or b) the url's 
didn't pass the filters. Also, take care that the url parameter of readdb only 
accepts the exact url, missing a slash can make a difference here. Use -dump 
to inspect the complete db.

> michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -stats -sort
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:    7
> retry 0:    7
> min score:    1.0
> avg score:    1.0
> max score:    1.0
> status 1 (db_unfetched):    2
>    docs.google.com :    1
>    sites.google.com :    1
> status 2 (db_fetched):    2
>    singinst.org :    1
>    www.egamaster.com :    1
> status 3 (db_gone):    1
>    wsdownload.bbc.co.uk :    1
> status 4 (db_redir_temp):    2
>    docs.google.com :    1
>    sites.google.com :    1
> CrawlDb statistics: done
> michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
> sites.google.com
> URL: sites.google.com
> not found
> 
> Fetcher: starting at 2011-03-22 14:31:27
> Fetcher: segment: crawl/segments/20110322143119
> Fetcher: threads: 10
> QueueFeeder finished: total 5 records + hit by time limit :0
> fetching
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLt
> QIE fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
> fetching
> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303
> _6min_heart.pdf fetching http://www.egamaster.com/datos/politica_fr.pdf
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
> Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0):
> expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@43582a7c
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf:
> failed(2,0): null
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-03-22 14:31:46, elapsed: 00:00:18
> 
> > On Wednesday 16 March 2011 17:25:58 Gabriele Kahlout wrote:
> > > Thank you, but the page is accessible even without https.
> > > 
> > > *QueueFeeder finished: total 1 records + hit by time limit :0*
> > > 
> > > > *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > > > -finishing thread FetcherThread, activeThreads=5
> > > 
> > > My understanding is that google prevents the crawler from fetching the
> > > page. Correct?
> > > Otherwise, why is the index empty?
> > > 
> > > Same here:
> > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > fetching
> > 
> > http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRii
> > Lt
> > 
> > > QIE
> > > 
> > > > --
> > > > Regards,
> > > > K. Gabriele
> > > > 
> > > > --- unchanged since 20/9/10 ---
> > > > P.S. If the subject contains "[LON]" or the addressee acknowledges
> > > > the receipt within 48 hours then I don't resend the email.
> > > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > > > time(x)
> > > > < Now + 48h) ⇒ ¬resend(I, this).
> > > > 
> > > > If an email is sent by a sender that is not a trusted contact or the
> > > > email does not contain a valid code then the email is not received. A
> > > > valid code starts with a hyphen and ends with "X".
> > > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
> > > > y
> > 
> > ∈
> > 
> > > > L(-[a-z]+[0-9]X)).
> > > > Email has been scanned for viruses by Altman Technologies' email
> > > > management service - www.altman.co.uk/emailsystems
> > > > 
> > > > Glasgow Caledonian University is a registered Scottish charity,
> > > > number SC021474
> > > > 
> > > > Winner: Times Higher Education’s Widening Participation Initiative of
> > 
> > the
> > 
> > > > Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> > 
> > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219
> > 
> > > > ,en.html
> > > > 
> > > > Winner: Times Higher Education’s Outstanding Support for Early Career
> > > > Researchers of the Year 2010, GCU as a lead with Universities
> > > > Scotland partners.
> > 
> > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,1569
> > 
> > > > 1,en.html
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

On Tue, Mar 22, 2011 at 1:33 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> If there's nothing produced by the QueueFeeder there can be several things
> going on:
> - you haven't injected URL's to be fetched
>

I've.



> - your regex-urlfilter doesn't pass the URL's injected (if you use regex-
> urlfilter)
>
It does


> - the page is not due for fetching (fetchInterval)
>
> I've rm -r the crawl dir before starting.


> You can check the contents of the CrawlDB by using nutch readdb.
>
> $ bin/nutch readdb crawl/crawldb -url
http://sites.google.com/a/mysimpatico.com/home/dp4j
URL: http://sites.google.com/a/mysimpatico.com/home/dp4j
not found
michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
http://sites.google.com
URL: http://sites.google.com
not found
michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -stats -sort
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:    7
retry 0:    7
min score:    1.0
avg score:    1.0
max score:    1.0
status 1 (db_unfetched):    2
   docs.google.com :    1
   sites.google.com :    1
status 2 (db_fetched):    2
   singinst.org :    1
   www.egamaster.com :    1
status 3 (db_gone):    1
   wsdownload.bbc.co.uk :    1
status 4 (db_redir_temp):    2
   docs.google.com :    1
   sites.google.com :    1
CrawlDb statistics: done
michaela:nutch-1.2 simpatico$ bin/nutch readdb crawl/crawldb -url
sites.google.com
URL: sites.google.com
not found

Fetcher: starting at 2011-03-22 14:31:27
Fetcher: segment: crawl/segments/20110322143119
Fetcher: threads: 10
QueueFeeder finished: total 5 records + hit by time limit :0
fetching
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE
fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
fetching http://singinst.org/upload/artificial-intelligence-risk.pdf
fetching
http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
fetching http://www.egamaster.com/datos/politica_fr.pdf
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
Error parsing: http://www.egamaster.com/datos/politica_fr.pdf: failed(2,0):
expected='endstream' actual=''
org.apache.pdfbox.io.PushBackInputStream@43582a7c
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
Error parsing: http://singinst.org/upload/artificial-intelligence-risk.pdf:
failed(2,0): null
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-22 14:31:46, elapsed: 00:00:18







> On Wednesday 16 March 2011 17:25:58 Gabriele Kahlout wrote:
> > Thank you, but the page is accessible even without https.
> >
> > *QueueFeeder finished: total 1 records + hit by time limit :0*
> >
> > > *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > > -finishing thread FetcherThread, activeThreads=5
> >
> > My understanding is that google prevents the crawler from fetching the
> > page. Correct?
> > Otherwise, why is the index empty?
> >
> > Same here:
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > fetching
> >
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLt
> > QIE
> >
> > > --
> > > Regards,
> > > K. Gabriele
> > >
> > > --- unchanged since 20/9/10 ---
> > > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > > receipt within 48 hours then I don't resend the email.
> > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > > time(x)
> > > < Now + 48h) ⇒ ¬resend(I, this).
> > >
> > > If an email is sent by a sender that is not a trusted contact or the
> > > email does not contain a valid code then the email is not received. A
> > > valid code starts with a hyphen and ends with "X".
> > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> > > L(-[a-z]+[0-9]X)).
> > > Email has been scanned for viruses by Altman Technologies' email
> > > management service - www.altman.co.uk/emailsystems
> > >
> > > Glasgow Caledonian University is a registered Scottish charity, number
> > > SC021474
> > >
> > > Winner: Times Higher Education’s Widening Participation Initiative of
> the
> > > Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219
> > > ,en.html
> > >
> > > Winner: Times Higher Education’s Outstanding Support for Early Career
> > > Researchers of the Year 2010, GCU as a lead with Universities Scotland
> > > partners.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,1569
> > > 1,en.html
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Markus Jelsma <ma...@openindex.io>.

If there's nothing produced by the QueueFeeder there can be several things 
going on:
- you haven't injected URL's to be fetched
- your regex-urlfilter doesn't pass the URL's injected (if you use regex-
urlfilter)
- the page is not due for fetching (fetchInterval)

You can check the contents of the CrawlDB by using nutch readdb.

On Wednesday 16 March 2011 17:25:58 Gabriele Kahlout wrote:
> Thank you, but the page is accessible even without https.
> 
> *QueueFeeder finished: total 1 records + hit by time limit :0*
> 
> > *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> > -finishing thread FetcherThread, activeThreads=5
> 
> My understanding is that google prevents the crawler from fetching the
> page. Correct?
> Otherwise, why is the index empty?
> 
> Same here:
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLt
> QIE
> 
> > --
> > Regards,
> > K. Gabriele
> > 
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> > 
> > If an email is sent by a sender that is not a trusted contact or the
> > email does not contain a valid code then the email is not received. A
> > valid code starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
> > Email has been scanned for viruses by Altman Technologies' email
> > management service - www.altman.co.uk/emailsystems
> > 
> > Glasgow Caledonian University is a registered Scottish charity, number
> > SC021474
> > 
> > Winner: Times Higher Education’s Widening Participation Initiative of the
> > Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> > 
> > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219
> > ,en.html
> > 
> > Winner: Times Higher Education’s Outstanding Support for Early Career
> > Researchers of the Year 2010, GCU as a lead with Universities Scotland
> > partners.
> > 
> > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,1569
> > 1,en.html

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Julien Nioche <li...@gmail.com>.

"hit by time limit" => see https://issues.apache.org/jira/browse/NUTCH-770

On 22 March 2011 12:26, ts egge <th...@googlemail.com> wrote:

> That's what I'm interested as well. "hit by time limit": Are these pages
> due
> to fetch because of the 30 days period? (or the value defined by argument
> -adddays)
>
> Am 22.03.2011 10:54 schrieb "Gabriele Kahlout" <ga...@mysimpatico.com>:
> > Where is *QueueFeeder finished: total 1 records + hit by time limit :0 *
> > documented?
> > I also get it for
> >
> >
>
> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
> > <view-source:
>
> http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
> >
> >
> >
> >
> > On Wed, Mar 16, 2011 at 5:25 PM, Gabriele Kahlout
> > <ga...@mysimpatico.com>wrote:
> >
> >> Thank you, but the page is accessible even without https.
> >>
> >> *QueueFeeder finished: total 1 records + hit by time limit :0*
> >>> *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> >>> -finishing thread FetcherThread, activeThreads=5
> >>>
> >>
> >> My understanding is that google prevents the crawler from fetching the
> >> page. Correct?
> >> Otherwise, why is the index empty?
> >>
> >> Same here:
> >>
> >> QueueFeeder finished: total 1 records + hit by time limit :0
> >> fetching
> >>
>
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE
> >>
> >>
> >>>
> >>> --
> >>> Regards,
> >>> K. Gabriele
> >>>
> >>> --- unchanged since 20/9/10 ---
> >>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> >>> receipt within 48 hours then I don't resend the email.
> >>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> >>> time(x)
> >>> < Now + 48h) ⇒ ¬resend(I, this).
> >>>
> >>> If an email is sent by a sender that is not a trusted contact or the
> email
> >>> does not contain a valid code then the email is not received. A valid
> code
> >>> starts with a hyphen and ends with "X".
> >>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> >>> L(-[a-z]+[0-9]X)).
> >>> Email has been scanned for viruses by Altman Technologies' email
> >>> management service - www.altman.co.uk/emailsystems
> >>>
> >>> Glasgow Caledonian University is a registered Scottish charity, number
> >>> SC021474
> >>>
> >>> Winner: Times Higher Education’s Widening Participation Initiative of
> the
> >>> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
> >>>
> >>>
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> >>>
> >>> Winner: Times Higher Education’s Outstanding Support for Early Career
> >>> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> >>> partners.
> >>>
> >>>
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> K. Gabriele
> >>
> >> --- unchanged since 20/9/10 ---
> >> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> >> receipt within 48 hours then I don't resend the email.
> >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> >> time(x) < Now + 48h) ⇒ ¬resend(I, this).
> >>
> >> If an email is sent by a sender that is not a trusted contact or the
> email
> >> does not contain a valid code then the email is not received. A valid
> code
> >> starts with a hyphen and ends with "X".
> >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> >> L(-[a-z]+[0-9]X)).
> >>
> >>
> >
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> > < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by ts egge <th...@googlemail.com>.

That's what I'm interested as well. "hit by time limit": Are these pages due
to fetch because of the 30 days period? (or the value defined by argument
-adddays)

Am 22.03.2011 10:54 schrieb "Gabriele Kahlout" <ga...@mysimpatico.com>:
> Where is *QueueFeeder finished: total 1 records + hit by time limit :0 *
> documented?
> I also get it for
>
>
http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
> <view-source:
http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
>
>
>
>
> On Wed, Mar 16, 2011 at 5:25 PM, Gabriele Kahlout
> <ga...@mysimpatico.com>wrote:
>
>> Thank you, but the page is accessible even without https.
>>
>> *QueueFeeder finished: total 1 records + hit by time limit :0*
>>> *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
>>> -finishing thread FetcherThread, activeThreads=5
>>>
>>
>> My understanding is that google prevents the crawler from fetching the
>> page. Correct?
>> Otherwise, why is the index empty?
>>
>> Same here:
>>
>> QueueFeeder finished: total 1 records + hit by time limit :0
>> fetching
>>
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE
>>
>>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x)
>>> < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
email
>>> does not contain a valid code then the email is not received. A valid
code
>>> starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>> Email has been scanned for viruses by Altman Technologies' email
>>> management service - www.altman.co.uk/emailsystems
>>>
>>> Glasgow Caledonian University is a registered Scottish charity, number
>>> SC021474
>>>
>>> Winner: Times Higher Education’s Widening Participation Initiative of
the
>>> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>>>
>>>
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>>>
>>> Winner: Times Higher Education’s Outstanding Support for Early Career
>>> Researchers of the Year 2010, GCU as a lead with Universities Scotland
>>> partners.
>>>
>>>
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the
email
>> does not contain a valid code then the email is not received. A valid
code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

Where is *QueueFeeder finished: total 1 records + hit by time limit :0 *
documented?
I also get it for

http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf
<view-source:http://wsdownload.bbc.co.uk/learningenglish/pdf/2011/03/110303122858_110303_6min_heart.pdf>



On Wed, Mar 16, 2011 at 5:25 PM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Thank you, but the page is accessible even without https.
>
>  *QueueFeeder finished: total 1 records + hit by time limit :0*
>> *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
>> -finishing thread FetcherThread, activeThreads=5
>>
>
> My understanding is that google prevents the crawler from fetching the
> page. Correct?
> Otherwise, why is the index empty?
>
> Same here:
>
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching
> http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE
>
>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x)
>> < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>> Email has been scanned for viruses by Altman Technologies' email
>> management service - www.altman.co.uk/emailsystems
>>
>> Glasgow Caledonian University is a registered Scottish charity, number
>> SC021474
>>
>> Winner: Times Higher Education’s Widening Participation Initiative of the
>> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>>
>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>>
>> Winner: Times Higher Education’s Outstanding Support for Early Career
>> Researchers of the Year 2010, GCU as a lead with Universities Scotland
>> partners.
>>
>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: What's wrong crawling a google site? Why is the time limit 0?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

Thank you, but the page is accessible even without https.

*QueueFeeder finished: total 1 records + hit by time limit :0*
> *fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
> -finishing thread FetcherThread, activeThreads=5
>

My understanding is that google prevents the crawler from fetching the page.
Correct?
Otherwise, why is the index empty?

Same here:
QueueFeeder finished: total 1 records + hit by time limit :0
fetching
http://docs.google.com/document/d/1mN4HxhsdfwkRzqkgLPl46R4KffR-Ww42LtQRiiLtQIE


>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
> Email has been scanned for viruses by Altman Technologies' email management
> service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

RE: What's wrong crawling a google site? Why is the time limit 0?

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.

Hi Kahlout,

I am not sure about the time limit == 0, however from the looks of the URL there are more than three forward slashes.
The property in regex-urlfilter (as well as crawl-urlfilter) specifies

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

If you try commenting out the above in both files then try a re-crawl, you will get the following

Fetcher: starting at 2011-03-16 09:48:54
Fetcher: segment: crawl/segments/20110316094511
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching https://sites.google.com/a/mysimpatico.com/home/dp4j
fetch of https://sites.google.com/a/mysimpatico.com/home/dp4j failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-16 09:49:05, elapsed: 00:00:10

This would suggest that you need to use http-client as your http protocol

Hope this is of some guidance
Thanks Lewis
________________________________________
From: Gabriele Kahlout [gabriele@mysimpatico.com]
Sent: 16 March 2011 07:44
To: user@nutch.apache.org
Subject: What's wrong crawling a google site? Why is the time limit 0?

$  bin/nutch inject crawl/crawldb dmoz
Injector: starting at 2011-03-15 22:17:40
Injector: crawlDb: crawl/crawldb
Injector: urlDir: dmoz
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-03-15 22:17:53, elapsed: 00:00:13
$  bin/nutch generate crawl/crawldb crawl/segments
Generator: starting at 2011-03-15 22:18:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20110315221842
Generator: finished at 2011-03-15 22:18:47, elapsed: 00:00:13
$ s1=`ls -d crawl/segments/2* | tail -1`
$  bin/nutch fetch $s1
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-03-15 22:18:59
Fetcher: segment: crawl/segments/20110315221842
Fetcher: threads: 10
*QueueFeeder finished: total 1 records + hit by time limit :0*
*fetching http://sites.google.com/a/mysimpatico.com/home/dp4j*
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-15 22:19:10, elapsed: 00:00:10
$  bin/nutch updatedb crawl/crawldb $s1
CrawlDb update: starting at 2011-03-15 22:19:17
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20110315221842]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-03-15 22:19:26, elapsed: 00:00:08
$  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2011-03-15 22:19:37
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/users/simpatico/nutch-1.2/crawl/segments/20110315221842
LinkDb: finished at 2011-03-15 22:19:44, elapsed: 00:00:06
$  bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/*
Indexer: starting at 2011-03-15 22:19:48
Indexer: finished at 2011-03-15 22:20:02, elapsed: 00:00:13
$  bin/nutch org.apache.nutch.searcher.NutchBean dp4j
Total hits: 0


--
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).
Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html