You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tom Chiverton <tc...@extravision.com> on 2016/10/17 15:38:00 UTC

Trouble fetch PDFs to pass to Tika (I think)

A site I am trying to index has it's HTML content on one domain, and 
some linked PDFs on another (an Amazon S3 bucket).


So I have set up my plugin.includes in site.xml :


<value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>


and made sure regexp-urlfilter.xml is OK with it all.


But I observe some oddness during fetching, and can't locate the PDFs in 
the Solr collection.

All the content on the PDF domain flys past with no pause :

-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 
URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl delay=5000ms)

and then it hits the primary domain and starts pausing between each :

Turning the log level for the fetcher to debug I see

DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.

but there is no robots.txt in the root of the Amazon S3 URL - 
https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !

Any ideas what could be up ?

-- 
*Tom Chiverton*
Lead Developer
e: 	tc@extravision.com <ma...@extravision.com>
p: 	0161 817 2922
t: 	@extravision <http://www.twitter.com/extravision>
w: 	www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, 
Manchester, M15 4LD.
Company Reg No: 0\u200c\u200c5017214 VAT: GB 8\u200c\u200c24 5386 19

This e-mail is intended solely for the person to whom it is addressed 
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author 
and do not necessarily represent those of Extravision Ltd.

Re: Trouble fetch PDFs to pass to Tika (I think)

Posted by Tom Chiverton <tc...@extravision.com>.

That's only in nutch-default.xml, and is set to the default which is true.

Good idea though !

Tom


On 17/10/16 17:27, Julien Nioche wrote:
> Hi Tom
>
> You haven't modified the value for the config below by any chance?
>
> 	<property> <name>http.robots.403.allow</name>
>
> 	<value>true</value>
>
> 	<description>Some servers return HTTP status 403 (Forbidden) if
>
> 	/robots.txt doesn't exist. This should probably mean that we are
>
> 	allowed to crawl the site nonetheless. If this is set to false,
>
> 	then such sites will be treated as forbidden.</description>
>
> 	</property>
>
>
> The default value (true) should work fine.
>
> Julien
>
>
> On 17 October 2016 at 16:38, Tom Chiverton <tc@extravision.com 
> <ma...@extravision.com>> wrote:
>
>     A site I am trying to index has it's HTML content on one domain,
>     and some linked PDFs on another (an Amazon S3 bucket).
>
>
>     So I have set up my plugin.includes in site.xml :
>
>
>     <value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
>
>
>     and made sure regexp-urlfilter.xml is OK with it all.
>
>
>     But I observe some oddness during fetching, and can't locate the
>     PDFs in the Solr collection.
>
>     All the content on the PDF domain flys past with no pause :
>
>     -finishing thread FetcherThread8, activeThreads=0
>     -finishing thread FetcherThread9, activeThreads=0
>     0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
>     kb/s, 0 URLs in 0 queues
>     -activeThreads=0
>     Using queue mode : byHost
>     Fetcher: threads: 10
>     Fetcher: throughput threshold: -1
>     Fetcher: throughput threshold sequence: 5
>     fetching https://s3-eu-west-1.amazonaws.com/
>     <https://s3-eu-west-1.amazonaws.com/>.... (queue crawl delay=5000ms)
>
>     and then it hits the primary domain and starts pausing between each :
>
>     Turning the log level for the fetcher to debug I see
>
>     DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
>
>     but there is no robots.txt in the root of the Amazon S3 URL -
>     https://s3-eu-west-1.amazonaws.com/robots.txt
>     <https://s3-eu-west-1.amazonaws.com/robots.txt> is a 403 !
>
>     Any ideas what could be up ?
>
>     -- 
>     *Tom Chiverton*
>     Lead Developer
>     e: 	tc@extravision.com <ma...@extravision.com>
>     p: 	0161 817 2922
>     t: 	@extravision <http://www.twitter.com/extravision>
>     w: 	www.extravision.com <http://www.extravision.com/>
>
>     Extravision - email worth seeing <http://www.extravision.com/>
>     Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
>     Manchester, M15 4LD.
>     Company Reg No: 0\u200c\u200c5017214 VAT: GB 8\u200c\u200c24 5386 19
>
>     This e-mail is intended solely for the person to whom it is
>     addressed and may contain confidential or privileged information.
>     Any views or opinions presented in this e-mail are solely of the
>     author and do not necessarily represent those of Extravision Ltd.
>
>
>
>
> -- 
> *
> */Open Source Solutions for Text Engineering/
> /
> /http://www.digitalpebble.com <http://www.digitalpebble.com/>
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

Re: Trouble fetch PDFs to pass to Tika (I think)

Posted by Julien Nioche <li...@gmail.com>.

Hi Tom

You haven't modified the value for the config below by any chance?

<property> <name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>
The default value (true) should work fine.

Julien


On 17 October 2016 at 16:38, Tom Chiverton <tc...@extravision.com> wrote:

> A site I am trying to index has it's HTML content on one domain, and some
> linked PDFs on another (an Amazon S3 bucket).
>
>
> So I have set up my plugin.includes in site.xml :
>
>
>         <value>protocol-httpclient|urlfilter-regex|index-(basic|
> anchor|more|metadata)|query-(basic|site|url|lang)|indexer-
> solr|nutch-extensionpoints|summary-basic|scoring-opic|
> urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
>
>
> and made sure regexp-urlfilter.xml is OK with it all.
>
> But I observe some oddness during fetching, and can't locate the PDFs in
> the Solr collection.
>
> All the content on the PDF domain flys past with no pause :
>
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl
> delay=5000ms)
>
> and then it hits the primary domain and starts pausing between each :
>
> Turning the log level for the fetcher to debug I see
>
> DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
>
> but there is no robots.txt in the root of the Amazon S3 URL -
> https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !
>
> Any ideas what could be up ?
>
> --
> *Tom Chiverton*
> Lead Developer
> e:  tc@extravision.com
> p:  0161 817 2922
> t:  @extravision <http://www.twitter.com/extravision>
> w:  www.extravision.com
> [image: Extravision - email worth seeing] <http://www.extravision.com/>
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester,
> M15 4LD.
> Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19
>
> This e-mail is intended solely for the person to whom it is addressed and
> may contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the author
> and do not necessarily represent those of Extravision Ltd.
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>