You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tom Chiverton <tc...@extravision.com> on 2016/10/17 15:38:00 UTC
Trouble fetch PDFs to pass to Tika (I think)
A site I am trying to index has it's HTML content on one domain, and
some linked PDFs on another (an Amazon S3 bucket).
So I have set up my plugin.includes in site.xml :
<value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
and made sure regexp-urlfilter.xml is OK with it all.
But I observe some oddness during fetching, and can't locate the PDFs in
the Solr collection.
All the content on the PDF domain flys past with no pause :
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl delay=5000ms)
and then it hits the primary domain and starts pausing between each :
Turning the log level for the fetcher to debug I see
DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
but there is no robots.txt in the root of the Amazon S3 URL -
https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !
Any ideas what could be up ?
--
*Tom Chiverton*
Lead Developer
e: tc@extravision.com <ma...@extravision.com>
p: 0161 817 2922
t: @extravision <http://www.twitter.com/extravision>
w: www.extravision.com <http://www.extravision.com/>
Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 0\u200c\u200c5017214 VAT: GB 8\u200c\u200c24 5386 19
This e-mail is intended solely for the person to whom it is addressed
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author
and do not necessarily represent those of Extravision Ltd.
Re: Trouble fetch PDFs to pass to Tika (I think)
Posted by Tom Chiverton <tc...@extravision.com>.
That's only in nutch-default.xml, and is set to the default which is true.
Good idea though !
Tom
On 17/10/16 17:27, Julien Nioche wrote:
> Hi Tom
>
> You haven't modified the value for the config below by any chance?
>
> <property> <name>http.robots.403.allow</name>
>
> <value>true</value>
>
> <description>Some servers return HTTP status 403 (Forbidden) if
>
> /robots.txt doesn't exist. This should probably mean that we are
>
> allowed to crawl the site nonetheless. If this is set to false,
>
> then such sites will be treated as forbidden.</description>
>
> </property>
>
>
> The default value (true) should work fine.
>
> Julien
>
>
> On 17 October 2016 at 16:38, Tom Chiverton <tc@extravision.com
> <ma...@extravision.com>> wrote:
>
> A site I am trying to index has it's HTML content on one domain,
> and some linked PDFs on another (an Amazon S3 bucket).
>
>
> So I have set up my plugin.includes in site.xml :
>
>
> <value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
>
>
> and made sure regexp-urlfilter.xml is OK with it all.
>
>
> But I observe some oddness during fetching, and can't locate the
> PDFs in the Solr collection.
>
> All the content on the PDF domain flys past with no pause :
>
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
> kb/s, 0 URLs in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching https://s3-eu-west-1.amazonaws.com/
> <https://s3-eu-west-1.amazonaws.com/>.... (queue crawl delay=5000ms)
>
> and then it hits the primary domain and starts pausing between each :
>
> Turning the log level for the fetcher to debug I see
>
> DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
>
> but there is no robots.txt in the root of the Amazon S3 URL -
> https://s3-eu-west-1.amazonaws.com/robots.txt
> <https://s3-eu-west-1.amazonaws.com/robots.txt> is a 403 !
>
> Any ideas what could be up ?
>
> --
> *Tom Chiverton*
> Lead Developer
> e: tc@extravision.com <ma...@extravision.com>
> p: 0161 817 2922
> t: @extravision <http://www.twitter.com/extravision>
> w: www.extravision.com <http://www.extravision.com/>
>
> Extravision - email worth seeing <http://www.extravision.com/>
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
> Manchester, M15 4LD.
> Company Reg No: 0\u200c\u200c5017214 VAT: GB 8\u200c\u200c24 5386 19
>
> This e-mail is intended solely for the person to whom it is
> addressed and may contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the
> author and do not necessarily represent those of Extravision Ltd.
>
>
>
>
> --
> *
> */Open Source Solutions for Text Engineering/
> /
> /http://www.digitalpebble.com <http://www.digitalpebble.com/>
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
Re: Trouble fetch PDFs to pass to Tika (I think)
Posted by Julien Nioche <li...@gmail.com>.
Hi Tom
You haven't modified the value for the config below by any chance?
<property> <name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>
The default value (true) should work fine.
Julien
On 17 October 2016 at 16:38, Tom Chiverton <tc...@extravision.com> wrote:
> A site I am trying to index has it's HTML content on one domain, and some
> linked PDFs on another (an Amazon S3 bucket).
>
>
> So I have set up my plugin.includes in site.xml :
>
>
> <value>protocol-httpclient|urlfilter-regex|index-(basic|
> anchor|more|metadata)|query-(basic|site|url|lang)|indexer-
> solr|nutch-extensionpoints|summary-basic|scoring-opic|
> urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
>
>
> and made sure regexp-urlfilter.xml is OK with it all.
>
> But I observe some oddness during fetching, and can't locate the PDFs in
> the Solr collection.
>
> All the content on the PDF domain flys past with no pause :
>
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl
> delay=5000ms)
>
> and then it hits the primary domain and starts pausing between each :
>
> Turning the log level for the fetcher to debug I see
>
> DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
>
> but there is no robots.txt in the root of the Amazon S3 URL -
> https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !
>
> Any ideas what could be up ?
>
> --
> *Tom Chiverton*
> Lead Developer
> e: tc@extravision.com
> p: 0161 817 2922
> t: @extravision <http://www.twitter.com/extravision>
> w: www.extravision.com
> [image: Extravision - email worth seeing] <http://www.extravision.com/>
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester,
> M15 4LD.
> Company Reg No: 05017214 VAT: GB 824 5386 19
>
> This e-mail is intended solely for the person to whom it is addressed and
> may contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the author
> and do not necessarily represent those of Extravision Ltd.
>
--
*Open Source Solutions for Text Engineering*
http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>