You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <kr...@mail.mil> on 2016/09/07 11:41:13 UTC

RE: [Non-DoD Source] RE: indexing metatags with Nutch 1.12 (UNCLASSIFIED)

CLASSIFICATION: UNCLASSIFIED

nutch_site.xml with...
<property>
		<name>
			plugin.includes
		</name>
		<value>
			protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
		</value>
		<description>
			item needed to parse metatags out of html.
		</description>
</property>

Throws the same errors.

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn.ctr@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, September 06, 2016 6:24 PM
To: user@nutch.apache.org
Subject: [Non-DoD Source] RE: indexing metatags with Nutch 1.12

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

Hm, this is odd. You have protocol-http configured and it should work just like that. Change it to protocol-httpclient to confirm a problem. Protocol-httpclient supported https for a much longer time than protocol-http. 

If it works with httpclient, there is some weird problem never noticed before.
M.

 
 
-----Original message-----
> From:Kris Musshorn <mu...@comcast.net>
> Sent: Tuesday 6th September 2016 23:26
> To: user@nutch.apache.org
> Subject: RE: indexing metatags with Nutch 1.12
> 
> Marcus,
> 
> Here is the nutch-site.xml in place when it throws errors that I posted previously
> 
> -----Original Message-----
> From: Markus Jelsma [Caution-mailto:markus.jelsma@openindex.io] 
> Sent: Tuesday, September 6, 2016 3:02 PM
> To: user@nutch.apache.org
> Subject: RE: indexing metatags with Nutch 1.12
> 
> Well, so we did add https to protocol-http's plugin.xml. Does your plugin.includes actually contain a protocol-* plugin?
> 
> 
>  
>  
> -----Original message-----
> > From:KRIS MUSSHORN <mu...@comcast.net>
> > Sent: Tuesday 6th September 2016 20:39
> > To: user@nutch.apache.org
> > Subject: Re: indexing metatags with Nutch 1.12
> > 
> > Markus, 
> > I'm not sure how to answer your question.
> > here are 2 xml files for your consideration.
> > 
> > Kris
> > 
> > ----------- 
> > From: "Markus Jelsma" <ma...@openindex.io>
> > To: user@nutch.apache.org
> > Sent: Tuesday, September 6, 2016 2:30:39 PM
> > Subject: RE: indexing metatags with Nutch 1.12
> > 
> > Well, this is certainly not an indexing metatags problem. You need to use protocol-httpclient for https, or configure protocol-http's plugin.xml to support https. That's identical to protocol-httpclient's plugin.xml.
> > 
> > On the other hand, when we added support for https to protocol-http, did we forget to add it to the plugin.xml?
> > 
> > 
> > 
> >  
> >  
> > -----Original message-----
> > > From:KRIS MUSSHORN <mu...@comcast.net>
> > > Sent: Tuesday 6th September 2016 19:29
> > > To: user@nutch.apache.org
> > > Subject: indexing metatags with Nutch 1.12
> > > 
> > > Caution-https://wiki.apache.org/nutch/IndexMetatags <Caution-https://wiki.apache.org/nutch/IndexMetatags>
> > > 
> > > Soon as i switch to nutch-site_v2 nutch throws protocol missing errors during crawl.
> > > 
> > > 2016-09-06 12:23:53,102 INFO  fetcher.Fetcher - -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=442, fetchQueues.getQueueCount=1
> > > 2016-09-06 12:23:53,576 INFO  fetcher.FetcherThread - fetching Caution-https://snip/inside/events/events_summary/documents/Harford_Co_Sheriff_Special_Brief.pdf (queue crawl delay=500ms)
> > > 2016-09-06 12:23:53,576 INFO  fetcher.FetcherThread - fetch of Caution-https://snip/inside/events/events_summary/documents/Harford_Co_Sheriff_Special_Brief.pdf failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
> > >     at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:84)
> > >     at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:257) 
> > > 
> > > how can i fix this?
> > > 
> > > Kris
> > > 
> > 
> 


CLASSIFICATION: UNCLASSIFIED