You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2011/08/01 20:49:23 UTC

Re: Fetched pages has no content

So I am not crazy, the protocol-httpclient IS broken!? I have been wondering
for a week or two what has changed between 1.2 and 1.3 that would have
caused such a problem.  

Is there a JIRA open for the issue?

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3216734.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by webdev1977 <we...@gmail.com>.
both are in the list, but I guess since parse-html is listed first, it wins.. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3218585.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by Julien Nioche <li...@gmail.com>.
Which parser are you using for html? parse-html or parse-tika?

On 1 August 2011 20:00, webdev1977 <we...@gmail.com> wrote:

> I had protocol-httpclient working in 1.2 and sending certificates for a
> group
> of sites.  I moved the plugin over to the 1.3 environment and it won't
> work.. I am having the same issue as the OP.. no content parsed for the
> seed
> url.  I see it come in on debug.wire... <html>....
> https://domain.com/test.php?id=123 link ...</html>..
> but then it does nothing with the links here.  I have tried changing my
> filters multiple times and it just won't parse them.  I also ran the
> ParseChecker class and I get "0" outlinks.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3216762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Fetched pages has no content

Posted by webdev1977 <we...@gmail.com>.
I had protocol-httpclient working in 1.2 and sending certificates for a group
of sites.  I moved the plugin over to the 1.3 environment and it won't
work.. I am having the same issue as the OP.. no content parsed for the seed
url.  I see it come in on debug.wire... <html>....
https://domain.com/test.php?id=123 link ...</html>..
but then it does nothing with the links here.  I have tried changing my
filters multiple times and it just won't parse them.  I also ran the
ParseChecker class and I get "0" outlinks.

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3216762.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by Markus Jelsma <ma...@openindex.io>.
What do you mean? Protocol-http was also default protocol plugin for 1.2 
earlier.
Are you looking for a Jira issue for rebuilding protocol-httpclient with the 
latest version? There is none but you of course are free to create one 
yourself.

> So I am not crazy, the protocol-httpclient IS broken!? I have been
> wondering for a week or two what has changed between 1.2 and 1.3 that
> would have caused such a problem.
> 
> Is there a JIRA open for the issue?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p
> 3216734.html Sent from the Nutch - User mailing list archive at Nabble.com.