You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Anders Rask <an...@gmail.com> on 2011/07/15 15:04:55 UTC

Fetched pages has no content

Hi!

We are using Nutch to crawl a bunch of websites and index them to Solr. At
the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3
and in the same time going from one server to two servers.

Unfortunately we are stuck with a problem which we haven't seen in the old
environment. Several of the pages that we are fetching contain no content
when they are stored in the segment. The following is an excerpt from
"readseg" on a segment containing such a page:

----

Recno:: 5
URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381

Content::
Version: -1
url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
contentType: text/html
metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
Connection=close Content-Type=text/html Server=Apache
Content:

----

The fetch logs say nothing unusual about retrieving this page:
2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.uu.se/news/news_item.php?typ=pm&id=1381

There seems to be nothing strange about the page itself and a very similar
page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and
indexed without any problems.

Anyone have any ideas about what might be wrong here?


Best regards,
--Anders Rask
www.findwise.com

Re: Fetched pages has no content

Posted by webdev1977 <we...@gmail.com>.

both are in the list, but I guess since parse-html is listed first, it wins.. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3218585.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by Julien Nioche <li...@gmail.com>.

Which parser are you using for html? parse-html or parse-tika?

On 1 August 2011 20:00, webdev1977 <we...@gmail.com> wrote:

> I had protocol-httpclient working in 1.2 and sending certificates for a
> group
> of sites.  I moved the plugin over to the 1.3 environment and it won't
> work.. I am having the same issue as the OP.. no content parsed for the
> seed
> url.  I see it come in on debug.wire... <html>....
> https://domain.com/test.php?id=123 link ...</html>..
> but then it does nothing with the links here.  I have tried changing my
> filters multiple times and it just won't parse them.  I also ran the
> ParseChecker class and I get "0" outlinks.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3216762.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Fetched pages has no content

Posted by webdev1977 <we...@gmail.com>.

I had protocol-httpclient working in 1.2 and sending certificates for a group
of sites.  I moved the plugin over to the 1.3 environment and it won't
work.. I am having the same issue as the OP.. no content parsed for the seed
url.  I see it come in on debug.wire... <html>....
https://domain.com/test.php?id=123 link ...</html>..
but then it does nothing with the links here.  I have tried changing my
filters multiple times and it just won't parse them.  I also ran the
ParseChecker class and I get "0" outlinks.

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3216762.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by Markus Jelsma <ma...@openindex.io>.

What do you mean? Protocol-http was also default protocol plugin for 1.2 
earlier.
Are you looking for a Jira issue for rebuilding protocol-httpclient with the 
latest version? There is none but you of course are free to create one 
yourself.

> So I am not crazy, the protocol-httpclient IS broken!? I have been
> wondering for a week or two what has changed between 1.2 and 1.3 that
> would have caused such a problem.
> 
> Is there a JIRA open for the issue?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p
> 3216734.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by webdev1977 <we...@gmail.com>.

So I am not crazy, the protocol-httpclient IS broken!? I have been wondering
for a week or two what has changed between 1.2 and 1.3 that would have
caused such a problem.  

Is there a JIRA open for the issue?

--
View this message in context: http://lucene.472066.n3.nabble.com/Fetched-pages-has-no-content-tp3171881p3216734.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetched pages has no content

Posted by Julien Nioche <li...@gmail.com>.

protocol-httpclient is broken and needs replacing

On 19 July 2011 23:10, Anders Rask <an...@gmail.com> wrote:

> Hi guys!
>
> I experimented some more, and it seems I'm only getting these problems when
> using protocol-httpclient. It works fine when I use protocol-http.
>
> Could you please try and see if you get the same behavior?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
> 2011/7/18 Anders Rask <an...@gmail.com>
>
> > Thank you for your quick responses!
> >
> > Our custom parser runs an embedded OpenPipeline (
> > http://www.openpipeline.org/), which in turn runs a Tika parser to parse
> > the content.
> >
> > I have tried running inject, generate, fetch, readseg with standard Nutch
> > now and it works fine with that page. So if the problem is in our custom
> > parser then how is the parser involved in the fetch command?
> >
> > Markus, you mentioned changes to HTML parse API between version 1.1 and
> > 1.2. I checked the CHANGES.txt file but couldn't find anything about
> this,
> > do you have more information?
> >
> >
> > Best regards,
> > --Anders Rask
> > www.findwise.com
> >
> >
> > 2011/7/18 Julien Nioche <li...@gmail.com>
> >
> >> As pointed out by Markus the logs show that the content has been
> properly
> >> fetched. Moreover
> >>
> >>
> >> > ./nutch org.apache.nutch.parse.ParserChecker '
> >> > http://www.uu.se/news/news_item.php?typ=pm&id=1381'
> >>
> >>
> >> works fine. Double check your custom parser, it is likely to be the
> source
> >> of the problem.
> >>
> >> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so
> >> which
> >> parser are you using for HTML - parse-html or parse-tika?
> >>
> >> Julien
> >>
> >>
> >>
> >> On 18 July 2011 10:46, Markus Jelsma <ma...@openindex.io>
> wrote:
> >>
> >> > Judging from the segment those url's are fetched and parsed. I think
> >> maybe
> >> > some HTML parse API's have changed between your 1.1 and 1.2 versions.
> If
> >> > parserchecker shows the same issue then it's most likey a parse plugin
> >> > problem
> >> > for the new version. Can you check?
> >> >
> >> > > Hi,
> >> > >
> >> > > If you have a look at your regex-ulrfilter.txt it will by default be
> >> > > rejecting ? in the URL. Please test with line edited (or commented
> >> out)
> >> > and
> >> > > see if the problem fades.
> >> > >
> >> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <an...@gmail.com>
> >> wrote:
> >> > > > Hi Markus!
> >> > > >
> >> > > > We are using a custom parser, but I don't think that the problem
> is
> >> in
> >> > > > the parsing. I got the same problem when trying the ParserChecker.
> I
> >> > > > also tried the following:
> >> > > >
> >> > > > I injected the following seeds:
> >> > > >
> >> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> >> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> >> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> >> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > > http://www.uu.se/
> >> > > >
> >> > > > Then generated a segment, fetched that segment and then did a
> >> readseg
> >> > > > with -noparse, -noparsedata and -noparsetext.
> >> > > >
> >> > > > I have attached the readseg dump and it shows no content for:
> >> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >
> >> > > > Can the problem somehow be in the configurations for the fetcher?
> >> > > >
> >> > > >
> >> > > > Best regards,
> >> > > > --Anders Rask
> >> > > > www.findwise.com
> >> > > >
> >> > > >
> >> > > > 2011/7/15 Markus Jelsma <ma...@openindex.io>
> >> > > >
> >> > > >> What parser are you using? What does bin/nutch
> >> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the
> >> content
> >> > > >> fine with parse-tika enabled.
> >> > > >>
> >> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> >> > > >> > Hi!
> >> > > >> >
> >> > > >> > We are using Nutch to crawl a bunch of websites and index them
> to
> >> > > >> > Solr.
> >> > > >>
> >> > > >> At
> >> > > >>
> >> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to
> >> > Nutch
> >> > > >>
> >> > > >> 1.3
> >> > > >>
> >> > > >> > and in the same time going from one server to two servers.
> >> > > >> >
> >> > > >> > Unfortunately we are stuck with a problem which we haven't seen
> >> in
> >> > the
> >> > > >>
> >> > > >> old
> >> > > >>
> >> > > >> > environment. Several of the pages that we are fetching contain
> no
> >> > > >>
> >> > > >> content
> >> > > >>
> >> > > >> > when they are stored in the segment. The following is an
> excerpt
> >> > from
> >> > > >> > "readseg" on a segment containing such a page:
> >> > > >> >
> >> > > >> > ----
> >> > > >> >
> >> > > >> > Recno:: 5
> >> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> >
> >> > > >> > Content::
> >> > > >> > Version: -1
> >> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> > contentType: text/html
> >> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT
> Content-Length=7195
> >> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name
> =20110715110049
> >> > > >> > Connection=close Content-Type=text/html Server=Apache
> >> > > >> > Content:
> >> > > >> >
> >> > > >> > ----
> >> > > >> >
> >> > > >> > The fetch logs say nothing unusual about retrieving this page:
> >> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> >> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > > >> >
> >> > > >> > There seems to be nothing strange about the page itself and a
> >> very
> >> > > >>
> >> > > >> similar
> >> > > >>
> >> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
> >> > crawled
> >> > > >>
> >> > > >> and
> >> > > >>
> >> > > >> > indexed without any problems.
> >> > > >> >
> >> > > >> > Anyone have any ideas about what might be wrong here?
> >> > > >> >
> >> > > >> >
> >> > > >> > Best regards,
> >> > > >> > --Anders Rask
> >> > > >> > www.findwise.com
> >> > > >>
> >> > > >> --
> >> > > >> Markus Jelsma - CTO - Openindex
> >> > > >> http://www.linkedin.com/in/markus17
> >> > > >> 050-8536620 / 06-50258350
> >> >
> >>
> >>
> >>
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >>
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Fetched pages has no content

Posted by Anders Rask <an...@gmail.com>.

Hi guys!

I experimented some more, and it seems I'm only getting these problems when
using protocol-httpclient. It works fine when I use protocol-http.

Could you please try and see if you get the same behavior?


Best regards,
--Anders Rask
www.findwise.com

2011/7/18 Anders Rask <an...@gmail.com>

> Thank you for your quick responses!
>
> Our custom parser runs an embedded OpenPipeline (
> http://www.openpipeline.org/), which in turn runs a Tika parser to parse
> the content.
>
> I have tried running inject, generate, fetch, readseg with standard Nutch
> now and it works fine with that page. So if the problem is in our custom
> parser then how is the parser involved in the fetch command?
>
> Markus, you mentioned changes to HTML parse API between version 1.1 and
> 1.2. I checked the CHANGES.txt file but couldn't find anything about this,
> do you have more information?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
>
> 2011/7/18 Julien Nioche <li...@gmail.com>
>
>> As pointed out by Markus the logs show that the content has been properly
>> fetched. Moreover
>>
>>
>> > ./nutch org.apache.nutch.parse.ParserChecker '
>> > http://www.uu.se/news/news_item.php?typ=pm&id=1381'
>>
>>
>> works fine. Double check your custom parser, it is likely to be the source
>> of the problem.
>>
>> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so
>> which
>> parser are you using for HTML - parse-html or parse-tika?
>>
>> Julien
>>
>>
>>
>> On 18 July 2011 10:46, Markus Jelsma <ma...@openindex.io> wrote:
>>
>> > Judging from the segment those url's are fetched and parsed. I think
>> maybe
>> > some HTML parse API's have changed between your 1.1 and 1.2 versions. If
>> > parserchecker shows the same issue then it's most likey a parse plugin
>> > problem
>> > for the new version. Can you check?
>> >
>> > > Hi,
>> > >
>> > > If you have a look at your regex-ulrfilter.txt it will by default be
>> > > rejecting ? in the URL. Please test with line edited (or commented
>> out)
>> > and
>> > > see if the problem fades.
>> > >
>> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <an...@gmail.com>
>> wrote:
>> > > > Hi Markus!
>> > > >
>> > > > We are using a custom parser, but I don't think that the problem is
>> in
>> > > > the parsing. I got the same problem when trying the ParserChecker. I
>> > > > also tried the following:
>> > > >
>> > > > I injected the following seeds:
>> > > >
>> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
>> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
>> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
>> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
>> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > > http://www.uu.se/
>> > > >
>> > > > Then generated a segment, fetched that segment and then did a
>> readseg
>> > > > with -noparse, -noparsedata and -noparsetext.
>> > > >
>> > > > I have attached the readseg dump and it shows no content for:
>> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >
>> > > > Can the problem somehow be in the configurations for the fetcher?
>> > > >
>> > > >
>> > > > Best regards,
>> > > > --Anders Rask
>> > > > www.findwise.com
>> > > >
>> > > >
>> > > > 2011/7/15 Markus Jelsma <ma...@openindex.io>
>> > > >
>> > > >> What parser are you using? What does bin/nutch
>> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the
>> content
>> > > >> fine with parse-tika enabled.
>> > > >>
>> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
>> > > >> > Hi!
>> > > >> >
>> > > >> > We are using Nutch to crawl a bunch of websites and index them to
>> > > >> > Solr.
>> > > >>
>> > > >> At
>> > > >>
>> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to
>> > Nutch
>> > > >>
>> > > >> 1.3
>> > > >>
>> > > >> > and in the same time going from one server to two servers.
>> > > >> >
>> > > >> > Unfortunately we are stuck with a problem which we haven't seen
>> in
>> > the
>> > > >>
>> > > >> old
>> > > >>
>> > > >> > environment. Several of the pages that we are fetching contain no
>> > > >>
>> > > >> content
>> > > >>
>> > > >> > when they are stored in the segment. The following is an excerpt
>> > from
>> > > >> > "readseg" on a segment containing such a page:
>> > > >> >
>> > > >> > ----
>> > > >> >
>> > > >> > Recno:: 5
>> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> >
>> > > >> > Content::
>> > > >> > Version: -1
>> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> > contentType: text/html
>> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
>> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
>> > > >> > Connection=close Content-Type=text/html Server=Apache
>> > > >> > Content:
>> > > >> >
>> > > >> > ----
>> > > >> >
>> > > >> > The fetch logs say nothing unusual about retrieving this page:
>> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
>> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > > >> >
>> > > >> > There seems to be nothing strange about the page itself and a
>> very
>> > > >>
>> > > >> similar
>> > > >>
>> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
>> > crawled
>> > > >>
>> > > >> and
>> > > >>
>> > > >> > indexed without any problems.
>> > > >> >
>> > > >> > Anyone have any ideas about what might be wrong here?
>> > > >> >
>> > > >> >
>> > > >> > Best regards,
>> > > >> > --Anders Rask
>> > > >> > www.findwise.com
>> > > >>
>> > > >> --
>> > > >> Markus Jelsma - CTO - Openindex
>> > > >> http://www.linkedin.com/in/markus17
>> > > >> 050-8536620 / 06-50258350
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>

Re: Fetched pages has no content

Posted by Anders Rask <an...@gmail.com>.

Thank you for your quick responses!

Our custom parser runs an embedded OpenPipeline (
http://www.openpipeline.org/), which in turn runs a Tika parser to parse the
content.

I have tried running inject, generate, fetch, readseg with standard Nutch
now and it works fine with that page. So if the problem is in our custom
parser then how is the parser involved in the fetch command?

Markus, you mentioned changes to HTML parse API between version 1.1 and 1.2.
I checked the CHANGES.txt file but couldn't find anything about this, do you
have more information?


Best regards,
--Anders Rask
www.findwise.com

2011/7/18 Julien Nioche <li...@gmail.com>

> As pointed out by Markus the logs show that the content has been properly
> fetched. Moreover
>
>
> > ./nutch org.apache.nutch.parse.ParserChecker '
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381'
>
>
> works fine. Double check your custom parser, it is likely to be the source
> of the problem.
>
> BTW : what does your custom parser do? Is it a HtmlParseFilter? If so which
> parser are you using for HTML - parse-html or parse-tika?
>
> Julien
>
>
>
> On 18 July 2011 10:46, Markus Jelsma <ma...@openindex.io> wrote:
>
> > Judging from the segment those url's are fetched and parsed. I think
> maybe
> > some HTML parse API's have changed between your 1.1 and 1.2 versions. If
> > parserchecker shows the same issue then it's most likey a parse plugin
> > problem
> > for the new version. Can you check?
> >
> > > Hi,
> > >
> > > If you have a look at your regex-ulrfilter.txt it will by default be
> > > rejecting ? in the URL. Please test with line edited (or commented out)
> > and
> > > see if the problem fades.
> > >
> > > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <an...@gmail.com>
> wrote:
> > > > Hi Markus!
> > > >
> > > > We are using a custom parser, but I don't think that the problem is
> in
> > > > the parsing. I got the same problem when trying the ParserChecker. I
> > > > also tried the following:
> > > >
> > > > I injected the following seeds:
> > > >
> > > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> > > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> > > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> > > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> > > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> > > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> > > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > > http://www.uu.se/
> > > >
> > > > Then generated a segment, fetched that segment and then did a readseg
> > > > with -noparse, -noparsedata and -noparsetext.
> > > >
> > > > I have attached the readseg dump and it shows no content for:
> > > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >
> > > > Can the problem somehow be in the configurations for the fetcher?
> > > >
> > > >
> > > > Best regards,
> > > > --Anders Rask
> > > > www.findwise.com
> > > >
> > > >
> > > > 2011/7/15 Markus Jelsma <ma...@openindex.io>
> > > >
> > > >> What parser are you using? What does bin/nutch
> > > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the
> content
> > > >> fine with parse-tika enabled.
> > > >>
> > > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> > > >> > Hi!
> > > >> >
> > > >> > We are using Nutch to crawl a bunch of websites and index them to
> > > >> > Solr.
> > > >>
> > > >> At
> > > >>
> > > >> > the moment we are in the process of upgrading from Nutch 1.1 to
> > Nutch
> > > >>
> > > >> 1.3
> > > >>
> > > >> > and in the same time going from one server to two servers.
> > > >> >
> > > >> > Unfortunately we are stuck with a problem which we haven't seen in
> > the
> > > >>
> > > >> old
> > > >>
> > > >> > environment. Several of the pages that we are fetching contain no
> > > >>
> > > >> content
> > > >>
> > > >> > when they are stored in the segment. The following is an excerpt
> > from
> > > >> > "readseg" on a segment containing such a page:
> > > >> >
> > > >> > ----
> > > >> >
> > > >> > Recno:: 5
> > > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> >
> > > >> > Content::
> > > >> > Version: -1
> > > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> > contentType: text/html
> > > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> > > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> > > >> > Connection=close Content-Type=text/html Server=Apache
> > > >> > Content:
> > > >> >
> > > >> > ----
> > > >> >
> > > >> > The fetch logs say nothing unusual about retrieving this page:
> > > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> > > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > >> >
> > > >> > There seems to be nothing strange about the page itself and a very
> > > >>
> > > >> similar
> > > >>
> > > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
> > crawled
> > > >>
> > > >> and
> > > >>
> > > >> > indexed without any problems.
> > > >> >
> > > >> > Anyone have any ideas about what might be wrong here?
> > > >> >
> > > >> >
> > > >> > Best regards,
> > > >> > --Anders Rask
> > > >> > www.findwise.com
> > > >>
> > > >> --
> > > >> Markus Jelsma - CTO - Openindex
> > > >> http://www.linkedin.com/in/markus17
> > > >> 050-8536620 / 06-50258350
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Fetched pages has no content

Posted by Julien Nioche <li...@gmail.com>.

As pointed out by Markus the logs show that the content has been properly
fetched. Moreover


> ./nutch org.apache.nutch.parse.ParserChecker '
> http://www.uu.se/news/news_item.php?typ=pm&id=1381'


works fine. Double check your custom parser, it is likely to be the source
of the problem.

BTW : what does your custom parser do? Is it a HtmlParseFilter? If so which
parser are you using for HTML - parse-html or parse-tika?

Julien



On 18 July 2011 10:46, Markus Jelsma <ma...@openindex.io> wrote:

> Judging from the segment those url's are fetched and parsed. I think maybe
> some HTML parse API's have changed between your 1.1 and 1.2 versions. If
> parserchecker shows the same issue then it's most likey a parse plugin
> problem
> for the new version. Can you check?
>
> > Hi,
> >
> > If you have a look at your regex-ulrfilter.txt it will by default be
> > rejecting ? in the URL. Please test with line edited (or commented out)
> and
> > see if the problem fades.
> >
> > On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <an...@gmail.com> wrote:
> > > Hi Markus!
> > >
> > > We are using a custom parser, but I don't think that the problem is in
> > > the parsing. I got the same problem when trying the ParserChecker. I
> > > also tried the following:
> > >
> > > I injected the following seeds:
> > >
> > > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> > > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> > > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> > > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> > > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> > > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> > > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > > http://www.uu.se/
> > >
> > > Then generated a segment, fetched that segment and then did a readseg
> > > with -noparse, -noparsedata and -noparsetext.
> > >
> > > I have attached the readseg dump and it shows no content for:
> > > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >
> > > Can the problem somehow be in the configurations for the fetcher?
> > >
> > >
> > > Best regards,
> > > --Anders Rask
> > > www.findwise.com
> > >
> > >
> > > 2011/7/15 Markus Jelsma <ma...@openindex.io>
> > >
> > >> What parser are you using? What does bin/nutch
> > >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content
> > >> fine with parse-tika enabled.
> > >>
> > >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> > >> > Hi!
> > >> >
> > >> > We are using Nutch to crawl a bunch of websites and index them to
> > >> > Solr.
> > >>
> > >> At
> > >>
> > >> > the moment we are in the process of upgrading from Nutch 1.1 to
> Nutch
> > >>
> > >> 1.3
> > >>
> > >> > and in the same time going from one server to two servers.
> > >> >
> > >> > Unfortunately we are stuck with a problem which we haven't seen in
> the
> > >>
> > >> old
> > >>
> > >> > environment. Several of the pages that we are fetching contain no
> > >>
> > >> content
> > >>
> > >> > when they are stored in the segment. The following is an excerpt
> from
> > >> > "readseg" on a segment containing such a page:
> > >> >
> > >> > ----
> > >> >
> > >> > Recno:: 5
> > >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> >
> > >> > Content::
> > >> > Version: -1
> > >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> > contentType: text/html
> > >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> > >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> > >> > Connection=close Content-Type=text/html Server=Apache
> > >> > Content:
> > >> >
> > >> > ----
> > >> >
> > >> > The fetch logs say nothing unusual about retrieving this page:
> > >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> > >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > >> >
> > >> > There seems to be nothing strange about the page itself and a very
> > >>
> > >> similar
> > >>
> > >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is
> crawled
> > >>
> > >> and
> > >>
> > >> > indexed without any problems.
> > >> >
> > >> > Anyone have any ideas about what might be wrong here?
> > >> >
> > >> >
> > >> > Best regards,
> > >> > --Anders Rask
> > >> > www.findwise.com
> > >>
> > >> --
> > >> Markus Jelsma - CTO - Openindex
> > >> http://www.linkedin.com/in/markus17
> > >> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Fetched pages has no content

Posted by Markus Jelsma <ma...@openindex.io>.

Judging from the segment those url's are fetched and parsed. I think maybe 
some HTML parse API's have changed between your 1.1 and 1.2 versions. If 
parserchecker shows the same issue then it's most likey a parse plugin problem 
for the new version. Can you check?

> Hi,
> 
> If you have a look at your regex-ulrfilter.txt it will by default be
> rejecting ? in the URL. Please test with line edited (or commented out) and
> see if the problem fades.
> 
> On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <an...@gmail.com> wrote:
> > Hi Markus!
> > 
> > We are using a custom parser, but I don't think that the problem is in
> > the parsing. I got the same problem when trying the ParserChecker. I
> > also tried the following:
> > 
> > I injected the following seeds:
> > 
> > http://www.uu.se/news/news_item.php?id=1423&typ=pm
> > http://www.uu.se/news/news_item.php?id=1421&typ=pm
> > http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> > http://www.uu.se/news/news_item.php?id=1407&typ=pm
> > http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> > http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> > http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > http://www.uu.se/
> > 
> > Then generated a segment, fetched that segment and then did a readseg
> > with -noparse, -noparsedata and -noparsetext.
> > 
> > I have attached the readseg dump and it shows no content for:
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > 
> > Can the problem somehow be in the configurations for the fetcher?
> > 
> > 
> > Best regards,
> > --Anders Rask
> > www.findwise.com
> > 
> > 
> > 2011/7/15 Markus Jelsma <ma...@openindex.io>
> > 
> >> What parser are you using? What does bin/nutch
> >> org.apache.nutch.parse.ParserChecker say? Here it outputs the content
> >> fine with parse-tika enabled.
> >> 
> >> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> >> > Hi!
> >> > 
> >> > We are using Nutch to crawl a bunch of websites and index them to
> >> > Solr.
> >> 
> >> At
> >> 
> >> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch
> >> 
> >> 1.3
> >> 
> >> > and in the same time going from one server to two servers.
> >> > 
> >> > Unfortunately we are stuck with a problem which we haven't seen in the
> >> 
> >> old
> >> 
> >> > environment. Several of the pages that we are fetching contain no
> >> 
> >> content
> >> 
> >> > when they are stored in the segment. The following is an excerpt from
> >> > "readseg" on a segment containing such a page:
> >> > 
> >> > ----
> >> > 
> >> > Recno:: 5
> >> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > 
> >> > Content::
> >> > Version: -1
> >> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > contentType: text/html
> >> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> >> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> >> > Connection=close Content-Type=text/html Server=Apache
> >> > Content:
> >> > 
> >> > ----
> >> > 
> >> > The fetch logs say nothing unusual about retrieving this page:
> >> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher:
> >> > fetching http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >> > 
> >> > There seems to be nothing strange about the page itself and a very
> >> 
> >> similar
> >> 
> >> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled
> >> 
> >> and
> >> 
> >> > indexed without any problems.
> >> > 
> >> > Anyone have any ideas about what might be wrong here?
> >> > 
> >> > 
> >> > Best regards,
> >> > --Anders Rask
> >> > www.findwise.com
> >> 
> >> --
> >> Markus Jelsma - CTO - Openindex
> >> http://www.linkedin.com/in/markus17
> >> 050-8536620 / 06-50258350

Re: Fetched pages has no content

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi,

If you have a look at your regex-ulrfilter.txt it will by default be
rejecting ? in the URL. Please test with line edited (or commented out) and
see if the problem fades.

On Mon, Jul 18, 2011 at 10:11 AM, Anders Rask <an...@gmail.com> wrote:

> Hi Markus!
>
> We are using a custom parser, but I don't think that the problem is in the
> parsing. I got the same problem when trying the ParserChecker. I also tried
> the following:
>
> I injected the following seeds:
>
> http://www.uu.se/news/news_item.php?id=1423&typ=pm
> http://www.uu.se/news/news_item.php?id=1421&typ=pm
> http://www.uu.se/news/news_item.php?id=1489&typ=artikel
> http://www.uu.se/news/news_item.php?id=1407&typ=pm
> http://www.uu.se/news/news_item.php?id=1234&typ=artikel
> http://www.uu.se/news/news_item.php?id=1233&typ=artikel
> http://www.uu.se/news/news_item.php?id=1180&typ=artikel
> http://www.uu.se/news/news_item.php?typ=pm&id=1381
> http://www.uu.se/
>
> Then generated a segment, fetched that segment and then did a readseg with
> -noparse, -noparsedata and -noparsetext.
>
> I have attached the readseg dump and it shows no content for:
> http://www.uu.se/news/news_item.php?typ=pm&id=1381
>
> Can the problem somehow be in the configurations for the fetcher?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
>
> 2011/7/15 Markus Jelsma <ma...@openindex.io>
>
>> What parser are you using? What does bin/nutch
>> org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine
>> with parse-tika enabled.
>>
>> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
>> > Hi!
>> >
>> > We are using Nutch to crawl a bunch of websites and index them to Solr.
>> At
>> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch
>> 1.3
>> > and in the same time going from one server to two servers.
>> >
>> > Unfortunately we are stuck with a problem which we haven't seen in the
>> old
>> > environment. Several of the pages that we are fetching contain no
>> content
>> > when they are stored in the segment. The following is an excerpt from
>> > "readseg" on a segment containing such a page:
>> >
>> > ----
>> >
>> > Recno:: 5
>> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> >
>> > Content::
>> > Version: -1
>> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> > contentType: text/html
>> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
>> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
>> > Connection=close Content-Type=text/html Server=Apache
>> > Content:
>> >
>> > ----
>> >
>> > The fetch logs say nothing unusual about retrieving this page:
>> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
>> >
>> > There seems to be nothing strange about the page itself and a very
>> similar
>> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled
>> and
>> > indexed without any problems.
>> >
>> > Anyone have any ideas about what might be wrong here?
>> >
>> >
>> > Best regards,
>> > --Anders Rask
>> > www.findwise.com
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>


-- 
*Lewis*

Re: Fetched pages has no content

Posted by Anders Rask <an...@gmail.com>.

Hi Markus!

We are using a custom parser, but I don't think that the problem is in the
parsing. I got the same problem when trying the ParserChecker. I also tried
the following:

I injected the following seeds:

http://www.uu.se/news/news_item.php?id=1423&typ=pm
http://www.uu.se/news/news_item.php?id=1421&typ=pm
http://www.uu.se/news/news_item.php?id=1489&typ=artikel
http://www.uu.se/news/news_item.php?id=1407&typ=pm
http://www.uu.se/news/news_item.php?id=1234&typ=artikel
http://www.uu.se/news/news_item.php?id=1233&typ=artikel
http://www.uu.se/news/news_item.php?id=1180&typ=artikel
http://www.uu.se/news/news_item.php?typ=pm&id=1381
http://www.uu.se/

Then generated a segment, fetched that segment and then did a readseg with
-noparse, -noparsedata and -noparsetext.

I have attached the readseg dump and it shows no content for:
http://www.uu.se/news/news_item.php?typ=pm&id=1381

Can the problem somehow be in the configurations for the fetcher?


Best regards,
--Anders Rask
www.findwise.com


2011/7/15 Markus Jelsma <ma...@openindex.io>

> What parser are you using? What does bin/nutch
> org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine
> with parse-tika enabled.
>
> On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> > Hi!
> >
> > We are using Nutch to crawl a bunch of websites and index them to Solr.
> At
> > the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3
> > and in the same time going from one server to two servers.
> >
> > Unfortunately we are stuck with a problem which we haven't seen in the
> old
> > environment. Several of the pages that we are fetching contain no content
> > when they are stored in the segment. The following is an excerpt from
> > "readseg" on a segment containing such a page:
> >
> > ----
> >
> > Recno:: 5
> > URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >
> > Content::
> > Version: -1
> > url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> > contentType: text/html
> > metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> > nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> > Connection=close Content-Type=text/html Server=Apache
> > Content:
> >
> > ----
> >
> > The fetch logs say nothing unusual about retrieving this page:
> > 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
> > http://www.uu.se/news/news_item.php?typ=pm&id=1381
> >
> > There seems to be nothing strange about the page itself and a very
> similar
> > page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and
> > indexed without any problems.
> >
> > Anyone have any ideas about what might be wrong here?
> >
> >
> > Best regards,
> > --Anders Rask
> > www.findwise.com
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Fetched pages has no content

Posted by Markus Jelsma <ma...@openindex.io>.

What parser are you using? What does bin/nutch 
org.apache.nutch.parse.ParserChecker say? Here it outputs the content fine 
with parse-tika enabled.

On Friday 15 July 2011 15:04:55 Anders Rask wrote:
> Hi!
> 
> We are using Nutch to crawl a bunch of websites and index them to Solr. At
> the moment we are in the process of upgrading from Nutch 1.1 to Nutch 1.3
> and in the same time going from one server to two servers.
> 
> Unfortunately we are stuck with a problem which we haven't seen in the old
> environment. Several of the pages that we are fetching contain no content
> when they are stored in the segment. The following is an excerpt from
> "readseg" on a segment containing such a page:
> 
> ----
> 
> Recno:: 5
> URL:: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> 
> Content::
> Version: -1
> url: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> base: http://www.uu.se/news/news_item.php?typ=pm&id=1381
> contentType: text/html
> metadata: Date=Fri, 15 Jul 2011 09:02:38 GMT Content-Length=7195
> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110715110049
> Connection=close Content-Type=text/html Server=Apache
> Content:
> 
> ----
> 
> The fetch logs say nothing unusual about retrieving this page:
> 2011-07-15 11:02:37,500 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.uu.se/news/news_item.php?typ=pm&id=1381
> 
> There seems to be nothing strange about the page itself and a very similar
> page (http://www.uu.se/news/news_item.php?id=1421&typ=pm) is crawled and
> indexed without any problems.
> 
> Anyone have any ideas about what might be wrong here?
> 
> 
> Best regards,
> --Anders Rask
> www.findwise.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350