You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by srinivasarao v <sr...@gmail.com> on 2009/01/08 15:58:27 UTC

Problem with Parsing in Nutch

Hi all,
          I am crawling "
http://en.wikipedia.org/wiki/Hyderabad,_Andhra_Pradesh", with depth 2, as a
part of my experiment. Off all the outlinks from this page, only some of the
outlinks are getting parsed. Some outlinks like "
http://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/India_Palace_.jpg/180px-India_Palace_.jpg"
are not getting parsed in the next depth. I set the db.max.outlinks.per.page
to -1 so that all the outlinks can be processed. But it was of no use...even
then they are not getting parsed...
Can anyone tell me the reason why some of the outlinks are not getting
parsed; and suggest me a way to come out of this?

Thank You
Srinivas

Re: Problem with Parsing in Nutch

Posted by srinivasarao v <sr...@gmail.com>.

I'm sorry I forgot to mention that I have an image-parser plugin.....
I'm not able to see that url in the output in depth 2..like if nutch fetches
that url, it says "fetching 'url' ". But that statement was not visible in
the output...

On Thu, Jan 8, 2009 at 9:40 PM, Ian.huang <yi...@hotmail.com> wrote:

> I think that is because nutch does not provide a parser for the url you are
> accessing. say, a jpg in your example.
>
> Check plugin.include in nutch-default.xml, and see whether or not such
> parse configured.
>
> Ian
>
> --------------------------------------------------
> From: "srinivasarao v" <sr...@gmail.com>
> Sent: Thursday, January 08, 2009 2:58 PM
> To: <nu...@lucene.apache.org>
> Subject: Problem with Parsing in Nutch
>
>
>  Hi all,
>>         I am crawling "
>> http://en.wikipedia.org/wiki/Hyderabad,_Andhra_Pradesh", with depth 2, as
>> a
>> part of my experiment. Off all the outlinks from this page, only some of
>> the
>> outlinks are getting parsed. Some outlinks like "
>>
>> http://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/India_Palace_.jpg/180px-India_Palace_.jpg
>> "
>> are not getting parsed in the next depth. I set the
>> db.max.outlinks.per.page
>> to -1 so that all the outlinks can be processed. But it was of no
>> use...even
>> then they are not getting parsed...
>> Can anyone tell me the reason why some of the outlinks are not getting
>> parsed; and suggest me a way to come out of this?
>>
>> Thank You
>> Srinivas
>>
>>

Re: Problem with Parsing in Nutch

Posted by "Ian.huang" <yi...@hotmail.com>.

I think that is because nutch does not provide a parser for the url you are 
accessing. say, a jpg in your example.

Check plugin.include in nutch-default.xml, and see whether or not such parse 
configured.

Ian

--------------------------------------------------
From: "srinivasarao v" <sr...@gmail.com>
Sent: Thursday, January 08, 2009 2:58 PM
To: <nu...@lucene.apache.org>
Subject: Problem with Parsing in Nutch

> Hi all,
>          I am crawling "
> http://en.wikipedia.org/wiki/Hyderabad,_Andhra_Pradesh", with depth 2, as 
> a
> part of my experiment. Off all the outlinks from this page, only some of 
> the
> outlinks are getting parsed. Some outlinks like "
> http://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/India_Palace_.jpg/180px-India_Palace_.jpg"
> are not getting parsed in the next depth. I set the 
> db.max.outlinks.per.page
> to -1 so that all the outlinks can be processed. But it was of no 
> use...even
> then they are not getting parsed...
> Can anyone tell me the reason why some of the outlinks are not getting
> parsed; and suggest me a way to come out of this?
>
> Thank You
> Srinivas
>