You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Carlos Pérez Miguel <cp...@gmail.com> on 2017/08/09 10:09:16 UTC

problems extracting outlinks

Hi,

While crawling a site, I found that the crawl stopped before expected
because lots of urls being downloaded was of the form:

http://www.domain.com/something/"http://www.domain.com"

After reading the html of the pages containing that outlinks I found that
those outlinks are note included in the source code, so I guess there may
be something incorrect in the page content or in the parse made by nutch.
How can I know which problem is? I am a little lost with this one.

In order to see the problem:

$ bin/nutch parsechecker
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio

And within the results we can see this particular outlink:
 outlink: toUrl:
https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/
"http://www.seguroscatalanaoccidente.com" anchor:
www.seguroscatalanaoccidente.com

Is there any way to solve or avoid this? maybe with the regex-urlfilter
file?

Thanks

Carlos Pérez Miguel

Re: problems extracting outlinks

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Carlos,

thanks for the follow-up. I've checked the mentioned link and Nutch 1.14:
- with parse-html the link is missing (also some more)
- with parse-tika it's extracted as expected: a self-referential link, the anchor part removed

That's a hint that we should have a closer look on the problem.
Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian


On 08/09/2017 08:10 PM, Carlos Pérez Miguel wrote:
> Hi Sebastian,
> 
> Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am
> using this old version because I use a modified version (not those
> plugins). I guess something changed in the parse-html plugin since my
> version.
> 
> Anyway, I think I found a clue about what is happening. This page is in
> catalan, a language in which is normal the use of single quotes. Most of
> the attributes of the html code are surrounded by single quotes and some of
> the values of those attributes use as well single quotes, so, I think the
> parser is confused by that. For example, in that page, line 278 we can see
> this tag:
> 
> <div data-group='#servei-d'atencio-al-client' class="sublevel">
> 
> Thanks,
> Carlos
> 
> Carlos Pérez Miguel
> 
> 2017-08-09 18:47 GMT+02:00 Sebastian Nagel <wa...@googlemail.com>:
> 
>> Hi Carlos,
>>
>> sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT
>> and the call
>>
>> $ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
>>   https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
>> assegurances-de-vida/vida-proteccio
>>
>> Could you tell us which Nutch version is used and also which plugins are
>> enabled?
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
>>> Hi,
>>>
>>> While crawling a site, I found that the crawl stopped before expected
>>> because lots of urls being downloaded was of the form:
>>>
>>> http://www.domain.com/something/"http://www.domain.com"
>>>
>>> After reading the html of the pages containing that outlinks I found that
>>> those outlinks are note included in the source code, so I guess there may
>>> be something incorrect in the page content or in the parse made by nutch.
>>> How can I know which problem is? I am a little lost with this one.
>>>
>>> In order to see the problem:
>>>
>>> $ bin/nutch parsechecker
>>> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
>> assegurances-de-vida/vida-proteccio
>>>
>>> And within the results we can see this particular outlink:
>>>  outlink: toUrl:
>>> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
>> assegurances-de-vida/
>>> "http://www.seguroscatalanaoccidente.com" anchor:
>>> www.seguroscatalanaoccidente.com
>>>
>>> Is there any way to solve or avoid this? maybe with the regex-urlfilter
>>> file?
>>>
>>> Thanks
>>>
>>> Carlos Pérez Miguel
>>>
>>
>>
>

Re: problems extracting outlinks

Posted by Carlos Pérez Miguel <cp...@gmail.com>.

Hi Sebastian,

Thank you for your answer. I am using Nutch 1.12. Same plugins as you. I am
using this old version because I use a modified version (not those
plugins). I guess something changed in the parse-html plugin since my
version.

Anyway, I think I found a clue about what is happening. This page is in
catalan, a language in which is normal the use of single quotes. Most of
the attributes of the html code are surrounded by single quotes and some of
the values of those attributes use as well single quotes, so, I think the
parser is confused by that. For example, in that page, line 278 we can see
this tag:

<div data-group='#servei-d'atencio-al-client' class="sublevel">

Thanks,
Carlos

Carlos Pérez Miguel

2017-08-09 18:47 GMT+02:00 Sebastian Nagel <wa...@googlemail.com>:

> Hi Carlos,
>
> sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT
> and the call
>
> $ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
>   https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
> assegurances-de-vida/vida-proteccio
>
> Could you tell us which Nutch version is used and also which plugins are
> enabled?
>
> Thanks,
> Sebastian
>
>
> On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
> > Hi,
> >
> > While crawling a site, I found that the crawl stopped before expected
> > because lots of urls being downloaded was of the form:
> >
> > http://www.domain.com/something/"http://www.domain.com"
> >
> > After reading the html of the pages containing that outlinks I found that
> > those outlinks are note included in the source code, so I guess there may
> > be something incorrect in the page content or in the parse made by nutch.
> > How can I know which problem is? I am a little lost with this one.
> >
> > In order to see the problem:
> >
> > $ bin/nutch parsechecker
> > https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
> assegurances-de-vida/vida-proteccio
> >
> > And within the results we can see this particular outlink:
> >  outlink: toUrl:
> > https://www.seguroscatalanaoccidente.com/cat/particulars/vida/
> assegurances-de-vida/
> > "http://www.seguroscatalanaoccidente.com" anchor:
> > www.seguroscatalanaoccidente.com
> >
> > Is there any way to solve or avoid this? maybe with the regex-urlfilter
> > file?
> >
> > Thanks
> >
> > Carlos Pérez Miguel
> >
>
>

Re: problems extracting outlinks

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Carlos,

sorry but I'm not able to reproduce the problem using Nutch 1.14-SNAPSHOT and the call

$ bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' \
  https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio

Could you tell us which Nutch version is used and also which plugins are enabled?

Thanks,
Sebastian


On 08/09/2017 12:09 PM, Carlos Pérez Miguel wrote:
> Hi,
> 
> While crawling a site, I found that the crawl stopped before expected
> because lots of urls being downloaded was of the form:
> 
> http://www.domain.com/something/"http://www.domain.com"
> 
> After reading the html of the pages containing that outlinks I found that
> those outlinks are note included in the source code, so I guess there may
> be something incorrect in the page content or in the parse made by nutch.
> How can I know which problem is? I am a little lost with this one.
> 
> In order to see the problem:
> 
> $ bin/nutch parsechecker
> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/vida-proteccio
> 
> And within the results we can see this particular outlink:
>  outlink: toUrl:
> https://www.seguroscatalanaoccidente.com/cat/particulars/vida/assegurances-de-vida/
> "http://www.seguroscatalanaoccidente.com" anchor:
> www.seguroscatalanaoccidente.com
> 
> Is there any way to solve or avoid this? maybe with the regex-urlfilter
> file?
> 
> Thanks
> 
> Carlos Pérez Miguel
>