You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Yann Levreau <ya...@gmail.com> on 2014/06/15 11:20:31 UTC

nutch elpais.com

hi everyone !

I'm sorry to disturb you but i need some assistance for getting the
outlinks of http://elpais.com.
I use Nutch 2.2.1.

The web page is well parsed, in debug I have all the outlinks in the Parse
object.
I use these basic plugins :

protocol-http|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

But outlinks are never injected in hbase (with http://elpais.com or
http://www.elpais.com).
If i try to parse www.nytimes.com, outlinks are normally injected and added
to the fetch list.

Any idea ?
Thanks
Yann

==> I have the same issue with http://www.lemonde.fr

Re: nutch elpais.com

Posted by Yann Levreau <ya...@gmail.com>.
You're right, I need to clean these config files. I think these plugins
came from Nutch 1.7 (bad copy/paste :) )
I have news with my issue. Actually there were two issues  :

1) outlinks are not set in the WebPage :

In ParseUtil.java (line195), we have :





*if (ParseStatusUtils.isSuccess(pstatus)) {      if (pstatus.getMinorCode()
== ParseStatusCodes.SUCCESS_REDIRECT) {        String newUrl =
ParseStatusUtils.getMessage(pstatus);        int refreshTime =
Integer.parseInt(ParseStatusUtils.getArg(pstatus, 1));*

In case if *ParseStatusCodes.SUCCESS_REDIRECT *is 100*, *outlinks are not
set into the WebPage even if outlinks are in the pars*e. *This is due to
the line 219 in HtmlParser.java :




*ParseStatus status = new ParseStatus();
status.setMajorCode(ParseStatusCodes.SUCCESS);    if
(metaTags.getRefresh()) {      ----->
status.setMinorCode(ParseStatusCodes.SUCCESS_REDIRECT); <------ *



*      status.addToArgs(new
Utf8(metaTags.getRefreshHref().toString()));      status.addToArgs(new
Utf8(Integer.toString(metaTags.getRefreshTime())));    }*


Replacing *ParseStatusCodes.SUCCESS_REDIRECT* with
*ParseStatusCodes.SUCCESS* correct the behavior of ParseUtil.java.
But Maybe I'm wrong to do this : *ParseStatusCodes.SUCCESS_REDIRECT *is
probably here for a good reason.

2) with www.nytimes.com

* :*
Web pages are redirection. For example,

*http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html
<http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html> *
leads to

*http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2014%2F06%2F17%2Fworld%2Fmiddleeast%2Firaq.html%3F_r%3D0
<http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2014%2F06%2F17%2Fworld%2Fmiddleeast%2Firaq.html%3F_r%3D0>*
wich leads to

*http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html?_r=0
<http://www.nytimes.com/2014/06/17/world/middleeast/iraq.html?_r=0>*
and so on.
We never get the content of this page. But may be this is by design and
there is a better way to crawl this site ....

I'm sorry to send in this mailing, maybe this is the wrong place. This is
just in case some of you had the same issue.
Thanks a lot !

Yann




2014-06-16 2:13 GMT-07:00 Julien Nioche <li...@gmail.com>:

> Salut Yann,
>
> Not really answering your question but where did you get this config from?
> Some of its elements have been long deprecated (query-*, response-*,
> summary-*)
>
> Julien
>
>
> On 15 June 2014 10:20, Yann Levreau <ya...@gmail.com> wrote:
>
>> hi everyone !
>>
>> I'm sorry to disturb you but i need some assistance for getting the
>> outlinks of http://elpais.com.
>> I use Nutch 2.2.1.
>>
>> The web page is well parsed, in debug I have all the outlinks in the
>> Parse object.
>> I use these basic plugins :
>>
>>
>> protocol-http|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>>
>> But outlinks are never injected in hbase (with http://elpais.com or
>> http://www.elpais.com).
>> If i try to parse www.nytimes.com, outlinks are normally injected and
>> added to the fetch list.
>>
>>  Any idea ?
>> Thanks
>> Yann
>>
>> ==> I have the same issue with http://www.lemonde.fr
>>
>>
>>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: nutch elpais.com

Posted by Julien Nioche <li...@gmail.com>.
Salut Yann,

Not really answering your question but where did you get this config from?
Some of its elements have been long deprecated (query-*, response-*,
summary-*)

Julien


On 15 June 2014 10:20, Yann Levreau <ya...@gmail.com> wrote:

> hi everyone !
>
> I'm sorry to disturb you but i need some assistance for getting the
> outlinks of http://elpais.com.
> I use Nutch 2.2.1.
>
> The web page is well parsed, in debug I have all the outlinks in the Parse
> object.
> I use these basic plugins :
>
>
> protocol-http|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> But outlinks are never injected in hbase (with http://elpais.com or
> http://www.elpais.com).
> If i try to parse www.nytimes.com, outlinks are normally injected and
> added to the fetch list.
>
> Any idea ?
> Thanks
> Yann
>
> ==> I have the same issue with http://www.lemonde.fr
>
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble