You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Mahmoud Gzawi <gz...@gmail.com> on 2015/03/20 14:52:47 UTC

multiple parses from one page

Hi everyone.

Is there any way to extract multiple parses from one page in nutch 2.x? 
Can anyone give hints where should i start digging?

Thanks.

Re: Problem with redirection

Posted by Mahmoud GZAWI <gz...@gmail.com>.

Hi Sebastian.
Thank you for your reply and sorry for answering late.

I'm using nutch 2.3.
You were right, the URL normalizers was causing the links to change.



2015-03-22 12:03 GMT+01:00 Sebastian Nagel <wa...@googlemail.com>:

> Hi Mahmoud,
>
> which version of Nutch 2.x is used exactly?
> Are all URLs in the redirect chain really accepted by URL filters?
> Do URL normalizers change URLs (esp. ";jsessionid=...")?
>
> Thanks,
> Sebastian
>
> On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote:
> > Hi everyone,
> >
> > I have a problem with redirection when crawling this site:
> http://www.abudhabi.ae
> >
> > $ bin/nutch parsechecker 'http://www.abudhabi.ae'
> > gives:
> > fetching: http://www.abudhabi.ae
> > Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
> >
> > With the new TEMP_MOVED
> > $ bin/nutch parsechecker 'https://www.abudhabi.ae/'
> > gives:
> > fetching: https://www.abudhabi.ae/
> > Fetch failed with protocol status: MOVED:
> https://www.abudhabi.ae/portal/faces/link?docName=homepage
> >
> > $ bin/nutch parsechecker '
> https://www.abudhabi.ae/portal/faces/link?docName=homepage'
> > gives:
> > fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> > Fetch failed with protocol status: TEMP_MOVED:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> >
> > $ bin/nutch parsechecker
> > '
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> '
> >
> > gives:
> > fetching:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > parsing:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > contentType: text/html
> > signature: 35b57b41538448fb349ea17d6566c981
> > ---------
> > Url
> > ---------------
> >
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > ---------
> > Metadata
> > ---------
> >
> > OriginalCharEncoding :     utf-8
> > CharEncodingForConversion :     utf-8
> > _rs_ :     �
> > ---------
> > Outlinks
> > ---------
> >
> >   outlink: toUrl:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
> > anchor:
> > ....
> > ---------
> > Headers
> > ---------
> >
> > X-Frame-Options :     sameorigin
> > Date :     Fri, 20 Mar 2015 21:47:43 GMT
> > Vary :     Accept-Encoding
> > Content-Encoding :     gzip
> > Via :     web01
> > Set-Cookie :
> TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
> > path=/portal
> > Connection :     close
> > Content-Type :     text/html;charset=utf-8
> >
> >
> > So the last link was parsed succefully. But when i try to crawl the site
> i dont get any documents. I
> > tried changing the http.redirect.max to 5, i desactivated all the lines
> in the regex-urlfilter.txt
> > and i also tried running the crawling command bin/crawl with 100 rounds
> but i still not get any
> > parsed documents.
> >
> > Can somebody help!
>
>

Re: Problem with redirection

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Mahmoud,

which version of Nutch 2.x is used exactly?
Are all URLs in the redirect chain really accepted by URL filters?
Do URL normalizers change URLs (esp. ";jsessionid=...")?

Thanks,
Sebastian

On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote:
> Hi everyone,
> 
> I have a problem with redirection when crawling this site: http://www.abudhabi.ae
> 
> $ bin/nutch parsechecker 'http://www.abudhabi.ae'
> gives:
> fetching: http://www.abudhabi.ae
> Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
> 
> With the new TEMP_MOVED
> $ bin/nutch parsechecker 'https://www.abudhabi.ae/'
> gives:
> fetching: https://www.abudhabi.ae/
> Fetch failed with protocol status: MOVED: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> 
> $ bin/nutch parsechecker 'https://www.abudhabi.ae/portal/faces/link?docName=homepage'
> gives:
> fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> Fetch failed with protocol status: TEMP_MOVED:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> 
> $ bin/nutch parsechecker
> 'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4'
> 
> gives:
> fetching:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> parsing:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> contentType: text/html
> signature: 35b57b41538448fb349ea17d6566c981
> ---------
> Url
> ---------------
> 
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> ---------
> Metadata
> ---------
> 
> OriginalCharEncoding :     utf-8
> CharEncodingForConversion :     utf-8
> _rs_ :     �
> ---------
> Outlinks
> ---------
> 
>   outlink: toUrl:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
> anchor:
> ....
> ---------
> Headers
> ---------
> 
> X-Frame-Options :     sameorigin
> Date :     Fri, 20 Mar 2015 21:47:43 GMT
> Vary :     Accept-Encoding
> Content-Encoding :     gzip
> Via :     web01
> Set-Cookie : TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
> path=/portal
> Connection :     close
> Content-Type :     text/html;charset=utf-8
> 
> 
> So the last link was parsed succefully. But when i try to crawl the site i dont get any documents. I
> tried changing the http.redirect.max to 5, i desactivated all the lines in the regex-urlfilter.txt
> and i also tried running the crawling command bin/crawl with 100 rounds but i still not get any
> parsed documents.
> 
> Can somebody help!

Problem with redirection

Posted by Mahmoud Gzawi <gz...@gmail.com>.

Hi everyone,

I have a problem with redirection when crawling this site: 
http://www.abudhabi.ae

$ bin/nutch parsechecker 'http://www.abudhabi.ae'
gives:
fetching: http://www.abudhabi.ae
Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/

With the new TEMP_MOVED
$ bin/nutch parsechecker 'https://www.abudhabi.ae/'
gives:
fetching: https://www.abudhabi.ae/
Fetch failed with protocol status: MOVED: 
https://www.abudhabi.ae/portal/faces/link?docName=homepage

$ bin/nutch parsechecker 
'https://www.abudhabi.ae/portal/faces/link?docName=homepage'
gives:
fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
Fetch failed with protocol status: TEMP_MOVED: 
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4

$ bin/nutch parsechecker 
'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4'
gives:
fetching: 
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
parsing: 
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
contentType: text/html
signature: 35b57b41538448fb349ea17d6566c981
---------
Url
---------------

https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
---------
Metadata
---------

OriginalCharEncoding :     utf-8
CharEncodingForConversion :     utf-8
_rs_ :     �
---------
Outlinks
---------

   outlink: toUrl: 
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336 
anchor:
....
---------
Headers
---------

X-Frame-Options :     sameorigin
Date :     Fri, 20 Mar 2015 21:47:43 GMT
Vary :     Accept-Encoding
Content-Encoding :     gzip
Via :     web01
Set-Cookie : 
TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0; 
path=/portal
Connection :     close
Content-Type :     text/html;charset=utf-8


So the last link was parsed succefully. But when i try to crawl the site 
i dont get any documents. I tried changing the http.redirect.max to 5, i 
desactivated all the lines in the regex-urlfilter.txt and i also tried 
running the crawling command bin/crawl with 100 rounds but i still not 
get any parsed documents.

Can somebody help!

Re: multiple parses from one page

Posted by Mahmoud Gzawi <gz...@gmail.com>.

By the way, i'm using a modification of the xpath-filter plugin. The 
problem is that nutch returns only one parse, so when i have multiple 
parses in the plugin, nutch overwrites all the parses and return only 
the last one.

On 20/03/2015 20:54, Mahmoud Gzawi wrote:
> Hi Sebastian,
> Thanks for your reply,
>
> I think the answer is :
>
> - separate parse trees (DOM trees) for parts of a document,
>   e.g., chapters, sections, tables, and other structural elements
>
> Let's say i have an html page with several sections, i need to extract 
> (using xpath) every section as a parse and index it as a seperate 
> document, every parse will have it's own metadata, outlinks, content, 
> title ...
>
> Thanks,
>
> On 20/03/2015 20:38, Sebastian Nagel wrote:
>> Hi Mahmoud,
>>
>> what is meant by "multiple parses"?
>>
>> - separate parse trees (DOM trees) for parts of a document,
>>    e.g., chapters, sections, tables, and other structural elements
>> - interpreting the same documents with multiple parsers,
>>    e.g., different HTML parsers
>> - parses of multi-document containers, e.g. zip files
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 03/20/2015 02:52 PM, Mahmoud Gzawi wrote:
>>> Hi everyone.
>>>
>>> Is there any way to extract multiple parses from one page in nutch 
>>> 2.x? Can anyone give hints where
>>> should i start digging?
>>>
>>> Thanks.
>

Re: multiple parses from one page

Posted by Mahmoud Gzawi <gz...@gmail.com>.

Hi Sebastian,
Thanks for your reply,

I think the answer is :

- separate parse trees (DOM trees) for parts of a document,
   e.g., chapters, sections, tables, and other structural elements

Let's say i have an html page with several sections, i need to extract 
(using xpath) every section as a parse and index it as a seperate 
document, every parse will have it's own metadata, outlinks, content, 
title ...

Thanks,

On 20/03/2015 20:38, Sebastian Nagel wrote:
> Hi Mahmoud,
>
> what is meant by "multiple parses"?
>
> - separate parse trees (DOM trees) for parts of a document,
>    e.g., chapters, sections, tables, and other structural elements
> - interpreting the same documents with multiple parsers,
>    e.g., different HTML parsers
> - parses of multi-document containers, e.g. zip files
>
> Thanks,
> Sebastian
>
>
> On 03/20/2015 02:52 PM, Mahmoud Gzawi wrote:
>> Hi everyone.
>>
>> Is there any way to extract multiple parses from one page in nutch 2.x? Can anyone give hints where
>> should i start digging?
>>
>> Thanks.

Re: multiple parses from one page

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Mahmoud,

what is meant by "multiple parses"?

- separate parse trees (DOM trees) for parts of a document,
  e.g., chapters, sections, tables, and other structural elements
- interpreting the same documents with multiple parsers,
  e.g., different HTML parsers
- parses of multi-document containers, e.g. zip files

Thanks,
Sebastian


On 03/20/2015 02:52 PM, Mahmoud Gzawi wrote:
> Hi everyone.
> 
> Is there any way to extract multiple parses from one page in nutch 2.x? Can anyone give hints where
> should i start digging?
> 
> Thanks.