You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Mahmoud Gzawi <gz...@gmail.com> on 2015/03/20 14:52:47 UTC
multiple parses from one page
Hi everyone.
Is there any way to extract multiple parses from one page in nutch 2.x?
Can anyone give hints where should i start digging?
Thanks.
Re: Problem with redirection
Posted by Mahmoud GZAWI <gz...@gmail.com>.
Hi Sebastian.
Thank you for your reply and sorry for answering late.
I'm using nutch 2.3.
You were right, the URL normalizers was causing the links to change.
2015-03-22 12:03 GMT+01:00 Sebastian Nagel <wa...@googlemail.com>:
> Hi Mahmoud,
>
> which version of Nutch 2.x is used exactly?
> Are all URLs in the redirect chain really accepted by URL filters?
> Do URL normalizers change URLs (esp. ";jsessionid=...")?
>
> Thanks,
> Sebastian
>
> On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote:
> > Hi everyone,
> >
> > I have a problem with redirection when crawling this site:
> http://www.abudhabi.ae
> >
> > $ bin/nutch parsechecker 'http://www.abudhabi.ae'
> > gives:
> > fetching: http://www.abudhabi.ae
> > Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
> >
> > With the new TEMP_MOVED
> > $ bin/nutch parsechecker 'https://www.abudhabi.ae/'
> > gives:
> > fetching: https://www.abudhabi.ae/
> > Fetch failed with protocol status: MOVED:
> https://www.abudhabi.ae/portal/faces/link?docName=homepage
> >
> > $ bin/nutch parsechecker '
> https://www.abudhabi.ae/portal/faces/link?docName=homepage'
> > gives:
> > fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> > Fetch failed with protocol status: TEMP_MOVED:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> >
> > $ bin/nutch parsechecker
> > '
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> '
> >
> > gives:
> > fetching:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > parsing:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > contentType: text/html
> > signature: 35b57b41538448fb349ea17d6566c981
> > ---------
> > Url
> > ---------------
> >
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > ---------
> > Metadata
> > ---------
> >
> > OriginalCharEncoding : utf-8
> > CharEncodingForConversion : utf-8
> > _rs_ : �
> > ---------
> > Outlinks
> > ---------
> >
> > outlink: toUrl:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
> > anchor:
> > ....
> > ---------
> > Headers
> > ---------
> >
> > X-Frame-Options : sameorigin
> > Date : Fri, 20 Mar 2015 21:47:43 GMT
> > Vary : Accept-Encoding
> > Content-Encoding : gzip
> > Via : web01
> > Set-Cookie :
> TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
> > path=/portal
> > Connection : close
> > Content-Type : text/html;charset=utf-8
> >
> >
> > So the last link was parsed succefully. But when i try to crawl the site
> i dont get any documents. I
> > tried changing the http.redirect.max to 5, i desactivated all the lines
> in the regex-urlfilter.txt
> > and i also tried running the crawling command bin/crawl with 100 rounds
> but i still not get any
> > parsed documents.
> >
> > Can somebody help!
>
>
Re: Problem with redirection
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Mahmoud,
which version of Nutch 2.x is used exactly?
Are all URLs in the redirect chain really accepted by URL filters?
Do URL normalizers change URLs (esp. ";jsessionid=...")?
Thanks,
Sebastian
On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote:
> Hi everyone,
>
> I have a problem with redirection when crawling this site: http://www.abudhabi.ae
>
> $ bin/nutch parsechecker 'http://www.abudhabi.ae'
> gives:
> fetching: http://www.abudhabi.ae
> Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
>
> With the new TEMP_MOVED
> $ bin/nutch parsechecker 'https://www.abudhabi.ae/'
> gives:
> fetching: https://www.abudhabi.ae/
> Fetch failed with protocol status: MOVED: https://www.abudhabi.ae/portal/faces/link?docName=homepage
>
> $ bin/nutch parsechecker 'https://www.abudhabi.ae/portal/faces/link?docName=homepage'
> gives:
> fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> Fetch failed with protocol status: TEMP_MOVED:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
>
>
> $ bin/nutch parsechecker
> 'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4'
>
> gives:
> fetching:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
>
> parsing:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
>
> contentType: text/html
> signature: 35b57b41538448fb349ea17d6566c981
> ---------
> Url
> ---------------
>
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
>
> ---------
> Metadata
> ---------
>
> OriginalCharEncoding : utf-8
> CharEncodingForConversion : utf-8
> _rs_ : �
> ---------
> Outlinks
> ---------
>
> outlink: toUrl:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
> anchor:
> ....
> ---------
> Headers
> ---------
>
> X-Frame-Options : sameorigin
> Date : Fri, 20 Mar 2015 21:47:43 GMT
> Vary : Accept-Encoding
> Content-Encoding : gzip
> Via : web01
> Set-Cookie : TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
> path=/portal
> Connection : close
> Content-Type : text/html;charset=utf-8
>
>
> So the last link was parsed succefully. But when i try to crawl the site i dont get any documents. I
> tried changing the http.redirect.max to 5, i desactivated all the lines in the regex-urlfilter.txt
> and i also tried running the crawling command bin/crawl with 100 rounds but i still not get any
> parsed documents.
>
> Can somebody help!
Problem with redirection
Posted by Mahmoud Gzawi <gz...@gmail.com>.
Hi everyone,
I have a problem with redirection when crawling this site:
http://www.abudhabi.ae
$ bin/nutch parsechecker 'http://www.abudhabi.ae'
gives:
fetching: http://www.abudhabi.ae
Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
With the new TEMP_MOVED
$ bin/nutch parsechecker 'https://www.abudhabi.ae/'
gives:
fetching: https://www.abudhabi.ae/
Fetch failed with protocol status: MOVED:
https://www.abudhabi.ae/portal/faces/link?docName=homepage
$ bin/nutch parsechecker
'https://www.abudhabi.ae/portal/faces/link?docName=homepage'
gives:
fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
Fetch failed with protocol status: TEMP_MOVED:
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
$ bin/nutch parsechecker
'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4'
gives:
fetching:
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
parsing:
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
contentType: text/html
signature: 35b57b41538448fb349ea17d6566c981
---------
Url
---------------
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
---------
Metadata
---------
OriginalCharEncoding : utf-8
CharEncodingForConversion : utf-8
_rs_ : �
---------
Outlinks
---------
outlink: toUrl:
https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
anchor:
....
---------
Headers
---------
X-Frame-Options : sameorigin
Date : Fri, 20 Mar 2015 21:47:43 GMT
Vary : Accept-Encoding
Content-Encoding : gzip
Via : web01
Set-Cookie :
TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
path=/portal
Connection : close
Content-Type : text/html;charset=utf-8
So the last link was parsed succefully. But when i try to crawl the site
i dont get any documents. I tried changing the http.redirect.max to 5, i
desactivated all the lines in the regex-urlfilter.txt and i also tried
running the crawling command bin/crawl with 100 rounds but i still not
get any parsed documents.
Can somebody help!
Re: multiple parses from one page
Posted by Mahmoud Gzawi <gz...@gmail.com>.
By the way, i'm using a modification of the xpath-filter plugin. The
problem is that nutch returns only one parse, so when i have multiple
parses in the plugin, nutch overwrites all the parses and return only
the last one.
On 20/03/2015 20:54, Mahmoud Gzawi wrote:
> Hi Sebastian,
> Thanks for your reply,
>
> I think the answer is :
>
> - separate parse trees (DOM trees) for parts of a document,
> e.g., chapters, sections, tables, and other structural elements
>
> Let's say i have an html page with several sections, i need to extract
> (using xpath) every section as a parse and index it as a seperate
> document, every parse will have it's own metadata, outlinks, content,
> title ...
>
> Thanks,
>
> On 20/03/2015 20:38, Sebastian Nagel wrote:
>> Hi Mahmoud,
>>
>> what is meant by "multiple parses"?
>>
>> - separate parse trees (DOM trees) for parts of a document,
>> e.g., chapters, sections, tables, and other structural elements
>> - interpreting the same documents with multiple parsers,
>> e.g., different HTML parsers
>> - parses of multi-document containers, e.g. zip files
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 03/20/2015 02:52 PM, Mahmoud Gzawi wrote:
>>> Hi everyone.
>>>
>>> Is there any way to extract multiple parses from one page in nutch
>>> 2.x? Can anyone give hints where
>>> should i start digging?
>>>
>>> Thanks.
>
Re: multiple parses from one page
Posted by Mahmoud Gzawi <gz...@gmail.com>.
Hi Sebastian,
Thanks for your reply,
I think the answer is :
- separate parse trees (DOM trees) for parts of a document,
e.g., chapters, sections, tables, and other structural elements
Let's say i have an html page with several sections, i need to extract
(using xpath) every section as a parse and index it as a seperate
document, every parse will have it's own metadata, outlinks, content,
title ...
Thanks,
On 20/03/2015 20:38, Sebastian Nagel wrote:
> Hi Mahmoud,
>
> what is meant by "multiple parses"?
>
> - separate parse trees (DOM trees) for parts of a document,
> e.g., chapters, sections, tables, and other structural elements
> - interpreting the same documents with multiple parsers,
> e.g., different HTML parsers
> - parses of multi-document containers, e.g. zip files
>
> Thanks,
> Sebastian
>
>
> On 03/20/2015 02:52 PM, Mahmoud Gzawi wrote:
>> Hi everyone.
>>
>> Is there any way to extract multiple parses from one page in nutch 2.x? Can anyone give hints where
>> should i start digging?
>>
>> Thanks.
Re: multiple parses from one page
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Mahmoud,
what is meant by "multiple parses"?
- separate parse trees (DOM trees) for parts of a document,
e.g., chapters, sections, tables, and other structural elements
- interpreting the same documents with multiple parsers,
e.g., different HTML parsers
- parses of multi-document containers, e.g. zip files
Thanks,
Sebastian
On 03/20/2015 02:52 PM, Mahmoud Gzawi wrote:
> Hi everyone.
>
> Is there any way to extract multiple parses from one page in nutch 2.x? Can anyone give hints where
> should i start digging?
>
> Thanks.