You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Canan GİRGİN <ca...@gmail.com> on 2013/03/25 21:17:51 UTC

parsechecker and redirection

Hi,

I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when I
try parsechecker command with redirected page,parseFilters turns wrong
results. Because parse text contains redirect descriptions.

Is there any problem?

Thanks, Canan

Nutch 2.1 / Ubuntu 12.04 / MySQL

Re: parsechecker and redirection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Alex,
We need to fix this.
Can you please open an issue in the Jira and we can address?
Thank you very much in advnace.
Lewis

On Mon, Mar 25, 2013 at 4:53 PM, <al...@aim.com> wrote:

> Hello,
>
> I would like  to let you know that, currently nutch -2.x does not index
> redirected pages, independent of if they are parsed or not.
>
> Thanks.
> Alex.
>
>
>
>

Re: parsechecker and redirection

Posted by al...@aim.com.
Hello,

I would like  to let you know that, currently nutch -2.x does not index redirected pages, independent of if they are parsed or not.

Thanks.
Alex.

 
 

-----Original Message-----
From: Sebastian Nagel <wa...@googlemail.com>
To: user <us...@nutch.apache.org>
Sent: Mon, Mar 25, 2013 3:52 pm
Subject: Re: parsechecker and redirection


Hi Lewis,

let's address NUTCH-1038, NUTCH-1389, NUTCH-1419, and NUTCH-1501!

On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote:
> Thanks for clarification on this one Seb.
> I was aware that you were clued up on this and hoped you would drrop in.
> 
> On Monday, March 25, 2013, Sebastian Nagel <wa...@googlemail.com>
> wrote:
>> Hi Canan, hi Lewis,
>>
>> parsechecker cannot follow redirects, also in trunk / 1.x.
>>
>> It would be nice, at least, if parsechecker would report
>> clearly that there is a redirect. Currently, you have to check
>> content metadata for the redirect target which is easy to overlook.
>>
>> % nutch parsechecker http://apachecon.eu
>> ...
>> Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=
> http://www.apachecon.eu/
>> ...
>>
>> There is already NUTCH-1419: report redirect and do not parse.
>> @Lewis: I'll review the latest patch soon, so we can sort this out.
>>
>> @Canan: feel free to open a new Jira to make parsechecker follow
> redirects. Thanks!
>>
>> Sebastian
>>
>>
>> On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
>>> Hi Canan,
>>> Thank you for bringing this up, I just noticed that 2.x does not have the
>>> configurable property in nutch-default.xml
>>>
>>> <property>
>>>   <name>http.redirect.max</name>
>>>   <value>0</value>
>>>   <description>The maximum number of redirects the fetcher will follow
> when
>>>   trying to fetch a page. If set to negative or 0, fetcher won't
> immediately
>>>   follow redirected URLs, instead it will record them for later fetching.
>>>   </description>
>>> </property>
>>>
>>> I've also looked over the trunk and 2.x branches and it seems that with
>>> regards to handling redirects, trunk is more functionally capable.
>>> I don't have time to look into this just now.
>>> You can begin looking in to the trunk code before the 2.x in an attempt
> to
>>> see how redirects should be handled and how a configurable depth can be
>>> specified for fetching of such URLs.
>>> It seems that we need to add such functionality to 2.x.
>>> Contributions would be very very welcome on this issue.
>>> Lewis
>>>
>>> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <canankaragoz@gmail.com
>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when
> I
>>>> try parsechecker command with redirected page,parseFilters turns wrong
>>>> results. Because parse text contains redirect descriptions.
>>>>
>>>> Is there any problem?
>>>>
>>>> Thanks, Canan
>>>>
>>>> Nutch 2.1 / Ubuntu 12.04 / MySQL
>>>>
>>>
>>>
>>>
>>
>>
> 


 

Re: parsechecker and redirection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Seb,
I've commented on the tickets. I am happy to commit the patches for the 1st
and 3rd.
Please let me know if you want me to commit them or you will do it?
Thanks
Lewis

On Mon, Mar 25, 2013 at 3:51 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Lewis,
>
> let's address NUTCH-1038, NUTCH-1389, NUTCH-1419, and NUTCH-1501!
>
> On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote:
> > Thanks for clarification on this one Seb.
> > I was aware that you were clued up on this and hoped you would drrop in.
> >
> > On Monday, March 25, 2013, Sebastian Nagel <wa...@googlemail.com>
> > wrote:
> >> Hi Canan, hi Lewis,
> >>
> >> parsechecker cannot follow redirects, also in trunk / 1.x.
> >>
> >> It would be nice, at least, if parsechecker would report
> >> clearly that there is a redirect. Currently, you have to check
> >> content metadata for the redirect target which is easy to overlook.
> >>
> >> % nutch parsechecker http://apachecon.eu
> >> ...
> >> Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=
> > http://www.apachecon.eu/
> >> ...
> >>
> >> There is already NUTCH-1419: report redirect and do not parse.
> >> @Lewis: I'll review the latest patch soon, so we can sort this out.
> >>
> >> @Canan: feel free to open a new Jira to make parsechecker follow
> > redirects. Thanks!
> >>
> >> Sebastian
> >>
> >>
> >> On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
> >>> Hi Canan,
> >>> Thank you for bringing this up, I just noticed that 2.x does not have
> the
> >>> configurable property in nutch-default.xml
> >>>
> >>> <property>
> >>>   <name>http.redirect.max</name>
> >>>   <value>0</value>
> >>>   <description>The maximum number of redirects the fetcher will follow
> > when
> >>>   trying to fetch a page. If set to negative or 0, fetcher won't
> > immediately
> >>>   follow redirected URLs, instead it will record them for later
> fetching.
> >>>   </description>
> >>> </property>
> >>>
> >>> I've also looked over the trunk and 2.x branches and it seems that with
> >>> regards to handling redirects, trunk is more functionally capable.
> >>> I don't have time to look into this just now.
> >>> You can begin looking in to the trunk code before the 2.x in an attempt
> > to
> >>> see how redirects should be handled and how a configurable depth can be
> >>> specified for fetching of such URLs.
> >>> It seems that we need to add such functionality to 2.x.
> >>> Contributions would be very very welcome on this issue.
> >>> Lewis
> >>>
> >>> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <canankaragoz@gmail.com
> >> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But
> when
> > I
> >>>> try parsechecker command with redirected page,parseFilters turns wrong
> >>>> results. Because parse text contains redirect descriptions.
> >>>>
> >>>> Is there any problem?
> >>>>
> >>>> Thanks, Canan
> >>>>
> >>>> Nutch 2.1 / Ubuntu 12.04 / MySQL
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>
>


-- 
*Lewis*

Re: parsechecker and redirection

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Lewis,

let's address NUTCH-1038, NUTCH-1389, NUTCH-1419, and NUTCH-1501!

On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote:
> Thanks for clarification on this one Seb.
> I was aware that you were clued up on this and hoped you would drrop in.
> 
> On Monday, March 25, 2013, Sebastian Nagel <wa...@googlemail.com>
> wrote:
>> Hi Canan, hi Lewis,
>>
>> parsechecker cannot follow redirects, also in trunk / 1.x.
>>
>> It would be nice, at least, if parsechecker would report
>> clearly that there is a redirect. Currently, you have to check
>> content metadata for the redirect target which is easy to overlook.
>>
>> % nutch parsechecker http://apachecon.eu
>> ...
>> Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=
> http://www.apachecon.eu/
>> ...
>>
>> There is already NUTCH-1419: report redirect and do not parse.
>> @Lewis: I'll review the latest patch soon, so we can sort this out.
>>
>> @Canan: feel free to open a new Jira to make parsechecker follow
> redirects. Thanks!
>>
>> Sebastian
>>
>>
>> On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
>>> Hi Canan,
>>> Thank you for bringing this up, I just noticed that 2.x does not have the
>>> configurable property in nutch-default.xml
>>>
>>> <property>
>>>   <name>http.redirect.max</name>
>>>   <value>0</value>
>>>   <description>The maximum number of redirects the fetcher will follow
> when
>>>   trying to fetch a page. If set to negative or 0, fetcher won't
> immediately
>>>   follow redirected URLs, instead it will record them for later fetching.
>>>   </description>
>>> </property>
>>>
>>> I've also looked over the trunk and 2.x branches and it seems that with
>>> regards to handling redirects, trunk is more functionally capable.
>>> I don't have time to look into this just now.
>>> You can begin looking in to the trunk code before the 2.x in an attempt
> to
>>> see how redirects should be handled and how a configurable depth can be
>>> specified for fetching of such URLs.
>>> It seems that we need to add such functionality to 2.x.
>>> Contributions would be very very welcome on this issue.
>>> Lewis
>>>
>>> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <canankaragoz@gmail.com
>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when
> I
>>>> try parsechecker command with redirected page,parseFilters turns wrong
>>>> results. Because parse text contains redirect descriptions.
>>>>
>>>> Is there any problem?
>>>>
>>>> Thanks, Canan
>>>>
>>>> Nutch 2.1 / Ubuntu 12.04 / MySQL
>>>>
>>>
>>>
>>>
>>
>>
> 


Re: parsechecker and redirection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Thanks for clarification on this one Seb.
I was aware that you were clued up on this and hoped you would drrop in.

On Monday, March 25, 2013, Sebastian Nagel <wa...@googlemail.com>
wrote:
> Hi Canan, hi Lewis,
>
> parsechecker cannot follow redirects, also in trunk / 1.x.
>
> It would be nice, at least, if parsechecker would report
> clearly that there is a redirect. Currently, you have to check
> content metadata for the redirect target which is easy to overlook.
>
> % nutch parsechecker http://apachecon.eu
> ...
> Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=
http://www.apachecon.eu/
> ...
>
> There is already NUTCH-1419: report redirect and do not parse.
> @Lewis: I'll review the latest patch soon, so we can sort this out.
>
> @Canan: feel free to open a new Jira to make parsechecker follow
redirects. Thanks!
>
> Sebastian
>
>
> On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
>> Hi Canan,
>> Thank you for bringing this up, I just noticed that 2.x does not have the
>> configurable property in nutch-default.xml
>>
>> <property>
>>   <name>http.redirect.max</name>
>>   <value>0</value>
>>   <description>The maximum number of redirects the fetcher will follow
when
>>   trying to fetch a page. If set to negative or 0, fetcher won't
immediately
>>   follow redirected URLs, instead it will record them for later fetching.
>>   </description>
>> </property>
>>
>> I've also looked over the trunk and 2.x branches and it seems that with
>> regards to handling redirects, trunk is more functionally capable.
>> I don't have time to look into this just now.
>> You can begin looking in to the trunk code before the 2.x in an attempt
to
>> see how redirects should be handled and how a configurable depth can be
>> specified for fetching of such URLs.
>> It seems that we need to add such functionality to 2.x.
>> Contributions would be very very welcome on this issue.
>> Lewis
>>
>> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <canankaragoz@gmail.com
>wrote:
>>
>>> Hi,
>>>
>>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when
I
>>> try parsechecker command with redirected page,parseFilters turns wrong
>>> results. Because parse text contains redirect descriptions.
>>>
>>> Is there any problem?
>>>
>>> Thanks, Canan
>>>
>>> Nutch 2.1 / Ubuntu 12.04 / MySQL
>>>
>>
>>
>>
>
>

-- 
*Lewis*

Re: parsechecker and redirection

Posted by Canan GİRGİN <ca...@gmail.com>.
Hi,

Thanks for quick replies.
I open a new jira issue[0] about following redirection.
NUTCH-1419 is a useful step.


[0]: https://issues.apache.org/jira/browse/NUTCH-1546



On Tue, Mar 26, 2013 at 12:04 AM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> Hi Canan, hi Lewis,
>
> parsechecker cannot follow redirects, also in trunk / 1.x.
>
> It would be nice, at least, if parsechecker would report
> clearly that there is a redirect. Currently, you have to check
> content metadata for the redirect target which is easy to overlook.
>
> % nutch parsechecker http://apachecon.eu
> ...
> Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=
> http://www.apachecon.eu/
> ...
>
> There is already NUTCH-1419: report redirect and do not parse.
> @Lewis: I'll review the latest patch soon, so we can sort this out.
>
> @Canan: feel free to open a new Jira to make parsechecker follow
> redirects. Thanks!
>
> Sebastian
>
>
> On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
> > Hi Canan,
> > Thank you for bringing this up, I just noticed that 2.x does not have the
> > configurable property in nutch-default.xml
> >
> > <property>
> >   <name>http.redirect.max</name>
> >   <value>0</value>
> >   <description>The maximum number of redirects the fetcher will follow
> when
> >   trying to fetch a page. If set to negative or 0, fetcher won't
> immediately
> >   follow redirected URLs, instead it will record them for later fetching.
> >   </description>
> > </property>
> >
> > I've also looked over the trunk and 2.x branches and it seems that with
> > regards to handling redirects, trunk is more functionally capable.
> > I don't have time to look into this just now.
> > You can begin looking in to the trunk code before the 2.x in an attempt
> to
> > see how redirects should be handled and how a configurable depth can be
> > specified for fetching of such URLs.
> > It seems that we need to add such functionality to 2.x.
> > Contributions would be very very welcome on this issue.
> > Lewis
> >
> > On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <canankaragoz@gmail.com
> >wrote:
> >
> >> Hi,
> >>
> >> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when
> I
> >> try parsechecker command with redirected page,parseFilters turns wrong
> >> results. Because parse text contains redirect descriptions.
> >>
> >> Is there any problem?
> >>
> >> Thanks, Canan
> >>
> >> Nutch 2.1 / Ubuntu 12.04 / MySQL
> >>
> >
> >
> >
>
>

Re: parsechecker and redirection

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Canan, hi Lewis,

parsechecker cannot follow redirects, also in trunk / 1.x.

It would be nice, at least, if parsechecker would report
clearly that there is a redirect. Currently, you have to check
content metadata for the redirect target which is easy to overlook.

% nutch parsechecker http://apachecon.eu
...
Content Metadata: Date=Mon, 25 Mar 2013 21:51:22 GMT Location=http://www.apachecon.eu/
...

There is already NUTCH-1419: report redirect and do not parse.
@Lewis: I'll review the latest patch soon, so we can sort this out.

@Canan: feel free to open a new Jira to make parsechecker follow redirects. Thanks!

Sebastian


On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote:
> Hi Canan,
> Thank you for bringing this up, I just noticed that 2.x does not have the
> configurable property in nutch-default.xml
> 
> <property>
>   <name>http.redirect.max</name>
>   <value>0</value>
>   <description>The maximum number of redirects the fetcher will follow when
>   trying to fetch a page. If set to negative or 0, fetcher won't immediately
>   follow redirected URLs, instead it will record them for later fetching.
>   </description>
> </property>
> 
> I've also looked over the trunk and 2.x branches and it seems that with
> regards to handling redirects, trunk is more functionally capable.
> I don't have time to look into this just now.
> You can begin looking in to the trunk code before the 2.x in an attempt to
> see how redirects should be handled and how a configurable depth can be
> specified for fetching of such URLs.
> It seems that we need to add such functionality to 2.x.
> Contributions would be very very welcome on this issue.
> Lewis
> 
> On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <ca...@gmail.com>wrote:
> 
>> Hi,
>>
>> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when I
>> try parsechecker command with redirected page,parseFilters turns wrong
>> results. Because parse text contains redirect descriptions.
>>
>> Is there any problem?
>>
>> Thanks, Canan
>>
>> Nutch 2.1 / Ubuntu 12.04 / MySQL
>>
> 
> 
> 


Re: parsechecker and redirection

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Canan,
Thank you for bringing this up, I just noticed that 2.x does not have the
configurable property in nutch-default.xml

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

I've also looked over the trunk and 2.x branches and it seems that with
regards to handling redirects, trunk is more functionally capable.
I don't have time to look into this just now.
You can begin looking in to the trunk code before the 2.x in an attempt to
see how redirects should be handled and how a configurable depth can be
specified for fetching of such URLs.
It seems that we need to add such functionality to 2.x.
Contributions would be very very welcome on this issue.
Lewis

On Mon, Mar 25, 2013 at 1:17 PM, Canan GİRGİN <ca...@gmail.com>wrote:

> Hi,
>
> I use "bin/nutch parsechecker" command.(Nutch 2.1)I works fine.But when I
> try parsechecker command with redirected page,parseFilters turns wrong
> results. Because parse text contains redirect descriptions.
>
> Is there any problem?
>
> Thanks, Canan
>
> Nutch 2.1 / Ubuntu 12.04 / MySQL
>



-- 
*Lewis*