You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andy Xue <an...@gmail.com> on 2012/05/31 03:34:04 UTC

"nutch-site.xml" not robust

Hi all:

The following situation has come to my attention regarding "*nutch-site.xml*"
when I'm using nutch trunk:
When listing multiple scoring filters in the property "*scoring.filter.order
*", it is vital that no spaces/newlines/tabs are placed in front of the
first value. E.g.:
This is fine:
<value>org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>

Either of these will generate an exception:
<value> org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
<value>
org.apache.nutch.scoring.opic.OPICScoringFilter
myFilter
</value>

The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a statement
(on line 59) "orderedFilters = order.split("\\s+");" tries to split the
aforementioned string. The leading spaces will cause an empty separate
array element as the first element, hence result in a ClassNotFound /
NullPointer exception.


It can be easily fixed of course, but what concerns me is that I suspect
the fact that other properties will have the same problem (i.e., must have
the value content immediately follow the *<value>* tag. This is not
considered robust.

Any thoughts?

Regards
Andy

Re: "nutch-site.xml" not robust

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Andy,

This is really useful and I could imagine it was a right pain to debug.
I think it is up to you regarding opening of tickets... as you say
this sounds as if it is a common aspect of the way config is read from
nutch-site.xml

If you come acorss any obvious ones which you think you could patch
(and you have time :)) then by all means patch them, the contribs
would be great.

Thanks

Lewis

On Tue, Jun 12, 2012 at 7:25 AM, Andy Xue <an...@gmail.com> wrote:
> Hi all:
>
> Like I suspected, this vulnerability affects more properties apart from the
> ones I described in NUTCH-1385.
> For instance, the property "plugin.includes":
>
>      <value>plugin_1|plugin_2</value>
> This is fine, it will load both plugins.
>
>      <value>plugin_1|plugin_2
>      </value>
> This is not fine since (I guess) the program will try to find a plugin
> named "plugin_2\n" (maybe not precise, but you get the idea).
>
> I've been debugging for this bug for hours and finally found it. The cause
> is that my editor automatically formats long line by splitting it into
> multiple lines.
>
> So the rule here is: no matter how long a property value is, do not spread
> it into multiple lines. Otherwise something unexpected will happen.
>
> At this point, I'm not sure whether I should submit another ticket because
> I don't know exactly which properties are affected by this problem. Just a
> heads up for all of you who might encounter the same problem in the future.
>
> Regards
> Andy
>
>
> On 9 June 2012 11:42, Andy Xue <an...@gmail.com> wrote:
>
>> Hi Lewis:
>>
>> Sorry for the delay. Sure, I'll open a ticket in a bit.
>>
>> Regards
>> Andy
>>
>>
>>
>> On 7 June 2012 21:28, Lewis John Mcgibbney <le...@gmail.com>wrote:
>>
>>> Hi Andy,
>>> Even opening a ticket and getting it logged would b great.
>>> Thanks
>>> Lewis
>>>
>>> On Wed, Jun 6, 2012 at 3:53 AM, Andy Xue <an...@gmail.com> wrote:
>>> > Hi Lewis:
>>> >
>>> > I'll try to find a time to do it. Thanks for the reply.
>>> >
>>> > Regards
>>> > Andy
>>> >
>>> >
>>> >
>>> > On 31 May 2012 20:37, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
>>> >wrote:
>>> >
>>> >> Hi Andy,
>>> >>
>>> >> This is a good catch and I would suggest you open an issue on the Jira
>>> >> and submit a patch for the few instances of where this actually
>>> >> occurs... e.g. I think there are currently 4 such instances in
>>> >> nutch-default which concern the ordering of such tools. Admittedly
>>> >> though I haven't dug down into the code to see if it is consistent as
>>> >> you assume...
>>> >>
>>> >> If you begin by investigating (and patching if necessary) these parts
>>> >> then this would make a nice patch. As you are using trunk, I wouldn't
>>> >> imagine it would take you too long.
>>> >>
>>> >> Thanks very much
>>> >>
>>> >> Lewis
>>> >>
>>> >> On Thu, May 31, 2012 at 2:34 AM, Andy Xue <an...@gmail.com>
>>> wrote:
>>> >> > Hi all:
>>> >> >
>>> >> > The following situation has come to my attention regarding
>>> >> "*nutch-site.xml*"
>>> >> > when I'm using nutch trunk:
>>> >> > When listing multiple scoring filters in the property
>>> >> "*scoring.filter.order
>>> >> > *", it is vital that no spaces/newlines/tabs are placed in front of
>>> the
>>> >> > first value. E.g.:
>>> >> > This is fine:
>>> >> > <value>org.apache.nutch.scoring.opic.OPICScoringFilter
>>> myFilter</value>
>>> >> >
>>> >> > Either of these will generate an exception:
>>> >> > <value> org.apache.nutch.scoring.opic.OPICScoringFilter
>>> myFilter</value>
>>> >> > <value>
>>> >> > org.apache.nutch.scoring.opic.OPICScoringFilter
>>> >> > myFilter
>>> >> > </value>
>>> >> >
>>> >> > The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a
>>> statement
>>> >> > (on line 59) "orderedFilters = order.split("\\s+");" tries to split
>>> the
>>> >> > aforementioned string. The leading spaces will cause an empty
>>> separate
>>> >> > array element as the first element, hence result in a ClassNotFound /
>>> >> > NullPointer exception.
>>> >> >
>>> >> >
>>> >> > It can be easily fixed of course, but what concerns me is that I
>>> suspect
>>> >> > the fact that other properties will have the same problem (i.e., must
>>> >> have
>>> >> > the value content immediately follow the *<value>* tag. This is not
>>> >> > considered robust.
>>> >> >
>>> >> > Any thoughts?
>>> >> >
>>> >> > Regards
>>> >> > Andy
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Lewis
>>> >>
>>>
>>>
>>>
>>> --
>>> Lewis
>>>
>>
>>



-- 
Lewis

Re: "nutch-site.xml" not robust

Posted by Andy Xue <an...@gmail.com>.
Hi all:

Like I suspected, this vulnerability affects more properties apart from the
ones I described in NUTCH-1385.
For instance, the property "plugin.includes":

      <value>plugin_1|plugin_2</value>
This is fine, it will load both plugins.

      <value>plugin_1|plugin_2
      </value>
This is not fine since (I guess) the program will try to find a plugin
named "plugin_2\n" (maybe not precise, but you get the idea).

I've been debugging for this bug for hours and finally found it. The cause
is that my editor automatically formats long line by splitting it into
multiple lines.

So the rule here is: no matter how long a property value is, do not spread
it into multiple lines. Otherwise something unexpected will happen.

At this point, I'm not sure whether I should submit another ticket because
I don't know exactly which properties are affected by this problem. Just a
heads up for all of you who might encounter the same problem in the future.

Regards
Andy


On 9 June 2012 11:42, Andy Xue <an...@gmail.com> wrote:

> Hi Lewis:
>
> Sorry for the delay. Sure, I'll open a ticket in a bit.
>
> Regards
> Andy
>
>
>
> On 7 June 2012 21:28, Lewis John Mcgibbney <le...@gmail.com>wrote:
>
>> Hi Andy,
>> Even opening a ticket and getting it logged would b great.
>> Thanks
>> Lewis
>>
>> On Wed, Jun 6, 2012 at 3:53 AM, Andy Xue <an...@gmail.com> wrote:
>> > Hi Lewis:
>> >
>> > I'll try to find a time to do it. Thanks for the reply.
>> >
>> > Regards
>> > Andy
>> >
>> >
>> >
>> > On 31 May 2012 20:37, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
>> >wrote:
>> >
>> >> Hi Andy,
>> >>
>> >> This is a good catch and I would suggest you open an issue on the Jira
>> >> and submit a patch for the few instances of where this actually
>> >> occurs... e.g. I think there are currently 4 such instances in
>> >> nutch-default which concern the ordering of such tools. Admittedly
>> >> though I haven't dug down into the code to see if it is consistent as
>> >> you assume...
>> >>
>> >> If you begin by investigating (and patching if necessary) these parts
>> >> then this would make a nice patch. As you are using trunk, I wouldn't
>> >> imagine it would take you too long.
>> >>
>> >> Thanks very much
>> >>
>> >> Lewis
>> >>
>> >> On Thu, May 31, 2012 at 2:34 AM, Andy Xue <an...@gmail.com>
>> wrote:
>> >> > Hi all:
>> >> >
>> >> > The following situation has come to my attention regarding
>> >> "*nutch-site.xml*"
>> >> > when I'm using nutch trunk:
>> >> > When listing multiple scoring filters in the property
>> >> "*scoring.filter.order
>> >> > *", it is vital that no spaces/newlines/tabs are placed in front of
>> the
>> >> > first value. E.g.:
>> >> > This is fine:
>> >> > <value>org.apache.nutch.scoring.opic.OPICScoringFilter
>> myFilter</value>
>> >> >
>> >> > Either of these will generate an exception:
>> >> > <value> org.apache.nutch.scoring.opic.OPICScoringFilter
>> myFilter</value>
>> >> > <value>
>> >> > org.apache.nutch.scoring.opic.OPICScoringFilter
>> >> > myFilter
>> >> > </value>
>> >> >
>> >> > The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a
>> statement
>> >> > (on line 59) "orderedFilters = order.split("\\s+");" tries to split
>> the
>> >> > aforementioned string. The leading spaces will cause an empty
>> separate
>> >> > array element as the first element, hence result in a ClassNotFound /
>> >> > NullPointer exception.
>> >> >
>> >> >
>> >> > It can be easily fixed of course, but what concerns me is that I
>> suspect
>> >> > the fact that other properties will have the same problem (i.e., must
>> >> have
>> >> > the value content immediately follow the *<value>* tag. This is not
>> >> > considered robust.
>> >> >
>> >> > Any thoughts?
>> >> >
>> >> > Regards
>> >> > Andy
>> >>
>> >>
>> >>
>> >> --
>> >> Lewis
>> >>
>>
>>
>>
>> --
>> Lewis
>>
>
>

Re: "nutch-site.xml" not robust

Posted by Andy Xue <an...@gmail.com>.
Hi Lewis:

Sorry for the delay. Sure, I'll open a ticket in a bit.

Regards
Andy


On 7 June 2012 21:28, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Andy,
> Even opening a ticket and getting it logged would b great.
> Thanks
> Lewis
>
> On Wed, Jun 6, 2012 at 3:53 AM, Andy Xue <an...@gmail.com> wrote:
> > Hi Lewis:
> >
> > I'll try to find a time to do it. Thanks for the reply.
> >
> > Regards
> > Andy
> >
> >
> >
> > On 31 May 2012 20:37, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com
> >wrote:
> >
> >> Hi Andy,
> >>
> >> This is a good catch and I would suggest you open an issue on the Jira
> >> and submit a patch for the few instances of where this actually
> >> occurs... e.g. I think there are currently 4 such instances in
> >> nutch-default which concern the ordering of such tools. Admittedly
> >> though I haven't dug down into the code to see if it is consistent as
> >> you assume...
> >>
> >> If you begin by investigating (and patching if necessary) these parts
> >> then this would make a nice patch. As you are using trunk, I wouldn't
> >> imagine it would take you too long.
> >>
> >> Thanks very much
> >>
> >> Lewis
> >>
> >> On Thu, May 31, 2012 at 2:34 AM, Andy Xue <an...@gmail.com>
> wrote:
> >> > Hi all:
> >> >
> >> > The following situation has come to my attention regarding
> >> "*nutch-site.xml*"
> >> > when I'm using nutch trunk:
> >> > When listing multiple scoring filters in the property
> >> "*scoring.filter.order
> >> > *", it is vital that no spaces/newlines/tabs are placed in front of
> the
> >> > first value. E.g.:
> >> > This is fine:
> >> > <value>org.apache.nutch.scoring.opic.OPICScoringFilter
> myFilter</value>
> >> >
> >> > Either of these will generate an exception:
> >> > <value> org.apache.nutch.scoring.opic.OPICScoringFilter
> myFilter</value>
> >> > <value>
> >> > org.apache.nutch.scoring.opic.OPICScoringFilter
> >> > myFilter
> >> > </value>
> >> >
> >> > The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a
> statement
> >> > (on line 59) "orderedFilters = order.split("\\s+");" tries to split
> the
> >> > aforementioned string. The leading spaces will cause an empty separate
> >> > array element as the first element, hence result in a ClassNotFound /
> >> > NullPointer exception.
> >> >
> >> >
> >> > It can be easily fixed of course, but what concerns me is that I
> suspect
> >> > the fact that other properties will have the same problem (i.e., must
> >> have
> >> > the value content immediately follow the *<value>* tag. This is not
> >> > considered robust.
> >> >
> >> > Any thoughts?
> >> >
> >> > Regards
> >> > Andy
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
>
>
>
> --
> Lewis
>

Re: "nutch-site.xml" not robust

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Andy,
Even opening a ticket and getting it logged would b great.
Thanks
Lewis

On Wed, Jun 6, 2012 at 3:53 AM, Andy Xue <an...@gmail.com> wrote:
> Hi Lewis:
>
> I'll try to find a time to do it. Thanks for the reply.
>
> Regards
> Andy
>
>
>
> On 31 May 2012 20:37, Lewis John Mcgibbney <le...@gmail.com>wrote:
>
>> Hi Andy,
>>
>> This is a good catch and I would suggest you open an issue on the Jira
>> and submit a patch for the few instances of where this actually
>> occurs... e.g. I think there are currently 4 such instances in
>> nutch-default which concern the ordering of such tools. Admittedly
>> though I haven't dug down into the code to see if it is consistent as
>> you assume...
>>
>> If you begin by investigating (and patching if necessary) these parts
>> then this would make a nice patch. As you are using trunk, I wouldn't
>> imagine it would take you too long.
>>
>> Thanks very much
>>
>> Lewis
>>
>> On Thu, May 31, 2012 at 2:34 AM, Andy Xue <an...@gmail.com> wrote:
>> > Hi all:
>> >
>> > The following situation has come to my attention regarding
>> "*nutch-site.xml*"
>> > when I'm using nutch trunk:
>> > When listing multiple scoring filters in the property
>> "*scoring.filter.order
>> > *", it is vital that no spaces/newlines/tabs are placed in front of the
>> > first value. E.g.:
>> > This is fine:
>> > <value>org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
>> >
>> > Either of these will generate an exception:
>> > <value> org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
>> > <value>
>> > org.apache.nutch.scoring.opic.OPICScoringFilter
>> > myFilter
>> > </value>
>> >
>> > The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a statement
>> > (on line 59) "orderedFilters = order.split("\\s+");" tries to split the
>> > aforementioned string. The leading spaces will cause an empty separate
>> > array element as the first element, hence result in a ClassNotFound /
>> > NullPointer exception.
>> >
>> >
>> > It can be easily fixed of course, but what concerns me is that I suspect
>> > the fact that other properties will have the same problem (i.e., must
>> have
>> > the value content immediately follow the *<value>* tag. This is not
>> > considered robust.
>> >
>> > Any thoughts?
>> >
>> > Regards
>> > Andy
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Re: "nutch-site.xml" not robust

Posted by Andy Xue <an...@gmail.com>.
Hi Lewis:

I'll try to find a time to do it. Thanks for the reply.

Regards
Andy



On 31 May 2012 20:37, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Andy,
>
> This is a good catch and I would suggest you open an issue on the Jira
> and submit a patch for the few instances of where this actually
> occurs... e.g. I think there are currently 4 such instances in
> nutch-default which concern the ordering of such tools. Admittedly
> though I haven't dug down into the code to see if it is consistent as
> you assume...
>
> If you begin by investigating (and patching if necessary) these parts
> then this would make a nice patch. As you are using trunk, I wouldn't
> imagine it would take you too long.
>
> Thanks very much
>
> Lewis
>
> On Thu, May 31, 2012 at 2:34 AM, Andy Xue <an...@gmail.com> wrote:
> > Hi all:
> >
> > The following situation has come to my attention regarding
> "*nutch-site.xml*"
> > when I'm using nutch trunk:
> > When listing multiple scoring filters in the property
> "*scoring.filter.order
> > *", it is vital that no spaces/newlines/tabs are placed in front of the
> > first value. E.g.:
> > This is fine:
> > <value>org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
> >
> > Either of these will generate an exception:
> > <value> org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
> > <value>
> > org.apache.nutch.scoring.opic.OPICScoringFilter
> > myFilter
> > </value>
> >
> > The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a statement
> > (on line 59) "orderedFilters = order.split("\\s+");" tries to split the
> > aforementioned string. The leading spaces will cause an empty separate
> > array element as the first element, hence result in a ClassNotFound /
> > NullPointer exception.
> >
> >
> > It can be easily fixed of course, but what concerns me is that I suspect
> > the fact that other properties will have the same problem (i.e., must
> have
> > the value content immediately follow the *<value>* tag. This is not
> > considered robust.
> >
> > Any thoughts?
> >
> > Regards
> > Andy
>
>
>
> --
> Lewis
>

Re: "nutch-site.xml" not robust

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Andy,

This is a good catch and I would suggest you open an issue on the Jira
and submit a patch for the few instances of where this actually
occurs... e.g. I think there are currently 4 such instances in
nutch-default which concern the ordering of such tools. Admittedly
though I haven't dug down into the code to see if it is consistent as
you assume...

If you begin by investigating (and patching if necessary) these parts
then this would make a nice patch. As you are using trunk, I wouldn't
imagine it would take you too long.

Thanks very much

Lewis

On Thu, May 31, 2012 at 2:34 AM, Andy Xue <an...@gmail.com> wrote:
> Hi all:
>
> The following situation has come to my attention regarding "*nutch-site.xml*"
> when I'm using nutch trunk:
> When listing multiple scoring filters in the property "*scoring.filter.order
> *", it is vital that no spaces/newlines/tabs are placed in front of the
> first value. E.g.:
> This is fine:
> <value>org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
>
> Either of these will generate an exception:
> <value> org.apache.nutch.scoring.opic.OPICScoringFilter myFilter</value>
> <value>
> org.apache.nutch.scoring.opic.OPICScoringFilter
> myFilter
> </value>
>
> The reason is: In *org.apache.nutch.scoring.ScoringFilters*, a statement
> (on line 59) "orderedFilters = order.split("\\s+");" tries to split the
> aforementioned string. The leading spaces will cause an empty separate
> array element as the first element, hence result in a ClassNotFound /
> NullPointer exception.
>
>
> It can be easily fixed of course, but what concerns me is that I suspect
> the fact that other properties will have the same problem (i.e., must have
> the value content immediately follow the *<value>* tag. This is not
> considered robust.
>
> Any thoughts?
>
> Regards
> Andy



-- 
Lewis