You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ar...@csiro.au on 2015/04/17 06:31:40 UTC

A bug in org.apache.nutch.parse.ParseUtil?

Hi,

>From reading the code it is clear that it is designed to allow using several parsers to parse a document in a sequence, until it is successfully parsed. In practice, this does not work because these lines

f (parseResult != null && !parseResult.isEmpty())
        return parseResult;

break the loop even if the parsing has failed because parseResult is not empty anyway, it contains a ParseData with ParseStatus.FAILED.
This is easy to fix, for example, by adding a line before the two lines mentioned above:

if ( parseResult != null ) parseResult.filter() ;

This will remove failed ParseData objects from the parseResult and leave it empty if the parsing had been unsuccessful. I believe that this fix is important because it allows use of backup parsers as originally designed and thus increase index completeness.

Regards,
Arkadi



Re: A bug in org.apache.nutch.parse.ParseUtil?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Sounds great, Arkadi (isAnySuccess()). Please submit a pull
request and/or patch when you get a chance. This sounds like
a needed change for sure.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "Arkadi.Kosmynin@csiro.au" <Ar...@csiro.au>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Tuesday, April 21, 2015 at 12:20 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: RE: A bug in org.apache.nutch.parse.ParseUtil?

>Hi Sebastian,
>
>Yes, I considered parseResult.isSuccess(), but the problem is, it returns
>success only if all parses were successful. So, if the first parser
>succeeds, it will break the loop, else all parsers will be used - I don't
>think this was the idea.
>
>If retaining ParseStatus of failed parses is important, perhaps a similar
>isAnySuccess() function could help.
>
>Regards,
>
>Arkadi
>
>-----Original Message-----
>From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>Sent: Saturday, 18 April 2015 7:37 AM
>To: user@nutch.apache.org
>Subject: Re: A bug in org.apache.nutch.parse.ParseUtil?
>
>Hi Arkadi,
>
>agreed that's a bug.
>
>> if ( parseResult != null ) parseResult.filter() ;
>
>parseResult.isSuccess()
>  would do the check without modifying the ParseResult
>
>In case, that also fall-back parsers fail it could useful to return one
>(the first? the last?) failed ParseResult. Luckily the parser places a
>meaningful error message or minor ParseStatus which could be used by the
>caller for diagnostics.
>
>Thanks,
>Sebastian
>
>On 04/17/2015 06:31 AM, Arkadi.Kosmynin@csiro.au wrote:
>> Hi,
>> 
>> From reading the code it is clear that it is designed to allow using
>> several parsers to parse a document in a sequence, until it is
>> successfully parsed. In practice, this does not work because these
>> lines
>> 
>> f (parseResult != null && !parseResult.isEmpty())
>>         return parseResult;
>> 
>> break the loop even if the parsing has failed because parseResult is
>>not empty anyway, it contains a ParseData with ParseStatus.FAILED.
>> This is easy to fix, for example, by adding a line before the two lines
>>mentioned above:
>> 
>> if ( parseResult != null ) parseResult.filter() ;
>> 
>> This will remove failed ParseData objects from the parseResult and
>>leave it empty if the parsing had been unsuccessful. I believe that this
>>fix is important because it allows use of backup parsers as originally
>>designed and thus increase index completeness.
>> 
>> Regards,
>> Arkadi
>> 
>> 
>> 
>


RE: A bug in org.apache.nutch.parse.ParseUtil?

Posted by Ar...@csiro.au.
Hi Sebastian,

Yes, I considered parseResult.isSuccess(), but the problem is, it returns success only if all parses were successful. So, if the first parser succeeds, it will break the loop, else all parsers will be used - I don't think this was the idea.

If retaining ParseStatus of failed parses is important, perhaps a similar isAnySuccess() function could help.

Regards,

Arkadi

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Saturday, 18 April 2015 7:37 AM
To: user@nutch.apache.org
Subject: Re: A bug in org.apache.nutch.parse.ParseUtil?

Hi Arkadi,

agreed that's a bug.

> if ( parseResult != null ) parseResult.filter() ;

parseResult.isSuccess()
  would do the check without modifying the ParseResult

In case, that also fall-back parsers fail it could useful to return one (the first? the last?) failed ParseResult. Luckily the parser places a meaningful error message or minor ParseStatus which could be used by the caller for diagnostics.

Thanks,
Sebastian

On 04/17/2015 06:31 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi,
> 
> From reading the code it is clear that it is designed to allow using 
> several parsers to parse a document in a sequence, until it is 
> successfully parsed. In practice, this does not work because these 
> lines
> 
> f (parseResult != null && !parseResult.isEmpty())
>         return parseResult;
> 
> break the loop even if the parsing has failed because parseResult is not empty anyway, it contains a ParseData with ParseStatus.FAILED.
> This is easy to fix, for example, by adding a line before the two lines mentioned above:
> 
> if ( parseResult != null ) parseResult.filter() ;
> 
> This will remove failed ParseData objects from the parseResult and leave it empty if the parsing had been unsuccessful. I believe that this fix is important because it allows use of backup parsers as originally designed and thus increase index completeness.
> 
> Regards,
> Arkadi
> 
> 
> 


Re: A bug in org.apache.nutch.parse.ParseUtil?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arkadi,

agreed that's a bug.

> if ( parseResult != null ) parseResult.filter() ;

parseResult.isSuccess()
  would do the check without modifying the ParseResult

In case, that also fall-back parsers fail it could useful to
return one (the first? the last?) failed ParseResult. Luckily the parser
places a meaningful error message or minor ParseStatus which
could be used by the caller for diagnostics.

Thanks,
Sebastian

On 04/17/2015 06:31 AM, Arkadi.Kosmynin@csiro.au wrote:
> Hi,
> 
> From reading the code it is clear that it is designed to allow using several parsers to parse a document in a sequence, until it is successfully parsed. In practice, this does not work because these lines
> 
> f (parseResult != null && !parseResult.isEmpty())
>         return parseResult;
> 
> break the loop even if the parsing has failed because parseResult is not empty anyway, it contains a ParseData with ParseStatus.FAILED.
> This is easy to fix, for example, by adding a line before the two lines mentioned above:
> 
> if ( parseResult != null ) parseResult.filter() ;
> 
> This will remove failed ParseData objects from the parseResult and leave it empty if the parsing had been unsuccessful. I believe that this fix is important because it allows use of backup parsers as originally designed and thus increase index completeness.
> 
> Regards,
> Arkadi
> 
> 
> 


Re: A bug in org.apache.nutch.parse.ParseUtil?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Awesome Arkadi. This sounds legit.

Can you scope this?

https://github.com/apache/nutch/#contributing


File an issue and then push a PR I’ll be sure to merge
it.

Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "Arkadi.Kosmynin@csiro.au" <Ar...@csiro.au>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Friday, April 17, 2015 at 12:31 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: A bug in org.apache.nutch.parse.ParseUtil?

>Hi,
>
>From reading the code it is clear that it is designed to allow using
>several parsers to parse a document in a sequence, until it is
>successfully parsed. In practice, this does not work because these lines
>
>f (parseResult != null && !parseResult.isEmpty())
>        return parseResult;
>
>break the loop even if the parsing has failed because parseResult is not
>empty anyway, it contains a ParseData with ParseStatus.FAILED.
>This is easy to fix, for example, by adding a line before the two lines
>mentioned above:
>
>if ( parseResult != null ) parseResult.filter() ;
>
>This will remove failed ParseData objects from the parseResult and leave
>it empty if the parsing had been unsuccessful. I believe that this fix is
>important because it allows use of backup parsers as originally designed
>and thus increase index completeness.
>
>Regards,
>Arkadi
>
>