You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2006/11/23 18:20:04 UTC

[jira] Closed: (NUTCH-406) Metadata tries to write null values

     [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]

Chris A. Mattmann closed NUTCH-406.
-----------------------------------


Patch applied to trunk:

http://svn.apache.org/viewvc?view=rev&revision=478619




> Metadata tries to write null values
> -----------------------------------
>
>                 Key: NUTCH-406
>                 URL: http://issues.apache.org/jira/browse/NUTCH-406
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Doğacan Güney
>         Assigned To: Chris A. Mattmann
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>
>
> During parsing, some urls (especially pdfs, it seems) may create <some_key, null> pairs in ParseData's parseMeta. 
> When Metadata.write() tries to write such a pair, it causes an NPE.
> Stack trace will be something like this:
>         at org.apache.hadoop.io.Text.encode(Text.java:373)
>         at org.apache.hadoop.io.Text.encode(Text.java:354)
>         at org.apache.hadoop.io.Text.writeString(Text.java:394)
>         at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
> I can consistently reproduce this using the following url:
> http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Re: [jira] Closed: (NUTCH-406) Metadata tries to write null values

Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Mattmann wrote:
>> 4. Issue could have been iterated in jira a bit further so all these
>> could have been catched before a commit.
>>     
>
> This is true: however, I thought that the point of bringing in new people
> was to move forward on some of these critical issues that keep moving their
> way down the priority stack? The issues that you raise above (e.g.,
> whitespace v. tabs, and "unnecessary comments"), although relevant points,
> really had nothing to do with the fix itself. I wanted to get the fix into
> the sources before everyone went away for thanksgiving (at least here in the
> U.S.), so that users could pull it down sooner rather than later. Is this
> not the correct policy? I'm a n00b, so I dunno ;)
>   

My practice is to leave the fix to mature a day or two (or three if it's 
a holidays season), even if it seems innocuous. The reason is that quite 
often people come back with valuable and totally unexpected insights 
(peer review) _when_ and if they had a chance to see the fix - and 
considering different time zones, occupations and workloads this may 
take a day or two even with best intentions... If a fix is complicated I 
explicitly ask for feedback.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Closed: (NUTCH-406) Metadata tries to write null values

Posted by Sami Siren <ss...@gmail.com>.
Chris Mattmann wrote:
>> 2. You left some unneccessary comments on source, bug history is
>> allready in jira and commit logs
> 
> I would disagree with this statement: no comment is "unnecessary". What if
> the users don't look into JIRA, or don't scan through the commit logs? The
> change that we just made was critical, though subtle, and a user could gloss
> over the fact that only non-null values get written now. BTW, I'm a fan of

That kind of information should go into proper place, javadocs for a 
method or a class where it really is visible. I would not store history 
information in java sources, there's all these tools for it that serve 
that purpose better.

> more comments, rather than less ;)

Don't take me wrong I have no problem with comments if they serve a 
purpose and are in proper place.

>> 4. Issue could have been iterated in jira a bit further so all these
>> could have been catched before a commit.
> 
> This is true: however, I thought that the point of bringing in new people
> was to move forward on some of these critical issues that keep moving their
> way down the priority stack? The issues that you raise above (e.g.,
> whitespace v. tabs, and "unnecessary comments"), although relevant points,
> really had nothing to do with the fix itself. I wanted to get the fix into
> the sources before everyone went away for thanksgiving (at least here in the
> U.S.), so that users could pull it down sooner rather than later. Is this

IMO there's no point rushing into things - Nutch is there tomorrow also.

> not the correct policy? I'm a n00b, so I dunno ;)

we're all noobs

--
  Sami Siren

Re: [jira] Closed: (NUTCH-406) Metadata tries to write null values

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Sami,


On 11/23/06 9:45 AM, "Sami Siren" <ss...@gmail.com> wrote:

> Couple of points:
> 
> 1. You used tabs

I just installed a new version of Eclipse, and forgot to change the default
preference for using tabs versus just whitespaces. I've went ahead and
changed this in my Eclipse and will commit an update that uses whitespaces
instead of tabs shortly.

> 2. You left some unneccessary comments on source, bug history is
> allready in jira and commit logs

I would disagree with this statement: no comment is "unnecessary". What if
the users don't look into JIRA, or don't scan through the commit logs? The
change that we just made was critical, though subtle, and a user could gloss
over the fact that only non-null values get written now. BTW, I'm a fan of
more comments, rather than less ;)

> 3. Why not addition to testcase?

Good point. I'll add a testcase for this in TestMetadata.

> 4. Issue could have been iterated in jira a bit further so all these
> could have been catched before a commit.

This is true: however, I thought that the point of bringing in new people
was to move forward on some of these critical issues that keep moving their
way down the priority stack? The issues that you raise above (e.g.,
whitespace v. tabs, and "unnecessary comments"), although relevant points,
really had nothing to do with the fix itself. I wanted to get the fix into
the sources before everyone went away for thanksgiving (at least here in the
U.S.), so that users could pull it down sooner rather than later. Is this
not the correct policy? I'm a n00b, so I dunno ;)

Cheers,
  Chris
 

> 
> --
>   Sami Siren
> 
> 
> 
> 
> Chris A. Mattmann (JIRA) wrote:
>>      [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]
>> 
>> Chris A. Mattmann closed NUTCH-406.
>> -----------------------------------
>> 
>> 
>> Patch applied to trunk:
>> 
>> http://svn.apache.org/viewvc?view=rev&revision=478619
>> 
>> 
>> 
>> 
>>> Metadata tries to write null values
>>> -----------------------------------
>>> 
>>>                 Key: NUTCH-406
>>>                 URL: http://issues.apache.org/jira/browse/NUTCH-406
>>>             Project: Nutch
>>>          Issue Type: Bug
>>>    Affects Versions: 0.9.0
>>>            Reporter: Doğacan Güney
>>>         Assigned To: Chris A. Mattmann
>>>             Fix For: 0.9.0
>>> 
>>>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>>> 
>>> 
>>> During parsing, some urls (especially pdfs, it seems) may create <some_key,
>>> null> pairs in ParseData's parseMeta.
>>> When Metadata.write() tries to write such a pair, it causes an NPE.
>>> Stack trace will be something like this:
>>>         at org.apache.hadoop.io.Text.encode(Text.java:373)
>>>         at org.apache.hadoop.io.Text.encode(Text.java:354)
>>>         at org.apache.hadoop.io.Text.writeString(Text.java:394)
>>>         at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
>>> I can consistently reproduce this using the following url:
>>> http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf
>> 
> 



Re: [jira] Closed: (NUTCH-406) Metadata tries to write null values

Posted by Sami Siren <ss...@gmail.com>.
Couple of points:

1. You used tabs
2. You left some unneccessary comments on source, bug history is 
allready in jira and commit logs
3. Why not addition to testcase?
4. Issue could have been iterated in jira a bit further so all these 
could have been catched before a commit.

--
  Sami Siren




Chris A. Mattmann (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]
> 
> Chris A. Mattmann closed NUTCH-406.
> -----------------------------------
> 
> 
> Patch applied to trunk:
> 
> http://svn.apache.org/viewvc?view=rev&revision=478619
> 
> 
> 
> 
>> Metadata tries to write null values
>> -----------------------------------
>>
>>                 Key: NUTCH-406
>>                 URL: http://issues.apache.org/jira/browse/NUTCH-406
>>             Project: Nutch
>>          Issue Type: Bug
>>    Affects Versions: 0.9.0
>>            Reporter: Doğacan Güney
>>         Assigned To: Chris A. Mattmann
>>             Fix For: 0.9.0
>>
>>         Attachments: NUTCH-406.patch, NUTCH-406.patch
>>
>>
>> During parsing, some urls (especially pdfs, it seems) may create <some_key, null> pairs in ParseData's parseMeta. 
>> When Metadata.write() tries to write such a pair, it causes an NPE.
>> Stack trace will be something like this:
>>         at org.apache.hadoop.io.Text.encode(Text.java:373)
>>         at org.apache.hadoop.io.Text.encode(Text.java:354)
>>         at org.apache.hadoop.io.Text.writeString(Text.java:394)
>>         at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
>> I can consistently reproduce this using the following url:
>> http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf
>