You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Julien Nioche <li...@gmail.com> on 2012/12/01 09:37:21 UTC

Re: Wrong ParseData in segment

Hi

Anything to do with the serialization? Objects are reused in the writers,
if a field is not reset then it will be reused as-is for the following
object.

Ju

On 30 November 2012 21:22, Markus Jelsma <ma...@openindex.io> wrote:

> Hi
>
> In our case it is really in the segment, and ends up in the index. Are
> there any known issues with parse filters? In that filter we do set the
> Parse object as class attribute but we reset it with the new Parse object
> right after filter() is called.
>
> I also cannot think of the custom Tika ContentHandler to be the issue, a
> new ContentHandler is created for each parse and passed to the
> TeeContentHandler, just all other ContentHandlers.
>
> I assume an individual parse is completely isolated from another because
> all those objects are created new for each record.
>
> Does anyone have a clue, however slight? Or any general tips on this, or
> how to attempt to reproduce it?
>
>
> Thanks
>
> -----Original message-----
> > From:Sebastian Nagel <wa...@googlemail.com>
> > Sent: Fri 30-Nov-2012 21:04
> > To: user@nutch.apache.org
> > Subject: Re: Wrong ParseData in segment
> >
> > Hi Markus,
> >
> > sounds somewhat similar to NUTCH-1252 but that was rather trivial
> > and easy to reproduce.
> >
> > Sebastian
> >
> > 2012/11/30 Markus Jelsma <ma...@openindex.io>:
> > > Hi,
> > >
> > > We've got an issue where one in a few thousand records partially
> contains another record's ParseMeta data. To be specific, record A ends up
> with the ParseMeta data of record B that is added by one of our custom
> parse plugins. I'm unsure as to where the problem really is because the
> parse plugin receives data from a modified parser plugin that in turn adds
> a custom Tika ContentHandler.
> > >
> > > Because i'm unable to reproduce this i had to inspect the code for
> places where an object is reused but an attribute is not reset. To me, that
> would be the most obvious problem, but until now i've been unsuccessful in
> finding the issue!
> > >
> > > Regardless of how remote the chance is of someone having had some
> similar issue: does anyone have some ideas to share?
> > >
> > > Thanks,
> > > Markus
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble