You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2011/12/23 16:27:21 UTC

Re: Multiple values encountered for non multivalued field

Okay, I've modified my local version to only add the more indexer title
when there's not an existing title.

How do I create and submit a patch?

On Thu, Nov 3, 2011 at 3:00 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Ah you're right. There's an issue for this. You're welcome to submit a
> patch:
>
> https://issues.apache.org/jira/browse/NUTCH-1140
>
> I'll mark it for 1.5, seems it isn't yet.
>
> > Actually, it turns out it's a Nutch issue.  Tika outputs the correct
> title
> > for the pdf.  However, the indexer-more plugin is adding in the filename
> > due to the HTTP header "Content-Disposition".
> >
> > Is there a way to turn this off while keeping the other functionality of
> > the plugin?  I'd prefer not to have a bunch of tweaks in the Nutch code.
> >
> > On Wed, Nov 2, 2011 at 10:11 AM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > The output is a bit misleading indeed. The file has two valid titles
> and
> > > two
> > > are being extracted. The title and the filename are both seen as titles
> > > by Tika.
> > >
> > > You can spot this behaviour better using the indexchecker tool.
> > >
> > > Please consult the Tika wiki, docs or mailing list on how to proceed.
> > > Either
> > > that or make your Solr schema field for title multiValued and deal with
> > > it appropriately in your search front-end.
> > >
> > > Cheers
> > >
> > > On Wednesday 02 November 2011 15:02:11 Bai Shen wrote:
> > > > Found it right after I asked. :)  BTW, the command is wrong on the
> > > > wiki.
> > >
> > >  I
> > >
> > > > need to get around to making an account so I can fix things.
> > > >
> > > > I ran it on the pdf url and it only gives me one title.  But it's
> > > > pretty long.  Could that be the problem?
> > > >
> > > > The url is
> > > > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfifyou
> > >
> > > want
> > >
> > > > to check yourself.
> > > >
> > > > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma
> > >
> > > <ma...@openindex.io>wrote:
> > > > > bin/nutch parsechecker <url>
> > > > >
> > > > > see also:
> > > > > http://wiki.apache.org/nutch/CommandLineOptions
> > > > >
> > > > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote:
> > > > > > Parsechecker tool?  Where do I find that?
> > > > > >
> > > > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma
> > > > >
> > > > > <ma...@openindex.io>wrote:
> > > > > > > > I'm running the latest version of 1.4  We just rebuilt it
> last
> > > > > > > > week. Is that patch included?
> > > > > > >
> > > > > > > Yes, so you actually have more than one non-zero length titles
> > >
> > > coming
> > >
> > > > > > > from your parser. Please try the parsechecker tool and confirm,
> > > > > > > but i'm not sure it
> > > > > > > is capable of showing multiple titles.
> > > > > > >
> > > > > > > > And where would it get multiple titles from?
> > > > > > >
> > > > > > > Most likely from PDF or other document types. You can check
> with
> > > > > > > a stand-alone
> > > > > > > Tika.
> > > > > > >
> > > > > > > > How do I tell what the titles
> > > > > > > > are so I can see if they're valid or not?
> > > > > > > >
> > > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma
> > > > > > >
> > > > > > > <ma...@openindex.io>wrote:
> > > > > > > > > This should work around the problem in most cases. The
> parser
> > >
> > > can
> > >
> > > > > > > output
> > > > > > >
> > > > > > > > > two
> > > > > > > > > titles of which one is actually empty. This patch (in 1.4)
> > >
> > > skips
> > >
> > > > > > > > > empty titles.
> > > > > > > > >
> > > > > > > > > If this doesn't work you really have two _valid_ titles
> > > > > > > > > coming from
> > > > > > >
> > > > > > > your
> > > > > > >
> > > > > > > > > document.
> > > > > > > > >
> > > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004
> > > > > > > > >
> > > > > > > > > > It looks like the issue I'm encountering is the same one
> as
> > > > > > > > > > here.
> > >
> > >
> http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu
> > >
> > > > > > > > > lt
> > > > > > > > >
> > > > > > > > > > iValued-field-title-td1446817.html
> > > > > > > > > >
> > > > > > > > > > I'm not really sure what the linked bug is since that
> > >
> > > involves
> > >
> > > > > the
> > > > >
> > > > > > > HTML
> > > > > > >
> > > > > > > > > > parser and I'm seeing this problem with a PDF file.
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <
> > > > >
> > > > > baishen.lists@gmail.com>
> > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > I'm getting an exception when I try to commit to Solr.
> > > > > > > > > > > Looking at the Solr log, it's showing that title is
> > > > > > > > > > > getting multiple values when it's not a multivalue
> > > > > > > > > > > field.  None of my code does anything with the title,
> so
> > > > > > > > > > > I'm not sure why this is happening.
> > > > > > > > > > >
> > > > > > > > > > > How can I look at the pending commit and determine why
> > >
> > > and/or
> > >
> > > > > > > delete
> > > > > > >
> > > > > > > > > the
> > > > > > > > >
> > > > > > > > > > > extraneous values?  The document in question is a pdf
> if
> > >
> > > that
> > >
> > > > > > > makes a
> > > > > > >
> > > > > > > > > > > difference.
> > > > >
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536620 / 06-50258350
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>

Re: Multiple values encountered for non multivalued field

Posted by Markus Jelsma <ma...@openindex.io>.

In Jira after you craeted an accoutn you can make a new issue and attach a 
patch.

On Friday 23 December 2011 16:27:21 Bai Shen wrote:
> Okay, I've modified my local version to only add the more indexer title
> when there's not an existing title.
> 
> How do I create and submit a patch?
> 
> On Thu, Nov 3, 2011 at 3:00 PM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > Ah you're right. There's an issue for this. You're welcome to submit a
> > patch:
> > 
> > https://issues.apache.org/jira/browse/NUTCH-1140
> > 
> > I'll mark it for 1.5, seems it isn't yet.
> > 
> > > Actually, it turns out it's a Nutch issue.  Tika outputs the correct
> > 
> > title
> > 
> > > for the pdf.  However, the indexer-more plugin is adding in the
> > > filename due to the HTTP header "Content-Disposition".
> > > 
> > > Is there a way to turn this off while keeping the other functionality
> > > of the plugin?  I'd prefer not to have a bunch of tweaks in the Nutch
> > > code.
> > > 
> > > On Wed, Nov 2, 2011 at 10:11 AM, Markus Jelsma
> > > 
> > > <ma...@openindex.io>wrote:
> > > > The output is a bit misleading indeed. The file has two valid titles
> > 
> > and
> > 
> > > > two
> > > > are being extracted. The title and the filename are both seen as
> > > > titles by Tika.
> > > > 
> > > > You can spot this behaviour better using the indexchecker tool.
> > > > 
> > > > Please consult the Tika wiki, docs or mailing list on how to proceed.
> > > > Either
> > > > that or make your Solr schema field for title multiValued and deal
> > > > with it appropriately in your search front-end.
> > > > 
> > > > Cheers
> > > > 
> > > > On Wednesday 02 November 2011 15:02:11 Bai Shen wrote:
> > > > > Found it right after I asked. :)  BTW, the command is wrong on the
> > > > > wiki.
> > > >  
> > > >  I
> > > >  
> > > > > need to get around to making an account so I can fix things.
> > > > > 
> > > > > I ran it on the pdf url and it only gives me one title.  But it's
> > > > > pretty long.  Could that be the problem?
> > > > > 
> > > > > The url is
> > > > > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfifyo
> > > > > u
> > > > 
> > > > want
> > > > 
> > > > > to check yourself.
> > > > > 
> > > > > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma
> > > > 
> > > > <ma...@openindex.io>wrote:
> > > > > > bin/nutch parsechecker <url>
> > > > > > 
> > > > > > see also:
> > > > > > http://wiki.apache.org/nutch/CommandLineOptions
> > > > > > 
> > > > > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote:
> > > > > > > Parsechecker tool?  Where do I find that?
> > > > > > > 
> > > > > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma
> > > > > > 
> > > > > > <ma...@openindex.io>wrote:
> > > > > > > > > I'm running the latest version of 1.4  We just rebuilt it
> > 
> > last
> > 
> > > > > > > > > week. Is that patch included?
> > > > > > > > 
> > > > > > > > Yes, so you actually have more than one non-zero length
> > > > > > > > titles
> > > > 
> > > > coming
> > > > 
> > > > > > > > from your parser. Please try the parsechecker tool and
> > > > > > > > confirm, but i'm not sure it
> > > > > > > > is capable of showing multiple titles.
> > > > > > > > 
> > > > > > > > > And where would it get multiple titles from?
> > > > > > > > 
> > > > > > > > Most likely from PDF or other document types. You can check
> > 
> > with
> > 
> > > > > > > > a stand-alone
> > > > > > > > Tika.
> > > > > > > > 
> > > > > > > > > How do I tell what the titles
> > > > > > > > > are so I can see if they're valid or not?
> > > > > > > > > 
> > > > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma
> > > > > > > > 
> > > > > > > > <ma...@openindex.io>wrote:
> > > > > > > > > > This should work around the problem in most cases. The
> > 
> > parser
> > 
> > > > can
> > > > 
> > > > > > > > output
> > > > > > > > 
> > > > > > > > > > two
> > > > > > > > > > titles of which one is actually empty. This patch (in
> > > > > > > > > > 1.4)
> > > > 
> > > > skips
> > > > 
> > > > > > > > > > empty titles.
> > > > > > > > > > 
> > > > > > > > > > If this doesn't work you really have two _valid_ titles
> > > > > > > > > > coming from
> > > > > > > > 
> > > > > > > > your
> > > > > > > > 
> > > > > > > > > > document.
> > > > > > > > > > 
> > > > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004
> > > > > > > > > > 
> > > > > > > > > > > It looks like the issue I'm encountering is the same
> > > > > > > > > > > one
> > 
> > as
> > 
> > > > > > > > > > > here.
> > 
> > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu
> > 
> > > > > > > > > > lt
> > > > > > > > > > 
> > > > > > > > > > > iValued-field-title-td1446817.html
> > > > > > > > > > > 
> > > > > > > > > > > I'm not really sure what the linked bug is since that
> > > > 
> > > > involves
> > > > 
> > > > > > the
> > > > > > 
> > > > > > > > HTML
> > > > > > > > 
> > > > > > > > > > > parser and I'm seeing this problem with a PDF file.
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <
> > > > > > 
> > > > > > baishen.lists@gmail.com>
> > > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > > > > I'm getting an exception when I try to commit to
> > > > > > > > > > > > Solr. Looking at the Solr log, it's showing that
> > > > > > > > > > > > title is getting multiple values when it's not a
> > > > > > > > > > > > multivalue field.  None of my code does anything
> > > > > > > > > > > > with the title,
> > 
> > so
> > 
> > > > > > > > > > > > I'm not sure why this is happening.
> > > > > > > > > > > > 
> > > > > > > > > > > > How can I look at the pending commit and determine
> > > > > > > > > > > > why
> > > > 
> > > > and/or
> > > > 
> > > > > > > > delete
> > > > > > > > 
> > > > > > > > > > the
> > > > > > > > > > 
> > > > > > > > > > > > extraneous values?  The document in question is a pdf
> > 
> > if
> > 
> > > > that
> > > > 
> > > > > > > > makes a
> > > > > > > > 
> > > > > > > > > > > > difference.
> > > > > > 
> > > > > > --
> > > > > > Markus Jelsma - CTO - Openindex
> > > > > > http://www.linkedin.com/in/markus17
> > > > > > 050-8536620 / 06-50258350
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex