You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/09 18:06:02 UTC

multiple values encountered for non multiValued field title

Hi,

 

I've got something weird again (using Nutch 1.2), a document fetched and parsed by Nutch doesn't comply with my Solr schema, it attempts to send more than one value to a non-multi valued field, the title field.

 

It's a PNG file that hasn't been caught by the regex-urlfilter (i'll need to fix that later) but it generates two titles? Not according to the parser checker:

 

bin/nutch org.apache.nutch.parse.ParserChecker http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81

 

Is there more i can do to debug or fix this one?

 

Cheers,

RE: Re: multiple values encountered for non multiValued field title

Posted by Markus Jelsma <ma...@buyways.nl>.
I see. I'd better try build Tika myself and upgrade it in Nutch so i can verify.
-----Original message-----
From: Ken Krugler <kk...@transpac.com>
Sent: Thu 09-09-2010 21:07
To: user@nutch.apache.org; 
Subject: Re: multiple values encountered for non multiValued field title


On Sep 9, 2010, at 11:42am, Markus Jelsma wrote:

> Ah, so Tika could be the trouble maker here but i cannot reproduce  
> yet, Nutch hasn't got 0.8 yet. Also, you're talking about HTML  
> documents while my examples where a PNG and PDF file. Does your  
> issue solve that as well?

Not sure, but see this comment on my issue: https://issues.apache.org/jira/browse/TIKA-478?focusedCommentId=12897956&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel 
#action_12897956

If the PNG and PDF parsers also emit head elements such as <title>,  
then (unverified) I think you'd get an empty title and then the real  
title. Don't know if that would cause multiple values (if one is empty).

-- Ken

>
> -----Original message-----
> From: Ken Krugler <kk...@transpac.com>
> Sent: Thu 09-09-2010 20:08
> To: user@nutch.apache.org;
> Subject: Re: multiple values encountered for non multiValued field  
> title
>
>
> On Sep 9, 2010, at 10:18am, Markus Jelsma wrote:
>
>> Hi,
>>
>> Luke is my friend and use it often, the problem here is that it
>> fails before it is added to my Solr index. Even without further
>> analyzing it is clear that for some reason, the title field for some
>> documents got two values instead of one. Debugging Tika is not yet
>> something i'm capable of ;)
>>
>> I need to know why Nutch parsed the file and came up with two
>> values, it may be a Tika bug or some other weirdness i haven't
>> though about.
>
> I've fixed up some issues related to element ordering in Tika, see https://issues.apache.org/jira/browse/TIKA-478
>  - though I think most of what I was seeing was triggered by changed
> made in 0.8-SNAPSHOT.
>
> Given the issues I ran into while trying to fix element ordering
> problems, I could easily see cases of busted HTML triggering the
> generation of two <title> elements.
>
> For cases where I care, I usually do a simple state machine, e.g. only
> process the first <title> element inside of a <head> element.
>
> -- Ken
>
>> -----Original message-----
>> From: André Ricardo <an...@gmail.com>
>> Sent: Thu 09-09-2010 18:37
>> To: user@nutch.apache.org;
>> Subject: Re: multiple values encountered for non multiValued field
>> title
>>
>> Hello Markus,
>>
>> I hope I understood your problem. Here goes my opinion:
>>
>> I don't know if you are aware that you can look inside indexes with
>> a tool
>> named "Luke" with this tool you can see exactly if there is more
>> than one
>> value for a field in a document.
>> Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
>> multiValued="true" if a field has more than one value.
>>
>> For example the creative commons field has a lot of values for the
>> same
>> document (by, nc, us, etc...)
>> I have
>> <field name="cc" multiValued="true" type="string" stored="true"
>> indexed="true"/>
>>
>> Hope this helps,
>> André Ricardo
>>
>> On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <markus.jelsma@buyways.nl
>>> wrote:
>>
>>>  And now there's also a PDF giving this kind of trouble:
>>>
>>>
>>> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
>>>
>>> -----Original message-----
>>> From: Markus Jelsma <ma...@buyways.nl>
>>> Sent: Thu 09-09-2010 18:06
>>> To: user@nutch.apache.org;
>>> Subject: multiple values encountered for non multiValued field title
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've got something weird again (using Nutch 1.2), a document
>>> fetched and
>>> parsed by Nutch doesn't comply with my Solr schema, it attempts to
>>> send more
>>> than one value to a non-multi valued field, the title field.
>>>
>>>
>>>
>>> It's a PNG file that hasn't been caught by the regex-urlfilter
>>> (i'll need
>>> to fix that later) but it generates two titles? Not according to
>>> the parser
>>> checker:
>>>
>>>
>>>
>>> bin/nutch org.apache.nutch.parse.ParserChecker
>>> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
>>>
>>>
>>>
>>> Is there more i can do to debug or fix this one?
>>>
>>>
>>>
>>> Cheers,
>>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: multiple values encountered for non multiValued field title

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 9, 2010, at 11:42am, Markus Jelsma wrote:

> Ah, so Tika could be the trouble maker here but i cannot reproduce  
> yet, Nutch hasn't got 0.8 yet. Also, you're talking about HTML  
> documents while my examples where a PNG and PDF file. Does your  
> issue solve that as well?

Not sure, but see this comment on my issue: https://issues.apache.org/jira/browse/TIKA-478?focusedCommentId=12897956&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel 
#action_12897956

If the PNG and PDF parsers also emit head elements such as <title>,  
then (unverified) I think you'd get an empty title and then the real  
title. Don't know if that would cause multiple values (if one is empty).

-- Ken

>
> -----Original message-----
> From: Ken Krugler <kk...@transpac.com>
> Sent: Thu 09-09-2010 20:08
> To: user@nutch.apache.org;
> Subject: Re: multiple values encountered for non multiValued field  
> title
>
>
> On Sep 9, 2010, at 10:18am, Markus Jelsma wrote:
>
>> Hi,
>>
>> Luke is my friend and use it often, the problem here is that it
>> fails before it is added to my Solr index. Even without further
>> analyzing it is clear that for some reason, the title field for some
>> documents got two values instead of one. Debugging Tika is not yet
>> something i'm capable of ;)
>>
>> I need to know why Nutch parsed the file and came up with two
>> values, it may be a Tika bug or some other weirdness i haven't
>> though about.
>
> I've fixed up some issues related to element ordering in Tika, see https://issues.apache.org/jira/browse/TIKA-478
>  - though I think most of what I was seeing was triggered by changed
> made in 0.8-SNAPSHOT.
>
> Given the issues I ran into while trying to fix element ordering
> problems, I could easily see cases of busted HTML triggering the
> generation of two <title> elements.
>
> For cases where I care, I usually do a simple state machine, e.g. only
> process the first <title> element inside of a <head> element.
>
> -- Ken
>
>> -----Original message-----
>> From: André Ricardo <an...@gmail.com>
>> Sent: Thu 09-09-2010 18:37
>> To: user@nutch.apache.org;
>> Subject: Re: multiple values encountered for non multiValued field
>> title
>>
>> Hello Markus,
>>
>> I hope I understood your problem. Here goes my opinion:
>>
>> I don't know if you are aware that you can look inside indexes with
>> a tool
>> named "Luke" with this tool you can see exactly if there is more
>> than one
>> value for a field in a document.
>> Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
>> multiValued="true" if a field has more than one value.
>>
>> For example the creative commons field has a lot of values for the
>> same
>> document (by, nc, us, etc...)
>> I have
>> <field name="cc" multiValued="true" type="string" stored="true"
>> indexed="true"/>
>>
>> Hope this helps,
>> André Ricardo
>>
>> On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <markus.jelsma@buyways.nl
>>> wrote:
>>
>>>  And now there's also a PDF giving this kind of trouble:
>>>
>>>
>>> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
>>>
>>> -----Original message-----
>>> From: Markus Jelsma <ma...@buyways.nl>
>>> Sent: Thu 09-09-2010 18:06
>>> To: user@nutch.apache.org;
>>> Subject: multiple values encountered for non multiValued field title
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've got something weird again (using Nutch 1.2), a document
>>> fetched and
>>> parsed by Nutch doesn't comply with my Solr schema, it attempts to
>>> send more
>>> than one value to a non-multi valued field, the title field.
>>>
>>>
>>>
>>> It's a PNG file that hasn't been caught by the regex-urlfilter
>>> (i'll need
>>> to fix that later) but it generates two titles? Not according to
>>> the parser
>>> checker:
>>>
>>>
>>>
>>> bin/nutch org.apache.nutch.parse.ParserChecker
>>> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
>>>
>>>
>>>
>>> Is there more i can do to debug or fix this one?
>>>
>>>
>>>
>>> Cheers,
>>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






RE: Re: multiple values encountered for non multiValued field title

Posted by Markus Jelsma <ma...@buyways.nl>.
Ah, so Tika could be the trouble maker here but i cannot reproduce yet, Nutch hasn't got 0.8 yet. Also, you're talking about HTML documents while my examples where a PNG and PDF file. Does your issue solve that as well?
 
-----Original message-----
From: Ken Krugler <kk...@transpac.com>
Sent: Thu 09-09-2010 20:08
To: user@nutch.apache.org; 
Subject: Re: multiple values encountered for non multiValued field title


On Sep 9, 2010, at 10:18am, Markus Jelsma wrote:

> Hi,
>
> Luke is my friend and use it often, the problem here is that it  
> fails before it is added to my Solr index. Even without further  
> analyzing it is clear that for some reason, the title field for some  
> documents got two values instead of one. Debugging Tika is not yet  
> something i'm capable of ;)
>
> I need to know why Nutch parsed the file and came up with two  
> values, it may be a Tika bug or some other weirdness i haven't  
> though about.

I've fixed up some issues related to element ordering in Tika, see https://issues.apache.org/jira/browse/TIKA-478 
 - though I think most of what I was seeing was triggered by changed  
made in 0.8-SNAPSHOT.

Given the issues I ran into while trying to fix element ordering  
problems, I could easily see cases of busted HTML triggering the  
generation of two <title> elements.

For cases where I care, I usually do a simple state machine, e.g. only  
process the first <title> element inside of a <head> element.

-- Ken

> -----Original message-----
> From: André Ricardo <an...@gmail.com>
> Sent: Thu 09-09-2010 18:37
> To: user@nutch.apache.org;
> Subject: Re: multiple values encountered for non multiValued field  
> title
>
> Hello Markus,
>
> I hope I understood your problem. Here goes my opinion:
>
> I don't know if you are aware that you can look inside indexes with  
> a tool
> named "Luke" with this tool you can see exactly if there is more  
> than one
> value for a field in a document.
> Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
> multiValued="true" if a field has more than one value.
>
> For example the creative commons field has a lot of values for the  
> same
> document (by, nc, us, etc...)
> I have
> <field name="cc" multiValued="true" type="string" stored="true"
> indexed="true"/>
>
> Hope this helps,
> André Ricardo
>
> On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <markus.jelsma@buyways.nl 
> >wrote:
>
>>  And now there's also a PDF giving this kind of trouble:
>>
>>
>> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
>>
>> -----Original message-----
>> From: Markus Jelsma <ma...@buyways.nl>
>> Sent: Thu 09-09-2010 18:06
>> To: user@nutch.apache.org;
>> Subject: multiple values encountered for non multiValued field title
>>
>> Hi,
>>
>>
>>
>> I've got something weird again (using Nutch 1.2), a document  
>> fetched and
>> parsed by Nutch doesn't comply with my Solr schema, it attempts to  
>> send more
>> than one value to a non-multi valued field, the title field.
>>
>>
>>
>> It's a PNG file that hasn't been caught by the regex-urlfilter  
>> (i'll need
>> to fix that later) but it generates two titles? Not according to  
>> the parser
>> checker:
>>
>>
>>
>> bin/nutch org.apache.nutch.parse.ParserChecker
>> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
>>
>>
>>
>> Is there more i can do to debug or fix this one?
>>
>>
>>
>> Cheers,
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: multiple values encountered for non multiValued field title

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 9, 2010, at 10:18am, Markus Jelsma wrote:

> Hi,
>
> Luke is my friend and use it often, the problem here is that it  
> fails before it is added to my Solr index. Even without further  
> analyzing it is clear that for some reason, the title field for some  
> documents got two values instead of one. Debugging Tika is not yet  
> something i'm capable of ;)
>
> I need to know why Nutch parsed the file and came up with two  
> values, it may be a Tika bug or some other weirdness i haven't  
> though about.

I've fixed up some issues related to element ordering in Tika, see https://issues.apache.org/jira/browse/TIKA-478 
  - though I think most of what I was seeing was triggered by changed  
made in 0.8-SNAPSHOT.

Given the issues I ran into while trying to fix element ordering  
problems, I could easily see cases of busted HTML triggering the  
generation of two <title> elements.

For cases where I care, I usually do a simple state machine, e.g. only  
process the first <title> element inside of a <head> element.

-- Ken

> -----Original message-----
> From: André Ricardo <an...@gmail.com>
> Sent: Thu 09-09-2010 18:37
> To: user@nutch.apache.org;
> Subject: Re: multiple values encountered for non multiValued field  
> title
>
> Hello Markus,
>
> I hope I understood your problem. Here goes my opinion:
>
> I don't know if you are aware that you can look inside indexes with  
> a tool
> named "Luke" with this tool you can see exactly if there is more  
> than one
> value for a field in a document.
> Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
> multiValued="true" if a field has more than one value.
>
> For example the creative commons field has a lot of values for the  
> same
> document (by, nc, us, etc...)
> I have
> <field name="cc" multiValued="true" type="string" stored="true"
> indexed="true"/>
>
> Hope this helps,
> André Ricardo
>
> On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <markus.jelsma@buyways.nl 
> >wrote:
>
>>  And now there's also a PDF giving this kind of trouble:
>>
>>
>> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
>>
>> -----Original message-----
>> From: Markus Jelsma <ma...@buyways.nl>
>> Sent: Thu 09-09-2010 18:06
>> To: user@nutch.apache.org;
>> Subject: multiple values encountered for non multiValued field title
>>
>> Hi,
>>
>>
>>
>> I've got something weird again (using Nutch 1.2), a document  
>> fetched and
>> parsed by Nutch doesn't comply with my Solr schema, it attempts to  
>> send more
>> than one value to a non-multi valued field, the title field.
>>
>>
>>
>> It's a PNG file that hasn't been caught by the regex-urlfilter  
>> (i'll need
>> to fix that later) but it generates two titles? Not according to  
>> the parser
>> checker:
>>
>>
>>
>> bin/nutch org.apache.nutch.parse.ParserChecker
>> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
>>
>>
>>
>> Is there more i can do to debug or fix this one?
>>
>>
>>
>> Cheers,
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






RE: Re: multiple values encountered for non multiValued field title

Posted by Markus Jelsma <ma...@buyways.nl>.
Hi,

 

Luke is my friend and use it often, the problem here is that it fails before it is added to my Solr index. Even without further analyzing it is clear that for some reason, the title field for some documents got two values instead of one. Debugging Tika is not yet something i'm capable of ;)

 

I need to know why Nutch parsed the file and came up with two values, it may be a Tika bug or some other weirdness i haven't though about.

 

Cheers
 
-----Original message-----
From: André Ricardo <an...@gmail.com>
Sent: Thu 09-09-2010 18:37
To: user@nutch.apache.org; 
Subject: Re: multiple values encountered for non multiValued field title

Hello Markus,

I hope I understood your problem. Here goes my opinion:

I don't know if you are aware that you can look inside indexes with a tool
named "Luke" with this tool you can see exactly if there is more than one
value for a field in a document.
Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
multiValued="true" if a field has more than one value.

For example the creative commons field has a lot of values for the same
document (by, nc, us, etc...)
I have
<field name="cc" multiValued="true" type="string" stored="true"
indexed="true"/>

Hope this helps,
André Ricardo

On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <ma...@buyways.nl>wrote:

>  And now there's also a PDF giving this kind of trouble:
>
>
> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
>
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Thu 09-09-2010 18:06
> To: user@nutch.apache.org;
> Subject: multiple values encountered for non multiValued field title
>
> Hi,
>
>
>
> I've got something weird again (using Nutch 1.2), a document fetched and
> parsed by Nutch doesn't comply with my Solr schema, it attempts to send more
> than one value to a non-multi valued field, the title field.
>
>
>
> It's a PNG file that hasn't been caught by the regex-urlfilter (i'll need
> to fix that later) but it generates two titles? Not according to the parser
> checker:
>
>
>
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
>
>
>
> Is there more i can do to debug or fix this one?
>
>
>
> Cheers,
>

Re: Re: Re: multiple values encountered for non multiValued field title

Posted by Max Lynch <ih...@gmail.com>.
I never had trouble with multiple values for the filesize field.  Only title
and maybe one other one when I was using the feed parser plugin.

I tore my hair out trying to figure it out, and then just decided to set
multiValued="true" on those fields that failed and it worked fine.


On Thu, Sep 9, 2010 at 12:24 PM, Markus Jelsma <ma...@buyways.nl>wrote:

> Because of the semantics of certain fields. It isn't very helpful to set a
> filesize field to accept multiple values, it looses meaning. Also, if this
> is a bug or misconfiguration or whatever, it prevents bad documents to enter
> the index.
>
>
>
> Of course, i could be wrong on the title field issue but at least i am
> unaware of documents that can have two titles. Even if a document could have
> an alternate title, then it should be stored in a separate field to derive
> meaning from it.
>
> -----Original message-----
> From: Max Lynch <ih...@gmail.com>
> Sent: Thu 09-09-2010 19:19
> To: user@nutch.apache.org;
> Subject: Re: Re: multiple values encountered for non multiValued field
> title
>
> On Thu, Sep 9, 2010 at 12:14 PM, Markus Jelsma <markus.jelsma@buyways.nl
> >wrote:
>
> > Thanks for the suggestion, but setting all fields to accept multiple
> values
> > isn't such a good idea.
> >
>
>
> Could you give me some insight into why?
>
>
> Thanks.
>

RE: Re: Re: multiple values encountered for non multiValued field title

Posted by Markus Jelsma <ma...@buyways.nl>.
Because of the semantics of certain fields. It isn't very helpful to set a filesize field to accept multiple values, it looses meaning. Also, if this is a bug or misconfiguration or whatever, it prevents bad documents to enter the index.

 

Of course, i could be wrong on the title field issue but at least i am unaware of documents that can have two titles. Even if a document could have an alternate title, then it should be stored in a separate field to derive meaning from it.
 
-----Original message-----
From: Max Lynch <ih...@gmail.com>
Sent: Thu 09-09-2010 19:19
To: user@nutch.apache.org; 
Subject: Re: Re: multiple values encountered for non multiValued field title

On Thu, Sep 9, 2010 at 12:14 PM, Markus Jelsma <ma...@buyways.nl>wrote:

> Thanks for the suggestion, but setting all fields to accept multiple values
> isn't such a good idea.
>


Could you give me some insight into why?


Thanks.

Re: Re: multiple values encountered for non multiValued field title

Posted by Max Lynch <ih...@gmail.com>.
On Thu, Sep 9, 2010 at 12:14 PM, Markus Jelsma <ma...@buyways.nl>wrote:

> Thanks for the suggestion, but setting all fields to accept multiple values
> isn't such a good idea.
>


Could you give me some insight into why?


Thanks.

RE: Re: multiple values encountered for non multiValued field title

Posted by Markus Jelsma <ma...@buyways.nl>.
Thanks for the suggestion, but setting all fields to accept multiple values isn't such a good idea. 
 
-----Original message-----
From: Max Lynch <ih...@gmail.com>
Sent: Thu 09-09-2010 19:12
To: user@nutch.apache.org; 
Subject: Re: multiple values encountered for non multiValued field title

Markus,
I had lots of problems with nutch and my solr schema.  More than just the
title field had multiple values. I just put them  all to multiValued="true"
since I couldn't trust solrindex or my schema.

2010/9/9 André Ricardo <an...@gmail.com>

> Hello Markus,
>
> I hope I understood your problem. Here goes my opinion:
>
> I don't know if you are aware that you can look inside indexes with a tool
> named "Luke" with this tool you can see exactly if there is more than one
> value for a field in a document.
> Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
> multiValued="true" if a field has more than one value.
>
> For example the creative commons field has a lot of values for the same
> document (by, nc, us, etc...)
> I have
> <field name="cc" multiValued="true" type="string" stored="true"
> indexed="true"/>
>
> Hope this helps,
> André Ricardo
>
> On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <markus.jelsma@buyways.nl
> >wrote:
>
> >  And now there's also a PDF giving this kind of trouble:
> >
> >
> >
> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
> >
> > -----Original message-----
> > From: Markus Jelsma <ma...@buyways.nl>
> > Sent: Thu 09-09-2010 18:06
> > To: user@nutch.apache.org;
> > Subject: multiple values encountered for non multiValued field title
> >
> > Hi,
> >
> >
> >
> > I've got something weird again (using Nutch 1.2), a document fetched and
> > parsed by Nutch doesn't comply with my Solr schema, it attempts to send
> more
> > than one value to a non-multi valued field, the title field.
> >
> >
> >
> > It's a PNG file that hasn't been caught by the regex-urlfilter (i'll need
> > to fix that later) but it generates two titles? Not according to the
> parser
> > checker:
> >
> >
> >
> > bin/nutch org.apache.nutch.parse.ParserChecker
> >
> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
> >
> >
> >
> > Is there more i can do to debug or fix this one?
> >
> >
> >
> > Cheers,
> >
>

 

Re: multiple values encountered for non multiValued field title

Posted by Max Lynch <ih...@gmail.com>.
Markus,
I had lots of problems with nutch and my solr schema.  More than just the
title field had multiple values. I just put them  all to multiValued="true"
since I couldn't trust solrindex or my schema.

2010/9/9 André Ricardo <an...@gmail.com>

> Hello Markus,
>
> I hope I understood your problem. Here goes my opinion:
>
> I don't know if you are aware that you can look inside indexes with a tool
> named "Luke" with this tool you can see exactly if there is more than one
> value for a field in a document.
> Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
> multiValued="true" if a field has more than one value.
>
> For example the creative commons field has a lot of values for the same
> document (by, nc, us, etc...)
> I have
> <field name="cc" multiValued="true" type="string" stored="true"
> indexed="true"/>
>
> Hope this helps,
> André Ricardo
>
> On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <markus.jelsma@buyways.nl
> >wrote:
>
> >  And now there's also a PDF giving this kind of trouble:
> >
> >
> >
> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
> >
> > -----Original message-----
> > From: Markus Jelsma <ma...@buyways.nl>
> > Sent: Thu 09-09-2010 18:06
> > To: user@nutch.apache.org;
> > Subject: multiple values encountered for non multiValued field title
> >
> > Hi,
> >
> >
> >
> > I've got something weird again (using Nutch 1.2), a document fetched and
> > parsed by Nutch doesn't comply with my Solr schema, it attempts to send
> more
> > than one value to a non-multi valued field, the title field.
> >
> >
> >
> > It's a PNG file that hasn't been caught by the regex-urlfilter (i'll need
> > to fix that later) but it generates two titles? Not according to the
> parser
> > checker:
> >
> >
> >
> > bin/nutch org.apache.nutch.parse.ParserChecker
> >
> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
> >
> >
> >
> > Is there more i can do to debug or fix this one?
> >
> >
> >
> > Cheers,
> >
>

Re: multiple values encountered for non multiValued field title

Posted by André Ricardo <an...@gmail.com>.
Hello Markus,

I hope I understood your problem. Here goes my opinion:

I don't know if you are aware that you can look inside indexes with a tool
named "Luke" with this tool you can see exactly if there is more than one
value for a field in a document.
Also in apache-solr-1.4.1/example/solr/conf/schema.xml you can add
multiValued="true" if a field has more than one value.

For example the creative commons field has a lot of values for the same
document (by, nc, us, etc...)
I have
<field name="cc" multiValued="true" type="string" stored="true"
indexed="true"/>

Hope this helps,
André Ricardo

On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma <ma...@buyways.nl>wrote:

>  And now there's also a PDF giving this kind of trouble:
>
>
> http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
>
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Thu 09-09-2010 18:06
> To: user@nutch.apache.org;
> Subject: multiple values encountered for non multiValued field title
>
> Hi,
>
>
>
> I've got something weird again (using Nutch 1.2), a document fetched and
> parsed by Nutch doesn't comply with my Solr schema, it attempts to send more
> than one value to a non-multi valued field, the title field.
>
>
>
> It's a PNG file that hasn't been caught by the regex-urlfilter (i'll need
> to fix that later) but it generates two titles? Not according to the parser
> checker:
>
>
>
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81
>
>
>
> Is there more i can do to debug or fix this one?
>
>
>
> Cheers,
>

RE: multiple values encountered for non multiValued field title

Posted by Markus Jelsma <ma...@buyways.nl>.
 And now there's also a PDF giving this kind of trouble:

http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf
 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Thu 09-09-2010 18:06
To: user@nutch.apache.org; 
Subject: multiple values encountered for non multiValued field title

Hi,

 

I've got something weird again (using Nutch 1.2), a document fetched and parsed by Nutch doesn't comply with my Solr schema, it attempts to send more than one value to a non-multi valued field, the title field.

 

It's a PNG file that hasn't been caught by the regex-urlfilter (i'll need to fix that later) but it generates two titles? Not according to the parser checker:

 

bin/nutch org.apache.nutch.parse.ParserChecker http://portal.groningen.nl/uploads/fckconnector/582ab124-ff1e-4aad-9d45-4cdc8babbc81

 

Is there more i can do to debug or fix this one?

 

Cheers,