You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by John Lafitte <jl...@brandextract.com> on 2014/02/24 20:31:22 UTC

multivalues returned unexpectedly

I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
RSS that has channel/title then channel/image/title it tries to add both of
them then fails when doing solrindex because title isn't multivalued.

I've used nutch indexchecker and I see the two titles being returned.  The
extra title is the value that in the content-disposition: filename http
header.  I only see one title when I run nutch readseg.  So I'm a little
confused why it's

I have made title multivalued in the solr schema and it seems to work that
way, but it seems wrong to me.  Documents shouldn't have more than one
title.  What is the correct way to fix this?

Re: multivalues returned unexpectedly

Posted by John Lafitte <jl...@brandextract.com>.

I think I found it already documented, I just wasn't searching for the
right plugin:

https://issues.apache.org/jira/browse/NUTCH-1140

There is a patch there, I will try that.  Thanks for the help!


On Mon, Feb 24, 2014 at 3:41 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi John,
>
> reproduced. It's the index-more plugin which adds the second title
> from Content-Disposition header field. If index-more is removed
> from plugin.includes the second title disappears:
>
> % bin/nutch indexchecker
> -Dplugin.includes="parse-tika|index-basic|protocol-http" \
>      http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
>
> Maybe that's an option for a quick work-around.
>
> You can also open an issue at https://issues.apache.org/jira/browse/Nutch.
> We'll check it. The authors of index-more explicitly add (with intension
> to overwrite?)
> the content-disposition title, cf. code comments:
>
>   // Reset title if we see non-standard HTTP header "Content-Disposition".
>   // It's a good indication that content provider wants filename therein
>   // be used as the title of this url.
>
>   // Patterns used to extract filename from possible non-standard
>   // HTTP header "Content-Disposition". Typically it looks like:
>   // Content-Disposition: inline; filename="foo.ppt"
>
> Thanks,
> Sebastian
>
>
> On 02/24/2014 10:23 PM, John Lafitte wrote:
> > Here is an example of the feed:
> >
> > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> >
> > bin/nutch indexchecker
> > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> >
> > It returns:
> > title : Microsoft - Custom Search microsoft-job2web
> > title : jobexport.xml
> >
> >
> > On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <jlafitte@brandextract.com
> >wrote:
> >
> >> I think the channel/image/title idea was probably wrong.  It looks like
> >> the extra title field is actually the http header Content-Disposition:
> >> inline; filename="jobexport.xml".  I can email you the url privately of
> the
> >> specific RSS feed I'm using for this issue, but since it's a client site
> >> I'm not sure I'm allowed to post it publicly.
> >>
> >> I'm using the default parser-plugins.xml which shows parse-tika before
> >> feed.  I don't have feed in my plugin.includes, but if I modify
> >> parser-plugins.xml and plugin.includes to try to favor the feed I still
> get
> >> the same results.  I might be doing something wrong.
> >>
> >>
> >>
> >>
> >> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
> >> wastl.nagel@googlemail.com> wrote:
> >>
> >>> Hi John,
> >>>
> >>> can you attach an (short) example document to reproduce the problem?
> >>> I was not able to reproduce it with the example in
> >>> http://de.wikipedia.org/wiki/RSS
> >>> which contains channel/image/title.
> >>>
> >>> Which parser plugin is used: "feed" or "parse-tika"?
> >>> (In doubt, please, add the value of property "plugin.includes")
> >>>
> >>> Sebastian
> >>>
> >>>
> >>> On 02/24/2014 08:31 PM, John Lafitte wrote:
> >>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
> indexing
> >>>> RSS that has channel/title then channel/image/title it tries to add
> >>> both of
> >>>> them then fails when doing solrindex because title isn't multivalued.
> >>>>
> >>>> I've used nutch indexchecker and I see the two titles being returned.
> >>>  The
> >>>> extra title is the value that in the content-disposition: filename
> http
> >>>> header.  I only see one title when I run nutch readseg.  So I'm a
> little
> >>>> confused why it's
> >>>>
> >>>> I have made title multivalued in the solr schema and it seems to work
> >>> that
> >>>> way, but it seems wrong to me.  Documents shouldn't have more than one
> >>>> title.  What is the correct way to fix this?
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: multivalues returned unexpectedly

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi John,

reproduced. It's the index-more plugin which adds the second title
from Content-Disposition header field. If index-more is removed
from plugin.includes the second title disappears:

% bin/nutch indexchecker -Dplugin.includes="parse-tika|index-basic|protocol-http" \
     http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS

Maybe that's an option for a quick work-around.

You can also open an issue at https://issues.apache.org/jira/browse/Nutch.
We'll check it. The authors of index-more explicitly add (with intension to overwrite?)
the content-disposition title, cf. code comments:

  // Reset title if we see non-standard HTTP header "Content-Disposition".
  // It's a good indication that content provider wants filename therein
  // be used as the title of this url.

  // Patterns used to extract filename from possible non-standard
  // HTTP header "Content-Disposition". Typically it looks like:
  // Content-Disposition: inline; filename="foo.ppt"

Thanks,
Sebastian


On 02/24/2014 10:23 PM, John Lafitte wrote:
> Here is an example of the feed:
> 
> http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> 
> bin/nutch indexchecker
> http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> 
> It returns:
> title : Microsoft - Custom Search microsoft-job2web
> title : jobexport.xml
> 
> 
> On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <jl...@brandextract.com>wrote:
> 
>> I think the channel/image/title idea was probably wrong.  It looks like
>> the extra title field is actually the http header Content-Disposition:
>> inline; filename="jobexport.xml".  I can email you the url privately of the
>> specific RSS feed I'm using for this issue, but since it's a client site
>> I'm not sure I'm allowed to post it publicly.
>>
>> I'm using the default parser-plugins.xml which shows parse-tika before
>> feed.  I don't have feed in my plugin.includes, but if I modify
>> parser-plugins.xml and plugin.includes to try to favor the feed I still get
>> the same results.  I might be doing something wrong.
>>
>>
>>
>>
>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>> wastl.nagel@googlemail.com> wrote:
>>
>>> Hi John,
>>>
>>> can you attach an (short) example document to reproduce the problem?
>>> I was not able to reproduce it with the example in
>>> http://de.wikipedia.org/wiki/RSS
>>> which contains channel/image/title.
>>>
>>> Which parser plugin is used: "feed" or "parse-tika"?
>>> (In doubt, please, add the value of property "plugin.includes")
>>>
>>> Sebastian
>>>
>>>
>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
>>>> RSS that has channel/title then channel/image/title it tries to add
>>> both of
>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>
>>>> I've used nutch indexchecker and I see the two titles being returned.
>>>  The
>>>> extra title is the value that in the content-disposition: filename http
>>>> header.  I only see one title when I run nutch readseg.  So I'm a little
>>>> confused why it's
>>>>
>>>> I have made title multivalued in the solr schema and it seems to work
>>> that
>>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>>> title.  What is the correct way to fix this?
>>>>
>>>
>>>
>>
>

Re: multivalues returned unexpectedly

Posted by John Lafitte <jl...@brandextract.com>.

Here is an example of the feed:

http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS

bin/nutch indexchecker
http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS

It returns:
title : Microsoft - Custom Search microsoft-job2web
title : jobexport.xml


On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <jl...@brandextract.com>wrote:

> I think the channel/image/title idea was probably wrong.  It looks like
> the extra title field is actually the http header Content-Disposition:
> inline; filename="jobexport.xml".  I can email you the url privately of the
> specific RSS feed I'm using for this issue, but since it's a client site
> I'm not sure I'm allowed to post it publicly.
>
> I'm using the default parser-plugins.xml which shows parse-tika before
> feed.  I don't have feed in my plugin.includes, but if I modify
> parser-plugins.xml and plugin.includes to try to favor the feed I still get
> the same results.  I might be doing something wrong.
>
>
>
>
> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com> wrote:
>
>> Hi John,
>>
>> can you attach an (short) example document to reproduce the problem?
>> I was not able to reproduce it with the example in
>> http://de.wikipedia.org/wiki/RSS
>> which contains channel/image/title.
>>
>> Which parser plugin is used: "feed" or "parse-tika"?
>> (In doubt, please, add the value of property "plugin.includes")
>>
>> Sebastian
>>
>>
>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>> > I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
>> > RSS that has channel/title then channel/image/title it tries to add
>> both of
>> > them then fails when doing solrindex because title isn't multivalued.
>> >
>> > I've used nutch indexchecker and I see the two titles being returned.
>>  The
>> > extra title is the value that in the content-disposition: filename http
>> > header.  I only see one title when I run nutch readseg.  So I'm a little
>> > confused why it's
>> >
>> > I have made title multivalued in the solr schema and it seems to work
>> that
>> > way, but it seems wrong to me.  Documents shouldn't have more than one
>> > title.  What is the correct way to fix this?
>> >
>>
>>
>

Re: multivalues returned unexpectedly

Posted by Chear Huang <ch...@neurosky.com>.

hi ,
i have little problem for the use nutch to crawl website, could
someone tell me what its the problem for the running crawl  ?

InjectorJob: org.apache.gora.util.GoraException:
java.lang.RuntimeException:
org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
Caused by: java.lang.RuntimeException:
org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:127)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109)
... 9 more

On Tue, Feb 25, 2014 at 6:01 AM, Sebastian Nagel
<wa...@googlemail.com> wrote:
>> https://issues.apache.org/jira/browse/NUTCH-1140
> Thanks for digging this up!
>
>> Why is index-more adding this?
> Maybe, to have some title for MIME types
> which have no title (e.g., plain text).
> That could be the intension.
> The code is old (> 9 years) and the web
> has changed since. The original
> RFC http://www.ietf.org/rfc/rfc1806.txt
> for the content-disposition header
> is even older (1995).
>
>
> On 02/24/2014 10:40 PM, John Lafitte wrote:
>> Okay, I invoked it the way you mentioned and I get the same result.
>>  However, I tried it without index-more included and I no longer have the
>> additional title.  Why is index-more adding this?
>>
>>
>> On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>>> wrote:
>>
>>>> I'm not sure I'm allowed to post it publicly.
>>> A minimalistic and anonymized example would be fine.
>>> However, if it's really the HTTP header it will
>>> be hard to make it reproducible.
>>>
>>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>>> get
>>>> the same results.  I might be doing something wrong.
>>>
>>> It's possible to set plugin.includes (and other properties) just for
>>> tools like indexchecker, parsechecker, etc:
>>>
>>> % bin/nutch indexchecker
>>> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml
>>>
>>>
>>> On 02/24/2014 09:59 PM, John Lafitte wrote:
>>>> I think the channel/image/title idea was probably wrong.  It looks like
>>> the
>>>> extra title field is actually the http header Content-Disposition:
>>> inline;
>>>> filename="jobexport.xml".  I can email you the url privately of the
>>>> specific RSS feed I'm using for this issue, but since it's a client site
>>>> I'm not sure I'm allowed to post it publicly.
>>>>
>>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>>> get
>>>> the same results.  I might be doing something wrong.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>>> wastl.nagel@googlemail.com
>>>>> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> can you attach an (short) example document to reproduce the problem?
>>>>> I was not able to reproduce it with the example in
>>>>> http://de.wikipedia.org/wiki/RSS
>>>>> which contains channel/image/title.
>>>>>
>>>>> Which parser plugin is used: "feed" or "parse-tika"?
>>>>> (In doubt, please, add the value of property "plugin.includes")
>>>>>
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
>>> indexing
>>>>>> RSS that has channel/title then channel/image/title it tries to add
>>> both
>>>>> of
>>>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>>>
>>>>>> I've used nutch indexchecker and I see the two titles being returned.
>>>>>  The
>>>>>> extra title is the value that in the content-disposition: filename http
>>>>>> header.  I only see one title when I run nutch readseg.  So I'm a
>>> little
>>>>>> confused why it's
>>>>>>
>>>>>> I have made title multivalued in the solr schema and it seems to work
>>>>> that
>>>>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>>>>> title.  What is the correct way to fix this?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: multivalues returned unexpectedly

Posted by Sebastian Nagel <wa...@googlemail.com>.

> https://issues.apache.org/jira/browse/NUTCH-1140
Thanks for digging this up!

> Why is index-more adding this?
Maybe, to have some title for MIME types
which have no title (e.g., plain text).
That could be the intension.
The code is old (> 9 years) and the web
has changed since. The original
RFC http://www.ietf.org/rfc/rfc1806.txt
for the content-disposition header
is even older (1995).


On 02/24/2014 10:40 PM, John Lafitte wrote:
> Okay, I invoked it the way you mentioned and I get the same result.
>  However, I tried it without index-more included and I no longer have the
> additional title.  Why is index-more adding this?
> 
> 
> On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>>> I'm not sure I'm allowed to post it publicly.
>> A minimalistic and anonymized example would be fine.
>> However, if it's really the HTTP header it will
>> be hard to make it reproducible.
>>
>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>> get
>>> the same results.  I might be doing something wrong.
>>
>> It's possible to set plugin.includes (and other properties) just for
>> tools like indexchecker, parsechecker, etc:
>>
>> % bin/nutch indexchecker
>> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml
>>
>>
>> On 02/24/2014 09:59 PM, John Lafitte wrote:
>>> I think the channel/image/title idea was probably wrong.  It looks like
>> the
>>> extra title field is actually the http header Content-Disposition:
>> inline;
>>> filename="jobexport.xml".  I can email you the url privately of the
>>> specific RSS feed I'm using for this issue, but since it's a client site
>>> I'm not sure I'm allowed to post it publicly.
>>>
>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>> get
>>> the same results.  I might be doing something wrong.
>>>
>>>
>>>
>>>
>>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>> wastl.nagel@googlemail.com
>>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> can you attach an (short) example document to reproduce the problem?
>>>> I was not able to reproduce it with the example in
>>>> http://de.wikipedia.org/wiki/RSS
>>>> which contains channel/image/title.
>>>>
>>>> Which parser plugin is used: "feed" or "parse-tika"?
>>>> (In doubt, please, add the value of property "plugin.includes")
>>>>
>>>> Sebastian
>>>>
>>>>
>>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
>> indexing
>>>>> RSS that has channel/title then channel/image/title it tries to add
>> both
>>>> of
>>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>>
>>>>> I've used nutch indexchecker and I see the two titles being returned.
>>>>  The
>>>>> extra title is the value that in the content-disposition: filename http
>>>>> header.  I only see one title when I run nutch readseg.  So I'm a
>> little
>>>>> confused why it's
>>>>>
>>>>> I have made title multivalued in the solr schema and it seems to work
>>>> that
>>>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>>>> title.  What is the correct way to fix this?
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: multivalues returned unexpectedly

Posted by John Lafitte <jl...@brandextract.com>.

Okay, I invoked it the way you mentioned and I get the same result.
 However, I tried it without index-more included and I no longer have the
additional title.  Why is index-more adding this?


On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> > I'm not sure I'm allowed to post it publicly.
> A minimalistic and anonymized example would be fine.
> However, if it's really the HTTP header it will
> be hard to make it reproducible.
>
> > I'm using the default parser-plugins.xml which shows parse-tika before
> > feed.  I don't have feed in my plugin.includes, but if I modify
> > parser-plugins.xml and plugin.includes to try to favor the feed I still
> get
> > the same results.  I might be doing something wrong.
>
> It's possible to set plugin.includes (and other properties) just for
> tools like indexchecker, parsechecker, etc:
>
> % bin/nutch indexchecker
> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml
>
>
> On 02/24/2014 09:59 PM, John Lafitte wrote:
> > I think the channel/image/title idea was probably wrong.  It looks like
> the
> > extra title field is actually the http header Content-Disposition:
> inline;
> > filename="jobexport.xml".  I can email you the url privately of the
> > specific RSS feed I'm using for this issue, but since it's a client site
> > I'm not sure I'm allowed to post it publicly.
> >
> > I'm using the default parser-plugins.xml which shows parse-tika before
> > feed.  I don't have feed in my plugin.includes, but if I modify
> > parser-plugins.xml and plugin.includes to try to favor the feed I still
> get
> > the same results.  I might be doing something wrong.
> >
> >
> >
> >
> > On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com
> >> wrote:
> >
> >> Hi John,
> >>
> >> can you attach an (short) example document to reproduce the problem?
> >> I was not able to reproduce it with the example in
> >> http://de.wikipedia.org/wiki/RSS
> >> which contains channel/image/title.
> >>
> >> Which parser plugin is used: "feed" or "parse-tika"?
> >> (In doubt, please, add the value of property "plugin.includes")
> >>
> >> Sebastian
> >>
> >>
> >> On 02/24/2014 08:31 PM, John Lafitte wrote:
> >>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
> indexing
> >>> RSS that has channel/title then channel/image/title it tries to add
> both
> >> of
> >>> them then fails when doing solrindex because title isn't multivalued.
> >>>
> >>> I've used nutch indexchecker and I see the two titles being returned.
> >>  The
> >>> extra title is the value that in the content-disposition: filename http
> >>> header.  I only see one title when I run nutch readseg.  So I'm a
> little
> >>> confused why it's
> >>>
> >>> I have made title multivalued in the solr schema and it seems to work
> >> that
> >>> way, but it seems wrong to me.  Documents shouldn't have more than one
> >>> title.  What is the correct way to fix this?
> >>>
> >>
> >>
> >
>
>

Re: multivalues returned unexpectedly

Posted by Sebastian Nagel <wa...@googlemail.com>.

> I'm not sure I'm allowed to post it publicly.
A minimalistic and anonymized example would be fine.
However, if it's really the HTTP header it will
be hard to make it reproducible.

> I'm using the default parser-plugins.xml which shows parse-tika before
> feed.  I don't have feed in my plugin.includes, but if I modify
> parser-plugins.xml and plugin.includes to try to favor the feed I still get
> the same results.  I might be doing something wrong.

It's possible to set plugin.includes (and other properties) just for
tools like indexchecker, parsechecker, etc:

% bin/nutch indexchecker -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml


On 02/24/2014 09:59 PM, John Lafitte wrote:
> I think the channel/image/title idea was probably wrong.  It looks like the
> extra title field is actually the http header Content-Disposition: inline;
> filename="jobexport.xml".  I can email you the url privately of the
> specific RSS feed I'm using for this issue, but since it's a client site
> I'm not sure I'm allowed to post it publicly.
> 
> I'm using the default parser-plugins.xml which shows parse-tika before
> feed.  I don't have feed in my plugin.includes, but if I modify
> parser-plugins.xml and plugin.includes to try to favor the feed I still get
> the same results.  I might be doing something wrong.
> 
> 
> 
> 
> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi John,
>>
>> can you attach an (short) example document to reproduce the problem?
>> I was not able to reproduce it with the example in
>> http://de.wikipedia.org/wiki/RSS
>> which contains channel/image/title.
>>
>> Which parser plugin is used: "feed" or "parse-tika"?
>> (In doubt, please, add the value of property "plugin.includes")
>>
>> Sebastian
>>
>>
>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
>>> RSS that has channel/title then channel/image/title it tries to add both
>> of
>>> them then fails when doing solrindex because title isn't multivalued.
>>>
>>> I've used nutch indexchecker and I see the two titles being returned.
>>  The
>>> extra title is the value that in the content-disposition: filename http
>>> header.  I only see one title when I run nutch readseg.  So I'm a little
>>> confused why it's
>>>
>>> I have made title multivalued in the solr schema and it seems to work
>> that
>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>> title.  What is the correct way to fix this?
>>>
>>
>>
>

Re: multivalues returned unexpectedly

Posted by John Lafitte <jl...@brandextract.com>.

I think the channel/image/title idea was probably wrong.  It looks like the
extra title field is actually the http header Content-Disposition: inline;
filename="jobexport.xml".  I can email you the url privately of the
specific RSS feed I'm using for this issue, but since it's a client site
I'm not sure I'm allowed to post it publicly.

I'm using the default parser-plugins.xml which shows parse-tika before
feed.  I don't have feed in my plugin.includes, but if I modify
parser-plugins.xml and plugin.includes to try to favor the feed I still get
the same results.  I might be doing something wrong.

On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi John,
>
> can you attach an (short) example document to reproduce the problem?
> I was not able to reproduce it with the example in
> http://de.wikipedia.org/wiki/RSS
> which contains channel/image/title.
>
> Which parser plugin is used: "feed" or "parse-tika"?
> (In doubt, please, add the value of property "plugin.includes")
>
> Sebastian
>
>
> On 02/24/2014 08:31 PM, John Lafitte wrote:
> > I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
> > RSS that has channel/title then channel/image/title it tries to add both
> of
> > them then fails when doing solrindex because title isn't multivalued.
> >
> > I've used nutch indexchecker and I see the two titles being returned.
>  The
> > extra title is the value that in the content-disposition: filename http
> > header.  I only see one title when I run nutch readseg.  So I'm a little
> > confused why it's
> >
> > I have made title multivalued in the solr schema and it seems to work
> that
> > way, but it seems wrong to me.  Documents shouldn't have more than one
> > title.  What is the correct way to fix this?
> >
>
>

Re: multivalues returned unexpectedly

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi John,

can you attach an (short) example document to reproduce the problem?
I was not able to reproduce it with the example in http://de.wikipedia.org/wiki/RSS
which contains channel/image/title.

Which parser plugin is used: "feed" or "parse-tika"?
(In doubt, please, add the value of property "plugin.includes")

Sebastian


On 02/24/2014 08:31 PM, John Lafitte wrote:
> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
> RSS that has channel/title then channel/image/title it tries to add both of
> them then fails when doing solrindex because title isn't multivalued.
> 
> I've used nutch indexchecker and I see the two titles being returned.  The
> extra title is the value that in the content-disposition: filename http
> header.  I only see one title when I run nutch readseg.  So I'm a little
> confused why it's
> 
> I have made title multivalued in the solr schema and it seems to work that
> way, but it seems wrong to me.  Documents shouldn't have more than one
> title.  What is the correct way to fix this?
>

Re: multivalues returned unexpectedly

Posted by Matthew Stevens <ma...@matthewstevens.org>.

unsubscribe

Austria: +436764457176
GoogleVoice: +15128504968
*Secure Message via OpenID
<https://contact.fullxri.com/contact/=matthew.stevens>*

Diese E-Mail einschließlich evtl. angehängter Dateien enthält vertrauliche
und/oder rechtlich irrtümlich erhalten haben, dürfen Sie weder den Inhalt
dieser E-Mails nutzen noch dürfen Sie die evtl. angehängten Dateien öffnen
und auch nichts kopieren oder weitergeben/verbreiten. Bitte verständigen
Sie den Absender und löschen Sie diese E-Mail und evtl. angehängte Dateien
umgehend. Vielen Dank!

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you
receive this in error, please contact the sender and delete the material
from any computer.

On Mon, Feb 24, 2014 at 8:31 PM, John Lafitte <jl...@brandextract.com>wrote:

> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
> RSS that has channel/title then channel/image/title it tries to add both of
> them then fails when doing solrindex because title isn't multivalued.
>
> I've used nutch indexchecker and I see the two titles being returned.  The
> extra title is the value that in the content-disposition: filename http
> header.  I only see one title when I run nutch readseg.  So I'm a little
> confused why it's
>
> I have made title multivalued in the solr schema and it seems to work that
> way, but it seems wrong to me.  Documents shouldn't have more than one
> title.  What is the correct way to fix this?
>