You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chear Huang <ch...@neurosky.com> on 2014/03/13 16:23:29 UTC

Re: multivalues returned unexpectedly

hi ,
i have little problem for the use nutch to crawl website, could
someone tell me what its the problem for the running crawl  ?

InjectorJob: org.apache.gora.util.GoraException:
java.lang.RuntimeException:
org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
Caused by: java.lang.RuntimeException:
org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:127)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109)
... 9 more

On Tue, Feb 25, 2014 at 6:01 AM, Sebastian Nagel
<wa...@googlemail.com> wrote:
>> https://issues.apache.org/jira/browse/NUTCH-1140
> Thanks for digging this up!
>
>> Why is index-more adding this?
> Maybe, to have some title for MIME types
> which have no title (e.g., plain text).
> That could be the intension.
> The code is old (> 9 years) and the web
> has changed since. The original
> RFC http://www.ietf.org/rfc/rfc1806.txt
> for the content-disposition header
> is even older (1995).
>
>
> On 02/24/2014 10:40 PM, John Lafitte wrote:
>> Okay, I invoked it the way you mentioned and I get the same result.
>>  However, I tried it without index-more included and I no longer have the
>> additional title.  Why is index-more adding this?
>>
>>
>> On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>>> wrote:
>>
>>>> I'm not sure I'm allowed to post it publicly.
>>> A minimalistic and anonymized example would be fine.
>>> However, if it's really the HTTP header it will
>>> be hard to make it reproducible.
>>>
>>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>>> get
>>>> the same results.  I might be doing something wrong.
>>>
>>> It's possible to set plugin.includes (and other properties) just for
>>> tools like indexchecker, parsechecker, etc:
>>>
>>> % bin/nutch indexchecker
>>> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml
>>>
>>>
>>> On 02/24/2014 09:59 PM, John Lafitte wrote:
>>>> I think the channel/image/title idea was probably wrong.  It looks like
>>> the
>>>> extra title field is actually the http header Content-Disposition:
>>> inline;
>>>> filename="jobexport.xml".  I can email you the url privately of the
>>>> specific RSS feed I'm using for this issue, but since it's a client site
>>>> I'm not sure I'm allowed to post it publicly.
>>>>
>>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>>> get
>>>> the same results.  I might be doing something wrong.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>>> wastl.nagel@googlemail.com
>>>>> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> can you attach an (short) example document to reproduce the problem?
>>>>> I was not able to reproduce it with the example in
>>>>> http://de.wikipedia.org/wiki/RSS
>>>>> which contains channel/image/title.
>>>>>
>>>>> Which parser plugin is used: "feed" or "parse-tika"?
>>>>> (In doubt, please, add the value of property "plugin.includes")
>>>>>
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
>>> indexing
>>>>>> RSS that has channel/title then channel/image/title it tries to add
>>> both
>>>>> of
>>>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>>>
>>>>>> I've used nutch indexchecker and I see the two titles being returned.
>>>>>  The
>>>>>> extra title is the value that in the content-disposition: filename http
>>>>>> header.  I only see one title when I run nutch readseg.  So I'm a
>>> little
>>>>>> confused why it's
>>>>>>
>>>>>> I have made title multivalued in the solr schema and it seems to work
>>>>> that
>>>>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>>>>> title.  What is the correct way to fix this?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>