You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tolga <to...@ozses.net> on 2012/05/22 22:20:16 UTC

parse.ParserFactory

Hi,

I crawl / index PDF files just fine, but I get the following warning.

parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to 
contentType application/pdf via parse-plugins.xml, but not enabled via 
plugin.includes in nutch-default.xml.

I've got the value 
protocol-http|urlfilter-regex|parse-(html|tika|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic) 
for plugin.includes property in nutch-default.xml. What am I missing?

Regards,

Re: parse.ParserFactory

Posted by Tolga <to...@ozses.net>.

...and also, nutch-site.xml is blank here, so I'm sure it's not being 
used at all.

On 5/29/12 11:34 AM, Julien Nioche wrote:
>> I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use
>> nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml.
>
> that's the case. I was just mentioning a recommended practice, not a strict
> requirement
>
>
>
>>
>> On 5/29/12 9:48 AM, Julien Nioche wrote:
>>
>>> if you are seeing this warning then this means that parse-pdf IS being
>>> used. You should modify nutch-site.xml and not nutch-default and my bet is
>>> that your are doing this in NUTCH_HOME/conf and not in
>>> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
>>>
>>>
>>>
>>> On 29 May 2012 07:31, Tolga<to...@ozses.net>   wrote:
>>>
>>>   Hi,
>>>> I know this issue should have been closed, but I thought I'd continue
>>>> this
>>>> rather than starting a new thread.
>>>>
>>>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
>>>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml,
>>>> but
>>>> not enabled via plugin.includes in nutch-default.xml and I have tika in
>>>> my
>>>> nutch-default.xml:<value>**protocol-http|**urlfilter-**
>>>> regex|parse-(html|**
>>>> tika|js|swf|zip|xml)|index-(****basic|anchor)|scoring-opic|**
>>>> urlnormalizer-(pass|regex|****basic)</value>. What's the point of seeing
>>>>
>>>> this warning if I already have tika? This should be removed IMHO.
>>>>
>>>> Regards,
>>>>
>>>>
>>>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>>>>
>>>>   Unless your using<= Nutch 1.2 you should not be using
>>>>> msexcel|mspowerpoint|msword|****oo|pdf| within your plugin.includes...
>>>>> all
>>>>>
>>>>> of these document formats are (and have been for some time)
>>>>> implemented as Apache Tika parsers.
>>>>>
>>>>> hth
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<to...@ozses.net>    wrote:
>>>>>
>>>>>   Hi,
>>>>>> I crawl / index PDF files just fine, but I get the following warning.
>>>>>>
>>>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>>>>> contentType
>>>>>> application/pdf via parse-plugins.xml, but not enabled via
>>>>>> plugin.includes
>>>>>> in nutch-default.xml.
>>>>>>
>>>>>> I've got the value
>>>>>> protocol-http|urlfilter-regex|****parse-(html|tika|js|msexcel|****
>>>>>> mspowerpoint|msword|oo|pdf|****swf|zip)|index-(basic|anchor)|****
>>>>>> scoring-opic|urlnormalizer-(****pass|regex|basic)
>>>>>>
>>>>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>
>

Re: parse.ParserFactory

Posted by Julien Nioche <li...@gmail.com>.

> I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use
> nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml.


that's the case. I was just mentioning a recommended practice, not a strict
requirement



>
>
> On 5/29/12 9:48 AM, Julien Nioche wrote:
>
>> if you are seeing this warning then this means that parse-pdf IS being
>> used. You should modify nutch-site.xml and not nutch-default and my bet is
>> that your are doing this in NUTCH_HOME/conf and not in
>> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
>>
>>
>>
>> On 29 May 2012 07:31, Tolga<to...@ozses.net>  wrote:
>>
>>  Hi,
>>>
>>> I know this issue should have been closed, but I thought I'd continue
>>> this
>>> rather than starting a new thread.
>>>
>>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
>>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml,
>>> but
>>> not enabled via plugin.includes in nutch-default.xml and I have tika in
>>> my
>>> nutch-default.xml:<value>**protocol-http|**urlfilter-**
>>> regex|parse-(html|**
>>> tika|js|swf|zip|xml)|index-(****basic|anchor)|scoring-opic|**
>>> urlnormalizer-(pass|regex|****basic)</value>. What's the point of seeing
>>>
>>> this warning if I already have tika? This should be removed IMHO.
>>>
>>> Regards,
>>>
>>>
>>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>>>
>>>  Unless your using<= Nutch 1.2 you should not be using
>>>> msexcel|mspowerpoint|msword|****oo|pdf| within your plugin.includes...
>>>> all
>>>>
>>>> of these document formats are (and have been for some time)
>>>> implemented as Apache Tika parsers.
>>>>
>>>> hth
>>>>
>>>>
>>>>
>>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<to...@ozses.net>   wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> I crawl / index PDF files just fine, but I get the following warning.
>>>>>
>>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>>>> contentType
>>>>> application/pdf via parse-plugins.xml, but not enabled via
>>>>> plugin.includes
>>>>> in nutch-default.xml.
>>>>>
>>>>> I've got the value
>>>>> protocol-http|urlfilter-regex|****parse-(html|tika|js|msexcel|****
>>>>> mspowerpoint|msword|oo|pdf|****swf|zip)|index-(basic|anchor)|****
>>>>> scoring-opic|urlnormalizer-(****pass|regex|basic)
>>>>>
>>>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>
>>>>
>>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: parse.ParserFactory

Posted by Tolga <to...@ozses.net>.

I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use 
nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml.

On 5/29/12 9:48 AM, Julien Nioche wrote:
> if you are seeing this warning then this means that parse-pdf IS being
> used. You should modify nutch-site.xml and not nutch-default and my bet is
> that your are doing this in NUTCH_HOME/conf and not in
> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
>
>
>
> On 29 May 2012 07:31, Tolga<to...@ozses.net>  wrote:
>
>> Hi,
>>
>> I know this issue should have been closed, but I thought I'd continue this
>> rather than starting a new thread.
>>
>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but
>> not enabled via plugin.includes in nutch-default.xml and I have tika in my
>> nutch-default.xml:<value>protocol-http|**urlfilter-regex|parse-(html|**
>> tika|js|swf|zip|xml)|index-(**basic|anchor)|scoring-opic|**
>> urlnormalizer-(pass|regex|**basic)</value>. What's the point of seeing
>> this warning if I already have tika? This should be removed IMHO.
>>
>> Regards,
>>
>>
>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>>
>>> Unless your using<= Nutch 1.2 you should not be using
>>> msexcel|mspowerpoint|msword|**oo|pdf| within your plugin.includes... all
>>> of these document formats are (and have been for some time)
>>> implemented as Apache Tika parsers.
>>>
>>> hth
>>>
>>>
>>>
>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<to...@ozses.net>   wrote:
>>>
>>>> Hi,
>>>>
>>>> I crawl / index PDF files just fine, but I get the following warning.
>>>>
>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>>> contentType
>>>> application/pdf via parse-plugins.xml, but not enabled via
>>>> plugin.includes
>>>> in nutch-default.xml.
>>>>
>>>> I've got the value
>>>> protocol-http|urlfilter-regex|**parse-(html|tika|js|msexcel|**
>>>> mspowerpoint|msword|oo|pdf|**swf|zip)|index-(basic|anchor)|**
>>>> scoring-opic|urlnormalizer-(**pass|regex|basic)
>>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>>
>>>> Regards,
>>>>
>>>
>>>
>

Re: parse.ParserFactory

Posted by Julien Nioche <li...@gmail.com>.

if you are seeing this warning then this means that parse-pdf IS being
used. You should modify nutch-site.xml and not nutch-default and my bet is
that your are doing this in NUTCH_HOME/conf and not in
NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)



On 29 May 2012 07:31, Tolga <to...@ozses.net> wrote:

> Hi,
>
> I know this issue should have been closed, but I thought I'd continue this
> rather than starting a new thread.
>
> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
> parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml and I have tika in my
> nutch-default.xml: <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika|js|swf|zip|xml)|index-(**basic|anchor)|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>. What's the point of seeing
> this warning if I already have tika? This should be removed IMHO.
>
> Regards,
>
>
> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>
>> Unless your using<= Nutch 1.2 you should not be using
>> msexcel|mspowerpoint|msword|**oo|pdf| within your plugin.includes... all
>> of these document formats are (and have been for some time)
>> implemented as Apache Tika parsers.
>>
>> hth
>>
>>
>>
>> On Tue, May 22, 2012 at 9:20 PM, Tolga<to...@ozses.net>  wrote:
>>
>>> Hi,
>>>
>>> I crawl / index PDF files just fine, but I get the following warning.
>>>
>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>> contentType
>>> application/pdf via parse-plugins.xml, but not enabled via
>>> plugin.includes
>>> in nutch-default.xml.
>>>
>>> I've got the value
>>> protocol-http|urlfilter-regex|**parse-(html|tika|js|msexcel|**
>>> mspowerpoint|msword|oo|pdf|**swf|zip)|index-(basic|anchor)|**
>>> scoring-opic|urlnormalizer-(**pass|regex|basic)
>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>
>>> Regards,
>>>
>>
>>
>>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: parse.ParserFactory

Posted by Tolga <to...@ozses.net>.

Hi,

I know this issue should have been closed, but I thought I'd continue 
this rather than starting a new thread.

Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin: 
parse-pdf mapped to contentType application/pdf via parse-plugins.xml, 
but not enabled via plugin.includes in nutch-default.xml and I have tika 
in my nutch-default.xml: 
<value>protocol-http|urlfilter-regex|parse-(html|tika|js|swf|zip|xml)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>. 
What's the point of seeing this warning if I already have tika? This 
should be removed IMHO.

Regards,

On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
> Unless your using<= Nutch 1.2 you should not be using
> msexcel|mspowerpoint|msword|oo|pdf| within your plugin.includes... all
> of these document formats are (and have been for some time)
> implemented as Apache Tika parsers.
>
> hth
>
>
>
> On Tue, May 22, 2012 at 9:20 PM, Tolga<to...@ozses.net>  wrote:
>> Hi,
>>
>> I crawl / index PDF files just fine, but I get the following warning.
>>
>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to contentType
>> application/pdf via parse-plugins.xml, but not enabled via plugin.includes
>> in nutch-default.xml.
>>
>> I've got the value
>> protocol-http|urlfilter-regex|parse-(html|tika|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
>> for plugin.includes property in nutch-default.xml. What am I missing?
>>
>> Regards,
>
>

Re: parse.ParserFactory

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Unless your using <= Nutch 1.2 you should not be using
msexcel|mspowerpoint|msword|oo|pdf| within your plugin.includes... all
of these document formats are (and have been for some time)
implemented as Apache Tika parsers.

hth



On Tue, May 22, 2012 at 9:20 PM, Tolga <to...@ozses.net> wrote:
> Hi,
>
> I crawl / index PDF files just fine, but I get the following warning.
>
> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to contentType
> application/pdf via parse-plugins.xml, but not enabled via plugin.includes
> in nutch-default.xml.
>
> I've got the value
> protocol-http|urlfilter-regex|parse-(html|tika|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
> for plugin.includes property in nutch-default.xml. What am I missing?
>
> Regards,



-- 
Lewis