You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tolga <to...@ozses.net> on 2012/05/23 12:44:32 UTC

Apparently far from last question :)

Hi,

I put the lines <mimeType name="application/x-excel">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>

in parse-plugins.xml, but I still can't crawl xls files. Why is that?

Regards,

Re: Apparently far from last question :)

Posted by Tolga <to...@ozses.net>.
I put that in because I noticed it wasn't crawled. After I put that in, 
it wasn't crawled either.

On 5/23/12 2:05 PM, Lewis John Mcgibbney wrote:
> There is absolutely no requirement to add this configuration to this file.
> If you you look at the XML file in question, one of the first XML
> configuration blocks says
>
> <!--  by default if the mimeType is set to *, or
>          if it can't be determined, use parse-tika -->
> 	<mimeType name="*">
> 	<plugin id="parse-tika" />
> 	</mimeType>
>
> Just remove your unnecessary config and Tika will do the work for you :0)
>
> Lewis
>
> On Wed, May 23, 2012 at 11:44 AM, Tolga<to...@ozses.net>  wrote:
>> Hi,
>>
>> I put the lines<mimeType name="application/x-excel">
>> <plugin id="parse-tika" />
>> <plugin id="feed" />
>> </mimeType>
>>
>> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
>>
>> Regards,
>
>

RE: Apparently far from last question :)

Posted by Markus Jelsma <ma...@openindex.io>.

 You can inspect the CrawlDB with the readdb tool, check if it's there.
 
-----Original message-----
> From:Tolga <to...@ozses.net>
> Sent: Wed 23-May-2012 14:21
> To: user@nutch.apache.org
> Subject: Re: Apparently far from last question :)
> 
> My colleague has just made me realize something. Is it possible that 
> this xls file wasn't crawled because there isn't a link to it within the 
> website?
> 
> Regards,
> 
> On 5/23/12 2:05 PM, Lewis John Mcgibbney wrote:
> > There is absolutely no requirement to add this configuration to this file.
> > If you you look at the XML file in question, one of the first XML
> > configuration blocks says
> >
> > <!--  by default if the mimeType is set to *, or
> >          if it can't be determined, use parse-tika -->
> > 	<mimeType name="*">
> > 	<plugin id="parse-tika" />
> > 	</mimeType>
> >
> > Just remove your unnecessary config and Tika will do the work for you :0)
> >
> > Lewis
> >
> > On Wed, May 23, 2012 at 11:44 AM, Tolga<to...@ozses.net>  wrote:
> >> Hi,
> >>
> >> I put the lines<mimeType name="application/x-excel">
> >> <plugin id="parse-tika" />
> >> <plugin id="feed" />
> >> </mimeType>
> >>
> >> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
> >>
> >> Regards,
> >
> >
> 

Re: Apparently far from last question :)

Posted by Tolga <to...@ozses.net>.
My colleague has just made me realize something. Is it possible that 
this xls file wasn't crawled because there isn't a link to it within the 
website?

Regards,

On 5/23/12 2:05 PM, Lewis John Mcgibbney wrote:
> There is absolutely no requirement to add this configuration to this file.
> If you you look at the XML file in question, one of the first XML
> configuration blocks says
>
> <!--  by default if the mimeType is set to *, or
>          if it can't be determined, use parse-tika -->
> 	<mimeType name="*">
> 	<plugin id="parse-tika" />
> 	</mimeType>
>
> Just remove your unnecessary config and Tika will do the work for you :0)
>
> Lewis
>
> On Wed, May 23, 2012 at 11:44 AM, Tolga<to...@ozses.net>  wrote:
>> Hi,
>>
>> I put the lines<mimeType name="application/x-excel">
>> <plugin id="parse-tika" />
>> <plugin id="feed" />
>> </mimeType>
>>
>> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
>>
>> Regards,
>
>

Re: Apparently far from last question :)

Posted by Lewis John Mcgibbney <le...@gmail.com>.
There is absolutely no requirement to add this configuration to this file.
If you you look at the XML file in question, one of the first XML
configuration blocks says

<!--  by default if the mimeType is set to *, or
        if it can't be determined, use parse-tika -->
	<mimeType name="*">
	  <plugin id="parse-tika" />
	</mimeType>

Just remove your unnecessary config and Tika will do the work for you :0)

Lewis

On Wed, May 23, 2012 at 11:44 AM, Tolga <to...@ozses.net> wrote:
> Hi,
>
> I put the lines <mimeType name="application/x-excel">
> <plugin id="parse-tika" />
> <plugin id="feed" />
> </mimeType>
>
> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
>
> Regards,



-- 
Lewis