You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tolga <to...@ozses.net> on 2012/05/23 12:44:32 UTC
Apparently far from last question :)
Hi,
I put the lines <mimeType name="application/x-excel">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
in parse-plugins.xml, but I still can't crawl xls files. Why is that?
Regards,
Re: Apparently far from last question :)
Posted by Tolga <to...@ozses.net>.
I put that in because I noticed it wasn't crawled. After I put that in,
it wasn't crawled either.
On 5/23/12 2:05 PM, Lewis John Mcgibbney wrote:
> There is absolutely no requirement to add this configuration to this file.
> If you you look at the XML file in question, one of the first XML
> configuration blocks says
>
> <!-- by default if the mimeType is set to *, or
> if it can't be determined, use parse-tika -->
> <mimeType name="*">
> <plugin id="parse-tika" />
> </mimeType>
>
> Just remove your unnecessary config and Tika will do the work for you :0)
>
> Lewis
>
> On Wed, May 23, 2012 at 11:44 AM, Tolga<to...@ozses.net> wrote:
>> Hi,
>>
>> I put the lines<mimeType name="application/x-excel">
>> <plugin id="parse-tika" />
>> <plugin id="feed" />
>> </mimeType>
>>
>> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
>>
>> Regards,
>
>
RE: Apparently far from last question :)
Posted by Markus Jelsma <ma...@openindex.io>.
You can inspect the CrawlDB with the readdb tool, check if it's there.
-----Original message-----
> From:Tolga <to...@ozses.net>
> Sent: Wed 23-May-2012 14:21
> To: user@nutch.apache.org
> Subject: Re: Apparently far from last question :)
>
> My colleague has just made me realize something. Is it possible that
> this xls file wasn't crawled because there isn't a link to it within the
> website?
>
> Regards,
>
> On 5/23/12 2:05 PM, Lewis John Mcgibbney wrote:
> > There is absolutely no requirement to add this configuration to this file.
> > If you you look at the XML file in question, one of the first XML
> > configuration blocks says
> >
> > <!-- by default if the mimeType is set to *, or
> > if it can't be determined, use parse-tika -->
> > <mimeType name="*">
> > <plugin id="parse-tika" />
> > </mimeType>
> >
> > Just remove your unnecessary config and Tika will do the work for you :0)
> >
> > Lewis
> >
> > On Wed, May 23, 2012 at 11:44 AM, Tolga<to...@ozses.net> wrote:
> >> Hi,
> >>
> >> I put the lines<mimeType name="application/x-excel">
> >> <plugin id="parse-tika" />
> >> <plugin id="feed" />
> >> </mimeType>
> >>
> >> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
> >>
> >> Regards,
> >
> >
>
Re: Apparently far from last question :)
Posted by Tolga <to...@ozses.net>.
My colleague has just made me realize something. Is it possible that
this xls file wasn't crawled because there isn't a link to it within the
website?
Regards,
On 5/23/12 2:05 PM, Lewis John Mcgibbney wrote:
> There is absolutely no requirement to add this configuration to this file.
> If you you look at the XML file in question, one of the first XML
> configuration blocks says
>
> <!-- by default if the mimeType is set to *, or
> if it can't be determined, use parse-tika -->
> <mimeType name="*">
> <plugin id="parse-tika" />
> </mimeType>
>
> Just remove your unnecessary config and Tika will do the work for you :0)
>
> Lewis
>
> On Wed, May 23, 2012 at 11:44 AM, Tolga<to...@ozses.net> wrote:
>> Hi,
>>
>> I put the lines<mimeType name="application/x-excel">
>> <plugin id="parse-tika" />
>> <plugin id="feed" />
>> </mimeType>
>>
>> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
>>
>> Regards,
>
>
Re: Apparently far from last question :)
Posted by Lewis John Mcgibbney <le...@gmail.com>.
There is absolutely no requirement to add this configuration to this file.
If you you look at the XML file in question, one of the first XML
configuration blocks says
<!-- by default if the mimeType is set to *, or
if it can't be determined, use parse-tika -->
<mimeType name="*">
<plugin id="parse-tika" />
</mimeType>
Just remove your unnecessary config and Tika will do the work for you :0)
Lewis
On Wed, May 23, 2012 at 11:44 AM, Tolga <to...@ozses.net> wrote:
> Hi,
>
> I put the lines <mimeType name="application/x-excel">
> <plugin id="parse-tika" />
> <plugin id="feed" />
> </mimeType>
>
> in parse-plugins.xml, but I still can't crawl xls files. Why is that?
>
> Regards,
--
Lewis