You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Naga Vijay <na...@gmail.com> on 2017/08/21 23:14:51 UTC

. Extending Tika

Hello,

I am using latest version of Tika (with Tesseract).

Some of the words in embedded image in a Microsoft doc are mis-spelt.

What is the best way to handle this?

Can I extend Tika to read from a say cache having key-value pairs to
correct the output of Tika?

Please suggest.

Thanks
Naga

Re: . Extending Tika

Posted by John Patrick <nh...@gmail.com>.

Just to confirm the miss spelt words, are if you open the word doc do
you see them spelt the same way?

e.g.
1) is the word doc wrong
2) tika is renaming something incorrectly?

as it it's (2), then i would patch tika to correct the parser.

if it's (1) then i would extend the current parser being used, which
can then do what you need.

On 24 August 2017 at 01:34, Naga Vijay <na...@gmail.com> wrote:
>
> (+) dev@tika.apache.org
>
> On Mon, Aug 21, 2017 at 4:14 PM, Naga Vijay <na...@gmail.com> wrote:
>>
>> Hello,
>>
>> I am using latest version of Tika (with Tesseract).
>>
>> Some of the words in embedded image in a Microsoft doc are mis-spelt in
>> the Tika output.
>>
>> What is the best way to handle this?
>>
>> Can I extend Tika to read from a cache having key-value pairs to correct
>> the output of Tika?
>>
>> Please suggest.
>>
>> Thanks
>> Naga
>
>

Re: . Extending Tika

Posted by Naga Vijay <na...@gmail.com>.

(+) dev@tika.apache.org

On Mon, Aug 21, 2017 at 4:14 PM, Naga Vijay <na...@gmail.com> wrote:

> Hello,
>
> I am using latest version of Tika (with Tesseract).
>
> Some of the words in embedded image in a Microsoft doc are mis-spelt in
> the Tika output.
>
> What is the best way to handle this?
>
> Can I extend Tika to read from a cache having key-value pairs to correct
> the output of Tika?
>
> Please suggest.
>
> Thanks
> Naga
>