You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Eliott <el...@gmail.com> on 2011/03/10 16:18:21 UTC

extracting text from tiff files from jackrabbit

Dear  Users!

We are using tika indirectly for a project based on jackrabbit. during 
the final phase of this project came into my attention that tiff files 
are also capable of storing the image and the ocr-ed text in a same 
file, just like PDFs do. Since we have many of such files, we have a 
business need to extract text from these tiffs to be able to do full 
text searches. As I understand tikka does not support this functionality 
in case of tiffs, while pdfs do work ok.  Is there any special reason 
for this?

Has anybody written a text extractor or knows a library that can get the 
text layer from these files?

thanks in advance
eliott

Re: extracting text from tiff files from jackrabbit

Posted by Paul Jakubik <pa...@purediscovery.com>.

Hi Elliott,

I think the answer to your question is that Tika does not perform OCR on any
format.

Some PDF files contain text and layout information instead of images. In
this case, a PDF text extractor can calculate how the text will be rendered
on a page and from that information figure out what text goes together and
extract it.

In other words, while PDF text extractors work much harder than text
extractors for simpler formats, they are still starting with text embedded
in the format instead of using OCR to identify characters in an image. Tika
does not extract text from PDFs if the PDF only contains images.

I know even less about the TIFF format than I do about the PDF format, mut I
think it strictly contains image formats and the only way to get body text
from a TIFF is through OCR. Since Tika doesn't perform OCR, I don't think
you can get body text from a TIFF using TIKA.

I hope this helps.

Paul

On Fri, Mar 11, 2011 at 10:16 AM, Eliott <el...@gmail.com> wrote:

> Hi!
>
> Can anybody point me into the right direction? this text in tiff seems to
> be a special tag used by Microsoft and some other applications.
>
> regards
> eliott
>
>
>
> On 10/03/2011 16:18, Eliott wrote:
>
>> Dear  Users!
>>
>> We are using tika indirectly for a project based on jackrabbit. during the
>> final phase of this project came into my attention that tiff files are also
>> capable of storing the image and the ocr-ed text in a same file, just like
>> PDFs do. Since we have many of such files, we have a business need to
>> extract text from these tiffs to be able to do full text searches. As I
>> understand tikka does not support this functionality in case of tiffs, while
>> pdfs do work ok.  Is there any special reason for this?
>>
>> Has anybody written a text extractor or knows a library that can get the
>> text layer from these files?
>>
>> thanks in advance
>> eliott
>>
>>
>

Re: extracting text from tiff files from jackrabbit

Posted by Jukka Zitting <jz...@adobe.com>.

Hi,

On 03/11/2011 05:16 PM, Eliott wrote:
> Can anybody point me into the right direction? this text in tiff seems
> to be a special tag used by Microsoft and some other applications.

Can you identify which tag this is? Perhaps we could teach the TIFF 
parser in Tika to spot and use such tags.

-- 
Jukka Zitting

Re: extracting text from tiff files from jackrabbit

Posted by Eliott <el...@gmail.com>.

Hi!

Can anybody point me into the right direction? this text in tiff seems 
to be a special tag used by Microsoft and some other applications.

regards
eliott


On 10/03/2011 16:18, Eliott wrote:
> Dear  Users!
>
> We are using tika indirectly for a project based on jackrabbit. during 
> the final phase of this project came into my attention that tiff files 
> are also capable of storing the image and the ocr-ed text in a same 
> file, just like PDFs do. Since we have many of such files, we have a 
> business need to extract text from these tiffs to be able to do full 
> text searches. As I understand tikka does not support this 
> functionality in case of tiffs, while pdfs do work ok.  Is there any 
> special reason for this?
>
> Has anybody written a text extractor or knows a library that can get 
> the text layer from these files?
>
> thanks in advance
> eliott
>