You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Eliott <el...@gmail.com> on 2011/03/10 16:18:21 UTC
extracting text from tiff files from jackrabbit
Dear Users!
We are using tika indirectly for a project based on jackrabbit. during
the final phase of this project came into my attention that tiff files
are also capable of storing the image and the ocr-ed text in a same
file, just like PDFs do. Since we have many of such files, we have a
business need to extract text from these tiffs to be able to do full
text searches. As I understand tikka does not support this functionality
in case of tiffs, while pdfs do work ok. Is there any special reason
for this?
Has anybody written a text extractor or knows a library that can get the
text layer from these files?
thanks in advance
eliott
Re: extracting text from tiff files from jackrabbit
Posted by Paul Jakubik <pa...@purediscovery.com>.
Hi Elliott,
I think the answer to your question is that Tika does not perform OCR on any
format.
Some PDF files contain text and layout information instead of images. In
this case, a PDF text extractor can calculate how the text will be rendered
on a page and from that information figure out what text goes together and
extract it.
In other words, while PDF text extractors work much harder than text
extractors for simpler formats, they are still starting with text embedded
in the format instead of using OCR to identify characters in an image. Tika
does not extract text from PDFs if the PDF only contains images.
I know even less about the TIFF format than I do about the PDF format, mut I
think it strictly contains image formats and the only way to get body text
from a TIFF is through OCR. Since Tika doesn't perform OCR, I don't think
you can get body text from a TIFF using TIKA.
I hope this helps.
Paul
On Fri, Mar 11, 2011 at 10:16 AM, Eliott <el...@gmail.com> wrote:
> Hi!
>
> Can anybody point me into the right direction? this text in tiff seems to
> be a special tag used by Microsoft and some other applications.
>
> regards
> eliott
>
>
>
> On 10/03/2011 16:18, Eliott wrote:
>
>> Dear Users!
>>
>> We are using tika indirectly for a project based on jackrabbit. during the
>> final phase of this project came into my attention that tiff files are also
>> capable of storing the image and the ocr-ed text in a same file, just like
>> PDFs do. Since we have many of such files, we have a business need to
>> extract text from these tiffs to be able to do full text searches. As I
>> understand tikka does not support this functionality in case of tiffs, while
>> pdfs do work ok. Is there any special reason for this?
>>
>> Has anybody written a text extractor or knows a library that can get the
>> text layer from these files?
>>
>> thanks in advance
>> eliott
>>
>>
>
Re: extracting text from tiff files from jackrabbit
Posted by Jukka Zitting <jz...@adobe.com>.
Hi,
On 03/11/2011 05:16 PM, Eliott wrote:
> Can anybody point me into the right direction? this text in tiff seems
> to be a special tag used by Microsoft and some other applications.
Can you identify which tag this is? Perhaps we could teach the TIFF
parser in Tika to spot and use such tags.
--
Jukka Zitting
Re: extracting text from tiff files from jackrabbit
Posted by Eliott <el...@gmail.com>.
Hi!
Can anybody point me into the right direction? this text in tiff seems
to be a special tag used by Microsoft and some other applications.
regards
eliott
On 10/03/2011 16:18, Eliott wrote:
> Dear Users!
>
> We are using tika indirectly for a project based on jackrabbit. during
> the final phase of this project came into my attention that tiff files
> are also capable of storing the image and the ocr-ed text in a same
> file, just like PDFs do. Since we have many of such files, we have a
> business need to extract text from these tiffs to be able to do full
> text searches. As I understand tikka does not support this
> functionality in case of tiffs, while pdfs do work ok. Is there any
> special reason for this?
>
> Has anybody written a text extractor or knows a library that can get
> the text layer from these files?
>
> thanks in advance
> eliott
>