You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "ron.vandenbranden" <ro...@kantl.be> on 2016/04/13 12:53:19 UTC

disable extraction of images

Hi,

I've just happily discovered Tika and am sorting out how well it fits 
our needs.

I'm trying to create a searchable index for PDF files that contain typed 
pages and pages with scanned text facsimile's. Some of those facsimile's 
are scans from print source materials, in which case Tika seems to be 
able to index their text contents as well. Impressive though that is, 
we're currently only interested in the actual text content in the PDF; 
not the content on the images in the PDF.

Is it possible to disable text extraction from images inside a PDF file? 
I'm testing with the CLI tika app, which has "extractInlineImages" set 
to false by default, if I'm not mistaken. Yet, the text of the images 
still is present in the generated HTML output. Am I missing something 
obvious?

Kind regards,

Ron

Re: disable extraction of images

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

Some scanning software include OCR features and include hidden text behind
the scanned images to make the resulting PDF searchable. I suspect this may
be happening in your case.

It would be technically possible to detect such hidden text and have an
option for excluding it from the output, but IIRC such a feature doesn't
currently exist in Tika or the underlying PDFBox library.

Best,

Jukka Zitting

On Wed, Apr 13, 2016 at 8:52 AM ron.vandenbranden <
ron.vandenbranden@kantl.be> wrote:

> Hi again,
>
>
> On 13/04/2016 13:18, ron.vandenbranden wrote:
>
>
> I wasn't aware of tesseract; I definitely don't have it on my classpath.
> I'm just testing with the stand-alone tika jar file. My Java skills are
> close to zero (apart from copy/paste and recompiling things). Could you
> tell me how to configure this for the standalone jar file, please?
>
>
> Ok, answering my own question: per the documentation at
> https://tika.apache.org/1.12/gettingstarted.html, I got the CLI app
> working with a configuration file with following command line arguments:
>
>   java -jar tika-app-1.12.jar --gui --config=tika-config.xml
>
> I'm using the example configuration file from
> https://wiki.apache.org/tika/TikaOCR#Disable_Tika_OCR, excluding the
> TesseractOCRParser.
>
> Yet, this does not seem to change anything: the image content is still
> extracted. Any idea what could be wrong?
>
> Best,
>
> Ron
> <http://www.facebook.com/KANTL.be>
>

Re: disable extraction of images

Posted by "ron.vandenbranden" <ro...@kantl.be>.

Hi again,

On 13/04/2016 13:18, ron.vandenbranden wrote:
>
> I wasn't aware of tesseract; I definitely don't have it on my 
> classpath. I'm just testing with the stand-alone tika jar file. My 
> Java skills are close to zero (apart from copy/paste and recompiling 
> things). Could you tell me how to configure this for the standalone 
> jar file, please?
>

Ok, answering my own question: per the documentation at 
https://tika.apache.org/1.12/gettingstarted.html, I got the CLI app 
working with a configuration file with following command line arguments:

   java -jar tika-app-1.12.jar --gui --config=tika-config.xml

I'm using the example configuration file from 
https://wiki.apache.org/tika/TikaOCR#Disable_Tika_OCR, excluding the 
TesseractOCRParser.

Yet, this does not seem to change anything: the image content is still 
extracted. Any idea what could be wrong?

Best,

Ron
<http://www.facebook.com/KANTL.be>

Re: disable extraction of images

Posted by "ron.vandenbranden" <ro...@kantl.be>.

Thanks,

I wasn't aware of tesseract; I definitely don't have it on my classpath. 
I'm just testing with the stand-alone tika jar file. My Java skills are 
close to zero (apart from copy/paste and recompiling things). Could you 
tell me how to configure this for the standalone jar file, please?

In the end, I'll be using Tika embedded in another app (the eXist native 
XML database), which uses 2 jars: tika-core and tika-parsers. How would 
I have to go about to disable tesseract there?

Apologies for the low-level questions, any help much appreciated!

Best,

Ron

On 13/04/2016 12:56, Nick Burch wrote:
> On Wed, 13 Apr 2016, ron.vandenbranden wrote:
>> Is it possible to disable text extraction from images inside a PDF 
>> file? I'm testing with the CLI tika app, which has 
>> "extractInlineImages" set to false by default, if I'm not mistaken. 
>> Yet, the text of the images still is present in the generated HTML 
>> output. Am I missing something obvious?
>
> Yup, see "Disable Tika OCR" in https://wiki.apache.org/tika/TikaOCR 
> (or remove tessaract from your path!)
>
> Nick
>
>

Re: disable extraction of images

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 13 Apr 2016, ron.vandenbranden wrote:
> Is it possible to disable text extraction from images inside a PDF file? 
> I'm testing with the CLI tika app, which has "extractInlineImages" set 
> to false by default, if I'm not mistaken. Yet, the text of the images 
> still is present in the generated HTML output. Am I missing something 
> obvious?

Yup, see "Disable Tika OCR" in https://wiki.apache.org/tika/TikaOCR (or 
remove tessaract from your path!)

Nick