You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Dave Meikle <lo...@gmail.com> on 2013/03/10 00:53:18 UTC

Re: Tika and invisible text from pdf

Hi Brad,

On 21 Feb 2013, at 11:28, Brad Stallion <br...@yahoo.com> wrote:

> I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
> How can I identify the invisible parts?

We use PDFBox under the hood in Tika.  Have you tried asking on their user list?

Cheers,
Dave

Re: Tika and invisible text from pdf

Posted by Brad Stallion <br...@yahoo.com>.
Hi Dave,
no, not yet, good idea.
In case there exists some parameter to tune in PDFBox, how can I access to it directly?
Thanks



>________________________________
> Da: Dave Meikle <lo...@gmail.com>
>A: user@tika.apache.org; Brad Stallion <br...@yahoo.com> 
>Inviato: Domenica 10 Marzo 2013 0:53
>Oggetto: Re: Tika and invisible text from pdf
> 
>Hi Brad,
>
>On 21 Feb 2013, at 11:28, Brad Stallion <br...@yahoo.com> wrote:
>
>> I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
>> How can I identify the invisible parts?
>
>We use PDFBox under the hood in Tika.  Have you tried asking on their user list?
>
>Cheers,
>Dave
>
>