You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Brad Stallion <br...@yahoo.com> on 2013/02/21 12:28:02 UTC

Tika and invisible text from pdf

Hi all,

I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
How can I identify the invisible parts?

I've asked to stack overflow as well:

http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf

Thanks a lot for your help!

bye

Re: Tika and invisible text from pdf

Posted by Brad Stallion <br...@yahoo.com>.
I've found something interesting here:

http://www.pdflib.com/fileadmin/pdflib/pdf/manuals/TET-4.1-manual.pdf


"Area of text extraction. By default, TET will extract all text from the visible page area. 
Using the clippingarea option of open_page( ) (see Table 10.10, page 172) you can change 
this to any of the PDF page box entries (e.g. TrimBox). With the keyword unlimited all 
text regardless of any page boxes can be extracted. The default value cropbox instructs 
TET to extract text within the area which is visible in Acrobat."

How can I have the same behavior using Tika?
Thanks a lot




>________________________________
> Da: Brad Stallion <br...@yahoo.com>
>A: "user@tika.apache.org" <us...@tika.apache.org> 
>Inviato: Giovedì 21 Febbraio 2013 14:10
>Oggetto: Re: Tika and invisible text from pdf
> 
>
>Hi Samir and thanks for your response.
>I've already tried and it makes no difference, at least with default settings.
>I attach a small pdf that shows what I mean: how do extract only "visible text"?
>
>
>If you try pdftotext (I'm using ubuntu 12.10), it skips the invisible text.
>
>
>Thanks
>
>
>
>>________________________________
>> Da: samir pendharkar <sa...@gmail.com>
>>A: user@tika.apache.org; Brad Stallion <br...@yahoo.com> 
>>Inviato: Giovedì 21 Febbraio 2013 13:21
>>Oggetto: Re: Tika and invisible text from pdf
>> 
>>
>>In such cases what works best is look at the "Structured Text" view in TIKA GUI.
>>
>>You might be able to skip tags that you don't want in the output(assuming invisible part is in some different tag). 
>>
>>
>>
>>On Thu, Feb 21, 2013 at 4:58 PM, Brad Stallion <br...@yahoo.com> wrote:
>>
>>Hi all,
>>>
>>>I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
>>>How can I identify the invisible parts?
>>>
>>>I've asked to stack overflow as well:
>>>
>>>http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf
>>>
>>>Thanks a lot for your help!
>>>
>>>bye
>>>
>>
>>
>>
>
>

Re: Tika and invisible text from pdf

Posted by Brad Stallion <br...@yahoo.com>.
Hi Samir and thanks for your response.
I've already tried and it makes no difference, at least with default settings.
I attach a small pdf that shows what I mean: how do extract only "visible text"?

If you try pdftotext (I'm using ubuntu 12.10), it skips the invisible text.

Thanks



>________________________________
> Da: samir pendharkar <sa...@gmail.com>
>A: user@tika.apache.org; Brad Stallion <br...@yahoo.com> 
>Inviato: Giovedì 21 Febbraio 2013 13:21
>Oggetto: Re: Tika and invisible text from pdf
> 
>
>In such cases what works best is look at the "Structured Text" view in TIKA GUI.
>
>You might be able to skip tags that you don't want in the output(assuming invisible part is in some different tag). 
>
>
>
>On Thu, Feb 21, 2013 at 4:58 PM, Brad Stallion <br...@yahoo.com> wrote:
>
>Hi all,
>>
>>I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
>>How can I identify the invisible parts?
>>
>>I've asked to stack overflow as well:
>>
>>http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf
>>
>>Thanks a lot for your help!
>>
>>bye
>>
>
>
>

Re: Tika and invisible text from pdf

Posted by samir pendharkar <sa...@gmail.com>.
In such cases what works best is look at the "Structured Text" view in TIKA
GUI.
You might be able to skip tags that you don't want in the output(assuming
invisible part is in some different tag).


On Thu, Feb 21, 2013 at 4:58 PM, Brad Stallion <br...@yahoo.com>wrote:

> Hi all,
>
> I'm extracting text from PDF files using my own sax handler. The problem
> is that I get both visible and invisible text, i.e. text contained in
> invisible parts of the layout.
> How can I identify the invisible parts?
>
> I've asked to stack overflow as well:
>
>
> http://stackoverflow.com/questions/14956556/tika-and-invisible-text-from-pdf
>
> Thanks a lot for your help!
>
> bye
>

Re: Tika and invisible text from pdf

Posted by Brad Stallion <br...@yahoo.com>.
Hi Dave,
no, not yet, good idea.
In case there exists some parameter to tune in PDFBox, how can I access to it directly?
Thanks



>________________________________
> Da: Dave Meikle <lo...@gmail.com>
>A: user@tika.apache.org; Brad Stallion <br...@yahoo.com> 
>Inviato: Domenica 10 Marzo 2013 0:53
>Oggetto: Re: Tika and invisible text from pdf
> 
>Hi Brad,
>
>On 21 Feb 2013, at 11:28, Brad Stallion <br...@yahoo.com> wrote:
>
>> I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
>> How can I identify the invisible parts?
>
>We use PDFBox under the hood in Tika.  Have you tried asking on their user list?
>
>Cheers,
>Dave
>
>

Re: Tika and invisible text from pdf

Posted by Dave Meikle <lo...@gmail.com>.
Hi Brad,

On 21 Feb 2013, at 11:28, Brad Stallion <br...@yahoo.com> wrote:

> I'm extracting text from PDF files using my own sax handler. The problem is that I get both visible and invisible text, i.e. text contained in invisible parts of the layout.
> How can I identify the invisible parts?

We use PDFBox under the hood in Tika.  Have you tried asking on their user list?

Cheers,
Dave