You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by "Brangs, Erik" <E....@dnb.de> on 2023/12/13 10:23:44 UTC

Text extraction from a certain PDF uses up multiple GB of memory

Hi,

we ran into problems when doing text extraction from the PDF at https://d-nb.info/1312454512/34 . We were using PDFBox 3.0.0 to extract the text and the text extraction used up multiple GB of memory. The problem can be reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room for improvement in text extraction in PDFBox for this case or is this just a badly generated PDF?

-- 
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.brangs@dnb.de
https://www.dnb.de


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Text extraction from a certain PDF uses up multiple GB of memory

Posted by Andreas Lehmkühler <an...@lehmi.de.INVALID>.

Looks like https://issues.apache.org/jira/browse/PDFBOX-5479

Am 13.12.23 um 14:50 schrieb Tilman Hausherr:
> On 13.12.2023 11:23, Brangs, Erik wrote:
>> Hi,
>>
>> we ran into problems when doing text extraction from the PDF athttps://d-nb.info/1312454512/34  . We were using PDFBox 3.0.0 to extract the text and the text extraction used up multiple GB of memory. The problem can be reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room for improvement in text extraction in PDFBox for this case or is this just a badly generated PDF?
>>
> Yeah it's a weird PDF: they have different font objects that point to 
> the same font file (See FontFile2). So the font is opened each time and 
> all tables are read amd stored. And since 3.0 we read much more tables 
> than in 2.0.
> Tilman
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Text extraction from a certain PDF uses up multiple GB of memory

Posted by Tilman Hausherr <TH...@t-online.de>.

On 13.12.2023 11:23, Brangs, Erik wrote:
> Hi,
>
> we ran into problems when doing text extraction from the PDF athttps://d-nb.info/1312454512/34  . We were using PDFBox 3.0.0 to extract the text and the text extraction used up multiple GB of memory. The problem can be reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room for improvement in text extraction in PDFBox for this case or is this just a badly generated PDF?
>
Yeah it's a weird PDF: they have different font objects that point to 
the same font file (See FontFile2). So the font is opened each time and 
all tables are read amd stored. And since 3.0 we read much more tables 
than in 2.0.
Tilman