You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Ribeaud, Christian (Ext)" <ch...@novartis.com> on 2019/11/14 10:18:55 UTC

Parsing huge PDF (400Mb, 2700 pages)

Hi,

My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).

By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.

I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).

Are they things I should be aware of?

Any hint would be very welcome. Thanks and have a nice day,

christian

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 14.11.2019 um 11:18 schrieb Ribeaud, Christian (Ext):
> I have the impression that memory plays a role. I have no more than 3GB

Can you share the file? I have 64GB so I could tell how much it really uses.

Tilman

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by John Patrick <nh...@gmail.com>.

Could you recreate a test pdf with similar page numbers and file size,
that behaves the same as your real pdf. It's probably the only way
people can help unless you investigate it yourself more.

Also as people have mentioned pdf are compressed so when decompressed
could be much larger, 100MB pdf might be 150MB or 2GB uncompress
totally dependent upon what is actually in the pdf.
Also they are not sequential files so the 1st page might require data
from all over the file, so if your doing page by page you probably
have to load and parse the whole file into memory.


On Thu, 14 Nov 2019 at 18:23, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>
> Mhmm - not running on AWS lambda but I do have an application handling PDFs with up to 30.000 pages and it takes only 2 minutes.
> Although the environments are not comparable it would be good to get a better idea of the content of the PDFs. Maybe there is
> something in there causing that long runtime.
>
> Could you share it privately?
>
> BR
> Maruan
>
> > Hi,
> >
> > I’ve read regarding the on-demand parser. I might have a look.
> >
> > Unfortunately, I am NOT allowed to share the PDF.
> >
> > What am I trying to do is the following: I am writing an AWS Lambda for parsing the PDF by page. The text should be extracted and send to Elasticsearch.
> >
> > Because of the Lambda environment, I have limited resources: 3GB and 15mn runtime max.
> >
> > This setup works marvelously with the majority of the PDFs. With the ones bigger than around 400Mb, I am overrunning the time limit.
> >
> > The problem is NOT Tika related, it is PDFBox related (I did a check). So I will have to find another strategy for the time being.
> >
> > Thanks to all for the feedback. Very appreciated.
> >
> > Kind regards and have a nice evening,
> >
> > christian
> >
> > From: Tilman Hausherr <TH...@t-online.de>
> > Sent: Donnerstag, 14. November 2019 18:05
> > To: user@tika.apache.org
> > Cc: users@pdfbox.apache.org
> > Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> >
> > The PDF can be much bigger than 3GB when decompressed.
> >
> > What you could try
> >
> > 1) using a scratch file (will be even slower) when opening the document
> > 2) the on-demand parser, see
> > https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>
> >
> > there is a branch on the svn server, you have to build from source.
> >
> > Tilman
> >
> > Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
> > Good evening,
> >
> > No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
> > So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> >
> > christian
> >
> > From: Tim Allison <ta...@apache.org>
> > Sent: Donnerstag, 14. November 2019 15:07
> > To: user@tika.apache.org<ma...@tika.apache.org>
> > Cc: users@pdfbox.apache.org<ma...@pdfbox.apache.org>
> > Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> >
> > CC'ing colleagues on PDFBox...any recommendations?
> >
> > Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
> >
> > On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> > Hi,
> > Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
> >
> > Cheers, Sergey
> >
> >
> > On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
> > Hi,
> >
> > My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
> >
> > By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
> >
> > I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
> >
> > Are they things I should be aware of?
> >
> > Any hint would be very welcome. Thanks and have a nice day,
> >
> > christian
> >
> >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by John Patrick <nh...@gmail.com>.

Could you recreate a test pdf with similar page numbers and file size,
that behaves the same as your real pdf. It's probably the only way
people can help unless you investigate it yourself more.

Also as people have mentioned pdf are compressed so when decompressed
could be much larger, 100MB pdf might be 150MB or 2GB uncompress
totally dependent upon what is actually in the pdf.
Also they are not sequential files so the 1st page might require data
from all over the file, so if your doing page by page you probably
have to load and parse the whole file into memory.


On Thu, 14 Nov 2019 at 18:23, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>
> Mhmm - not running on AWS lambda but I do have an application handling PDFs with up to 30.000 pages and it takes only 2 minutes.
> Although the environments are not comparable it would be good to get a better idea of the content of the PDFs. Maybe there is
> something in there causing that long runtime.
>
> Could you share it privately?
>
> BR
> Maruan
>
> > Hi,
> >
> > I’ve read regarding the on-demand parser. I might have a look.
> >
> > Unfortunately, I am NOT allowed to share the PDF.
> >
> > What am I trying to do is the following: I am writing an AWS Lambda for parsing the PDF by page. The text should be extracted and send to Elasticsearch.
> >
> > Because of the Lambda environment, I have limited resources: 3GB and 15mn runtime max.
> >
> > This setup works marvelously with the majority of the PDFs. With the ones bigger than around 400Mb, I am overrunning the time limit.
> >
> > The problem is NOT Tika related, it is PDFBox related (I did a check). So I will have to find another strategy for the time being.
> >
> > Thanks to all for the feedback. Very appreciated.
> >
> > Kind regards and have a nice evening,
> >
> > christian
> >
> > From: Tilman Hausherr <TH...@t-online.de>
> > Sent: Donnerstag, 14. November 2019 18:05
> > To: user@tika.apache.org
> > Cc: users@pdfbox.apache.org
> > Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> >
> > The PDF can be much bigger than 3GB when decompressed.
> >
> > What you could try
> >
> > 1) using a scratch file (will be even slower) when opening the document
> > 2) the on-demand parser, see
> > https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>
> >
> > there is a branch on the svn server, you have to build from source.
> >
> > Tilman
> >
> > Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
> > Good evening,
> >
> > No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
> > So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> >
> > christian
> >
> > From: Tim Allison <ta...@apache.org>
> > Sent: Donnerstag, 14. November 2019 15:07
> > To: user@tika.apache.org<ma...@tika.apache.org>
> > Cc: users@pdfbox.apache.org<ma...@pdfbox.apache.org>
> > Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> >
> > CC'ing colleagues on PDFBox...any recommendations?
> >
> > Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
> >
> > On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> > Hi,
> > Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
> >
> > Cheers, Sergey
> >
> >
> > On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
> > Hi,
> >
> > My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
> >
> > By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
> >
> > I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
> >
> > Are they things I should be aware of?
> >
> > Any hint would be very welcome. Thanks and have a nice day,
> >
> > christian
> >
> >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Mhmm - not running on AWS lambda but I do have an application handling PDFs with up to 30.000 pages and it takes only 2 minutes.
Although the environments are not comparable it would be good to get a better idea of the content of the PDFs. Maybe there is
something in there causing that long runtime.

Could you share it privately?

BR
Maruan

> Hi,
> 
> I’ve read regarding the on-demand parser. I might have a look.
> 
> Unfortunately, I am NOT allowed to share the PDF.
> 
> What am I trying to do is the following: I am writing an AWS Lambda for parsing the PDF by page. The text should be extracted and send to Elasticsearch.
> 
> Because of the Lambda environment, I have limited resources: 3GB and 15mn runtime max.
> 
> This setup works marvelously with the majority of the PDFs. With the ones bigger than around 400Mb, I am overrunning the time limit.
> 
> The problem is NOT Tika related, it is PDFBox related (I did a check). So I will have to find another strategy for the time being.
> 
> Thanks to all for the feedback. Very appreciated.
> 
> Kind regards and have a nice evening,
> 
> christian
> 
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Donnerstag, 14. November 2019 18:05
> To: user@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> The PDF can be much bigger than 3GB when decompressed.
> 
> What you could try
> 
> 1) using a scratch file (will be even slower) when opening the document
> 2) the on-demand parser, see
> https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>
> 
> there is a branch on the svn server, you have to build from source.
> 
> Tilman
> 
> Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
> Good evening,
> 
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> 
> christian
> 
> From: Tim Allison <ta...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org<ma...@tika.apache.org>
> Cc: users@pdfbox.apache.org<ma...@pdfbox.apache.org>
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> CC'ing colleagues on PDFBox...any recommendations?
> 
> Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
> 
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> Hi,
> Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
> 
> Cheers, Sergey
> 
> 
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
> Hi,
> 
> My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
> 
> By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
> 
> I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
> 
> Are they things I should be aware of?
> 
> Any hint would be very welcome. Thanks and have a nice day,
> 
> christian
> 
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Mhmm - not running on AWS lambda but I do have an application handling PDFs with up to 30.000 pages and it takes only 2 minutes.
Although the environments are not comparable it would be good to get a better idea of the content of the PDFs. Maybe there is
something in there causing that long runtime.

Could you share it privately?

BR
Maruan

> Hi,
> 
> I’ve read regarding the on-demand parser. I might have a look.
> 
> Unfortunately, I am NOT allowed to share the PDF.
> 
> What am I trying to do is the following: I am writing an AWS Lambda for parsing the PDF by page. The text should be extracted and send to Elasticsearch.
> 
> Because of the Lambda environment, I have limited resources: 3GB and 15mn runtime max.
> 
> This setup works marvelously with the majority of the PDFs. With the ones bigger than around 400Mb, I am overrunning the time limit.
> 
> The problem is NOT Tika related, it is PDFBox related (I did a check). So I will have to find another strategy for the time being.
> 
> Thanks to all for the feedback. Very appreciated.
> 
> Kind regards and have a nice evening,
> 
> christian
> 
> From: Tilman Hausherr <TH...@t-online.de>
> Sent: Donnerstag, 14. November 2019 18:05
> To: user@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> The PDF can be much bigger than 3GB when decompressed.
> 
> What you could try
> 
> 1) using a scratch file (will be even slower) when opening the document
> 2) the on-demand parser, see
> https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>
> 
> there is a branch on the svn server, you have to build from source.
> 
> Tilman
> 
> Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
> Good evening,
> 
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> 
> christian
> 
> From: Tim Allison <ta...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org<ma...@tika.apache.org>
> Cc: users@pdfbox.apache.org<ma...@pdfbox.apache.org>
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> CC'ing colleagues on PDFBox...any recommendations?
> 
> Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
> 
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> Hi,
> Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
> 
> Cheers, Sergey
> 
> 
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
> Hi,
> 
> My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
> 
> By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
> 
> I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
> 
> Are they things I should be aware of?
> 
> Any hint would be very welcome. Thanks and have a nice day,
> 
> christian
> 
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

RE: Parsing huge PDF (400Mb, 2700 pages)

Posted by "Ribeaud, Christian (Ext)" <ch...@novartis.com>.

Hi,

I’ve read regarding the on-demand parser. I might have a look.

Unfortunately, I am NOT allowed to share the PDF.

What am I trying to do is the following: I am writing an AWS Lambda for parsing the PDF by page. The text should be extracted and send to Elasticsearch.

Because of the Lambda environment, I have limited resources: 3GB and 15mn runtime max.

This setup works marvelously with the majority of the PDFs. With the ones bigger than around 400Mb, I am overrunning the time limit.

The problem is NOT Tika related, it is PDFBox related (I did a check). So I will have to find another strategy for the time being.

Thanks to all for the feedback. Very appreciated.

Kind regards and have a nice evening,

christian

From: Tilman Hausherr <TH...@t-online.de>
Sent: Donnerstag, 14. November 2019 18:05
To: user@tika.apache.org
Cc: users@pdfbox.apache.org
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

The PDF can be much bigger than 3GB when decompressed.

What you could try

1) using a scratch file (will be even slower) when opening the document
2) the on-demand parser, see
https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>

there is a branch on the svn server, you have to build from source.

Tilman

Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
Good evening,

No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.

christian

From: Tim Allison <ta...@apache.org>
Sent: Donnerstag, 14. November 2019 15:07
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: users@pdfbox.apache.org<ma...@pdfbox.apache.org>
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi,
Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements

Cheers, Sergey


On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
Hi,

My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).

By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.

I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).

Are they things I should be aware of?

Any hint would be very welcome. Thanks and have a nice day,

christian

RE: Parsing huge PDF (400Mb, 2700 pages)

Posted by "Ribeaud, Christian (Ext)" <ch...@novartis.com>.

Hi,

I’ve read regarding the on-demand parser. I might have a look.

Unfortunately, I am NOT allowed to share the PDF.

What am I trying to do is the following: I am writing an AWS Lambda for parsing the PDF by page. The text should be extracted and send to Elasticsearch.

Because of the Lambda environment, I have limited resources: 3GB and 15mn runtime max.

This setup works marvelously with the majority of the PDFs. With the ones bigger than around 400Mb, I am overrunning the time limit.

The problem is NOT Tika related, it is PDFBox related (I did a check). So I will have to find another strategy for the time being.

Thanks to all for the feedback. Very appreciated.

Kind regards and have a nice evening,

christian

From: Tilman Hausherr <TH...@t-online.de>
Sent: Donnerstag, 14. November 2019 18:05
To: user@tika.apache.org
Cc: users@pdfbox.apache.org
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

The PDF can be much bigger than 3GB when decompressed.

What you could try

1) using a scratch file (will be even slower) when opening the document
2) the on-demand parser, see
https://issues.apache.org/jira/browse/PDFBOX-4569<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PDFBOX-2D4569&d=DwMDaQ&c=ZbgFmJjg4pdtrnL2HUJUDw&r=MS-8N6QwQqjzb8iBrPi691rQrebnovN-5Sk5-OHOxKQ&m=_kWIdSjT5LgtYpKhEsWgOOWGO2MZJ5uZvxB6zIG-Ixk&s=AOHvMcfbvkw1ADozf5VqFIwLMHmBvs7o7hvhSgJD7cM&e=>

there is a branch on the svn server, you have to build from source.

Tilman

Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
Good evening,

No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.

christian

From: Tim Allison <ta...@apache.org>
Sent: Donnerstag, 14. November 2019 15:07
To: user@tika.apache.org<ma...@tika.apache.org>
Cc: users@pdfbox.apache.org<ma...@pdfbox.apache.org>
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi,
Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements

Cheers, Sergey


On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
Hi,

My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).

By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.

I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).

Are they things I should be aware of?

Any hint would be very welcome. Thanks and have a nice day,

christian

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Tilman Hausherr <TH...@t-online.de>.

The PDF can be much bigger than 3GB when decompressed.

What you could try

1) using a scratch file (will be even slower) when opening the document
2) the on-demand parser, see
https://issues.apache.org/jira/browse/PDFBOX-4569

there is a branch on the svn server, you have to build from source.

Tilman

Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
>
> Good evening,
>
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear 
> (read) that PDFBox does NOT stream the PDF.
>
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
>
> christian
>
> *From:*Tim Allison <ta...@apache.org>
> *Sent:* Donnerstag, 14. November 2019 15:07
> *To:* user@tika.apache.org
> *Cc:* users@pdfbox.apache.org
> *Subject:* Re: Parsing huge PDF (400Mb, 2700 pages)
>
> CC'ing colleagues on PDFBox...any recommendations?
>
> Sergey's recommendation is great for documents that can be parsed via 
> streaming.  However, PDFBox does not currently parse PDFs in a 
> streaming mode. It builds the full document tree -- PDFBox colleagues 
> let me know if I'm wrong.
>
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sberyozkin@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     Are you using tika-server ? If yes and you can submit the data
>     using a multipart/form-data payload then it may help, CXF (used by
>     tika-server) should do the best effort at saving the multipart
>     payloads to the temp locations on the disk, and thus minimize the
>     memory requirements
>
>     Cheers, Sergey
>
>     On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext)
>     <christian.ribeaud@novartis.com
>     <ma...@novartis.com>> wrote:
>
>         Hi,
>
>         My application handles all kind of documents (mainly PDFs). In
>         a very few cases, you might expect huge PDFs (< 500MB).
>
>         By around 400MB I am hitting the wall, parsing takes ages
>         (although quite fast at the beginning). I've tried several
>         ideas but none of them brought the desired amelioration.
>
>         I have the impression that memory plays a role. I have no more
>         than 3GB (and I think this should be enough as we are
>         streaming the document and using event based XML parser).
>
>         Are they things I should be aware of?
>
>         Any hint would be very welcome. Thanks and have a nice day,
>
>         christian
>

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by John Lussmyer <Co...@CasaDelGato.com>.

On Thu Nov 14 08:32:20 PST 2019 sahyoun@fileaffairs.de said:
>well - PDF ist not really easily streamable as
>
>- it's organized as a random access format
>- the refernce table about the objects forming the PDF is at the end of the file to you have to read the last parts first and
>then move back

While the PDF file itself can't be usefully streamed, the CONTENT streams can be.
Those are usually 99.99% of the file size.


--

Try my Sensible Email package!  https://sourceforge.net/projects/sensibleemail/

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

well - PDF ist not really easily streamable as 

- it's organized as a random access format
- the refernce table about the objects forming the PDF is at the end of the file to you have to read the last parts first and
then move back
- objects making up the content can be spread around the file
- pages can be organized in trees
- page resources such as Images or fonts may be shared across pages
- the information/content of these resources may be sitting before or after the page objects
- PDFs can be incrementally changed so information in a section might be outdated by a revision which comes later in the file

...

so it's more similar to buidling a DOM from an XML and handling that than stream parsing an XML.

That doesn't mean that there are ways to improve the current parsing ...

BR
Maruan
  
> Good evening,
> 
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> 
> christian
> 
> From: Tim Allison <ta...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> CC'ing colleagues on PDFBox...any recommendations?
> 
> Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
> 
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> Hi,
> Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
> 
> Cheers, Sergey
> 
> 
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
> Hi,
> 
> My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
> 
> By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
> 
> I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
> 
> Are they things I should be aware of?
> 
> Any hint would be very welcome. Thanks and have a nice day,
> 
> christian
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

well - PDF ist not really easily streamable as 

- it's organized as a random access format
- the refernce table about the objects forming the PDF is at the end of the file to you have to read the last parts first and
then move back
- objects making up the content can be spread around the file
- pages can be organized in trees
- page resources such as Images or fonts may be shared across pages
- the information/content of these resources may be sitting before or after the page objects
- PDFs can be incrementally changed so information in a section might be outdated by a revision which comes later in the file

...

so it's more similar to buidling a DOM from an XML and handling that than stream parsing an XML.

That doesn't mean that there are ways to improve the current parsing ...

BR
Maruan
  
> Good evening,
> 
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
> 
> christian
> 
> From: Tim Allison <ta...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
> 
> CC'ing colleagues on PDFBox...any recommendations?
> 
> Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
> 
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> Hi,
> Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
> 
> Cheers, Sergey
> 
> 
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
> Hi,
> 
> My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
> 
> By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
> 
> I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
> 
> Are they things I should be aware of?
> 
> Any hint would be very welcome. Thanks and have a nice day,
> 
> christian
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Tilman Hausherr <TH...@t-online.de>.

The PDF can be much bigger than 3GB when decompressed.

What you could try

1) using a scratch file (will be even slower) when opening the document
2) the on-demand parser, see
https://issues.apache.org/jira/browse/PDFBOX-4569

there is a branch on the svn server, you have to build from source.

Tilman

Am 14.11.2019 um 17:15 schrieb Ribeaud, Christian (Ext):
>
> Good evening,
>
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear 
> (read) that PDFBox does NOT stream the PDF.
>
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
>
> christian
>
> *From:*Tim Allison <ta...@apache.org>
> *Sent:* Donnerstag, 14. November 2019 15:07
> *To:* user@tika.apache.org
> *Cc:* users@pdfbox.apache.org
> *Subject:* Re: Parsing huge PDF (400Mb, 2700 pages)
>
> CC'ing colleagues on PDFBox...any recommendations?
>
> Sergey's recommendation is great for documents that can be parsed via 
> streaming.  However, PDFBox does not currently parse PDFs in a 
> streaming mode. It builds the full document tree -- PDFBox colleagues 
> let me know if I'm wrong.
>
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sberyozkin@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     Are you using tika-server ? If yes and you can submit the data
>     using a multipart/form-data payload then it may help, CXF (used by
>     tika-server) should do the best effort at saving the multipart
>     payloads to the temp locations on the disk, and thus minimize the
>     memory requirements
>
>     Cheers, Sergey
>
>     On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext)
>     <christian.ribeaud@novartis.com
>     <ma...@novartis.com>> wrote:
>
>         Hi,
>
>         My application handles all kind of documents (mainly PDFs). In
>         a very few cases, you might expect huge PDFs (< 500MB).
>
>         By around 400MB I am hitting the wall, parsing takes ages
>         (although quite fast at the beginning). I've tried several
>         ideas but none of them brought the desired amelioration.
>
>         I have the impression that memory plays a role. I have no more
>         than 3GB (and I think this should be enough as we are
>         streaming the document and using event based XML parser).
>
>         Are they things I should be aware of?
>
>         Any hint would be very welcome. Thanks and have a nice day,
>
>         christian
>

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by John Patrick <nh...@gmail.com>.

What jdk are you using?
Java 8? 11? 13? i.e. a version that is currently in active support
Are you using the latest release of that version?
Have you switch on gc logging and seen if that is the issue?
Constantly doing gc? You might need to tweak the arguments depending
on what gc you are using?

If you take a look at the classic gc diagram from say here
https://geekspearls.blogspot.com/2016/02/how-java-garbage-collection-works.html

if your file is 400MB and it isn't streaming, then your eden might
need to be more than it's default value otherwise the eden will get
instantly use then moved to tenured.

gc logs should give you an idea.

Have you tried restarting and see if it's faster if it is the 1st file
to process?

John

On Thu, 14 Nov 2019 at 16:16, Ribeaud, Christian (Ext)
<ch...@novartis.com> wrote:
>
> Good evening,
>
>
>
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
>
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
>
>
>
> christian
>
>
>
> From: Tim Allison <ta...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
>
>
>
> CC'ing colleagues on PDFBox...any recommendations?
>
>
>
> Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
>
>
>
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com> wrote:
>
> Hi,
>
> Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
>
>
>
> Cheers, Sergey
>
>
>
>
>
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com> wrote:
>
> Hi,
>
> My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
>
> By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
>
> I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
>
> Are they things I should be aware of?
>
> Any hint would be very welcome. Thanks and have a nice day,
>
> christian

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by John Patrick <nh...@gmail.com>.

What jdk are you using?
Java 8? 11? 13? i.e. a version that is currently in active support
Are you using the latest release of that version?
Have you switch on gc logging and seen if that is the issue?
Constantly doing gc? You might need to tweak the arguments depending
on what gc you are using?

If you take a look at the classic gc diagram from say here
https://geekspearls.blogspot.com/2016/02/how-java-garbage-collection-works.html

if your file is 400MB and it isn't streaming, then your eden might
need to be more than it's default value otherwise the eden will get
instantly use then moved to tenured.

gc logs should give you an idea.

Have you tried restarting and see if it's faster if it is the 1st file
to process?

John

On Thu, 14 Nov 2019 at 16:16, Ribeaud, Christian (Ext)
<ch...@novartis.com> wrote:
>
> Good evening,
>
>
>
> No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
>
> So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.
>
>
>
> christian
>
>
>
> From: Tim Allison <ta...@apache.org>
> Sent: Donnerstag, 14. November 2019 15:07
> To: user@tika.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: Parsing huge PDF (400Mb, 2700 pages)
>
>
>
> CC'ing colleagues on PDFBox...any recommendations?
>
>
>
> Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.
>
>
>
> On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com> wrote:
>
> Hi,
>
> Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements
>
>
>
> Cheers, Sergey
>
>
>
>
>
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com> wrote:
>
> Hi,
>
> My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).
>
> By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.
>
> I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).
>
> Are they things I should be aware of?
>
> Any hint would be very welcome. Thanks and have a nice day,
>
> christian

RE: Parsing huge PDF (400Mb, 2700 pages)

Posted by "Ribeaud, Christian (Ext)" <ch...@novartis.com>.

Good evening,

No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.

christian

From: Tim Allison <ta...@apache.org>
Sent: Donnerstag, 14. November 2019 15:07
To: user@tika.apache.org
Cc: users@pdfbox.apache.org
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi,
Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements

Cheers, Sergey

On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
Hi,

My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).

By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.

I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).

Are they things I should be aware of?

Any hint would be very welcome. Thanks and have a nice day,

christian

RE: Parsing huge PDF (400Mb, 2700 pages)

Posted by "Ribeaud, Christian (Ext)" <ch...@novartis.com>.

Good evening,

No, I am NOT using tika-server. And uh, I am a bit surprised to hear (read) that PDFBox does NOT stream the PDF.
So let’s wait for PDFBox colleagues feedback. Thanks anyway for yours.

christian

From: Tim Allison <ta...@apache.org>
Sent: Donnerstag, 14. November 2019 15:07
To: user@tika.apache.org
Cc: users@pdfbox.apache.org
Subject: Re: Parsing huge PDF (400Mb, 2700 pages)

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via streaming.  However, PDFBox does not currently parse PDFs in a streaming mode.  It builds the full document tree -- PDFBox colleagues let me know if I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi,
Are you using tika-server ? If yes and you can submit the data using a multipart/form-data payload then it may help, CXF (used by tika-server) should do the best effort at saving the multipart payloads to the temp locations on the disk, and thus minimize the memory requirements

Cheers, Sergey

On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <ch...@novartis.com>> wrote:
Hi,

My application handles all kind of documents (mainly PDFs). In a very few cases, you might expect huge PDFs (< 500MB).

By around 400MB I am hitting the wall, parsing takes ages (although quite fast at the beginning). I've tried several ideas but none of them brought the desired amelioration.

I have the impression that memory plays a role. I have no more than 3GB (and I think this should be enough as we are streaming the document and using event based XML parser).

Are they things I should be aware of?

Any hint would be very welcome. Thanks and have a nice day,

christian

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Tim Allison <ta...@apache.org>.

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via
streaming.  However, PDFBox does not currently parse PDFs in a streaming
mode.  It builds the full document tree -- PDFBox colleagues let me know if
I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi,
> Are you using tika-server ? If yes and you can submit the data using a
> multipart/form-data payload then it may help, CXF (used by tika-server)
> should do the best effort at saving the multipart payloads to the temp
> locations on the disk, and thus minimize the memory requirements
>
> Cheers, Sergey
>
>
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <
> christian.ribeaud@novartis.com> wrote:
>
>> Hi,
>>
>> My application handles all kind of documents (mainly PDFs). In a very few
>> cases, you might expect huge PDFs (< 500MB).
>>
>> By around 400MB I am hitting the wall, parsing takes ages (although quite
>> fast at the beginning). I've tried several ideas but none of them brought
>> the desired amelioration.
>>
>> I have the impression that memory plays a role. I have no more than 3GB
>> (and I think this should be enough as we are streaming the document and
>> using event based XML parser).
>>
>> Are they things I should be aware of?
>>
>> Any hint would be very welcome. Thanks and have a nice day,
>>
>> christian
>>
>>

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Tim Allison <ta...@apache.org>.

CC'ing colleagues on PDFBox...any recommendations?

Sergey's recommendation is great for documents that can be parsed via
streaming.  However, PDFBox does not currently parse PDFs in a streaming
mode.  It builds the full document tree -- PDFBox colleagues let me know if
I'm wrong.

On Thu, Nov 14, 2019 at 5:51 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi,
> Are you using tika-server ? If yes and you can submit the data using a
> multipart/form-data payload then it may help, CXF (used by tika-server)
> should do the best effort at saving the multipart payloads to the temp
> locations on the disk, and thus minimize the memory requirements
>
> Cheers, Sergey
>
>
> On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <
> christian.ribeaud@novartis.com> wrote:
>
>> Hi,
>>
>> My application handles all kind of documents (mainly PDFs). In a very few
>> cases, you might expect huge PDFs (< 500MB).
>>
>> By around 400MB I am hitting the wall, parsing takes ages (although quite
>> fast at the beginning). I've tried several ideas but none of them brought
>> the desired amelioration.
>>
>> I have the impression that memory plays a role. I have no more than 3GB
>> (and I think this should be enough as we are streaming the document and
>> using event based XML parser).
>>
>> Are they things I should be aware of?
>>
>> Any hint would be very welcome. Thanks and have a nice day,
>>
>> christian
>>
>>

Re: Parsing huge PDF (400Mb, 2700 pages)

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,
Are you using tika-server ? If yes and you can submit the data using a
multipart/form-data payload then it may help, CXF (used by tika-server)
should do the best effort at saving the multipart payloads to the temp
locations on the disk, and thus minimize the memory requirements

Cheers, Sergey

On Thu, Nov 14, 2019 at 10:21 AM Ribeaud, Christian (Ext) <
christian.ribeaud@novartis.com> wrote:

> Hi,
>
> My application handles all kind of documents (mainly PDFs). In a very few
> cases, you might expect huge PDFs (< 500MB).
>
> By around 400MB I am hitting the wall, parsing takes ages (although quite
> fast at the beginning). I've tried several ideas but none of them brought
> the desired amelioration.
>
> I have the impression that memory plays a role. I have no more than 3GB
> (and I think this should be enough as we are streaming the document and
> using event based XML parser).
>
> Are they things I should be aware of?
>
> Any hint would be very welcome. Thanks and have a nice day,
>
> christian
>
>