You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Slava G <sl...@gmail.com> on 2019/02/26 16:55:58 UTC

Very slow PDF parsing.

Hi,

I have large PDF (about 65mb) that contains mainly text and some images.

Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 running
on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running CentOS
Linux).

Please advise if there anything I can do to speedup.Or maybe it's a bug in
PDFBox ?

When I'm printing java stack , I see all the time in this stack :

at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.find(Unknown Source)

at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)

at java.util.HashMap.getNode(Unknown Source)

at java.util.HashMap.containsKey(Unknown Source)

at java.util.HashSet.contains(Unknown Source)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)

at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)

at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)

at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)

at
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)

at
org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)

at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)

at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)

at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)

at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)

at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)


P.S. Btw, the PDF is not encrypted at all.

Thanks

Re: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

I've tried to find this on this specific Linux server, and no, there's no
tesseract installed.
I'm configuring the pdf parser with those parameters:

PDFParser tmpPdf = new PDFParser();

PDFParserConfig config = tmpPdf.getPDFParserConfig();

config.setMaxMainMemoryBytes(31457280);

config.setExtractAcroFormContent(false);

config.setExtractBookmarksText(false);

config.setCatchIntermediateIOExceptions(true);

On Tue, Feb 26, 2019 at 7:13 PM Tim Allison <ta...@apache.org> wrote:

> Sorry...that's an OCR tool.  One thing that can slow down processing
> dramatically is if you have tesseract installed (try typing 'tesseract' on
> your commandline) and if you've turned it on for PDFs.  I suspect this
> isn't your problem, though.
>
>
>
> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>
>> Thanks Tim,
>> But frankly speaking, it's a shame, but don't know what is tessercat is
>> in this context 🙂
>>
>> Thanks
>>
>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>
>>> Thank you, Slava!
>>>
>>> Do you have tesseract installed?
>>>
>>> Colleagues on PDFBox, any recommendations?
>>>
>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I have large PDF (about 65mb) that contains mainly text and some
>>> images.
>>> >
>>> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
>>> CentOS Linux).
>>> >
>>> > Please advise if there anything I can do to speedup.Or maybe it's a
>>> bug in PDFBox ?
>>> >
>>> > When I'm printing java stack , I see all the time in this stack :
>>> >
>>> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >
>>> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >
>>> > at java.util.HashMap.getNode(Unknown Source)
>>> >
>>> > at java.util.HashMap.containsKey(Unknown Source)
>>> >
>>> > at java.util.HashSet.contains(Unknown Source)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >
>>> > at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >
>>> > at
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >
>>> > at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >
>>> > at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >
>>> > at
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >
>>> > at
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >
>>> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >
>>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >
>>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >
>>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >
>>> >
>>> > P.S. Btw, the PDF is not encrypted at all.
>>> >
>>> > Thanks
>>>
>>

Re: Fwd: Very slow PDF parsing.

Posted by Konstantin Gribov <gr...@gmail.com>.

Follow up, it seems to be fixed, so not actual for me anymore. Sorry for
this bit of noise in lists)

-- 
Best regards,
Konstantin Gribov.


On Thu, Mar 21, 2019 at 7:56 PM Konstantin Gribov <gr...@gmail.com> wrote:

> Slava,
>
> Could you please forward this pdf to private@tika.apache.org (Tika PMC
> only private list)? I had similar issues with some pdf but were unable to
> get them from client to look into it with profiler.
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:
>
>> Tim, to what email to send you the PDF ?
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>>
>>> I'll once I'll get customer's approval.
>>> Meanwhile I can do any checks, if you can specify what to check.
>>> Thanks
>>>
>>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you can share the file directly w me or someone else on the
>>>> PDFBox team?
>>>>
>>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>>> > Thanks
>>>> >
>>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>>> >
>>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>>> >> Seems that issue is still there.
>>>> >> Thanks
>>>> >>
>>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>>> >>
>>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>>> >>>
>>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>>> wrote:
>>>> >>>
>>>> >>>> Any chance you could try with the 2.0.14 release
>>>> candidate...unless you
>>>> >>>> have already?
>>>> >>>>
>>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>> >>>>
>>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so
>>>> far 2
>>>> >>>>> hours and still counting...
>>>> >>>>> It's seems to be a PDFBox issue.
>>>> >>>>>
>>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>> >>>>>> It can be easier to investigate the problem.
>>>> >>>>>>
>>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>>> cristian.vat@gmail.com>
>>>> >>>>>> a écrit :
>>>> >>>>>>
>>>> >>>>>>> Just looking at the stack trace it won't be the same anymore
>>>> due to
>>>> >>>>>>> PDFBOX-4453
>>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>>> >>>>>>>
>>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>> >>>>>>> command-line ExtractText command (
>>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> This is the code :
>>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>>> >>>>>>>> config.setExtractBookmarksText(false);
>>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>> >>>>>>>> Metadata metadata = new Metadata();
>>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> >>>>>>>> ParseContext());
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> This is the default in Tika, where the default for
>>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>>> TikaInputStream
>>>> >>>>>>>>> via tika-app or tika-server or something else?
>>>> >>>>>>>>>
>>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>> >>>>>>>>> memoryUsageSetting =
>>>> >>>>>>>>>
>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>> >>>>>>>>> }
>>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> } else {
>>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>>> CloseShieldInputStream(stream),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> }
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Hi,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>>> run
>>>> >>>>>>>>>> the
>>>> >>>>>>>>>> profiler.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>>> user
>>>> >>>>>>>>>> password.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> It would also be interesting to hear what parameter is
>>>> passed to
>>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Tilman
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>> >>>>>>>>>> > PDFBox Colleagues,
>>>> >>>>>>>>>> >    Any ideas?
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>> >>>>>>>>>> processing
>>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>> >>>>>>>>>> 'tesseract' on
>>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>> >>>>>>>>>> suspect this
>>>> >>>>>>>>>> > isn't your problem, though.
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >> Thanks Tim,
>>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>> >>>>>>>>>> tessercat is in
>>>> >>>>>>>>>> >> this context 🙂
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> Thanks
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >>> Thank you, Slava!
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>>> Hi,
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>>> and
>>>> >>>>>>>>>> some images.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>>> (TIKA
>>>> >>>>>>>>>> 1.19.1
>>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM
>>>> with SSD
>>>> >>>>>>>>>> disk, running
>>>> >>>>>>>>>> >>> CentOS Linux).
>>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>>> maybe
>>>> >>>>>>>>>> it's a bug
>>>> >>>>>>>>>> >>> in PDFBox ?
>>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in
>>>> this
>>>> >>>>>>>>>> stack :
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown
>>>> Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Thanks
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> >>>>>>>>>> For additional commands, e-mail:
>>>> users-help@pdfbox.apache.org
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>>
>>>

Re: Fwd: Very slow PDF parsing.

Posted by Konstantin Gribov <gr...@gmail.com>.

Follow up, it seems to be fixed, so not actual for me anymore. Sorry for
this bit of noise in lists)

-- 
Best regards,
Konstantin Gribov.


On Thu, Mar 21, 2019 at 7:56 PM Konstantin Gribov <gr...@gmail.com> wrote:

> Slava,
>
> Could you please forward this pdf to private@tika.apache.org (Tika PMC
> only private list)? I had similar issues with some pdf but were unable to
> get them from client to look into it with profiler.
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:
>
>> Tim, to what email to send you the PDF ?
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>>
>>> I'll once I'll get customer's approval.
>>> Meanwhile I can do any checks, if you can specify what to check.
>>> Thanks
>>>
>>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you can share the file directly w me or someone else on the
>>>> PDFBox team?
>>>>
>>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>>> > Thanks
>>>> >
>>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>>> >
>>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>>> >> Seems that issue is still there.
>>>> >> Thanks
>>>> >>
>>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>>> >>
>>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>>> >>>
>>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>>> wrote:
>>>> >>>
>>>> >>>> Any chance you could try with the 2.0.14 release
>>>> candidate...unless you
>>>> >>>> have already?
>>>> >>>>
>>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>> >>>>
>>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so
>>>> far 2
>>>> >>>>> hours and still counting...
>>>> >>>>> It's seems to be a PDFBox issue.
>>>> >>>>>
>>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>> >>>>>> It can be easier to investigate the problem.
>>>> >>>>>>
>>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>>> cristian.vat@gmail.com>
>>>> >>>>>> a écrit :
>>>> >>>>>>
>>>> >>>>>>> Just looking at the stack trace it won't be the same anymore
>>>> due to
>>>> >>>>>>> PDFBOX-4453
>>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>>> >>>>>>>
>>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>> >>>>>>> command-line ExtractText command (
>>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> This is the code :
>>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>>> >>>>>>>> config.setExtractBookmarksText(false);
>>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>> >>>>>>>> Metadata metadata = new Metadata();
>>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> >>>>>>>> ParseContext());
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> This is the default in Tika, where the default for
>>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>>> TikaInputStream
>>>> >>>>>>>>> via tika-app or tika-server or something else?
>>>> >>>>>>>>>
>>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>> >>>>>>>>> memoryUsageSetting =
>>>> >>>>>>>>>
>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>> >>>>>>>>> }
>>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> } else {
>>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>>> CloseShieldInputStream(stream),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> }
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Hi,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>>> run
>>>> >>>>>>>>>> the
>>>> >>>>>>>>>> profiler.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>>> user
>>>> >>>>>>>>>> password.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> It would also be interesting to hear what parameter is
>>>> passed to
>>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Tilman
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>> >>>>>>>>>> > PDFBox Colleagues,
>>>> >>>>>>>>>> >    Any ideas?
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>> >>>>>>>>>> processing
>>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>> >>>>>>>>>> 'tesseract' on
>>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>> >>>>>>>>>> suspect this
>>>> >>>>>>>>>> > isn't your problem, though.
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >> Thanks Tim,
>>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>> >>>>>>>>>> tessercat is in
>>>> >>>>>>>>>> >> this context 🙂
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> Thanks
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >>> Thank you, Slava!
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>>> Hi,
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>>> and
>>>> >>>>>>>>>> some images.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>>> (TIKA
>>>> >>>>>>>>>> 1.19.1
>>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM
>>>> with SSD
>>>> >>>>>>>>>> disk, running
>>>> >>>>>>>>>> >>> CentOS Linux).
>>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>>> maybe
>>>> >>>>>>>>>> it's a bug
>>>> >>>>>>>>>> >>> in PDFBox ?
>>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in
>>>> this
>>>> >>>>>>>>>> stack :
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown
>>>> Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Thanks
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> >>>>>>>>>> For additional commands, e-mail:
>>>> users-help@pdfbox.apache.org
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>>
>>>

Re: Fwd: Very slow PDF parsing.

Posted by Konstantin Gribov <gr...@gmail.com>.

Slava,

Could you please forward this pdf to private@tika.apache.org (Tika PMC only
private list)? I had similar issues with some pdf but were unable to get
them from client to look into it with profiler.

-- 
Best regards,
Konstantin Gribov.


On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:

> Tim, to what email to send you the PDF ?
> Thanks
>
> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>
>> I'll once I'll get customer's approval.
>> Meanwhile I can do any checks, if you can specify what to check.
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Any chance you can share the file directly w me or someone else on the
>>> PDFBox team?
>>>
>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>
>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>> > Thanks
>>> >
>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>> >
>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>> >> Seems that issue is still there.
>>> >> Thanks
>>> >>
>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>> >>
>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>> >>>
>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>> wrote:
>>> >>>
>>> >>>> Any chance you could try with the 2.0.14 release candidate...unless
>>> you
>>> >>>> have already?
>>> >>>>
>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>> >>>>
>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far
>>> 2
>>> >>>>> hours and still counting...
>>> >>>>> It's seems to be a PDFBox issue.
>>> >>>>>
>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>> >>>>>> It can be easier to investigate the problem.
>>> >>>>>>
>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>> cristian.vat@gmail.com>
>>> >>>>>> a écrit :
>>> >>>>>>
>>> >>>>>>> Just looking at the stack trace it won't be the same anymore due
>>> to
>>> >>>>>>> PDFBOX-4453
>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>> >>>>>>>
>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>> >>>>>>> command-line ExtractText command (
>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> This is the code :
>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>> >>>>>>>> config.setExtractBookmarksText(false);
>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>> >>>>>>>> Metadata metadata = new Metadata();
>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>> >>>>>>>> ParseContext());
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>> tallison@apache.org>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> This is the default in Tika, where the default for
>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>> >>>>>>>>>
>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>> TikaInputStream
>>> >>>>>>>>> via tika-app or tika-server or something else?
>>> >>>>>>>>>
>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> >>>>>>>>> memoryUsageSetting =
>>> >>>>>>>>>
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> >>>>>>>>> }
>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>> >>>>>>>>> password, memoryUsageSetting);
>>> >>>>>>>>> } else {
>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>> CloseShieldInputStream(stream),
>>> >>>>>>>>> password, memoryUsageSetting);
>>> >>>>>>>>> }
>>> >>>>>>>>>
>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Hi,
>>> >>>>>>>>>>
>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>> run
>>> >>>>>>>>>> the
>>> >>>>>>>>>> profiler.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>> user
>>> >>>>>>>>>> password.
>>> >>>>>>>>>>
>>> >>>>>>>>>> It would also be interesting to hear what parameter is passed
>>> to
>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Tilman
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> >>>>>>>>>> > PDFBox Colleagues,
>>> >>>>>>>>>> >    Any ideas?
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>> >>>>>>>>>> processing
>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>> >>>>>>>>>> 'tesseract' on
>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>> >>>>>>>>>> suspect this
>>> >>>>>>>>>> > isn't your problem, though.
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com
>>> >
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >> Thanks Tim,
>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>> >>>>>>>>>> tessercat is in
>>> >>>>>>>>>> >> this context 🙂
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >> Thanks
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>> tallison@apache.org>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >>> Thank you, Slava!
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>> slavago@gmail.com>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >>>> Hi,
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>> and
>>> >>>>>>>>>> some images.
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>> (TIKA
>>> >>>>>>>>>> 1.19.1
>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
>>> SSD
>>> >>>>>>>>>> disk, running
>>> >>>>>>>>>> >>> CentOS Linux).
>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>> maybe
>>> >>>>>>>>>> it's a bug
>>> >>>>>>>>>> >>> in PDFBox ?
>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>> >>>>>>>>>> stack :
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> Thanks
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>>
>>

Re: Fwd: Very slow PDF parsing.

Posted by Konstantin Gribov <gr...@gmail.com>.

Slava,

Could you please forward this pdf to private@tika.apache.org (Tika PMC only
private list)? I had similar issues with some pdf but were unable to get
them from client to look into it with profiler.

-- 
Best regards,
Konstantin Gribov.


On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:

> Tim, to what email to send you the PDF ?
> Thanks
>
> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>
>> I'll once I'll get customer's approval.
>> Meanwhile I can do any checks, if you can specify what to check.
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Any chance you can share the file directly w me or someone else on the
>>> PDFBox team?
>>>
>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>
>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>> > Thanks
>>> >
>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>> >
>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>> >> Seems that issue is still there.
>>> >> Thanks
>>> >>
>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>> >>
>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>> >>>
>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>> wrote:
>>> >>>
>>> >>>> Any chance you could try with the 2.0.14 release candidate...unless
>>> you
>>> >>>> have already?
>>> >>>>
>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>> >>>>
>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far
>>> 2
>>> >>>>> hours and still counting...
>>> >>>>> It's seems to be a PDFBox issue.
>>> >>>>>
>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>> >>>>>> It can be easier to investigate the problem.
>>> >>>>>>
>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>> cristian.vat@gmail.com>
>>> >>>>>> a écrit :
>>> >>>>>>
>>> >>>>>>> Just looking at the stack trace it won't be the same anymore due
>>> to
>>> >>>>>>> PDFBOX-4453
>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>> >>>>>>>
>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>> >>>>>>> command-line ExtractText command (
>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> This is the code :
>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>> >>>>>>>> config.setExtractBookmarksText(false);
>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>> >>>>>>>> Metadata metadata = new Metadata();
>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>> >>>>>>>> ParseContext());
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>> tallison@apache.org>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> This is the default in Tika, where the default for
>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>> >>>>>>>>>
>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>> TikaInputStream
>>> >>>>>>>>> via tika-app or tika-server or something else?
>>> >>>>>>>>>
>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> >>>>>>>>> memoryUsageSetting =
>>> >>>>>>>>>
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> >>>>>>>>> }
>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>> >>>>>>>>> password, memoryUsageSetting);
>>> >>>>>>>>> } else {
>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>> CloseShieldInputStream(stream),
>>> >>>>>>>>> password, memoryUsageSetting);
>>> >>>>>>>>> }
>>> >>>>>>>>>
>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Hi,
>>> >>>>>>>>>>
>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>> run
>>> >>>>>>>>>> the
>>> >>>>>>>>>> profiler.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>> user
>>> >>>>>>>>>> password.
>>> >>>>>>>>>>
>>> >>>>>>>>>> It would also be interesting to hear what parameter is passed
>>> to
>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Tilman
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> >>>>>>>>>> > PDFBox Colleagues,
>>> >>>>>>>>>> >    Any ideas?
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>> >>>>>>>>>> processing
>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>> >>>>>>>>>> 'tesseract' on
>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>> >>>>>>>>>> suspect this
>>> >>>>>>>>>> > isn't your problem, though.
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com
>>> >
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >> Thanks Tim,
>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>> >>>>>>>>>> tessercat is in
>>> >>>>>>>>>> >> this context 🙂
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >> Thanks
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>> tallison@apache.org>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >>> Thank you, Slava!
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>> slavago@gmail.com>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >>>> Hi,
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>> and
>>> >>>>>>>>>> some images.
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>> (TIKA
>>> >>>>>>>>>> 1.19.1
>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
>>> SSD
>>> >>>>>>>>>> disk, running
>>> >>>>>>>>>> >>> CentOS Linux).
>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>> maybe
>>> >>>>>>>>>> it's a bug
>>> >>>>>>>>>> >>> in PDFBox ?
>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>> >>>>>>>>>> stack :
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> Thanks
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>>
>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Tim, to what email to send you the PDF ?
Thanks

On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:

> I'll once I'll get customer's approval.
> Meanwhile I can do any checks, if you can specify what to check.
> Thanks
>
> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>
>> Any chance you can share the file directly w me or someone else on the
>> PDFBox team?
>>
>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>
>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>> > Thanks
>> >
>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>> >
>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>> >> Seems that issue is still there.
>> >> Thanks
>> >>
>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>> >>
>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>> >>>
>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>> wrote:
>> >>>
>> >>>> Any chance you could try with the 2.0.14 release candidate...unless
>> you
>> >>>> have already?
>> >>>>
>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>> >>>>
>> >>>>
>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>> >>>>
>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>> >>>>> hours and still counting...
>> >>>>> It's seems to be a PDFBox issue.
>> >>>>>
>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>> >>>>>> It can be easier to investigate the problem.
>> >>>>>>
>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>> cristian.vat@gmail.com>
>> >>>>>> a écrit :
>> >>>>>>
>> >>>>>>> Just looking at the stack trace it won't be the same anymore due
>> to
>> >>>>>>> PDFBOX-4453
>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>> >>>>>>> changes how decryption is handled. Not sure if related though.
>> >>>>>>>
>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>> >>>>>>> command-line ExtractText command (
>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> This is the code :
>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>> >>>>>>>> config.setExtractAcroFormContent(false);
>> >>>>>>>> config.setExtractBookmarksText(false);
>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>> >>>>>>>> Metadata metadata = new Metadata();
>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>> >>>>>>>> ParseContext());
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <tallison@apache.org
>> >
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> This is the default in Tika, where the default for
>> >>>>>>>>> maxMainMemoryBytes=500MB.
>> >>>>>>>>>
>> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>> >>>>>>>>> via tika-app or tika-server or something else?
>> >>>>>>>>>
>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>> >>>>>>>>> memoryUsageSetting =
>> >>>>>>>>>
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>> >>>>>>>>> }
>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>> >>>>>>>>> // File based -- send file directly to PDFBox
>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>> >>>>>>>>> password, memoryUsageSetting);
>> >>>>>>>>> } else {
>> >>>>>>>>> pdfDocument = PDDocument.load(new
>> CloseShieldInputStream(stream),
>> >>>>>>>>> password, memoryUsageSetting);
>> >>>>>>>>> }
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>> >>>>>>>>> THausherr@t-online.de> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>> run
>> >>>>>>>>>> the
>> >>>>>>>>>> profiler.
>> >>>>>>>>>>
>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>> >>>>>>>>>>
>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>> >>>>>>>>>> password.
>> >>>>>>>>>>
>> >>>>>>>>>> It would also be interesting to hear what parameter is passed
>> to
>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>> >>>>>>>>>>
>> >>>>>>>>>> Tilman
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>> >>>>>>>>>> > PDFBox Colleagues,
>> >>>>>>>>>> >    Any ideas?
>> >>>>>>>>>> >
>> >>>>>>>>>> > ---------- Forwarded message ---------
>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>> >>>>>>>>>> > To: <us...@tika.apache.org>
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>> >>>>>>>>>> processing
>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>> >>>>>>>>>> 'tesseract' on
>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>> >>>>>>>>>> suspect this
>> >>>>>>>>>> > isn't your problem, though.
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >
>> >>>>>>>>>> >> Thanks Tim,
>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>> >>>>>>>>>> tessercat is in
>> >>>>>>>>>> >> this context 🙂
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Thanks
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>> tallison@apache.org>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >>
>> >>>>>>>>>> >>> Thank you, Slava!
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> Do you have tesseract installed?
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>> slavago@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >>>> Hi,
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>> and
>> >>>>>>>>>> some images.
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>> (TIKA
>> >>>>>>>>>> 1.19.1
>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
>> SSD
>> >>>>>>>>>> disk, running
>> >>>>>>>>>> >>> CentOS Linux).
>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>> maybe
>> >>>>>>>>>> it's a bug
>> >>>>>>>>>> >>> in PDFBox ?
>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>> >>>>>>>>>> stack :
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> Thanks
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>>
>>
>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Tim, to what email to send you the PDF ?
Thanks

On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:

> I'll once I'll get customer's approval.
> Meanwhile I can do any checks, if you can specify what to check.
> Thanks
>
> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>
>> Any chance you can share the file directly w me or someone else on the
>> PDFBox team?
>>
>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>
>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>> > Thanks
>> >
>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>> >
>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>> >> Seems that issue is still there.
>> >> Thanks
>> >>
>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>> >>
>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>> >>>
>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>> wrote:
>> >>>
>> >>>> Any chance you could try with the 2.0.14 release candidate...unless
>> you
>> >>>> have already?
>> >>>>
>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>> >>>>
>> >>>>
>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>> >>>>
>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>> >>>>> hours and still counting...
>> >>>>> It's seems to be a PDFBox issue.
>> >>>>>
>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>> >>>>>> It can be easier to investigate the problem.
>> >>>>>>
>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>> cristian.vat@gmail.com>
>> >>>>>> a écrit :
>> >>>>>>
>> >>>>>>> Just looking at the stack trace it won't be the same anymore due
>> to
>> >>>>>>> PDFBOX-4453
>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>> >>>>>>> changes how decryption is handled. Not sure if related though.
>> >>>>>>>
>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>> >>>>>>> command-line ExtractText command (
>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> This is the code :
>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>> >>>>>>>> config.setExtractAcroFormContent(false);
>> >>>>>>>> config.setExtractBookmarksText(false);
>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>> >>>>>>>> Metadata metadata = new Metadata();
>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>> >>>>>>>> ParseContext());
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <tallison@apache.org
>> >
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> This is the default in Tika, where the default for
>> >>>>>>>>> maxMainMemoryBytes=500MB.
>> >>>>>>>>>
>> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>> >>>>>>>>> via tika-app or tika-server or something else?
>> >>>>>>>>>
>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>> >>>>>>>>> memoryUsageSetting =
>> >>>>>>>>>
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>> >>>>>>>>> }
>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>> >>>>>>>>> // File based -- send file directly to PDFBox
>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>> >>>>>>>>> password, memoryUsageSetting);
>> >>>>>>>>> } else {
>> >>>>>>>>> pdfDocument = PDDocument.load(new
>> CloseShieldInputStream(stream),
>> >>>>>>>>> password, memoryUsageSetting);
>> >>>>>>>>> }
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>> >>>>>>>>> THausherr@t-online.de> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>> run
>> >>>>>>>>>> the
>> >>>>>>>>>> profiler.
>> >>>>>>>>>>
>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>> >>>>>>>>>>
>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>> >>>>>>>>>> password.
>> >>>>>>>>>>
>> >>>>>>>>>> It would also be interesting to hear what parameter is passed
>> to
>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>> >>>>>>>>>>
>> >>>>>>>>>> Tilman
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>> >>>>>>>>>> > PDFBox Colleagues,
>> >>>>>>>>>> >    Any ideas?
>> >>>>>>>>>> >
>> >>>>>>>>>> > ---------- Forwarded message ---------
>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>> >>>>>>>>>> > To: <us...@tika.apache.org>
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>> >>>>>>>>>> processing
>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>> >>>>>>>>>> 'tesseract' on
>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>> >>>>>>>>>> suspect this
>> >>>>>>>>>> > isn't your problem, though.
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> >
>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >
>> >>>>>>>>>> >> Thanks Tim,
>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>> >>>>>>>>>> tessercat is in
>> >>>>>>>>>> >> this context 🙂
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> Thanks
>> >>>>>>>>>> >>
>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>> tallison@apache.org>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >>
>> >>>>>>>>>> >>> Thank you, Slava!
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> Do you have tesseract installed?
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>> >>>>>>>>>> >>>
>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>> slavago@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>> >>>> Hi,
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>> and
>> >>>>>>>>>> some images.
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>> (TIKA
>> >>>>>>>>>> 1.19.1
>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
>> SSD
>> >>>>>>>>>> disk, running
>> >>>>>>>>>> >>> CentOS Linux).
>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>> maybe
>> >>>>>>>>>> it's a bug
>> >>>>>>>>>> >>> in PDFBox ?
>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>> >>>>>>>>>> stack :
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> >>>
>> >>>>>>>>>>
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> at
>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>> >>>>>>>>>> >>>>
>> >>>>>>>>>> >>>> Thanks
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> >>>>>>>>>>
>> >>>>>>>>>>
>>
>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

I'll once I'll get customer's approval.
Meanwhile I can do any checks, if you can specify what to check.
Thanks

On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:

> Any chance you can share the file directly w me or someone else on the
> PDFBox team?
>
> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>
> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> > Thanks
> >
> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
> >
> >> With 2.0.14 it's 40 minutes running, no result, still working...
> >> Seems that issue is still there.
> >> Thanks
> >>
> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
> >>
> >>> Checking with 2.0.14. Started as an app. Will update soon.
> >>>
> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
> wrote:
> >>>
> >>>> Any chance you could try with the 2.0.14 release candidate...unless
> you
> >>>> have already?
> >>>>
> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
> >>>>
> >>>>
> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
> >>>>
> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
> >>>>> hours and still counting...
> >>>>> It's seems to be a PDFBox issue.
> >>>>>
> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
> >>>>>> It can be easier to investigate the problem.
> >>>>>>
> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian.vat@gmail.com
> >
> >>>>>> a écrit :
> >>>>>>
> >>>>>>> Just looking at the stack trace it won't be the same anymore due to
> >>>>>>> PDFBOX-4453
> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
> >>>>>>> changes how decryption is handled. Not sure if related though.
> >>>>>>>
> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
> >>>>>>> command-line ExtractText command (
> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> This is the code :
> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
> >>>>>>>> PDFParser tmpPdf = new PDFParser();
> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
> >>>>>>>> config.setExtractAcroFormContent(false);
> >>>>>>>> config.setExtractBookmarksText(false);
> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
> >>>>>>>> Metadata metadata = new Metadata();
> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
> >>>>>>>> ParseContext());
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This is the default in Tika, where the default for
> >>>>>>>>> maxMainMemoryBytes=500MB.
> >>>>>>>>>
> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
> >>>>>>>>> via tika-app or tika-server or something else?
> >>>>>>>>>
> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> >>>>>>>>> memoryUsageSetting =
> >>>>>>>>>
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> >>>>>>>>> }
> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
> >>>>>>>>> // File based -- send file directly to PDFBox
> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> } else {
> >>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
> >>>>>>>>> THausherr@t-online.de> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
> >>>>>>>>>> the
> >>>>>>>>>> profiler.
> >>>>>>>>>>
> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
> >>>>>>>>>>
> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
> >>>>>>>>>> password.
> >>>>>>>>>>
> >>>>>>>>>> It would also be interesting to hear what parameter is passed to
> >>>>>>>>>> MemoryUsageSetting when load() is called.
> >>>>>>>>>>
> >>>>>>>>>> Tilman
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> >>>>>>>>>> > PDFBox Colleagues,
> >>>>>>>>>> >    Any ideas?
> >>>>>>>>>> >
> >>>>>>>>>> > ---------- Forwarded message ---------
> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
> >>>>>>>>>> > To: <us...@tika.apache.org>
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
> >>>>>>>>>> processing
> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
> >>>>>>>>>> 'tesseract' on
> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
> >>>>>>>>>> suspect this
> >>>>>>>>>> > isn't your problem, though.
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>> >
> >>>>>>>>>> >> Thanks Tim,
> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
> >>>>>>>>>> tessercat is in
> >>>>>>>>>> >> this context 🙂
> >>>>>>>>>> >>
> >>>>>>>>>> >> Thanks
> >>>>>>>>>> >>
> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>> >>
> >>>>>>>>>> >>> Thank you, Slava!
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> Do you have tesseract installed?
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slavago@gmail.com
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>> >>>> Hi,
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
> >>>>>>>>>> some images.
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
> (TIKA
> >>>>>>>>>> 1.19.1
> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
> SSD
> >>>>>>>>>> disk, running
> >>>>>>>>>> >>> CentOS Linux).
> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
> maybe
> >>>>>>>>>> it's a bug
> >>>>>>>>>> >>> in PDFBox ?
> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
> >>>>>>>>>> stack :
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

I'll once I'll get customer's approval.
Meanwhile I can do any checks, if you can specify what to check.
Thanks

On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:

> Any chance you can share the file directly w me or someone else on the
> PDFBox team?
>
> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>
> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> > Thanks
> >
> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
> >
> >> With 2.0.14 it's 40 minutes running, no result, still working...
> >> Seems that issue is still there.
> >> Thanks
> >>
> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
> >>
> >>> Checking with 2.0.14. Started as an app. Will update soon.
> >>>
> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
> wrote:
> >>>
> >>>> Any chance you could try with the 2.0.14 release candidate...unless
> you
> >>>> have already?
> >>>>
> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
> >>>>
> >>>>
> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
> >>>>
> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
> >>>>> hours and still counting...
> >>>>> It's seems to be a PDFBox issue.
> >>>>>
> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
> >>>>>> It can be easier to investigate the problem.
> >>>>>>
> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian.vat@gmail.com
> >
> >>>>>> a écrit :
> >>>>>>
> >>>>>>> Just looking at the stack trace it won't be the same anymore due to
> >>>>>>> PDFBOX-4453
> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
> >>>>>>> changes how decryption is handled. Not sure if related though.
> >>>>>>>
> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
> >>>>>>> command-line ExtractText command (
> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> This is the code :
> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
> >>>>>>>> PDFParser tmpPdf = new PDFParser();
> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
> >>>>>>>> config.setExtractAcroFormContent(false);
> >>>>>>>> config.setExtractBookmarksText(false);
> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
> >>>>>>>> Metadata metadata = new Metadata();
> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
> >>>>>>>> ParseContext());
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This is the default in Tika, where the default for
> >>>>>>>>> maxMainMemoryBytes=500MB.
> >>>>>>>>>
> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
> >>>>>>>>> via tika-app or tika-server or something else?
> >>>>>>>>>
> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> >>>>>>>>> memoryUsageSetting =
> >>>>>>>>>
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> >>>>>>>>> }
> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
> >>>>>>>>> // File based -- send file directly to PDFBox
> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> } else {
> >>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
> >>>>>>>>> THausherr@t-online.de> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
> >>>>>>>>>> the
> >>>>>>>>>> profiler.
> >>>>>>>>>>
> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
> >>>>>>>>>>
> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
> >>>>>>>>>> password.
> >>>>>>>>>>
> >>>>>>>>>> It would also be interesting to hear what parameter is passed to
> >>>>>>>>>> MemoryUsageSetting when load() is called.
> >>>>>>>>>>
> >>>>>>>>>> Tilman
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> >>>>>>>>>> > PDFBox Colleagues,
> >>>>>>>>>> >    Any ideas?
> >>>>>>>>>> >
> >>>>>>>>>> > ---------- Forwarded message ---------
> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
> >>>>>>>>>> > To: <us...@tika.apache.org>
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
> >>>>>>>>>> processing
> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
> >>>>>>>>>> 'tesseract' on
> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
> >>>>>>>>>> suspect this
> >>>>>>>>>> > isn't your problem, though.
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> >
> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>> >
> >>>>>>>>>> >> Thanks Tim,
> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
> >>>>>>>>>> tessercat is in
> >>>>>>>>>> >> this context 🙂
> >>>>>>>>>> >>
> >>>>>>>>>> >> Thanks
> >>>>>>>>>> >>
> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <tallison@apache.org
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>> >>
> >>>>>>>>>> >>> Thank you, Slava!
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> Do you have tesseract installed?
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
> >>>>>>>>>> >>>
> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <slavago@gmail.com
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>> >>>> Hi,
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
> >>>>>>>>>> some images.
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
> (TIKA
> >>>>>>>>>> 1.19.1
> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
> SSD
> >>>>>>>>>> disk, running
> >>>>>>>>>> >>> CentOS Linux).
> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
> maybe
> >>>>>>>>>> it's a bug
> >>>>>>>>>> >>> in PDFBox ?
> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
> >>>>>>>>>> stack :
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> >>>
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> at
> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>>>>>>> >>>>
> >>>>>>>>>> >>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
>

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Any chance you can share the file directly w me or someone else on the
PDFBox team?

On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:

> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> Thanks
>
> On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>
>> With 2.0.14 it's 40 minutes running, no result, still working...
>> Seems that issue is still there.
>> Thanks
>>
>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>
>>> Checking with 2.0.14. Started as an app. Will update soon.
>>>
>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>>> have already?
>>>>
>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>>
>>>>
>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>>> hours and still counting...
>>>>> It's seems to be a PDFBox issue.
>>>>>
>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>>>
>>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>>>> It can be easier to investigate the problem.
>>>>>>
>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>>> PDFBOX-4453
>>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>>>>> changes how decryption is handled. Not sure if related though.
>>>>>>>
>>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>>> command-line ExtractText command (
>>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This is the code :
>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>>> config.setExtractBookmarksText(false);
>>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>>> Metadata metadata = new Metadata();
>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>>> ParseContext());
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is the default in Tika, where the default for
>>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>>
>>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>>
>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>>> memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>>> }
>>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> } else {
>>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>>> the
>>>>>>>>>> profiler.
>>>>>>>>>>
>>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>>
>>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>>> password.
>>>>>>>>>>
>>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>>> > PDFBox Colleagues,
>>>>>>>>>> >    Any ideas?
>>>>>>>>>> >
>>>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>>>> > To: <us...@tika.apache.org>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>>> processing
>>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>>>> 'tesseract' on
>>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>>> suspect this
>>>>>>>>>> > isn't your problem, though.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> >> Thanks Tim,
>>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>>> tessercat is in
>>>>>>>>>> >> this context 🙂
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks
>>>>>>>>>> >>
>>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >>> Thank you, Slava!
>>>>>>>>>> >>>
>>>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>>>> >>>
>>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>>>> >>>
>>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >>>> Hi,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>>> some images.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>>> 1.19.1
>>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>>> disk, running
>>>>>>>>>> >>> CentOS Linux).
>>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>>> it's a bug
>>>>>>>>>> >>> in PDFBox ?
>>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>>>> stack :
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Thank you, Tilman!

On Thu, Feb 28, 2019 at 2:19 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Thanks, I got the file. It has about 1000 objects but much more objects
> are created. So I think this is a bug and not related to the size.
>
> The hashmap in decryption seems suspicious to me... Coincidentally,
> today I discovered IdentityHashMap, which may have been what I was
> searching for a few weeks agoin PDFBOX-4453. Using that one opens the
> file in a few seconds.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

Thanks, I got the file. It has about 1000 objects but much more objects 
are created. So I think this is a bug and not related to the size.

The hashmap in decryption seems suspicious to me... Coincidentally, 
today I discovered IdentityHashMap, which may have been what I was 
searching for a few weeks agoin PDFBOX-4453. Using that one opens the 
file in a few seconds.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Thanks.
The app is till running, already an hour.
I'm requested customer permission to share the file, waiting for his
approval.
Once I'll get an answer from him will let you know.
Thanks

On Wed, Feb 27, 2019 at 7:05 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Yes, will do. Use a sharehoster (e.g. filedropper.com ) and put the file
> into an encrypted ZIP. Please send the link and the password to
> tilman at snafu dot de. Make sure you're not breaking any laws by
> sending the file.
>
> Tilman
>
>
> Am 27.02.2019 um 17:33 schrieb Slava G:
> > As this is customer file, I can share it in private and I'll ask you to
> > dispose it after the investigation is done.
> > So, how can I share it with you?
> > Checking now with 2.0.6 app. Will update...
> >
> >
> > On Wed, Feb 27, 2019, 18:28 Tilman Hausherr <TH...@t-online.de>
> wrote:
> >
> >> We really need the file to find out what's going on.
> >>
> >> If you can't share it, you'll have to investigate yourself by using the
> >> profiler. Before that, try with old 2.0.* versions to see if these are
> >> faster.
> >>
> >> Tilman
> >>
> >> Am 27.02.2019 um 17:23 schrieb Slava G:
> >>> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> >>> Thanks
> >>>
> >>> On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
> >>>
> >>>> With 2.0.14 it's 40 minutes running, no result, still working...
> >>>> Seems that issue is still there.
> >>>> Thanks
> >>>>
> >>>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
> >>>>
> >>>>> Checking with 2.0.14. Started as an app. Will update soon.
> >>>>>
> >>>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
> >> wrote:
> >>>>>> Any chance you could try with the 2.0.14 release candidate...unless
> >> you
> >>>>>> have already?
> >>>>>>
> >>>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far
> 2
> >>>>>>> hours and still counting...
> >>>>>>> It's seems to be a PDFBox issue.
> >>>>>>>
> >>>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
> >> wrote:
> >>>>>>>> Why don't you do a basic test with tika server in a 3thrd and a
> >> *wget*
> >>>>>>>> or *curl* bash client to parse your 65Mo PDF.
> >>>>>>>> It can be easier to investigate the problem.
> >>>>>>>>
> >>>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
> cristian.vat@gmail.com
> >>>>>>>> a écrit :
> >>>>>>>>
> >>>>>>>>> Just looking at the stack trace it won't be the same anymore due
> to
> >>>>>>>>> PDFBOX-4453
> >>>>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
> >> changes
> >>>>>>>>> how decryption is handled. Not sure if related though.
> >>>>>>>>>
> >>>>>>>>> Can you duplicate the problem without Tika using just PDFBox
> >>>>>>>>> command-line ExtractText command (
> >>>>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>>> This is the code :
> >>>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
> >>>>>>>>>> PDFParser tmpPdf = new PDFParser();
> >>>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> >>>>>>>>>> config.setMaxMainMemoryBytes(31457280);
> >>>>>>>>>> config.setExtractAcroFormContent(false);
> >>>>>>>>>> config.setExtractBookmarksText(false);
> >>>>>>>>>> config.setCatchIntermediateIOExceptions(true);
> >>>>>>>>>> Metadata metadata = new Metadata();
> >>>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> >>>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
> >>>>>>>>>> ParseContext());
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
> tallison@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> This is the default in Tika, where the default for
> >>>>>>>>>>> maxMainMemoryBytes=500MB.
> >>>>>>>>>>>
> >>>>>>>>>>> Slava, how are you calling this in Tika?  With a
> TikaInputStream
> >>>>>>>>>>> via tika-app or tika-server or something else?
> >>>>>>>>>>>
> >>>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
> >>>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
> >>>>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> >>>>>>>>>>> memoryUsageSetting =
> >>>>>>>>>>>
> >> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> >>>>>>>>>>> }
> >>>>>>>>>>> if (tstream != null && tstream.hasFile()) {
> >>>>>>>>>>> // File based -- send file directly to PDFBox
> >>>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
> >> password,
> >>>>>>>>>>> memoryUsageSetting);
> >>>>>>>>>>> } else {
> >>>>>>>>>>> pdfDocument = PDDocument.load(new
> CloseShieldInputStream(stream),
> >>>>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
> >>>>>>>>>>> THausherr@t-online.de> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
> run
> >>>>>>>>>>>> the
> >>>>>>>>>>>> profiler.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
> user
> >>>>>>>>>>>> password.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It would also be interesting to hear what parameter is passed
> to
> >>>>>>>>>>>> MemoryUsageSetting when load() is called.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Tilman
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> >>>>>>>>>>>>> PDFBox Colleagues,
> >>>>>>>>>>>>>      Any ideas?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ---------- Forwarded message ---------
> >>>>>>>>>>>>> From: Tim Allison <ta...@apache.org>
> >>>>>>>>>>>>> Date: Tue, Feb 26, 2019 at 12:13 PM
> >>>>>>>>>>>>> Subject: Re: Very slow PDF parsing.
> >>>>>>>>>>>>> To: <us...@tika.apache.org>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry...that's an OCR tool.  One thing that can slow down
> >>>>>>>>>>>> processing
> >>>>>>>>>>>>> dramatically is if you have tesseract installed (try typing
> >>>>>>>>>>>> 'tesseract' on
> >>>>>>>>>>>>> your commandline) and if you've turned it on for PDFs.  I
> >>>>>>>>>>>> suspect this
> >>>>>>>>>>>>> isn't your problem, though.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Thanks Tim,
> >>>>>>>>>>>>>> But frankly speaking, it's a shame, but don't know what is
> >>>>>>>>>>>> tessercat is in
> >>>>>>>>>>>>>> this context 🙂
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <
> tallison@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Thank you, Slava!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Do you have tesseract installed?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Colleagues on PDFBox, any recommendations?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
> slavago@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have large PDF (about 65mb) that contains mainly text
> and
> >>>>>>>>>>>> some images.
> >>>>>>>>>>>>>>>> Parsing of such PDF can take about 2 days or even more
> (TIKA
> >>>>>>>>>>>> 1.19.1
> >>>>>>>>>>>>>>> running on XEON server with 4 cores CPU and 30GB RAM with
> SSD
> >>>>>>>>>>>> disk, running
> >>>>>>>>>>>>>>> CentOS Linux).
> >>>>>>>>>>>>>>>> Please advise if there anything I can do to speedup.Or
> maybe
> >>>>>>>>>>>> it's a bug
> >>>>>>>>>>>>>>> in PDFBox ?
> >>>>>>>>>>>>>>>> When I'm printing java stack , I see all the time in this
> >>>>>>>>>>>> stack :
> >>>>>>>>>>>>>>>> at
> >> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at java.util.HashSet.contains(Unknown Source)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>>>>>>>>>>>>>> at
> >>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>>>>>>>>>>>>>> at
> >> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>>>>>>>>>>>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

Yes, will do. Use a sharehoster (e.g. filedropper.com ) and put the file 
into an encrypted ZIP. Please send the link and the password to    
tilman at snafu dot de. Make sure you're not breaking any laws by 
sending the file.

Tilman


Am 27.02.2019 um 17:33 schrieb Slava G:
> As this is customer file, I can share it in private and I'll ask you to
> dispose it after the investigation is done.
> So, how can I share it with you?
> Checking now with 2.0.6 app. Will update...
>
>
> On Wed, Feb 27, 2019, 18:28 Tilman Hausherr <TH...@t-online.de> wrote:
>
>> We really need the file to find out what's going on.
>>
>> If you can't share it, you'll have to investigate yourself by using the
>> profiler. Before that, try with old 2.0.* versions to see if these are
>> faster.
>>
>> Tilman
>>
>> Am 27.02.2019 um 17:23 schrieb Slava G:
>>> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>> Thanks
>>>
>>> On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>>
>>>> With 2.0.14 it's 40 minutes running, no result, still working...
>>>> Seems that issue is still there.
>>>> Thanks
>>>>
>>>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>>>
>>>>> Checking with 2.0.14. Started as an app. Will update soon.
>>>>>
>>>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>> wrote:
>>>>>> Any chance you could try with the 2.0.14 release candidate...unless
>> you
>>>>>> have already?
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>>>>
>>>>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>>>>> hours and still counting...
>>>>>>> It's seems to be a PDFBox issue.
>>>>>>>
>>>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>> wrote:
>>>>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>> *wget*
>>>>>>>> or *curl* bash client to parse your 65Mo PDF.
>>>>>>>> It can be easier to investigate the problem.
>>>>>>>>
>>>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian.vat@gmail.com
>>>>>>>> a écrit :
>>>>>>>>
>>>>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>>>>> PDFBOX-4453
>>>>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>> changes
>>>>>>>>> how decryption is handled. Not sure if related though.
>>>>>>>>>
>>>>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>>>>> command-line ExtractText command (
>>>>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is the code :
>>>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>>>>> config.setExtractBookmarksText(false);
>>>>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>>>>> Metadata metadata = new Metadata();
>>>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>>>>> ParseContext());
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is the default in Tika, where the default for
>>>>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>>>>
>>>>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>>>>
>>>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>>>>> memoryUsageSetting =
>>>>>>>>>>>
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>>>>> }
>>>>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>> password,
>>>>>>>>>>> memoryUsageSetting);
>>>>>>>>>>> } else {
>>>>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>>>>> the
>>>>>>>>>>>> profiler.
>>>>>>>>>>>>
>>>>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>>>>
>>>>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>>>>> password.
>>>>>>>>>>>>
>>>>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>>>>
>>>>>>>>>>>> Tilman
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>>>>>> PDFBox Colleagues,
>>>>>>>>>>>>>      Any ideas?
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------- Forwarded message ---------
>>>>>>>>>>>>> From: Tim Allison <ta...@apache.org>
>>>>>>>>>>>>> Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>>>>>> Subject: Re: Very slow PDF parsing.
>>>>>>>>>>>>> To: <us...@tika.apache.org>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>>>>> processing
>>>>>>>>>>>>> dramatically is if you have tesseract installed (try typing
>>>>>>>>>>>> 'tesseract' on
>>>>>>>>>>>>> your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>>>>> suspect this
>>>>>>>>>>>>> isn't your problem, though.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Thanks Tim,
>>>>>>>>>>>>>> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>>>>> tessercat is in
>>>>>>>>>>>>>> this context 🙂
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Thank you, Slava!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you have tesseract installed?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Colleagues on PDFBox, any recommendations?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>>>>> some images.
>>>>>>>>>>>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>>>>> 1.19.1
>>>>>>>>>>>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>>>>> disk, running
>>>>>>>>>>>>>>> CentOS Linux).
>>>>>>>>>>>>>>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>>>>> it's a bug
>>>>>>>>>>>>>>> in PDFBox ?
>>>>>>>>>>>>>>>> When I'm printing java stack , I see all the time in this
>>>>>>>>>>>> stack :
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>>>>>>>>> at
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>>>>>>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

As this is customer file, I can share it in private and I'll ask you to
dispose it after the investigation is done.
So, how can I share it with you?
Checking now with 2.0.6 app. Will update...


On Wed, Feb 27, 2019, 18:28 Tilman Hausherr <TH...@t-online.de> wrote:

> We really need the file to find out what's going on.
>
> If you can't share it, you'll have to investigate yourself by using the
> profiler. Before that, try with old 2.0.* versions to see if these are
> faster.
>
> Tilman
>
> Am 27.02.2019 um 17:23 schrieb Slava G:
> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> > Thanks
> >
> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
> >
> >> With 2.0.14 it's 40 minutes running, no result, still working...
> >> Seems that issue is still there.
> >> Thanks
> >>
> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
> >>
> >>> Checking with 2.0.14. Started as an app. Will update soon.
> >>>
> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
> wrote:
> >>>
> >>>> Any chance you could try with the 2.0.14 release candidate...unless
> you
> >>>> have already?
> >>>>
> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
> >>>>
> >>>>
> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
> >>>>
> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
> >>>>> hours and still counting...
> >>>>> It's seems to be a PDFBox issue.
> >>>>>
> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
> *wget*
> >>>>>> or *curl* bash client to parse your 65Mo PDF.
> >>>>>> It can be easier to investigate the problem.
> >>>>>>
> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cristian.vat@gmail.com
> >
> >>>>>> a écrit :
> >>>>>>
> >>>>>>> Just looking at the stack trace it won't be the same anymore due to
> >>>>>>> PDFBOX-4453
> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
> changes
> >>>>>>> how decryption is handled. Not sure if related though.
> >>>>>>>
> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
> >>>>>>> command-line ExtractText command (
> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> This is the code :
> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
> >>>>>>>> PDFParser tmpPdf = new PDFParser();
> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
> >>>>>>>> config.setExtractAcroFormContent(false);
> >>>>>>>> config.setExtractBookmarksText(false);
> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
> >>>>>>>> Metadata metadata = new Metadata();
> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
> >>>>>>>> ParseContext());
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> This is the default in Tika, where the default for
> >>>>>>>>> maxMainMemoryBytes=500MB.
> >>>>>>>>>
> >>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
> >>>>>>>>> via tika-app or tika-server or something else?
> >>>>>>>>>
> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> >>>>>>>>> memoryUsageSetting =
> >>>>>>>>>
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> >>>>>>>>> }
> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
> >>>>>>>>> // File based -- send file directly to PDFBox
> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
> password,
> >>>>>>>>> memoryUsageSetting);
> >>>>>>>>> } else {
> >>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> >>>>>>>>> password, memoryUsageSetting);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
> >>>>>>>>> THausherr@t-online.de> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
> >>>>>>>>>> the
> >>>>>>>>>> profiler.
> >>>>>>>>>>
> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
> >>>>>>>>>>
> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
> >>>>>>>>>> password.
> >>>>>>>>>>
> >>>>>>>>>> It would also be interesting to hear what parameter is passed to
> >>>>>>>>>> MemoryUsageSetting when load() is called.
> >>>>>>>>>>
> >>>>>>>>>> Tilman
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> >>>>>>>>>>> PDFBox Colleagues,
> >>>>>>>>>>>     Any ideas?
> >>>>>>>>>>>
> >>>>>>>>>>> ---------- Forwarded message ---------
> >>>>>>>>>>> From: Tim Allison <ta...@apache.org>
> >>>>>>>>>>> Date: Tue, Feb 26, 2019 at 12:13 PM
> >>>>>>>>>>> Subject: Re: Very slow PDF parsing.
> >>>>>>>>>>> To: <us...@tika.apache.org>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry...that's an OCR tool.  One thing that can slow down
> >>>>>>>>>> processing
> >>>>>>>>>>> dramatically is if you have tesseract installed (try typing
> >>>>>>>>>> 'tesseract' on
> >>>>>>>>>>> your commandline) and if you've turned it on for PDFs.  I
> >>>>>>>>>> suspect this
> >>>>>>>>>>> isn't your problem, though.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>> Thanks Tim,
> >>>>>>>>>>>> But frankly speaking, it's a shame, but don't know what is
> >>>>>>>>>> tessercat is in
> >>>>>>>>>>>> this context 🙂
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>> Thank you, Slava!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Do you have tesseract installed?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Colleagues on PDFBox, any recommendations?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have large PDF (about 65mb) that contains mainly text and
> >>>>>>>>>> some images.
> >>>>>>>>>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA
> >>>>>>>>>> 1.19.1
> >>>>>>>>>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
> >>>>>>>>>> disk, running
> >>>>>>>>>>>>> CentOS Linux).
> >>>>>>>>>>>>>> Please advise if there anything I can do to speedup.Or maybe
> >>>>>>>>>> it's a bug
> >>>>>>>>>>>>> in PDFBox ?
> >>>>>>>>>>>>>> When I'm printing java stack , I see all the time in this
> >>>>>>>>>> stack :
> >>>>>>>>>>>>>> at
> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at java.util.HashSet.contains(Unknown Source)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>>>>>>>>>>>> at
> >>>>>>>>>>
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>>>>>>>>>>>> at
> >>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>>>>>>>>>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>>>>>>>>>>> at
> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>>>>>>>>>>> at
> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

We really need the file to find out what's going on.

If you can't share it, you'll have to investigate yourself by using the 
profiler. Before that, try with old 2.0.* versions to see if these are 
faster.

Tilman

Am 27.02.2019 um 17:23 schrieb Slava G:
> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> Thanks
>
> On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>
>> With 2.0.14 it's 40 minutes running, no result, still working...
>> Seems that issue is still there.
>> Thanks
>>
>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>
>>> Checking with 2.0.14. Started as an app. Will update soon.
>>>
>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>>> have already?
>>>>
>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>>
>>>>
>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>>> hours and still counting...
>>>>> It's seems to be a PDFBox issue.
>>>>>
>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>>>
>>>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>>>> or *curl* bash client to parse your 65Mo PDF.
>>>>>> It can be easier to investigate the problem.
>>>>>>
>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>>> PDFBOX-4453
>>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>>>> how decryption is handled. Not sure if related though.
>>>>>>>
>>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>>> command-line ExtractText command (
>>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This is the code :
>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>>> config.setExtractBookmarksText(false);
>>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>>> Metadata metadata = new Metadata();
>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>>> ParseContext());
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is the default in Tika, where the default for
>>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>>
>>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>>
>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>>> memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>>> }
>>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>>>>> memoryUsageSetting);
>>>>>>>>> } else {
>>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>>> the
>>>>>>>>>> profiler.
>>>>>>>>>>
>>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>>
>>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>>> password.
>>>>>>>>>>
>>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>>>> PDFBox Colleagues,
>>>>>>>>>>>     Any ideas?
>>>>>>>>>>>
>>>>>>>>>>> ---------- Forwarded message ---------
>>>>>>>>>>> From: Tim Allison <ta...@apache.org>
>>>>>>>>>>> Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>>>> Subject: Re: Very slow PDF parsing.
>>>>>>>>>>> To: <us...@tika.apache.org>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>>> processing
>>>>>>>>>>> dramatically is if you have tesseract installed (try typing
>>>>>>>>>> 'tesseract' on
>>>>>>>>>>> your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>>> suspect this
>>>>>>>>>>> isn't your problem, though.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Thanks Tim,
>>>>>>>>>>>> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>>> tessercat is in
>>>>>>>>>>>> this context 🙂
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> Thank you, Slava!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have tesseract installed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Colleagues on PDFBox, any recommendations?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>>> some images.
>>>>>>>>>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>>> 1.19.1
>>>>>>>>>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>>> disk, running
>>>>>>>>>>>>> CentOS Linux).
>>>>>>>>>>>>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>>> it's a bug
>>>>>>>>>>>>> in PDFBox ?
>>>>>>>>>>>>>> When I'm printing java stack , I see all the time in this
>>>>>>>>>> stack :
>>>>>>>>>>>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>>>>>>> at
>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Any chance you can share the file directly w me or someone else on the
PDFBox team?

On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:

> After 3h 40m it's still parsing using PDFBox 2.0.14 app...
> Thanks
>
> On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>
>> With 2.0.14 it's 40 minutes running, no result, still working...
>> Seems that issue is still there.
>> Thanks
>>
>> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>
>>> Checking with 2.0.14. Started as an app. Will update soon.
>>>
>>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>>> have already?
>>>>
>>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>>
>>>>
>>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>>> hours and still counting...
>>>>> It's seems to be a PDFBox issue.
>>>>>
>>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>>>
>>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>>>> It can be easier to investigate the problem.
>>>>>>
>>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>>> PDFBOX-4453
>>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>>>>> changes how decryption is handled. Not sure if related though.
>>>>>>>
>>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>>> command-line ExtractText command (
>>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This is the code :
>>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>>> config.setExtractBookmarksText(false);
>>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>>> Metadata metadata = new Metadata();
>>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>>> ParseContext());
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is the default in Tika, where the default for
>>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>>
>>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>>
>>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>>> memoryUsageSetting =
>>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>>> }
>>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> } else {
>>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>>> password, memoryUsageSetting);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>>> the
>>>>>>>>>> profiler.
>>>>>>>>>>
>>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>>
>>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>>> password.
>>>>>>>>>>
>>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>>
>>>>>>>>>> Tilman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>>> > PDFBox Colleagues,
>>>>>>>>>> >    Any ideas?
>>>>>>>>>> >
>>>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>>>> > To: <us...@tika.apache.org>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>>> processing
>>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>>>> 'tesseract' on
>>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>>> suspect this
>>>>>>>>>> > isn't your problem, though.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> >> Thanks Tim,
>>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>>> tessercat is in
>>>>>>>>>> >> this context 🙂
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks
>>>>>>>>>> >>
>>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >>> Thank you, Slava!
>>>>>>>>>> >>>
>>>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>>>> >>>
>>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>>>> >>>
>>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >>>> Hi,
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>>> some images.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>>> 1.19.1
>>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>>> disk, running
>>>>>>>>>> >>> CentOS Linux).
>>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>>> it's a bug
>>>>>>>>>> >>> in PDFBox ?
>>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>>>> stack :
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>>> >>>> at
>>>>>>>>>> >>>
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> at
>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

After 3h 40m it's still parsing using PDFBox 2.0.14 app...
Thanks

On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:

> With 2.0.14 it's 40 minutes running, no result, still working...
> Seems that issue is still there.
> Thanks
>
> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>
>> Checking with 2.0.14. Started as an app. Will update soon.
>>
>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>> have already?
>>>
>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>
>>>
>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>
>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>> hours and still counting...
>>>> It's seems to be a PDFBox issue.
>>>>
>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>>
>>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>>> or *curl* bash client to parse your 65Mo PDF.
>>>>> It can be easier to investigate the problem.
>>>>>
>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>
>>>>>
>>>>>
>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>> PDFBOX-4453
>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>>> how decryption is handled. Not sure if related though.
>>>>>>
>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>> command-line ExtractText command (
>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>>
>>>>>>> This is the code :
>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>> config.setExtractBookmarksText(false);
>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>> Metadata metadata = new Metadata();
>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>> ParseContext());
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> This is the default in Tika, where the default for
>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>
>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>
>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>> memoryUsageSetting =
>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>> }
>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>>>> memoryUsageSetting);
>>>>>>>> } else {
>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>> password, memoryUsageSetting);
>>>>>>>> }
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>> the
>>>>>>>>> profiler.
>>>>>>>>>
>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>
>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>> password.
>>>>>>>>>
>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>> > PDFBox Colleagues,
>>>>>>>>> >    Any ideas?
>>>>>>>>> >
>>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>>> > To: <us...@tika.apache.org>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>> processing
>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>>> 'tesseract' on
>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>> suspect this
>>>>>>>>> > isn't your problem, though.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> >> Thanks Tim,
>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>> tessercat is in
>>>>>>>>> >> this context 🙂
>>>>>>>>> >>
>>>>>>>>> >> Thanks
>>>>>>>>> >>
>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >>> Thank you, Slava!
>>>>>>>>> >>>
>>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>>> >>>
>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>>> >>>
>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >>>> Hi,
>>>>>>>>> >>>>
>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>> some images.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>> 1.19.1
>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>> disk, running
>>>>>>>>> >>> CentOS Linux).
>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>> it's a bug
>>>>>>>>> >>> in PDFBox ?
>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>>> stack :
>>>>>>>>> >>>>
>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

After 3h 40m it's still parsing using PDFBox 2.0.14 app...
Thanks

On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:

> With 2.0.14 it's 40 minutes running, no result, still working...
> Seems that issue is still there.
> Thanks
>
> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>
>> Checking with 2.0.14. Started as an app. Will update soon.
>>
>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Any chance you could try with the 2.0.14 release candidate...unless you
>>> have already?
>>>
>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>
>>>
>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>
>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>>> hours and still counting...
>>>> It's seems to be a PDFBox issue.
>>>>
>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>>
>>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>>> or *curl* bash client to parse your 65Mo PDF.
>>>>> It can be easier to investigate the problem.
>>>>>
>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>>
>>>>>
>>>>>
>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>>> PDFBOX-4453
>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>>> how decryption is handled. Not sure if related though.
>>>>>>
>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>>> command-line ExtractText command (
>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>>
>>>>>>> This is the code :
>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>>> config.setExtractAcroFormContent(false);
>>>>>>> config.setExtractBookmarksText(false);
>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>>> Metadata metadata = new Metadata();
>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>>> ParseContext());
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> This is the default in Tika, where the default for
>>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>>
>>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream
>>>>>>>> via tika-app or tika-server or something else?
>>>>>>>>
>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>>> memoryUsageSetting =
>>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>>> }
>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>>> // File based -- send file directly to PDFBox
>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>>>> memoryUsageSetting);
>>>>>>>> } else {
>>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>>> password, memoryUsageSetting);
>>>>>>>> }
>>>>>>>>
>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>>> the
>>>>>>>>> profiler.
>>>>>>>>>
>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>>
>>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>>> password.
>>>>>>>>>
>>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>>
>>>>>>>>> Tilman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>>> > PDFBox Colleagues,
>>>>>>>>> >    Any ideas?
>>>>>>>>> >
>>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>>> > To: <us...@tika.apache.org>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>>> processing
>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>>> 'tesseract' on
>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>>>>>>> suspect this
>>>>>>>>> > isn't your problem, though.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> >> Thanks Tim,
>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>>> tessercat is in
>>>>>>>>> >> this context 🙂
>>>>>>>>> >>
>>>>>>>>> >> Thanks
>>>>>>>>> >>
>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >>> Thank you, Slava!
>>>>>>>>> >>>
>>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>>> >>>
>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>>> >>>
>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >>>> Hi,
>>>>>>>>> >>>>
>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>>> some images.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>>> 1.19.1
>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>>> disk, running
>>>>>>>>> >>> CentOS Linux).
>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>>> it's a bug
>>>>>>>>> >>> in PDFBox ?
>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>>> stack :
>>>>>>>>> >>>>
>>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>>> >>>> at
>>>>>>>>> >>>
>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>> >>>>
>>>>>>>>> >>>> at
>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

With 2.0.14 it's 40 minutes running, no result, still working...
Seems that issue is still there.
Thanks

On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:

> Checking with 2.0.14. Started as an app. Will update soon.
>
> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>
>> Any chance you could try with the 2.0.14 release candidate...unless you
>> have already?
>>
>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>
>>
>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>
>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>> hours and still counting...
>>> It's seems to be a PDFBox issue.
>>>
>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>
>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>> or *curl* bash client to parse your 65Mo PDF.
>>>> It can be easier to investigate the problem.
>>>>
>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>
>>>>
>>>>
>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
>>>> écrit :
>>>>
>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>> PDFBOX-4453
>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>> how decryption is handled. Not sure if related though.
>>>>>
>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>> command-line ExtractText command (
>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>
>>>>>> This is the code :
>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>> config.setExtractAcroFormContent(false);
>>>>>> config.setExtractBookmarksText(false);
>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>> Metadata metadata = new Metadata();
>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>> ParseContext());
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> This is the default in Tika, where the default for
>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>
>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>>>> tika-app or tika-server or something else?
>>>>>>>
>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>> memoryUsageSetting =
>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>> }
>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>> // File based -- send file directly to PDFBox
>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>>> memoryUsageSetting);
>>>>>>> } else {
>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>> password, memoryUsageSetting);
>>>>>>> }
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>> the
>>>>>>>> profiler.
>>>>>>>>
>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>
>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>> password.
>>>>>>>>
>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>> > PDFBox Colleagues,
>>>>>>>> >    Any ideas?
>>>>>>>> >
>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>> > To: <us...@tika.apache.org>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>> processing
>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>> 'tesseract' on
>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>>>> this
>>>>>>>> > isn't your problem, though.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Thanks Tim,
>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>> tessercat is in
>>>>>>>> >> this context 🙂
>>>>>>>> >>
>>>>>>>> >> Thanks
>>>>>>>> >>
>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >>> Thank you, Slava!
>>>>>>>> >>>
>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>> >>>
>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>> >>>
>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>> Hi,
>>>>>>>> >>>>
>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>> some images.
>>>>>>>> >>>>
>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>> 1.19.1
>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>> disk, running
>>>>>>>> >>> CentOS Linux).
>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>> it's a bug
>>>>>>>> >>> in PDFBox ?
>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>> stack :
>>>>>>>> >>>>
>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

With 2.0.14 it's 40 minutes running, no result, still working...
Seems that issue is still there.
Thanks

On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:

> Checking with 2.0.14. Started as an app. Will update soon.
>
> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:
>
>> Any chance you could try with the 2.0.14 release candidate...unless you
>> have already?
>>
>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>
>>
>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>
>>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>>> hours and still counting...
>>> It's seems to be a PDFBox issue.
>>>
>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>>
>>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>>> or *curl* bash client to parse your 65Mo PDF.
>>>> It can be easier to investigate the problem.
>>>>
>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>>
>>>>
>>>>
>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
>>>> écrit :
>>>>
>>>>> Just looking at the stack trace it won't be the same anymore due to
>>>>> PDFBOX-4453
>>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>>> how decryption is handled. Not sure if related though.
>>>>>
>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>>> command-line ExtractText command (
>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>>
>>>>>> This is the code :
>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>>> PDFParser tmpPdf = new PDFParser();
>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>>> config.setExtractAcroFormContent(false);
>>>>>> config.setExtractBookmarksText(false);
>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>>> Metadata metadata = new Metadata();
>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>>> ParseContext());
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> This is the default in Tika, where the default for
>>>>>>> maxMainMemoryBytes=500MB.
>>>>>>>
>>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>>>> tika-app or tika-server or something else?
>>>>>>>
>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>>> memoryUsageSetting =
>>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>>> }
>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>>> // File based -- send file directly to PDFBox
>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>>> memoryUsageSetting);
>>>>>>> } else {
>>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>>> password, memoryUsageSetting);
>>>>>>> }
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>>> THausherr@t-online.de> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> As usual, it would be nice to have the PDF, so that we could run
>>>>>>>> the
>>>>>>>> profiler.
>>>>>>>>
>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>>
>>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>>> password.
>>>>>>>>
>>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>>> > PDFBox Colleagues,
>>>>>>>> >    Any ideas?
>>>>>>>> >
>>>>>>>> > ---------- Forwarded message ---------
>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>>> > To: <us...@tika.apache.org>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>>> processing
>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>>> 'tesseract' on
>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>>>> this
>>>>>>>> > isn't your problem, though.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >> Thanks Tim,
>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>>> tessercat is in
>>>>>>>> >> this context 🙂
>>>>>>>> >>
>>>>>>>> >> Thanks
>>>>>>>> >>
>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >>> Thank you, Slava!
>>>>>>>> >>>
>>>>>>>> >>> Do you have tesseract installed?
>>>>>>>> >>>
>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>>> >>>
>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>> Hi,
>>>>>>>> >>>>
>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>>> some images.
>>>>>>>> >>>>
>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>>> 1.19.1
>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>>> disk, running
>>>>>>>> >>> CentOS Linux).
>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>>> it's a bug
>>>>>>>> >>> in PDFBox ?
>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>>>>>>> stack :
>>>>>>>> >>>>
>>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>> >>>> at
>>>>>>>> >>>
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>> >>>>
>>>>>>>> >>>> at
>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Checking with 2.0.14. Started as an app. Will update soon.

On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:

> Any chance you could try with the 2.0.14 release candidate...unless you
> have already?
>
> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>
>
> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>
>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>> hours and still counting...
>> It's seems to be a PDFBox issue.
>>
>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>
>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>> or *curl* bash client to parse your 65Mo PDF.
>>> It can be easier to investigate the problem.
>>>
>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>
>>>
>>>
>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
>>> écrit :
>>>
>>>> Just looking at the stack trace it won't be the same anymore due to
>>>> PDFBOX-4453
>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>> how decryption is handled. Not sure if related though.
>>>>
>>>> Can you duplicate the problem without Tika using just PDFBox
>>>> command-line ExtractText command (
>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>
>>>>
>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>
>>>>> This is the code :
>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>> PDFParser tmpPdf = new PDFParser();
>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>> config.setExtractAcroFormContent(false);
>>>>> config.setExtractBookmarksText(false);
>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>> Metadata metadata = new Metadata();
>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>> ParseContext());
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> This is the default in Tika, where the default for
>>>>>> maxMainMemoryBytes=500MB.
>>>>>>
>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>>> tika-app or tika-server or something else?
>>>>>>
>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>> memoryUsageSetting =
>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>> }
>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>> // File based -- send file directly to PDFBox
>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>> memoryUsageSetting);
>>>>>> } else {
>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>> password, memoryUsageSetting);
>>>>>> }
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>> THausherr@t-online.de> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>>>> profiler.
>>>>>>>
>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>
>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>> password.
>>>>>>>
>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>> > PDFBox Colleagues,
>>>>>>> >    Any ideas?
>>>>>>> >
>>>>>>> > ---------- Forwarded message ---------
>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>> > To: <us...@tika.apache.org>
>>>>>>> >
>>>>>>> >
>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>> processing
>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>> 'tesseract' on
>>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>>> this
>>>>>>> > isn't your problem, though.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> >> Thanks Tim,
>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>> tessercat is in
>>>>>>> >> this context 🙂
>>>>>>> >>
>>>>>>> >> Thanks
>>>>>>> >>
>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >>> Thank you, Slava!
>>>>>>> >>>
>>>>>>> >>> Do you have tesseract installed?
>>>>>>> >>>
>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>> >>>
>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>> Hi,
>>>>>>> >>>>
>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>> some images.
>>>>>>> >>>>
>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>> 1.19.1
>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>> disk, running
>>>>>>> >>> CentOS Linux).
>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>> it's a bug
>>>>>>> >>> in PDFBox ?
>>>>>>> >>>> When I'm printing java stack , I see all the time in this stack
>>>>>>> :
>>>>>>> >>>>
>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>> >>>> at
>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>> >>>>
>>>>>>> >>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Checking with 2.0.14. Started as an app. Will update soon.

On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org> wrote:

> Any chance you could try with the 2.0.14 release candidate...unless you
> have already?
>
> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>
>
> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>
>> Well, I ran (as was suggested) PDFBox app to extract text , so far 2
>> hours and still counting...
>> It's seems to be a PDFBox issue.
>>
>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>>
>>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>>> or *curl* bash client to parse your 65Mo PDF.
>>> It can be easier to investigate the problem.
>>>
>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>
>>>
>>>
>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
>>> écrit :
>>>
>>>> Just looking at the stack trace it won't be the same anymore due to
>>>> PDFBOX-4453
>>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>>> how decryption is handled. Not sure if related though.
>>>>
>>>> Can you duplicate the problem without Tika using just PDFBox
>>>> command-line ExtractText command (
>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>>
>>>>
>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>>
>>>>> This is the code :
>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>>> PDFParser tmpPdf = new PDFParser();
>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>>> config.setMaxMainMemoryBytes(31457280);
>>>>> config.setExtractAcroFormContent(false);
>>>>> config.setExtractBookmarksText(false);
>>>>> config.setCatchIntermediateIOExceptions(true);
>>>>> Metadata metadata = new Metadata();
>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>>> ParseContext());
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> This is the default in Tika, where the default for
>>>>>> maxMainMemoryBytes=500MB.
>>>>>>
>>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>>> tika-app or tika-server or something else?
>>>>>>
>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>>> memoryUsageSetting =
>>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>>> }
>>>>>> if (tstream != null && tstream.hasFile()) {
>>>>>> // File based -- send file directly to PDFBox
>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>>> memoryUsageSetting);
>>>>>> } else {
>>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>>> password, memoryUsageSetting);
>>>>>> }
>>>>>>
>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>>> THausherr@t-online.de> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>>>> profiler.
>>>>>>>
>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>>
>>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>>> password.
>>>>>>>
>>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>>> MemoryUsageSetting when load() is called.
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>>> > PDFBox Colleagues,
>>>>>>> >    Any ideas?
>>>>>>> >
>>>>>>> > ---------- Forwarded message ---------
>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>>> > To: <us...@tika.apache.org>
>>>>>>> >
>>>>>>> >
>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>>>>> processing
>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>>> 'tesseract' on
>>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>>> this
>>>>>>> > isn't your problem, though.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> >> Thanks Tim,
>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>>> tessercat is in
>>>>>>> >> this context 🙂
>>>>>>> >>
>>>>>>> >> Thanks
>>>>>>> >>
>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >>> Thank you, Slava!
>>>>>>> >>>
>>>>>>> >>> Do you have tesseract installed?
>>>>>>> >>>
>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>>> >>>
>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>> Hi,
>>>>>>> >>>>
>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and
>>>>>>> some images.
>>>>>>> >>>>
>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>>> 1.19.1
>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>>> disk, running
>>>>>>> >>> CentOS Linux).
>>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>>> it's a bug
>>>>>>> >>> in PDFBox ?
>>>>>>> >>>> When I'm printing java stack , I see all the time in this stack
>>>>>>> :
>>>>>>> >>>>
>>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>> >>>> at
>>>>>>> >>>
>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>> >>>> at
>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>> >>>>
>>>>>>> >>>> at
>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>> >>>>
>>>>>>> >>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Any chance you could try with the 2.0.14 release candidate...unless you
have already?

https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/


On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:

> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours
> and still counting...
> It's seems to be a PDFBox issue.
>
> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>
>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>> or *curl* bash client to parse your 65Mo PDF.
>> It can be easier to investigate the problem.
>>
>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>
>>
>>
>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
>> écrit :
>>
>>> Just looking at the stack trace it won't be the same anymore due to
>>> PDFBOX-4453
>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>> how decryption is handled. Not sure if related though.
>>>
>>> Can you duplicate the problem without Tika using just PDFBox
>>> command-line ExtractText command (
>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>
>>>
>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>
>>>> This is the code :
>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> PDFParser tmpPdf = new PDFParser();
>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> config.setMaxMainMemoryBytes(31457280);
>>>> config.setExtractAcroFormContent(false);
>>>> config.setExtractBookmarksText(false);
>>>> config.setCatchIntermediateIOExceptions(true);
>>>> Metadata metadata = new Metadata();
>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> ParseContext());
>>>>
>>>>
>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> This is the default in Tika, where the default for
>>>>> maxMainMemoryBytes=500MB.
>>>>>
>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>> tika-app or tika-server or something else?
>>>>>
>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>> memoryUsageSetting =
>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>> }
>>>>> if (tstream != null && tstream.hasFile()) {
>>>>> // File based -- send file directly to PDFBox
>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>> memoryUsageSetting);
>>>>> } else {
>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>> password, memoryUsageSetting);
>>>>> }
>>>>>
>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>> THausherr@t-online.de> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>>> profiler.
>>>>>>
>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>
>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>> password.
>>>>>>
>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>> MemoryUsageSetting when load() is called.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>> > PDFBox Colleagues,
>>>>>> >    Any ideas?
>>>>>> >
>>>>>> > ---------- Forwarded message ---------
>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>> > To: <us...@tika.apache.org>
>>>>>> >
>>>>>> >
>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>> 'tesseract' on
>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>> this
>>>>>> > isn't your problem, though.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>>>>> >
>>>>>> >> Thanks Tim,
>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>> tessercat is in
>>>>>> >> this context 🙂
>>>>>> >>
>>>>>> >> Thanks
>>>>>> >>
>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>> Thank you, Slava!
>>>>>> >>>
>>>>>> >>> Do you have tesseract installed?
>>>>>> >>>
>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>> >>>
>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>> wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>>>> images.
>>>>>> >>>>
>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>> 1.19.1
>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>> disk, running
>>>>>> >>> CentOS Linux).
>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>> it's a bug
>>>>>> >>> in PDFBox ?
>>>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>>>> >>>>
>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>> >>>> at
>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>> >>>>
>>>>>> >>>> at
>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>> >>>>
>>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>> >>>>
>>>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>> >>>>
>>>>>> >>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Any chance you could try with the 2.0.14 release candidate...unless you
have already?

https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/


On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:

> Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours
> and still counting...
> It's seems to be a PDFBox issue.
>
> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:
>
>> Why don't you do a basic test with tika server in a 3thrd and a *wget*
>> or *curl* bash client to parse your 65Mo PDF.
>> It can be easier to investigate the problem.
>>
>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>
>>
>>
>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
>> écrit :
>>
>>> Just looking at the stack trace it won't be the same anymore due to
>>> PDFBOX-4453
>>> Some changes present in not yet released pdfbox 2.0.14 and it changes
>>> how decryption is handled. Not sure if related though.
>>>
>>> Can you duplicate the problem without Tika using just PDFBox
>>> command-line ExtractText command (
>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>
>>>
>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>>
>>>> This is the code :
>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> PDFParser tmpPdf = new PDFParser();
>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> config.setMaxMainMemoryBytes(31457280);
>>>> config.setExtractAcroFormContent(false);
>>>> config.setExtractBookmarksText(false);
>>>> config.setCatchIntermediateIOExceptions(true);
>>>> Metadata metadata = new Metadata();
>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> ParseContext());
>>>>
>>>>
>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> This is the default in Tika, where the default for
>>>>> maxMainMemoryBytes=500MB.
>>>>>
>>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>>> tika-app or tika-server or something else?
>>>>>
>>>>> MemoryUsageSetting memoryUsageSetting =
>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>>> memoryUsageSetting =
>>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>>> }
>>>>> if (tstream != null && tstream.hasFile()) {
>>>>> // File based -- send file directly to PDFBox
>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>>> memoryUsageSetting);
>>>>> } else {
>>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>>> password, memoryUsageSetting);
>>>>> }
>>>>>
>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>>> THausherr@t-online.de> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>>> profiler.
>>>>>>
>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>>
>>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>>> password.
>>>>>>
>>>>>> It would also be interesting to hear what parameter is passed to
>>>>>> MemoryUsageSetting when load() is called.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>>> > PDFBox Colleagues,
>>>>>> >    Any ideas?
>>>>>> >
>>>>>> > ---------- Forwarded message ---------
>>>>>> > From: Tim Allison <ta...@apache.org>
>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>>> > Subject: Re: Very slow PDF parsing.
>>>>>> > To: <us...@tika.apache.org>
>>>>>> >
>>>>>> >
>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>>> > dramatically is if you have tesseract installed (try typing
>>>>>> 'tesseract' on
>>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>>> this
>>>>>> > isn't your problem, though.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>>>>> >
>>>>>> >> Thanks Tim,
>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>>> tessercat is in
>>>>>> >> this context 🙂
>>>>>> >>
>>>>>> >> Thanks
>>>>>> >>
>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>> Thank you, Slava!
>>>>>> >>>
>>>>>> >>> Do you have tesseract installed?
>>>>>> >>>
>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>>> >>>
>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>>> wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>>>> images.
>>>>>> >>>>
>>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>>> 1.19.1
>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>>> disk, running
>>>>>> >>> CentOS Linux).
>>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe
>>>>>> it's a bug
>>>>>> >>> in PDFBox ?
>>>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>>>> >>>>
>>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>>> >>>>
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>> >>>> at
>>>>>> >>>
>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>> >>>> at
>>>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>> >>>>
>>>>>> >>>> at
>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>> >>>>
>>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>> >>>>
>>>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>> >>>>
>>>>>> >>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Well, I ran (as was suggested) PDFBox app to extract text , so far 2 hours
and still counting...
It's seems to be a PDFBox issue.

On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com> wrote:

> Why don't you do a basic test with tika server in a 3thrd and a *wget* or
> *curl* bash client to parse your 65Mo PDF.
> It can be easier to investigate the problem.
>
> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>
>
>
> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
> écrit :
>
>> Just looking at the stack trace it won't be the same anymore due to
>> PDFBOX-4453
>> Some changes present in not yet released pdfbox 2.0.14 and it changes how
>> decryption is handled. Not sure if related though.
>>
>> Can you duplicate the problem without Tika using just PDFBox command-line
>> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html )
>> on that file?
>>
>>
>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>>
>>> This is the code :
>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>> PDFParser tmpPdf = new PDFParser();
>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>> config.setMaxMainMemoryBytes(31457280);
>>> config.setExtractAcroFormContent(false);
>>> config.setExtractBookmarksText(false);
>>> config.setCatchIntermediateIOExceptions(true);
>>> Metadata metadata = new Metadata();
>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>> ParseContext());
>>>
>>>
>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>>
>>>> This is the default in Tika, where the default for
>>>> maxMainMemoryBytes=500MB.
>>>>
>>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>>> tika-app or tika-server or something else?
>>>>
>>>> MemoryUsageSetting memoryUsageSetting =
>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>> memoryUsageSetting =
>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>> }
>>>> if (tstream != null && tstream.hasFile()) {
>>>> // File based -- send file directly to PDFBox
>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>>> memoryUsageSetting);
>>>> } else {
>>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>>> password, memoryUsageSetting);
>>>> }
>>>>
>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>>> profiler.
>>>>>
>>>>> The HashSet is used to avoid decrypting objects twice.
>>>>>
>>>>> The "not encrypted" file is likely encrypted with an empty user
>>>>> password.
>>>>>
>>>>> It would also be interesting to hear what parameter is passed to
>>>>> MemoryUsageSetting when load() is called.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>> > PDFBox Colleagues,
>>>>> >    Any ideas?
>>>>> >
>>>>> > ---------- Forwarded message ---------
>>>>> > From: Tim Allison <ta...@apache.org>
>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>> > Subject: Re: Very slow PDF parsing.
>>>>> > To: <us...@tika.apache.org>
>>>>> >
>>>>> >
>>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>> > dramatically is if you have tesseract installed (try typing
>>>>> 'tesseract' on
>>>>> > your commandline) and if you've turned it on for PDFs.  I suspect
>>>>> this
>>>>> > isn't your problem, though.
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>>>> >
>>>>> >> Thanks Tim,
>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>>> tessercat is in
>>>>> >> this context 🙂
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org>
>>>>> wrote:
>>>>> >>
>>>>> >>> Thank you, Slava!
>>>>> >>>
>>>>> >>> Do you have tesseract installed?
>>>>> >>>
>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>>> >>>
>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com>
>>>>> wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>>> images.
>>>>> >>>>
>>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA
>>>>> 1.19.1
>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD
>>>>> disk, running
>>>>> >>> CentOS Linux).
>>>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's
>>>>> a bug
>>>>> >>> in PDFBox ?
>>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>>> >>>>
>>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>> >>>>
>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>>> >>>>
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>> >>>> at
>>>>> >>>
>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>> >>>>
>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>> >>>>
>>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>> >>>>
>>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>>> >>>>
>>>>> >>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>

Re: Fwd: Very slow PDF parsing.

Posted by JB Data31 <jb...@gmail.com>.

Why don't you do a basic test with tika server in a 3thrd and a *wget* or
*curl* bash client to parse your 65Mo PDF.
It can be easier to investigate the problem.

@*JB*Δ <http://jbigdata.fr/jbigdata/index.html>



Le mar. 26 févr. 2019 à 23:05, Cristian Vat <cr...@gmail.com> a
écrit :

> Just looking at the stack trace it won't be the same anymore due to
> PDFBOX-4453
> Some changes present in not yet released pdfbox 2.0.14 and it changes how
> decryption is handled. Not sure if related though.
>
> Can you duplicate the problem without Tika using just PDFBox command-line
> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
> that file?
>
>
> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>
>> This is the code :
>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> PDFParser tmpPdf = new PDFParser();
>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> config.setMaxMainMemoryBytes(31457280);
>> config.setExtractAcroFormContent(false);
>> config.setExtractBookmarksText(false);
>> config.setCatchIntermediateIOExceptions(true);
>> Metadata metadata = new Metadata();
>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>>
>>
>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:
>>
>>>
>>> This is the default in Tika, where the default for
>>> maxMainMemoryBytes=500MB.
>>>
>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>> tika-app or tika-server or something else?
>>>
>>> MemoryUsageSetting memoryUsageSetting =
>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> memoryUsageSetting =
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> }
>>> if (tstream != null && tstream.hasFile()) {
>>> // File based -- send file directly to PDFBox
>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>> memoryUsageSetting);
>>> } else {
>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>> password, memoryUsageSetting);
>>> }
>>>
>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>> profiler.
>>>>
>>>> The HashSet is used to avoid decrypting objects twice.
>>>>
>>>> The "not encrypted" file is likely encrypted with an empty user
>>>> password.
>>>>
>>>> It would also be interesting to hear what parameter is passed to
>>>> MemoryUsageSetting when load() is called.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>> > PDFBox Colleagues,
>>>> >    Any ideas?
>>>> >
>>>> > ---------- Forwarded message ---------
>>>> > From: Tim Allison <ta...@apache.org>
>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>> > Subject: Re: Very slow PDF parsing.
>>>> > To: <us...@tika.apache.org>
>>>> >
>>>> >
>>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>>> > dramatically is if you have tesseract installed (try typing
>>>> 'tesseract' on
>>>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>>>> > isn't your problem, though.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>>> >
>>>> >> Thanks Tim,
>>>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>>>> is in
>>>> >> this context 🙂
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>>> >>
>>>> >>> Thank you, Slava!
>>>> >>>
>>>> >>> Do you have tesseract installed?
>>>> >>>
>>>> >>> Colleagues on PDFBox, any recommendations?
>>>> >>>
>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>>> images.
>>>> >>>>
>>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>>>> running
>>>> >>> CentOS Linux).
>>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's
>>>> a bug
>>>> >>> in PDFBox ?
>>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>>> >>>>
>>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>> >>>>
>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>> >>>>
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> >>>> at
>>>> >>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>> >>>>
>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>> >>>>
>>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>> >>>>
>>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>> >>>>
>>>> >>>>
>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>> >>>>
>>>> >>>> Thanks
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

Yes that was changed, it will use even more memory. Although I believe 
this isn't the main culprit in your file.
My suspicion is that the file has many pages and is also a tagged PDF, 
and/or has huge content streams (e.g. long vector graphics).

Tilman

Am 27.02.2019 um 00:05 schrieb Cristian Vat:
> Just looking at the stack trace it won't be the same anymore due to
> PDFBOX-4453
> Some changes present in not yet released pdfbox 2.0.14 and it changes how
> decryption is handled. Not sure if related though.
>
> Can you duplicate the problem without Tika using just PDFBox command-line
> ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
> that file?
>
>
> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:
>
>> This is the code :
>> InputStream in = TikaInputStream.get(inputFile.toPath());
>> PDFParser tmpPdf = new PDFParser();
>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>> config.setMaxMainMemoryBytes(31457280);
>> config.setExtractAcroFormContent(false);
>> config.setExtractBookmarksText(false);
>> config.setCatchIntermediateIOExceptions(true);
>> Metadata metadata = new Metadata();
>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>>
>>
>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> This is the default in Tika, where the default for
>>> maxMainMemoryBytes=500MB.
>>>
>>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>>> tika-app or tika-server or something else?
>>>
>>> MemoryUsageSetting memoryUsageSetting =
>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> memoryUsageSetting =
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> }
>>> if (tstream != null && tstream.hasFile()) {
>>> // File based -- send file directly to PDFBox
>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>>> memoryUsageSetting);
>>> } else {
>>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>>> password, memoryUsageSetting);
>>> }
>>>
>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> As usual, it would be nice to have the PDF, so that we could run the
>>>> profiler.
>>>>
>>>> The HashSet is used to avoid decrypting objects twice.
>>>>
>>>> The "not encrypted" file is likely encrypted with an empty user password.
>>>>
>>>> It would also be interesting to hear what parameter is passed to
>>>> MemoryUsageSetting when load() is called.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>>> PDFBox Colleagues,
>>>>>     Any ideas?
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Tim Allison <ta...@apache.org>
>>>>> Date: Tue, Feb 26, 2019 at 12:13 PM
>>>>> Subject: Re: Very slow PDF parsing.
>>>>> To: <us...@tika.apache.org>
>>>>>
>>>>>
>>>>> Sorry...that's an OCR tool.  One thing that can slow down processing
>>>>> dramatically is if you have tesseract installed (try typing
>>>> 'tesseract' on
>>>>> your commandline) and if you've turned it on for PDFs.  I suspect this
>>>>> isn't your problem, though.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tim,
>>>>>> But frankly speaking, it's a shame, but don't know what is tessercat
>>>> is in
>>>>>> this context 🙂
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>>>>>
>>>>>>> Thank you, Slava!
>>>>>>>
>>>>>>> Do you have tesseract installed?
>>>>>>>
>>>>>>> Colleagues on PDFBox, any recommendations?
>>>>>>>
>>>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have large PDF (about 65mb) that contains mainly text and some
>>>> images.
>>>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>>>> running
>>>>>>> CentOS Linux).
>>>>>>>> Please advise if there anything I can do to speedup.Or maybe it's a
>>>> bug
>>>>>>> in PDFBox ?
>>>>>>>> When I'm printing java stack , I see all the time in this stack :
>>>>>>>>
>>>>>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>>>
>>>>>>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>>>
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>>>> at
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>>>> at
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>>>
>>>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>>>
>>>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>>>
>>>>>>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>>>
>>>>>>>>
>>>>>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>>>
>>>>>>>> Thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Cristian Vat <cr...@gmail.com>.

Just looking at the stack trace it won't be the same anymore due to
PDFBOX-4453
Some changes present in not yet released pdfbox 2.0.14 and it changes how
decryption is handled. Not sure if related though.

Can you duplicate the problem without Tika using just PDFBox command-line
ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
that file?


On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:

> This is the code :
> InputStream in = TikaInputStream.get(inputFile.toPath());
> PDFParser tmpPdf = new PDFParser();
> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> config.setMaxMainMemoryBytes(31457280);
> config.setExtractAcroFormContent(false);
> config.setExtractBookmarksText(false);
> config.setCatchIntermediateIOExceptions(true);
> Metadata metadata = new Metadata();
> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>
>
> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:
>
>>
>> This is the default in Tika, where the default for
>> maxMainMemoryBytes=500MB.
>>
>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>> tika-app or tika-server or something else?
>>
>> MemoryUsageSetting memoryUsageSetting =
>> MemoryUsageSetting.setupMainMemoryOnly();
>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>> memoryUsageSetting =
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>> }
>> if (tstream != null && tstream.hasFile()) {
>> // File based -- send file directly to PDFBox
>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>> memoryUsageSetting);
>> } else {
>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>> password, memoryUsageSetting);
>> }
>>
>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Hi,
>>>
>>> As usual, it would be nice to have the PDF, so that we could run the
>>> profiler.
>>>
>>> The HashSet is used to avoid decrypting objects twice.
>>>
>>> The "not encrypted" file is likely encrypted with an empty user password.
>>>
>>> It would also be interesting to hear what parameter is passed to
>>> MemoryUsageSetting when load() is called.
>>>
>>> Tilman
>>>
>>>
>>>
>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> > PDFBox Colleagues,
>>> >    Any ideas?
>>> >
>>> > ---------- Forwarded message ---------
>>> > From: Tim Allison <ta...@apache.org>
>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>> > Subject: Re: Very slow PDF parsing.
>>> > To: <us...@tika.apache.org>
>>> >
>>> >
>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>> > dramatically is if you have tesseract installed (try typing
>>> 'tesseract' on
>>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>>> > isn't your problem, though.
>>> >
>>> >
>>> >
>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>> >
>>> >> Thanks Tim,
>>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>>> is in
>>> >> this context 🙂
>>> >>
>>> >> Thanks
>>> >>
>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>> >>
>>> >>> Thank you, Slava!
>>> >>>
>>> >>> Do you have tesseract installed?
>>> >>>
>>> >>> Colleagues on PDFBox, any recommendations?
>>> >>>
>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>> images.
>>> >>>>
>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>>> running
>>> >>> CentOS Linux).
>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
>>> bug
>>> >>> in PDFBox ?
>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>> >>>>
>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>> >>>>
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >>>>
>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >>>>
>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >>>>
>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >>>>
>>> >>>>
>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>> >>>>
>>> >>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>

Re: Fwd: Very slow PDF parsing.

Posted by Cristian Vat <cr...@gmail.com>.

Just looking at the stack trace it won't be the same anymore due to
PDFBOX-4453
Some changes present in not yet released pdfbox 2.0.14 and it changes how
decryption is handled. Not sure if related though.

Can you duplicate the problem without Tika using just PDFBox command-line
ExtractText command ( https://pdfbox.apache.org/2.0/commandline.html ) on
that file?


On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com> wrote:

> This is the code :
> InputStream in = TikaInputStream.get(inputFile.toPath());
> PDFParser tmpPdf = new PDFParser();
> PDFParserConfig config = tmpPdf.getPDFParserConfig();
> config.setMaxMainMemoryBytes(31457280);
> config.setExtractAcroFormContent(false);
> config.setExtractBookmarksText(false);
> config.setCatchIntermediateIOExceptions(true);
> Metadata metadata = new Metadata();
> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
> tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());
>
>
> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:
>
>>
>> This is the default in Tika, where the default for
>> maxMainMemoryBytes=500MB.
>>
>> Slava, how are you calling this in Tika?  With a TikaInputStream via
>> tika-app or tika-server or something else?
>>
>> MemoryUsageSetting memoryUsageSetting =
>> MemoryUsageSetting.setupMainMemoryOnly();
>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>> memoryUsageSetting =
>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>> }
>> if (tstream != null && tstream.hasFile()) {
>> // File based -- send file directly to PDFBox
>> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
>> memoryUsageSetting);
>> } else {
>> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
>> password, memoryUsageSetting);
>> }
>>
>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Hi,
>>>
>>> As usual, it would be nice to have the PDF, so that we could run the
>>> profiler.
>>>
>>> The HashSet is used to avoid decrypting objects twice.
>>>
>>> The "not encrypted" file is likely encrypted with an empty user password.
>>>
>>> It would also be interesting to hear what parameter is passed to
>>> MemoryUsageSetting when load() is called.
>>>
>>> Tilman
>>>
>>>
>>>
>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> > PDFBox Colleagues,
>>> >    Any ideas?
>>> >
>>> > ---------- Forwarded message ---------
>>> > From: Tim Allison <ta...@apache.org>
>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>> > Subject: Re: Very slow PDF parsing.
>>> > To: <us...@tika.apache.org>
>>> >
>>> >
>>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>>> > dramatically is if you have tesseract installed (try typing
>>> 'tesseract' on
>>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>>> > isn't your problem, though.
>>> >
>>> >
>>> >
>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>> >
>>> >> Thanks Tim,
>>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>>> is in
>>> >> this context 🙂
>>> >>
>>> >> Thanks
>>> >>
>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>> >>
>>> >>> Thank you, Slava!
>>> >>>
>>> >>> Do you have tesseract installed?
>>> >>>
>>> >>> Colleagues on PDFBox, any recommendations?
>>> >>>
>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>>> images.
>>> >>>>
>>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>>> running
>>> >>> CentOS Linux).
>>> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
>>> bug
>>> >>> in PDFBox ?
>>> >>>> When I'm printing java stack , I see all the time in this stack :
>>> >>>>
>>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>> >>>>
>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>> >>>>
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >>>> at
>>> >>>
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >>>>
>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >>>>
>>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >>>>
>>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >>>>
>>> >>>>
>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>> >>>>
>>> >>>> Thanks
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

This is the code :
InputStream in = TikaInputStream.get(inputFile.toPath());
PDFParser tmpPdf = new PDFParser();
PDFParserConfig config = tmpPdf.getPDFParserConfig();
config.setMaxMainMemoryBytes(31457280);
config.setExtractAcroFormContent(false);
config.setExtractBookmarksText(false);
config.setCatchIntermediateIOExceptions(true);
Metadata metadata = new Metadata();
metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());


On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:

>
> This is the default in Tika, where the default for
> maxMainMemoryBytes=500MB.
>
> Slava, how are you calling this in Tika?  With a TikaInputStream via
> tika-app or tika-server or something else?
>
> MemoryUsageSetting memoryUsageSetting =
> MemoryUsageSetting.setupMainMemoryOnly();
> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> memoryUsageSetting =
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> }
> if (tstream != null && tstream.hasFile()) {
> // File based -- send file directly to PDFBox
> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
> memoryUsageSetting);
> } else {
> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> password, memoryUsageSetting);
> }
>
> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Hi,
>>
>> As usual, it would be nice to have the PDF, so that we could run the
>> profiler.
>>
>> The HashSet is used to avoid decrypting objects twice.
>>
>> The "not encrypted" file is likely encrypted with an empty user password.
>>
>> It would also be interesting to hear what parameter is passed to
>> MemoryUsageSetting when load() is called.
>>
>> Tilman
>>
>>
>>
>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>> > PDFBox Colleagues,
>> >    Any ideas?
>> >
>> > ---------- Forwarded message ---------
>> > From: Tim Allison <ta...@apache.org>
>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>> > Subject: Re: Very slow PDF parsing.
>> > To: <us...@tika.apache.org>
>> >
>> >
>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>> > dramatically is if you have tesseract installed (try typing 'tesseract'
>> on
>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>> > isn't your problem, though.
>> >
>> >
>> >
>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>> >
>> >> Thanks Tim,
>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>> is in
>> >> this context 🙂
>> >>
>> >> Thanks
>> >>
>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>> >>
>> >>> Thank you, Slava!
>> >>>
>> >>> Do you have tesseract installed?
>> >>>
>> >>> Colleagues on PDFBox, any recommendations?
>> >>>
>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>> images.
>> >>>>
>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>> running
>> >>> CentOS Linux).
>> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
>> bug
>> >>> in PDFBox ?
>> >>>> When I'm printing java stack , I see all the time in this stack :
>> >>>>
>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap.getNode(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>> >>>>
>> >>>> at java.util.HashSet.contains(Unknown Source)
>> >>>>
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >>>> at
>> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >>>>
>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >>>>
>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >>>>
>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >>>>
>> >>>>
>> >>>> P.S. Btw, the PDF is not encrypted at all.
>> >>>>
>> >>>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

That is likely too small. It should be retested with a higher value or 
with memory only.

Tilman

Am 26.02.2019 um 19:02 schrieb Tim Allison:
> This is the default in Tika, where the default for maxMainMemoryBytes=500MB.
>
> Slava, how are you calling this in Tika?  With a TikaInputStream via
> tika-app or tika-server or something else?
>
> MemoryUsageSetting memoryUsageSetting =
> MemoryUsageSetting.setupMainMemoryOnly();
> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> memoryUsageSetting =
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> }
> if (tstream != null && tstream.hasFile()) {
> // File based -- send file directly to PDFBox
> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
> memoryUsageSetting);
> } else {
> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password,
> memoryUsageSetting);
> }
>
> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Hi,
>>
>> As usual, it would be nice to have the PDF, so that we could run the
>> profiler.
>>
>> The HashSet is used to avoid decrypting objects twice.
>>
>> The "not encrypted" file is likely encrypted with an empty user password.
>>
>> It would also be interesting to hear what parameter is passed to
>> MemoryUsageSetting when load() is called.
>>
>> Tilman
>>
>>
>>
>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> PDFBox Colleagues,
>>>     Any ideas?
>>>
>>> ---------- Forwarded message ---------
>>> From: Tim Allison <ta...@apache.org>
>>> Date: Tue, Feb 26, 2019 at 12:13 PM
>>> Subject: Re: Very slow PDF parsing.
>>> To: <us...@tika.apache.org>
>>>
>>>
>>> Sorry...that's an OCR tool.  One thing that can slow down processing
>>> dramatically is if you have tesseract installed (try typing 'tesseract'
>> on
>>> your commandline) and if you've turned it on for PDFs.  I suspect this
>>> isn't your problem, though.
>>>
>>>
>>>
>>> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>>>
>>>> Thanks Tim,
>>>> But frankly speaking, it's a shame, but don't know what is tessercat is
>> in
>>>> this context 🙂
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>>>
>>>>> Thank you, Slava!
>>>>>
>>>>> Do you have tesseract installed?
>>>>>
>>>>> Colleagues on PDFBox, any recommendations?
>>>>>
>>>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have large PDF (about 65mb) that contains mainly text and some
>> images.
>>>>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>> running
>>>>> CentOS Linux).
>>>>>> Please advise if there anything I can do to speedup.Or maybe it's a
>> bug
>>>>> in PDFBox ?
>>>>>> When I'm printing java stack , I see all the time in this stack :
>>>>>>
>>>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>>>
>>>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>>>
>>>>>> at java.util.HashSet.contains(Unknown Source)
>>>>>>
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>>>> at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>>>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>>>> at
>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>>>
>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>>>
>>>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>>>
>>>>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>>>
>>>>>>
>>>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>>>
>>>>>> Thanks
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

This is the code :
InputStream in = TikaInputStream.get(inputFile.toPath());
PDFParser tmpPdf = new PDFParser();
PDFParserConfig config = tmpPdf.getPDFParserConfig();
config.setMaxMainMemoryBytes(31457280);
config.setExtractAcroFormContent(false);
config.setExtractBookmarksText(false);
config.setCatchIntermediateIOExceptions(true);
Metadata metadata = new Metadata();
metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
tmpPdf.parse(inputStream, textHandler, this.metadata, new ParseContext());


On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <ta...@apache.org> wrote:

>
> This is the default in Tika, where the default for
> maxMainMemoryBytes=500MB.
>
> Slava, how are you calling this in Tika?  With a TikaInputStream via
> tika-app or tika-server or something else?
>
> MemoryUsageSetting memoryUsageSetting =
> MemoryUsageSetting.setupMainMemoryOnly();
> if (localConfig.getMaxMainMemoryBytes() >= 0) {
> memoryUsageSetting =
> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
> }
> if (tstream != null && tstream.hasFile()) {
> // File based -- send file directly to PDFBox
> pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
> memoryUsageSetting);
> } else {
> pdfDocument = PDDocument.load(new CloseShieldInputStream(stream),
> password, memoryUsageSetting);
> }
>
> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Hi,
>>
>> As usual, it would be nice to have the PDF, so that we could run the
>> profiler.
>>
>> The HashSet is used to avoid decrypting objects twice.
>>
>> The "not encrypted" file is likely encrypted with an empty user password.
>>
>> It would also be interesting to hear what parameter is passed to
>> MemoryUsageSetting when load() is called.
>>
>> Tilman
>>
>>
>>
>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>> > PDFBox Colleagues,
>> >    Any ideas?
>> >
>> > ---------- Forwarded message ---------
>> > From: Tim Allison <ta...@apache.org>
>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>> > Subject: Re: Very slow PDF parsing.
>> > To: <us...@tika.apache.org>
>> >
>> >
>> > Sorry...that's an OCR tool.  One thing that can slow down processing
>> > dramatically is if you have tesseract installed (try typing 'tesseract'
>> on
>> > your commandline) and if you've turned it on for PDFs.  I suspect this
>> > isn't your problem, though.
>> >
>> >
>> >
>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>> >
>> >> Thanks Tim,
>> >> But frankly speaking, it's a shame, but don't know what is tessercat
>> is in
>> >> this context 🙂
>> >>
>> >> Thanks
>> >>
>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>> >>
>> >>> Thank you, Slava!
>> >>>
>> >>> Do you have tesseract installed?
>> >>>
>> >>> Colleagues on PDFBox, any recommendations?
>> >>>
>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I have large PDF (about 65mb) that contains mainly text and some
>> images.
>> >>>>
>> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
>> running
>> >>> CentOS Linux).
>> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
>> bug
>> >>> in PDFBox ?
>> >>>> When I'm printing java stack , I see all the time in this stack :
>> >>>>
>> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap.getNode(Unknown Source)
>> >>>>
>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>> >>>>
>> >>>> at java.util.HashSet.contains(Unknown Source)
>> >>>>
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >>>> at
>> >>>
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >>>> at
>> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >>>>
>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >>>>
>> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >>>>
>> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >>>>
>> >>>>
>> >>>> P.S. Btw, the PDF is not encrypted at all.
>> >>>>
>> >>>> Thanks
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

This is the default in Tika, where the default for maxMainMemoryBytes=500MB.

Slava, how are you calling this in Tika?  With a TikaInputStream via
tika-app or tika-server or something else?

MemoryUsageSetting memoryUsageSetting =
MemoryUsageSetting.setupMainMemoryOnly();
if (localConfig.getMaxMainMemoryBytes() >= 0) {
memoryUsageSetting =
MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
}
if (tstream != null && tstream.hasFile()) {
// File based -- send file directly to PDFBox
pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
memoryUsageSetting);
} else {
pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password,
memoryUsageSetting);
}

On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> As usual, it would be nice to have the PDF, so that we could run the
> profiler.
>
> The HashSet is used to avoid decrypting objects twice.
>
> The "not encrypted" file is likely encrypted with an empty user password.
>
> It would also be interesting to hear what parameter is passed to
> MemoryUsageSetting when load() is called.
>
> Tilman
>
>
>
> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> > PDFBox Colleagues,
> >    Any ideas?
> >
> > ---------- Forwarded message ---------
> > From: Tim Allison <ta...@apache.org>
> > Date: Tue, Feb 26, 2019 at 12:13 PM
> > Subject: Re: Very slow PDF parsing.
> > To: <us...@tika.apache.org>
> >
> >
> > Sorry...that's an OCR tool.  One thing that can slow down processing
> > dramatically is if you have tesseract installed (try typing 'tesseract'
> on
> > your commandline) and if you've turned it on for PDFs.  I suspect this
> > isn't your problem, though.
> >
> >
> >
> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
> >
> >> Thanks Tim,
> >> But frankly speaking, it's a shame, but don't know what is tessercat is
> in
> >> this context 🙂
> >>
> >> Thanks
> >>
> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
> >>
> >>> Thank you, Slava!
> >>>
> >>> Do you have tesseract installed?
> >>>
> >>> Colleagues on PDFBox, any recommendations?
> >>>
> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I have large PDF (about 65mb) that contains mainly text and some
> images.
> >>>>
> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
> running
> >>> CentOS Linux).
> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
> bug
> >>> in PDFBox ?
> >>>> When I'm printing java stack , I see all the time in this stack :
> >>>>
> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>
> >>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>
> >>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>
> >>>> at java.util.HashSet.contains(Unknown Source)
> >>>>
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>> at
> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>
> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>
> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>
> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>
> >>>>
> >>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>
> >>>> Thanks
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

This is the default in Tika, where the default for maxMainMemoryBytes=500MB.

Slava, how are you calling this in Tika?  With a TikaInputStream via
tika-app or tika-server or something else?

MemoryUsageSetting memoryUsageSetting =
MemoryUsageSetting.setupMainMemoryOnly();
if (localConfig.getMaxMainMemoryBytes() >= 0) {
memoryUsageSetting =
MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
}
if (tstream != null && tstream.hasFile()) {
// File based -- send file directly to PDFBox
pdfDocument = PDDocument.load(tstream.getPath().toFile(), password,
memoryUsageSetting);
} else {
pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), password,
memoryUsageSetting);
}

On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> As usual, it would be nice to have the PDF, so that we could run the
> profiler.
>
> The HashSet is used to avoid decrypting objects twice.
>
> The "not encrypted" file is likely encrypted with an empty user password.
>
> It would also be interesting to hear what parameter is passed to
> MemoryUsageSetting when load() is called.
>
> Tilman
>
>
>
> Am 26.02.2019 um 18:14 schrieb Tim Allison:
> > PDFBox Colleagues,
> >    Any ideas?
> >
> > ---------- Forwarded message ---------
> > From: Tim Allison <ta...@apache.org>
> > Date: Tue, Feb 26, 2019 at 12:13 PM
> > Subject: Re: Very slow PDF parsing.
> > To: <us...@tika.apache.org>
> >
> >
> > Sorry...that's an OCR tool.  One thing that can slow down processing
> > dramatically is if you have tesseract installed (try typing 'tesseract'
> on
> > your commandline) and if you've turned it on for PDFs.  I suspect this
> > isn't your problem, though.
> >
> >
> >
> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
> >
> >> Thanks Tim,
> >> But frankly speaking, it's a shame, but don't know what is tessercat is
> in
> >> this context 🙂
> >>
> >> Thanks
> >>
> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
> >>
> >>> Thank you, Slava!
> >>>
> >>> Do you have tesseract installed?
> >>>
> >>> Colleagues on PDFBox, any recommendations?
> >>>
> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I have large PDF (about 65mb) that contains mainly text and some
> images.
> >>>>
> >>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
> >>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk,
> running
> >>> CentOS Linux).
> >>>> Please advise if there anything I can do to speedup.Or maybe it's a
> bug
> >>> in PDFBox ?
> >>>> When I'm printing java stack , I see all the time in this stack :
> >>>>
> >>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
> >>>>
> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >>>>
> >>>> at java.util.HashMap.getNode(Unknown Source)
> >>>>
> >>>> at java.util.HashMap.containsKey(Unknown Source)
> >>>>
> >>>> at java.util.HashSet.contains(Unknown Source)
> >>>>
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >>>> at
> >>>
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >>>> at
> >>>
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >>>> at
> >>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >>>>
> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >>>>
> >>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >>>>
> >>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >>>>
> >>>>
> >>>> P.S. Btw, the PDF is not encrypted at all.
> >>>>
> >>>> Thanks
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

As usual, it would be nice to have the PDF, so that we could run the 
profiler.

The HashSet is used to avoid decrypting objects twice.

The "not encrypted" file is likely encrypted with an empty user password.

It would also be interesting to hear what parameter is passed to 
MemoryUsageSetting when load() is called.

Tilman



Am 26.02.2019 um 18:14 schrieb Tim Allison:
> PDFBox Colleagues,
>    Any ideas?
>
> ---------- Forwarded message ---------
> From: Tim Allison <ta...@apache.org>
> Date: Tue, Feb 26, 2019 at 12:13 PM
> Subject: Re: Very slow PDF parsing.
> To: <us...@tika.apache.org>
>
>
> Sorry...that's an OCR tool.  One thing that can slow down processing
> dramatically is if you have tesseract installed (try typing 'tesseract' on
> your commandline) and if you've turned it on for PDFs.  I suspect this
> isn't your problem, though.
>
>
>
> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>
>> Thanks Tim,
>> But frankly speaking, it's a shame, but don't know what is tessercat is in
>> this context 🙂
>>
>> Thanks
>>
>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>
>>> Thank you, Slava!
>>>
>>> Do you have tesseract installed?
>>>
>>> Colleagues on PDFBox, any recommendations?
>>>
>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I have large PDF (about 65mb) that contains mainly text and some images.
>>>>
>>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
>>> CentOS Linux).
>>>> Please advise if there anything I can do to speedup.Or maybe it's a bug
>>> in PDFBox ?
>>>> When I'm printing java stack , I see all the time in this stack :
>>>>
>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>
>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>
>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>
>>>> at java.util.HashSet.contains(Unknown Source)
>>>>
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> at
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>
>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>
>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>
>>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>
>>>>
>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>
>>>> Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Fwd: Very slow PDF parsing.

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

As usual, it would be nice to have the PDF, so that we could run the 
profiler.

The HashSet is used to avoid decrypting objects twice.

The "not encrypted" file is likely encrypted with an empty user password.

It would also be interesting to hear what parameter is passed to 
MemoryUsageSetting when load() is called.

Tilman

Am 26.02.2019 um 18:14 schrieb Tim Allison:
> PDFBox Colleagues,
>    Any ideas?
>
> ---------- Forwarded message ---------
> From: Tim Allison <ta...@apache.org>
> Date: Tue, Feb 26, 2019 at 12:13 PM
> Subject: Re: Very slow PDF parsing.
> To: <us...@tika.apache.org>
>
>
> Sorry...that's an OCR tool.  One thing that can slow down processing
> dramatically is if you have tesseract installed (try typing 'tesseract' on
> your commandline) and if you've turned it on for PDFs.  I suspect this
> isn't your problem, though.
>
>
>
> On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:
>
>> Thanks Tim,
>> But frankly speaking, it's a shame, but don't know what is tessercat is in
>> this context 🙂
>>
>> Thanks
>>
>> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>>
>>> Thank you, Slava!
>>>
>>> Do you have tesseract installed?
>>>
>>> Colleagues on PDFBox, any recommendations?
>>>
>>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I have large PDF (about 65mb) that contains mainly text and some images.
>>>>
>>>> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
>>> CentOS Linux).
>>>> Please advise if there anything I can do to speedup.Or maybe it's a bug
>>> in PDFBox ?
>>>> When I'm printing java stack , I see all the time in this stack :
>>>>
>>>> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>>
>>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>>>
>>>> at java.util.HashMap.getNode(Unknown Source)
>>>>
>>>> at java.util.HashMap.containsKey(Unknown Source)
>>>>
>>>> at java.util.HashSet.contains(Unknown Source)
>>>>
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> at
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> at
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> at
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>>
>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>>
>>>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>>
>>>> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>>
>>>>
>>>> P.S. Btw, the PDF is not encrypted at all.
>>>>
>>>> Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

PDFBox Colleagues,
  Any ideas?

---------- Forwarded message ---------
From: Tim Allison <ta...@apache.org>
Date: Tue, Feb 26, 2019 at 12:13 PM
Subject: Re: Very slow PDF parsing.
To: <us...@tika.apache.org>


Sorry...that's an OCR tool.  One thing that can slow down processing
dramatically is if you have tesseract installed (try typing 'tesseract' on
your commandline) and if you've turned it on for PDFs.  I suspect this
isn't your problem, though.



On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:

> Thanks Tim,
> But frankly speaking, it's a shame, but don't know what is tessercat is in
> this context 🙂
>
> Thanks
>
> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>
>> Thank you, Slava!
>>
>> Do you have tesseract installed?
>>
>> Colleagues on PDFBox, any recommendations?
>>
>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I have large PDF (about 65mb) that contains mainly text and some images.
>> >
>> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
>> CentOS Linux).
>> >
>> > Please advise if there anything I can do to speedup.Or maybe it's a bug
>> in PDFBox ?
>> >
>> > When I'm printing java stack , I see all the time in this stack :
>> >
>> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >
>> > at java.util.HashMap.getNode(Unknown Source)
>> >
>> > at java.util.HashMap.containsKey(Unknown Source)
>> >
>> > at java.util.HashSet.contains(Unknown Source)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >
>> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >
>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >
>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >
>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >
>> >
>> > P.S. Btw, the PDF is not encrypted at all.
>> >
>> > Thanks
>>
>

Fwd: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

PDFBox Colleagues,
  Any ideas?

---------- Forwarded message ---------
From: Tim Allison <ta...@apache.org>
Date: Tue, Feb 26, 2019 at 12:13 PM
Subject: Re: Very slow PDF parsing.
To: <us...@tika.apache.org>


Sorry...that's an OCR tool.  One thing that can slow down processing
dramatically is if you have tesseract installed (try typing 'tesseract' on
your commandline) and if you've turned it on for PDFs.  I suspect this
isn't your problem, though.



On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:

> Thanks Tim,
> But frankly speaking, it's a shame, but don't know what is tessercat is in
> this context 🙂
>
> Thanks
>
> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>
>> Thank you, Slava!
>>
>> Do you have tesseract installed?
>>
>> Colleagues on PDFBox, any recommendations?
>>
>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I have large PDF (about 65mb) that contains mainly text and some images.
>> >
>> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
>> CentOS Linux).
>> >
>> > Please advise if there anything I can do to speedup.Or maybe it's a bug
>> in PDFBox ?
>> >
>> > When I'm printing java stack , I see all the time in this stack :
>> >
>> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >
>> > at java.util.HashMap.getNode(Unknown Source)
>> >
>> > at java.util.HashMap.containsKey(Unknown Source)
>> >
>> > at java.util.HashSet.contains(Unknown Source)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >
>> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >
>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >
>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >
>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >
>> >
>> > P.S. Btw, the PDF is not encrypted at all.
>> >
>> > Thanks
>>
>

Re: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Sorry...that's an OCR tool.  One thing that can slow down processing
dramatically is if you have tesseract installed (try typing 'tesseract' on
your commandline) and if you've turned it on for PDFs.  I suspect this
isn't your problem, though.



On Tue, Feb 26, 2019 at 12:08 PM Slava G <sl...@gmail.com> wrote:

> Thanks Tim,
> But frankly speaking, it's a shame, but don't know what is tessercat is in
> this context 🙂
>
> Thanks
>
> On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:
>
>> Thank you, Slava!
>>
>> Do you have tesseract installed?
>>
>> Colleagues on PDFBox, any recommendations?
>>
>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I have large PDF (about 65mb) that contains mainly text and some images.
>> >
>> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
>> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
>> CentOS Linux).
>> >
>> > Please advise if there anything I can do to speedup.Or maybe it's a bug
>> in PDFBox ?
>> >
>> > When I'm printing java stack , I see all the time in this stack :
>> >
>> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.find(Unknown Source)
>> >
>> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>> >
>> > at java.util.HashMap.getNode(Unknown Source)
>> >
>> > at java.util.HashMap.containsKey(Unknown Source)
>> >
>> > at java.util.HashSet.contains(Unknown Source)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>> >
>> > at
>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>> >
>> > at
>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>> >
>> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>> >
>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>> >
>> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> >
>> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>> >
>> >
>> > P.S. Btw, the PDF is not encrypted at all.
>> >
>> > Thanks
>>
>

Re: Very slow PDF parsing.

Posted by Slava G <sl...@gmail.com>.

Thanks Tim,
But frankly speaking, it's a shame, but don't know what is tessercat is in
this context 🙂

Thanks

On Tue, Feb 26, 2019, 19:04 Tim Allison <ta...@apache.org> wrote:

> Thank you, Slava!
>
> Do you have tesseract installed?
>
> Colleagues on PDFBox, any recommendations?
>
> On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
> >
> > Hi,
> >
> > I have large PDF (about 65mb) that contains mainly text and some images.
> >
> > Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1
> running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running
> CentOS Linux).
> >
> > Please advise if there anything I can do to speedup.Or maybe it's a bug
> in PDFBox ?
> >
> > When I'm printing java stack , I see all the time in this stack :
> >
> > at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.find(Unknown Source)
> >
> > at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
> >
> > at java.util.HashMap.getNode(Unknown Source)
> >
> > at java.util.HashMap.containsKey(Unknown Source)
> >
> > at java.util.HashSet.contains(Unknown Source)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
> >
> > at
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
> >
> > at
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
> >
> > at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
> >
> > at
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
> >
> > at
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
> >
> > at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
> >
> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
> >
> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
> >
> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
> >
> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
> >
> >
> > P.S. Btw, the PDF is not encrypted at all.
> >
> > Thanks
>

Re: Very slow PDF parsing.

Posted by Tim Allison <ta...@apache.org>.

Thank you, Slava!

Do you have tesseract installed?

Colleagues on PDFBox, any recommendations?

On Tue, Feb 26, 2019 at 11:56 AM Slava G <sl...@gmail.com> wrote:
>
> Hi,
>
> I have large PDF (about 65mb) that contains mainly text and some images.
>
> Parsing of such PDF can take about 2 days or even more (TIKA 1.19.1 running on XEON server with 4 cores CPU and 30GB RAM with SSD disk, running CentOS Linux).
>
> Please advise if there anything I can do to speedup.Or maybe it's a bug in PDFBox ?
>
> When I'm printing java stack , I see all the time in this stack :
>
> at org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.find(Unknown Source)
>
> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>
> at java.util.HashMap.getNode(Unknown Source)
>
> at java.util.HashMap.containsKey(Unknown Source)
>
> at java.util.HashSet.contains(Unknown Source)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>
> at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>
> at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>
> at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>
> at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>
> at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>
> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>
>
> P.S. Btw, the PDF is not encrypted at all.
>
> Thanks