You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Konstantin Gribov <gr...@gmail.com> on 2019/03/21 16:56:40 UTC

Re: Fwd: Very slow PDF parsing.

Slava,

Could you please forward this pdf to private@tika.apache.org (Tika PMC only
private list)? I had similar issues with some pdf but were unable to get
them from client to look into it with profiler.

-- 
Best regards,
Konstantin Gribov.


On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:

> Tim, to what email to send you the PDF ?
> Thanks
>
> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>
>> I'll once I'll get customer's approval.
>> Meanwhile I can do any checks, if you can specify what to check.
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Any chance you can share the file directly w me or someone else on the
>>> PDFBox team?
>>>
>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>
>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>> > Thanks
>>> >
>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>> >
>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>> >> Seems that issue is still there.
>>> >> Thanks
>>> >>
>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>> >>
>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>> >>>
>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>> wrote:
>>> >>>
>>> >>>> Any chance you could try with the 2.0.14 release candidate...unless
>>> you
>>> >>>> have already?
>>> >>>>
>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>> >>>>
>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so far
>>> 2
>>> >>>>> hours and still counting...
>>> >>>>> It's seems to be a PDFBox issue.
>>> >>>>>
>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>> >>>>>> It can be easier to investigate the problem.
>>> >>>>>>
>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>> cristian.vat@gmail.com>
>>> >>>>>> a écrit :
>>> >>>>>>
>>> >>>>>>> Just looking at the stack trace it won't be the same anymore due
>>> to
>>> >>>>>>> PDFBOX-4453
>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>> >>>>>>>
>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>> >>>>>>> command-line ExtractText command (
>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> This is the code :
>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>> >>>>>>>> config.setExtractBookmarksText(false);
>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>> >>>>>>>> Metadata metadata = new Metadata();
>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>> >>>>>>>> ParseContext());
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>> tallison@apache.org>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> This is the default in Tika, where the default for
>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>> >>>>>>>>>
>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>> TikaInputStream
>>> >>>>>>>>> via tika-app or tika-server or something else?
>>> >>>>>>>>>
>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>> >>>>>>>>> memoryUsageSetting =
>>> >>>>>>>>>
>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>> >>>>>>>>> }
>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>> >>>>>>>>> password, memoryUsageSetting);
>>> >>>>>>>>> } else {
>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>> CloseShieldInputStream(stream),
>>> >>>>>>>>> password, memoryUsageSetting);
>>> >>>>>>>>> }
>>> >>>>>>>>>
>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Hi,
>>> >>>>>>>>>>
>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>> run
>>> >>>>>>>>>> the
>>> >>>>>>>>>> profiler.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>> user
>>> >>>>>>>>>> password.
>>> >>>>>>>>>>
>>> >>>>>>>>>> It would also be interesting to hear what parameter is passed
>>> to
>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Tilman
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>> >>>>>>>>>> > PDFBox Colleagues,
>>> >>>>>>>>>> >    Any ideas?
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>> >>>>>>>>>> processing
>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>> >>>>>>>>>> 'tesseract' on
>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>> >>>>>>>>>> suspect this
>>> >>>>>>>>>> > isn't your problem, though.
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >
>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <slavago@gmail.com
>>> >
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >
>>> >>>>>>>>>> >> Thanks Tim,
>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>> >>>>>>>>>> tessercat is in
>>> >>>>>>>>>> >> this context 🙂
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >> Thanks
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>> tallison@apache.org>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >>
>>> >>>>>>>>>> >>> Thank you, Slava!
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>> slavago@gmail.com>
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>> >>>> Hi,
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>> and
>>> >>>>>>>>>> some images.
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>> (TIKA
>>> >>>>>>>>>> 1.19.1
>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM with
>>> SSD
>>> >>>>>>>>>> disk, running
>>> >>>>>>>>>> >>> CentOS Linux).
>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>> maybe
>>> >>>>>>>>>> it's a bug
>>> >>>>>>>>>> >>> in PDFBox ?
>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in this
>>> >>>>>>>>>> stack :
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> >>>
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>>
>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> at
>>> >>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>> >>>>>>>>>> >>>>
>>> >>>>>>>>>> >>>> Thanks
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>>
>>

Re: Fwd: Very slow PDF parsing.

Posted by Konstantin Gribov <gr...@gmail.com>.

Follow up, it seems to be fixed, so not actual for me anymore. Sorry for
this bit of noise in lists)

-- 
Best regards,
Konstantin Gribov.


On Thu, Mar 21, 2019 at 7:56 PM Konstantin Gribov <gr...@gmail.com> wrote:

> Slava,
>
> Could you please forward this pdf to private@tika.apache.org (Tika PMC
> only private list)? I had similar issues with some pdf but were unable to
> get them from client to look into it with profiler.
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:
>
>> Tim, to what email to send you the PDF ?
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>>
>>> I'll once I'll get customer's approval.
>>> Meanwhile I can do any checks, if you can specify what to check.
>>> Thanks
>>>
>>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you can share the file directly w me or someone else on the
>>>> PDFBox team?
>>>>
>>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>>> > Thanks
>>>> >
>>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>>> >
>>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>>> >> Seems that issue is still there.
>>>> >> Thanks
>>>> >>
>>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>>> >>
>>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>>> >>>
>>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>>> wrote:
>>>> >>>
>>>> >>>> Any chance you could try with the 2.0.14 release
>>>> candidate...unless you
>>>> >>>> have already?
>>>> >>>>
>>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>> >>>>
>>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so
>>>> far 2
>>>> >>>>> hours and still counting...
>>>> >>>>> It's seems to be a PDFBox issue.
>>>> >>>>>
>>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>> >>>>>> It can be easier to investigate the problem.
>>>> >>>>>>
>>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>>> cristian.vat@gmail.com>
>>>> >>>>>> a écrit :
>>>> >>>>>>
>>>> >>>>>>> Just looking at the stack trace it won't be the same anymore
>>>> due to
>>>> >>>>>>> PDFBOX-4453
>>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>>> >>>>>>>
>>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>> >>>>>>> command-line ExtractText command (
>>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> This is the code :
>>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>>> >>>>>>>> config.setExtractBookmarksText(false);
>>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>> >>>>>>>> Metadata metadata = new Metadata();
>>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> >>>>>>>> ParseContext());
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> This is the default in Tika, where the default for
>>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>>> TikaInputStream
>>>> >>>>>>>>> via tika-app or tika-server or something else?
>>>> >>>>>>>>>
>>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>> >>>>>>>>> memoryUsageSetting =
>>>> >>>>>>>>>
>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>> >>>>>>>>> }
>>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> } else {
>>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>>> CloseShieldInputStream(stream),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> }
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Hi,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>>> run
>>>> >>>>>>>>>> the
>>>> >>>>>>>>>> profiler.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>>> user
>>>> >>>>>>>>>> password.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> It would also be interesting to hear what parameter is
>>>> passed to
>>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Tilman
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>> >>>>>>>>>> > PDFBox Colleagues,
>>>> >>>>>>>>>> >    Any ideas?
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>> >>>>>>>>>> processing
>>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>> >>>>>>>>>> 'tesseract' on
>>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>> >>>>>>>>>> suspect this
>>>> >>>>>>>>>> > isn't your problem, though.
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >> Thanks Tim,
>>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>> >>>>>>>>>> tessercat is in
>>>> >>>>>>>>>> >> this context 🙂
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> Thanks
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >>> Thank you, Slava!
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>>> Hi,
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>>> and
>>>> >>>>>>>>>> some images.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>>> (TIKA
>>>> >>>>>>>>>> 1.19.1
>>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM
>>>> with SSD
>>>> >>>>>>>>>> disk, running
>>>> >>>>>>>>>> >>> CentOS Linux).
>>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>>> maybe
>>>> >>>>>>>>>> it's a bug
>>>> >>>>>>>>>> >>> in PDFBox ?
>>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in
>>>> this
>>>> >>>>>>>>>> stack :
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown
>>>> Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Thanks
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> >>>>>>>>>> For additional commands, e-mail:
>>>> users-help@pdfbox.apache.org
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>>
>>>

Re: Fwd: Very slow PDF parsing.

Posted by Konstantin Gribov <gr...@gmail.com>.

Follow up, it seems to be fixed, so not actual for me anymore. Sorry for
this bit of noise in lists)

-- 
Best regards,
Konstantin Gribov.


On Thu, Mar 21, 2019 at 7:56 PM Konstantin Gribov <gr...@gmail.com> wrote:

> Slava,
>
> Could you please forward this pdf to private@tika.apache.org (Tika PMC
> only private list)? I had similar issues with some pdf but were unable to
> get them from client to look into it with profiler.
>
> --
> Best regards,
> Konstantin Gribov.
>
>
> On Thu, Feb 28, 2019 at 7:27 PM Slava G <sl...@gmail.com> wrote:
>
>> Tim, to what email to send you the PDF ?
>> Thanks
>>
>> On Thu, Feb 28, 2019 at 3:57 PM Slava G <sl...@gmail.com> wrote:
>>
>>> I'll once I'll get customer's approval.
>>> Meanwhile I can do any checks, if you can specify what to check.
>>> Thanks
>>>
>>> On Thu, Feb 28, 2019 at 3:56 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Any chance you can share the file directly w me or someone else on the
>>>> PDFBox team?
>>>>
>>>> On Wed, Feb 27, 2019 at 11:24 AM Slava G <sl...@gmail.com> wrote:
>>>>
>>>> > After 3h 40m it's still parsing using PDFBox 2.0.14 app...
>>>> > Thanks
>>>> >
>>>> > On Wed, Feb 27, 2019 at 3:29 PM Slava G <sl...@gmail.com> wrote:
>>>> >
>>>> >> With 2.0.14 it's 40 minutes running, no result, still working...
>>>> >> Seems that issue is still there.
>>>> >> Thanks
>>>> >>
>>>> >> On Wed, Feb 27, 2019 at 2:52 PM Slava G <sl...@gmail.com> wrote:
>>>> >>
>>>> >>> Checking with 2.0.14. Started as an app. Will update soon.
>>>> >>>
>>>> >>> On Wed, Feb 27, 2019 at 2:47 PM Tim Allison <ta...@apache.org>
>>>> wrote:
>>>> >>>
>>>> >>>> Any chance you could try with the 2.0.14 release
>>>> candidate...unless you
>>>> >>>> have already?
>>>> >>>>
>>>> >>>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.14/
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Feb 27, 2019 at 3:04 AM Slava G <sl...@gmail.com> wrote:
>>>> >>>>
>>>> >>>>> Well, I ran (as was suggested) PDFBox app to extract text , so
>>>> far 2
>>>> >>>>> hours and still counting...
>>>> >>>>> It's seems to be a PDFBox issue.
>>>> >>>>>
>>>> >>>>> On Wed, Feb 27, 2019 at 9:51 AM JB Data31 <jb...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Why don't you do a basic test with tika server in a 3thrd and a
>>>> >>>>>> *wget* or *curl* bash client to parse your 65Mo PDF.
>>>> >>>>>> It can be easier to investigate the problem.
>>>> >>>>>>
>>>> >>>>>> @*JB*Δ <http://jbigdata.fr/jbigdata/index.html>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Le mar. 26 févr. 2019 à 23:05, Cristian Vat <
>>>> cristian.vat@gmail.com>
>>>> >>>>>> a écrit :
>>>> >>>>>>
>>>> >>>>>>> Just looking at the stack trace it won't be the same anymore
>>>> due to
>>>> >>>>>>> PDFBOX-4453
>>>> >>>>>>> Some changes present in not yet released pdfbox 2.0.14 and it
>>>> >>>>>>> changes how decryption is handled. Not sure if related though.
>>>> >>>>>>>
>>>> >>>>>>> Can you duplicate the problem without Tika using just PDFBox
>>>> >>>>>>> command-line ExtractText command (
>>>> >>>>>>> https://pdfbox.apache.org/2.0/commandline.html ) on that file?
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Feb 26, 2019 at 8:24 PM Slava G <sl...@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> This is the code :
>>>> >>>>>>>> InputStream in = TikaInputStream.get(inputFile.toPath());
>>>> >>>>>>>> PDFParser tmpPdf = new PDFParser();
>>>> >>>>>>>> PDFParserConfig config = tmpPdf.getPDFParserConfig();
>>>> >>>>>>>> config.setMaxMainMemoryBytes(31457280);
>>>> >>>>>>>> config.setExtractAcroFormContent(false);
>>>> >>>>>>>> config.setExtractBookmarksText(false);
>>>> >>>>>>>> config.setCatchIntermediateIOExceptions(true);
>>>> >>>>>>>> Metadata metadata = new Metadata();
>>>> >>>>>>>> metadata.set(HttpHelper.CONTENT_TYPE, "application/pdf");
>>>> >>>>>>>> tmpPdf.parse(inputStream, textHandler, this.metadata, new
>>>> >>>>>>>> ParseContext());
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Tue, Feb 26, 2019 at 8:02 PM Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> This is the default in Tika, where the default for
>>>> >>>>>>>>> maxMainMemoryBytes=500MB.
>>>> >>>>>>>>>
>>>> >>>>>>>>> Slava, how are you calling this in Tika?  With a
>>>> TikaInputStream
>>>> >>>>>>>>> via tika-app or tika-server or something else?
>>>> >>>>>>>>>
>>>> >>>>>>>>> MemoryUsageSetting memoryUsageSetting =
>>>> >>>>>>>>> MemoryUsageSetting.setupMainMemoryOnly();
>>>> >>>>>>>>> if (localConfig.getMaxMainMemoryBytes() >= 0) {
>>>> >>>>>>>>> memoryUsageSetting =
>>>> >>>>>>>>>
>>>> MemoryUsageSetting.setupMixed(localConfig.getMaxMainMemoryBytes());
>>>> >>>>>>>>> }
>>>> >>>>>>>>> if (tstream != null && tstream.hasFile()) {
>>>> >>>>>>>>> // File based -- send file directly to PDFBox
>>>> >>>>>>>>> pdfDocument = PDDocument.load(tstream.getPath().toFile(),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> } else {
>>>> >>>>>>>>> pdfDocument = PDDocument.load(new
>>>> CloseShieldInputStream(stream),
>>>> >>>>>>>>> password, memoryUsageSetting);
>>>> >>>>>>>>> }
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Tue, Feb 26, 2019 at 12:43 PM Tilman Hausherr <
>>>> >>>>>>>>> THausherr@t-online.de> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Hi,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> As usual, it would be nice to have the PDF, so that we could
>>>> run
>>>> >>>>>>>>>> the
>>>> >>>>>>>>>> profiler.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The HashSet is used to avoid decrypting objects twice.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> The "not encrypted" file is likely encrypted with an empty
>>>> user
>>>> >>>>>>>>>> password.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> It would also be interesting to hear what parameter is
>>>> passed to
>>>> >>>>>>>>>> MemoryUsageSetting when load() is called.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Tilman
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Am 26.02.2019 um 18:14 schrieb Tim Allison:
>>>> >>>>>>>>>> > PDFBox Colleagues,
>>>> >>>>>>>>>> >    Any ideas?
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > ---------- Forwarded message ---------
>>>> >>>>>>>>>> > From: Tim Allison <ta...@apache.org>
>>>> >>>>>>>>>> > Date: Tue, Feb 26, 2019 at 12:13 PM
>>>> >>>>>>>>>> > Subject: Re: Very slow PDF parsing.
>>>> >>>>>>>>>> > To: <us...@tika.apache.org>
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > Sorry...that's an OCR tool.  One thing that can slow down
>>>> >>>>>>>>>> processing
>>>> >>>>>>>>>> > dramatically is if you have tesseract installed (try typing
>>>> >>>>>>>>>> 'tesseract' on
>>>> >>>>>>>>>> > your commandline) and if you've turned it on for PDFs.  I
>>>> >>>>>>>>>> suspect this
>>>> >>>>>>>>>> > isn't your problem, though.
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> > On Tue, Feb 26, 2019 at 12:08 PM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >
>>>> >>>>>>>>>> >> Thanks Tim,
>>>> >>>>>>>>>> >> But frankly speaking, it's a shame, but don't know what is
>>>> >>>>>>>>>> tessercat is in
>>>> >>>>>>>>>> >> this context 🙂
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> Thanks
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >> On Tue, Feb 26, 2019, 19:04 Tim Allison <
>>>> tallison@apache.org>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>
>>>> >>>>>>>>>> >>> Thank you, Slava!
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Do you have tesseract installed?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> Colleagues on PDFBox, any recommendations?
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>> >>> On Tue, Feb 26, 2019 at 11:56 AM Slava G <
>>>> slavago@gmail.com>
>>>> >>>>>>>>>> wrote:
>>>> >>>>>>>>>> >>>> Hi,
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> I have large PDF (about 65mb) that contains mainly text
>>>> and
>>>> >>>>>>>>>> some images.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Parsing of such PDF can take about 2 days or even more
>>>> (TIKA
>>>> >>>>>>>>>> 1.19.1
>>>> >>>>>>>>>> >>> running on XEON server with 4 cores CPU and 30GB RAM
>>>> with SSD
>>>> >>>>>>>>>> disk, running
>>>> >>>>>>>>>> >>> CentOS Linux).
>>>> >>>>>>>>>> >>>> Please advise if there anything I can do to speedup.Or
>>>> maybe
>>>> >>>>>>>>>> it's a bug
>>>> >>>>>>>>>> >>> in PDFBox ?
>>>> >>>>>>>>>> >>>> When I'm printing java stack , I see all the time in
>>>> this
>>>> >>>>>>>>>> stack :
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> org.apache.pdfbox.cos.COSString.equals(COSString.java:259)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.find(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap$TreeNode.getTreeNode(Unknown
>>>> Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.getNode(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashMap.containsKey(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at java.util.HashSet.contains(Unknown Source)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:390)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptArray(SecurityHandler.java:577)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:408)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptDictionary(SecurityHandler.java:517)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:404)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:946)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:874)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:794)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:754)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>> >>>
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:185)
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:220)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1028)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> at
>>>> >>>>>>>>>>
>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:152)
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> P.S. Btw, the PDF is not encrypted at all.
>>>> >>>>>>>>>> >>>>
>>>> >>>>>>>>>> >>>> Thanks
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> >>>>>>>>>> For additional commands, e-mail:
>>>> users-help@pdfbox.apache.org
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>>
>>>