You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2016/09/14 16:52:13 UTC

Re: PDFBox 2.0.3 TIKA comparison

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
> https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip
>
> This run was against the full corpus, not just PDFs.  I used a fairly recent nightly build of PDFBox and POI's 3.15-rc1.
>
> The one apparent major new exception for PDF files was apparently fixed before 2.0.3.  So, please ignore that one!
>
> There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit.  Looks like ~2 million more "common English words" via Tilman's methodology.
>
> Let me know if you have any questions.

I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM

metadata went from 6766 to 4134.

Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: PDFBox 2.0.3 TIKA comparison

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Perfect.  Thank you!

-----Original Message-----
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de] 
Sent: Thursday, September 15, 2016 8:31 AM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 15.09.2016 um 13:52 schrieb Allison, Timothy B.:
>> The one apparent major new exception for PDF files was apparently fixed before 2.0.3.  So, please ignore that one!
>
> Wait...if possible, please confirm that you did fix this recently (within the last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get the exception...however, it is possible that multithreading might trigger this exception.


I've fixed that 2 days ago, it's part of the RC.

BR
Andreas
>
> java.lang.NullPointerException
> 	at org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
> 	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
> 	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.<init>(DictionaryEncoding.java:128)
> 	at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
> 	at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:209)
> 	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> 	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
> 	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> 	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> 	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
> 	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> 	at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
> 	at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
> 	at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
> 	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
> 	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox 2.0.3 TIKA comparison

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 15.09.2016 um 13:52 schrieb Allison, Timothy B.:
>> The one apparent major new exception for PDF files was apparently fixed before 2.0.3.  So, please ignore that one!
>
> Wait...if possible, please confirm that you did fix this recently (within the last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get the exception...however, it is possible that multithreading might trigger this exception.


I've fixed that 2 days ago, it's part of the RC.

BR
Andreas
>
> java.lang.NullPointerException
> 	at org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
> 	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
> 	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.<init>(DictionaryEncoding.java:128)
> 	at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
> 	at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:209)
> 	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> 	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
> 	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> 	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> 	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> 	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> 	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
> 	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> 	at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
> 	at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
> 	at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
> 	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
> 	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: PDFBox 2.0.3 TIKA comparison

Posted by "Allison, Timothy B." <ta...@mitre.org>.

If this doesn't look like something you've recently fixed, I can rerun with the actual 2.0.3-rc1 (only on pdfs!) and see if I'm still getting this exception.

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Thursday, September 15, 2016 7:53 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 2.0.3 TIKA comparison
Importance: High

> The one apparent major new exception for PDF files was apparently fixed before 2.0.3.  So, please ignore that one!

Wait...if possible, please confirm that you did fix this recently (within the last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get the exception...however, it is possible that multithreading might trigger this exception.

java.lang.NullPointerException
	at org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.<init>(DictionaryEncoding.java:128)
	at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
	at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:209)
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
	at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
	at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
	at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: PDFBox 2.0.3 TIKA comparison

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> The one apparent major new exception for PDF files was apparently fixed before 2.0.3.  So, please ignore that one!

Wait...if possible, please confirm that you did fix this recently (within the last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get the exception...however, it is possible that multithreading might trigger this exception.

java.lang.NullPointerException
	at org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
	at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.<init>(DictionaryEncoding.java:128)
	at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
	at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:209)
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
	at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
	at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
	at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
	at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
	at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox 2.0.3 TIKA comparison

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 14.09.2016 um 22:42 schrieb Allison, Timothy B.:
>> Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction looks to have improved quite a bit" :-)
>
> Y, absolutely.  Thank _you_ for reviewing the output and all of your other work, of course!
Tim, we have to thank you for running those tests again!!

BR
Andreas
>
> Cheers,
>
>           Tim
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Wednesday, September 14, 2016 2:50 PM
> To: dev@pdfbox.apache.org
> Subject: Re: PDFBox 2.0.3 TIKA comparison
>
>
>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>>
>>>
>>> There are some regressions in content extraction, but overall,
>>> content extraction looks to have improved quite a bit.  Looks like ~2
>>> million more "common English words" via Tilman's methodology.
>
> After some wandering around I finally looked at content extraction only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
> It turned out that all files were from Delaware courts, so I've decided to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
> The extracted text with 2.0.2 and 2.0.3 is
>
> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
>
> in 2.0.1 and 1.8 it is
>
> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
>
> For 1.8 the explanation is that text extraction takes words, while in
> 2.* each character is taken alone.
>
> The bad result in 2.0.3 is because of an incorrect /W array. The space has a width of 3, while other characters have widths between 200 and 722. So PDFBox believes that there are spaces where there are none.
>
> The only mystery that remains is why it worked in 2.0.1. Maybe that one took an average glyph width for spaces, or the width value from the font itself. I'll find this out later, but it isn't a high priority. A look at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction looks to have improved quite a bit" :-)
>
> Thanks for testing!
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: PDFBox 2.0.3 TIKA comparison

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction looks to have improved quite a bit" :-)

Y, absolutely.  Thank _you_ for reviewing the output and all of your other work, of course!

Cheers,

          Tim

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Wednesday, September 14, 2016 2:50 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison


> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>
>>
>> There are some regressions in content extraction, but overall, 
>> content extraction looks to have improved quite a bit.  Looks like ~2 
>> million more "common English words" via Tilman's methodology.

After some wandering around I finally looked at content extraction only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
It turned out that all files were from Delaware courts, so I've decided to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in
2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The space has a width of 3, while other characters have widths between 200 and 722. So PDFBox believes that there are spaces where there are none.

The only mystery that remains is why it worked in 2.0.1. Maybe that one took an average glyph width for spaces, or the width value from the font itself. I'll find this out later, but it isn't a high priority. A look at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction looks to have improved quite a bit" :-)

Thanks for testing!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox 2.0.3 TIKA comparison

Posted by John Hewson <jo...@jahewson.com>.

> On 15 Sep 2016, at 09:02, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:
>> 
>>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>>> 
>>>> 
>>>> There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit.  Looks like ~2 million more "common English words" via Tilman's methodology. 
>> 
>> After some wandering around I finally looked at content extraction only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
>> It turned out that all files were from Delaware courts, so I've decided to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
>> The extracted text with 2.0.2 and 2.0.3 is
>> 
>> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
>> 
>> in 2.0.1 and 1.8 it is
>> 
>> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
>> 
>> For 1.8 the explanation is that text extraction takes words, while in 2.* each character is taken alone.
>> 
>> The bad result in 2.0.3 is because of an incorrect /W array. The space has a width of 3, while other characters have widths between 200 and 722. So PDFBox believes that there are spaces where there are none. 
> 
> The story is different, the space width (which is 250, not 3 - the table is a ranges array) is NOT taken from the space glyph, but from an average of all glyphs.

Ok, good. I was just about to investigate that remark in your previous email because the Widths array overrides any embedded font widths, so strictly speaking can’t contain a “bad” width, as whatever it contains is defined to be the width. We even stretch glyphs to fit that width (as Acrobat does).

> It's a good thing I looked past in history. The breaking change was in rev 1744613 (PDFBOX-3354) and is related to the calculation of the average glyph width. Before rev 1744613 the averageWidth was always 0 (due to a bug likely accidentally introduced in some refactoring), which was corrected to a default value (1000) in text extraction.

I’m not convinced that we should be using average widths at all. In the absence of justification, typographic tradition defines a space as being between 0.2 and 0.3 em (where 1em = the font size in pt). 250 would be a sensible default, unless the font contains a space character (with an empty path, so we know it is really a space).

Perhaps this could go on the wish list for “new text extraction”.

— John

> Starting with rev 1744613 an average width was calculated, but due to many 0 values (over 65534) in the /W ranges array, the result was unreliable:
> 
> /W [1 1 0 2 3 250 4 10 0 11
> 12 333 13 14 0 15 15 250 16 16
> 333 17 17 250 18 18 277 19 19 0
> 20 23 500 24 35 0 36 36 722 37
> 37 666 38 39 722 40 40 666 41 41
> 610 42 43 777 44 44 389 45 45 0
> 46 46 777 47 47 666 48 48 943 49
> 49 722 50 50 777 51 51 610 52 52
> 0 53 53 722 54 54 556 55 55 666
> 56 57 722 59 59 0 60 60 722 61
> 67 0 68 68 500 69 69 556 70 70
> 443 71 71 556 72 72 443 73 73 333
> 74 74 500 75 75 556 76 76 277 77
> 77 0 78 78 556 79 79 277 80 80
> 833 81 81 556 82 82 500 83 84 556
> 85 85 443 86 86 389 87 87 333 88
> 88 556 89 89 0 90 90 722 91 92
> 500 93 178 0 179 180 500 181 181 0
> 182 182 333 183 751 0 752 752 198 753
> 794 0 795 795 612 796 1126 0 1127 1127
> 125 1129 1129 2000 1130 65534 0]
> 
> Solution: ignore widths that are <=0. 0 values in PDFont are already ignored in PDFont, but not in PDCIDFont.
> 
> Before the solution: 0.52861196. After the fix: 549.8571.
> 
> I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, but in 2.0.4.
> 
> Tilman
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: dev-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

RE: PDFBox 2.0.3 TIKA comparison

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Great.  Thank you!

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Thursday, September 15, 2016 12:03 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:
>
>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>>
>>>
>>> There are some regressions in content extraction, but overall, 
>>> content extraction looks to have improved quite a bit.  Looks like
>>> ~2 million more "common English words" via Tilman's methodology. 
>
> After some wandering around I finally looked at content extraction 
> only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
> It turned out that all files were from Delaware courts, so I've 
> decided to look only at one single file, 
> Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
> The extracted text with 2.0.2 and 2.0.3 is
>
> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
>
> in 2.0.1 and 1.8 it is
>
> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
>
> For 1.8 the explanation is that text extraction takes words, while in
> 2.* each character is taken alone.
>
> The bad result in 2.0.3 is because of an incorrect /W array. The space 
> has a width of 3, while other characters have widths between 200 and 
> 722. So PDFBox believes that there are spaces where there are none.

The story is different, the space width (which is 250, not 3 - the table is a ranges array) is NOT taken from the space glyph, but from an average of all glyphs. It's a good thing I looked past in history. The breaking change was in rev 1744613 (PDFBOX-3354) and is related to the calculation of the average glyph width. Before rev 1744613 the averageWidth was always 0 (due to a bug likely accidentally introduced in some refactoring), which was corrected to a default value (1000) in text extraction.

Starting with rev 1744613 an average width was calculated, but due to many 0 values (over 65534) in the /W ranges array, the result was
unreliable:

/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

Solution: ignore widths that are <=0. 0 values in PDFont are already ignored in PDFont, but not in PDCIDFont.

Before the solution: 0.52861196. After the fix: 549.8571.

I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, but in 2.0.4.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox 2.0.3 TIKA comparison

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:
>
>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>>
>>>
>>> There are some regressions in content extraction, but overall, 
>>> content extraction looks to have improved quite a bit.  Looks like 
>>> ~2 million more "common English words" via Tilman's methodology. 
>
> After some wandering around I finally looked at content extraction 
> only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
> It turned out that all files were from Delaware courts, so I've 
> decided to look only at one single file, 
> Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
> The extracted text with 2.0.2 and 2.0.3 is
>
> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
>
> in 2.0.1 and 1.8 it is
>
> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
>
> For 1.8 the explanation is that text extraction takes words, while in 
> 2.* each character is taken alone.
>
> The bad result in 2.0.3 is because of an incorrect /W array. The space 
> has a width of 3, while other characters have widths between 200 and 
> 722. So PDFBox believes that there are spaces where there are none. 

The story is different, the space width (which is 250, not 3 - the table 
is a ranges array) is NOT taken from the space glyph, but from an 
average of all glyphs. It's a good thing I looked past in history. The 
breaking change was in rev 1744613 (PDFBOX-3354) and is related to the 
calculation of the average glyph width. Before rev 1744613 the 
averageWidth was always 0 (due to a bug likely accidentally introduced 
in some refactoring), which was corrected to a default value (1000) in 
text extraction.

Starting with rev 1744613 an average width was calculated, but due to 
many 0 values (over 65534) in the /W ranges array, the result was 
unreliable:

/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

Solution: ignore widths that are <=0. 0 values in PDFont are already 
ignored in PDFont, but not in PDCIDFont.

Before the solution: 0.52861196. After the fix: 549.8571.

I'll open an issue and commit a fix after sending this. It won't be in 
2.0.3, but in 2.0.4.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox 2.0.3 TIKA comparison

Posted by Tilman Hausherr <TH...@t-online.de>.

> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>
>>
>> There are some regressions in content extraction, but overall, 
>> content extraction looks to have improved quite a bit.  Looks like ~2 
>> million more "common English words" via Tilman's methodology. 

After some wandering around I finally looked at content extraction only, 
at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
It turned out that all files were from Delaware courts, so I've decided 
to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in 
2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The space 
has a width of 3, while other characters have widths between 200 and 
722. So PDFBox believes that there are spaces where there are none.

The only mystery that remains is why it worked in 2.0.1. Maybe that one 
took an average glyph width for spaces, or the width value from the font 
itself. I'll find this out later, but it isn't a high priority. A look 
at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, 
"content extraction looks to have improved quite a bit" :-)

Thanks for testing!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: PDFBox 2.0.3 TIKA comparison

Posted by "Allison, Timothy B." <ta...@mitre.org>.

That was caused by a cap we placed in Tika in extracting XMP history: TIKA-1999 [1]

We haven't switched to XMPBox...still on JempBox from 1.8.x.

https://issues.apache.org/jira/browse/TIKA-1999

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Wednesday, September 14, 2016 12:52 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
> https://github.com/tballison/share/blob/master/tika_comparisons/report
> s_tika_20160904_dev.zip
>
> This run was against the full corpus, not just PDFs.  I used a fairly recent nightly build of PDFBox and POI's 3.15-rc1.
>
> The one apparent major new exception for PDF files was apparently fixed before 2.0.3.  So, please ignore that one!
>
> There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit.  Looks like ~2 million more "common English words" via Tilman's methodology.
>
> Let me know if you have any questions.

I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM

metadata went from 6766 to 4134.

Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org