You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by GitBox <gi...@apache.org> on 2020/10/26 16:04:47 UTC

[GitHub] [pdfbox] SchwingSK opened a new pull request #89: PDFBOX-5002: fix word detection in PDFTextStripper

SchwingSK opened a new pull request #89:
URL: https://github.com/apache/pdfbox/pull/89


   The problem lied with the fact that maxHeightForLine is kept, even when the text font changes (which is intentional so as not to trigger a new line when there is sub/superscript). This leads in this case to PDFTextStripper merging two lines that should be separate.
   The patch assumes that when the current character is separated from the previous one, the maxHeightForLine has to be reset.
   This breaks only one test: eu-001.pdf, and it should as the new code correctly detects two lines where there was only one detected before.
   
   (the patch has been tested with mvn clean test on the 2.0.21 branch with commit bdf2ae77e693cc73d4cdeb9a95c6ac2845d11ead applied, as the current 2.0 branch does not pass tests)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


[GitHub] [pdfbox] SchwingSK edited a comment on pull request #89: PDFBOX-5002: fix word detection in PDFTextStripper

Posted by GitBox <gi...@apache.org>.
SchwingSK edited a comment on pull request #89:
URL: https://github.com/apache/pdfbox/pull/89#issuecomment-717151484


   After testing with 14646 PDFs, I reduced the five-space rule down to only one (my feeling was wrong ;) ), as it gives even better results, and does not break more TestTextStripper tests.
   5 spaces: 965 pages with at least one space fixed out of 14841 pages
   1 space: 1083 pages with at least one space fixed out of 14841 pages
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


[GitHub] [pdfbox] asfgit closed pull request #89: PDFBOX-5002: fix word detection in PDFTextStripper

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #89:
URL: https://github.com/apache/pdfbox/pull/89


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


[GitHub] [pdfbox] SchwingSK commented on pull request #89: PDFBOX-5002: fix word detection in PDFTextStripper

Posted by GitBox <gi...@apache.org>.
SchwingSK commented on pull request #89:
URL: https://github.com/apache/pdfbox/pull/89#issuecomment-717151484


   After testing with 14646 PDFs, I reduced the five-space rule down to only one, as it gives even better results, and does not break more TestTextStripper tests.
   5 spaces: 965 pages with at least one space fixed out of 14841 pages
   1 space: 1083 pages with at least one space fixed out of 14841 pages
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org