You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Roman (JIRA)" <ji...@apache.org> on 2017/03/06 08:44:32 UTC

[jira] [Comment Edited] (PDFBOX-3710) Text Stripper in 2.0 lost some texts - regression

    [ https://issues.apache.org/jira/browse/PDFBOX-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896904#comment-15896904 ] 

Roman edited comment on PDFBOX-3710 at 3/6/17 8:43 AM:
-------------------------------------------------------

Also, please notice that when our App was based on 1.8, user could copy/paste this problematic 4 lines of text. But the text was corrupted, although most of characters readable, but in uppercase. In the same time, OVERALL word was completely broken. I'm printing these 4 lines below:

/6%2!,,

4HEAPPLICANTSACADEMICEXTRACURRICULARANDPERSONALCHARACTERISTICS

2ELEVANTCONTEXTFORTHEAPPLICANTSPERFORMANCEANDINVOLVEMENTSUCHASPARTICULARITIESOFFAMILYSITUATIONORRESPONSIBILITIESAFTER
SCHOOLWORKOBLIGATIONSSIBLINGCHILDCARE

 /BSERVEDPROBLEMATICBEHAVIORSPERHAPSSEPARABLEFROMACADEMICPERFORMANCETHATANADMISSIONCOMMITTEESHOULDEXPLOREFURTHER 




was (Author: rmakarov):
Also, please notice that when our App was based on 1.8, user could copy/paste this problematic 4 lines of text. But the text was corrupted, although most of characters readable, but in uppercase. In the same time, OVERALL word was completely broken. I'm printing these 4 line below:

/6%2!,,

4HEAPPLICANTSACADEMICEXTRACURRICULARANDPERSONALCHARACTERISTICS

2ELEVANTCONTEXTFORTHEAPPLICANTSPERFORMANCEANDINVOLVEMENTSUCHASPARTICULARITIESOFFAMILYSITUATIONORRESPONSIBILITIESAFTER
SCHOOLWORKOBLIGATIONSSIBLINGCHILDCARE

 /BSERVEDPROBLEMATICBEHAVIORSPERHAPSSEPARABLEFROMACADEMICPERFORMANCETHATANADMISSIONCOMMITTEESHOULDEXPLOREFURTHER 



> Text Stripper in 2.0 lost some texts - regression
> -------------------------------------------------
>
>                 Key: PDFBOX-3710
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3710
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: highlight19.pdf_page1-marked-1.png, highlight19.pdf_page1.pdf, regression_in_blue.png
>
>
> After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 lines of texts are disappeared. Those are the texts followed by black bullet (3 lines) and also "OVERALL" word which is placed above in table.
> Problematic PDF attached - [^highlight19.pdf_page1.pdf]
> Also, attached the result of [DrawPrintTextLocations|https://apache.googlesource.com/pdfbox/+/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java] example - 
> [highlight19.pdf_page1-marked-1.png|https://issues.apache.org/jira/secure/attachment/12856229/highlight19.pdf_page1-marked-1.png]
> Notice, that unicodes, red and blue boxes missing for problematic text. The main problem that these glyphs are absent in *textPositions* parameter which is passed to *writeString* function, line #275. In the 1.8 version these characters ARE present, so their positions along with their char codes could be extracted fine in our App.
> Also, attached picture of regression in our App - [^regression_in_blue.png]. Here, blue boxes drawn where text WAS present and disappeared afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org