You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2020/12/28 13:53:00 UTC
[jira] [Commented] (PDFBOX-2138) Corrupted words when using PDFTextStripper

    [ https://issues.apache.org/jira/browse/PDFBOX-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255596#comment-17255596 ] 

Andreas Lehmkühler commented on PDFBOX-2138:
--------------------------------------------

The pdf uses 3 marked content sequences, each contains the text. Looks like different versions of the source document to me. However, PDFBox extracts all of those sequences and merges the results to one text. If the sort option is active the whole text is merged to the very same starting position and the text becomes totally unreadable.

We have to find a way to choose the correct version/marked content sequence of the text to be used when extracting the text and omit the other ones. Does anyone know how to identify the correct one? I've tried to find the answer in the specs but failed.



> Corrupted words when using PDFTextStripper
> ------------------------------------------
>
>                 Key: PDFBOX-2138
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2138
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>         Environment: Windows 7 / 64 bit
>            Reporter: Walter Kehl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: PDFBOX-2138.pdf, PDFBOX-2138.txt, banking-banana-skins-2014.pdf, banking-banana-skins-2014.txt
>
>
> >> I am using PDFTextStripper (embedded into another application) to get 
> >> the raw text of PDFs so far with good results but recently a PDF file 
> >> has appeared where the output of the PDFTextStripper was corrupted. I 
> >> got sentences like:
> >>
> >>    
> >>
> >> "There is al o con ern that b nkers may be pushed to misprice risk 
> >> (No. 6) by the pres ures of c mpetition and an abunda ce of central b 
> >> nk-provided liquidity."
> > Additionally some portions of text appear 
> > twice in the output: first correctly and then corrupted. I have 
> > attached an output created with PDFBox's command line options.
> > If you compare lines 357- 365 with lines 421-429 you see that it is 
> > the same paragraph, first ok and then with characters missing. In the 
> > original source this paragraph is unique.
> > The same seems to happen for the other instances where text is corrupted.
> I also tried it directly on the command line with the same results: input and output files are attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org