You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Orbel Mkrtchyan (JIRA)" <ji...@apache.org> on 2014/05/02 12:07:15 UTC

[jira] [Updated] (PDFBOX-2053) Issue with PDFBox position reading

     [ https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Orbel Mkrtchyan updated PDFBOX-2053:
------------------------------------

    Attachment: test.pdf

> Issue with PDFBox position reading
> ----------------------------------
>
>                 Key: PDFBOX-2053
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.3
>            Reporter: Orbel Mkrtchyan
>         Attachments: test.pdf
>
>
> Using PDFBox 1.8.4,
> bug #1:
> 		PDDocument doc = new PDDocument();
> 		doc.load("test-pcc7247.pdf");
> 		doc.save("out.pdf");
> 		doc.close();
> The resulting file is corrupted, contains 0 pages and cannot be viewed by Acrobat Reader.
> bug #2: consider the following code snippet. The code runs like this:
>       Extractor extractor = new Extractor();
>       extractor.writeText(pdDoc, output);
> Using the code defined like this:
> public class Extractor extends PDFTextStripper {
> ...
>     protected void writePage() throws IOException
>     {
>         for( int i = 0; i < charactersByArticle.size(); i++)
>         {
>             List<TextPosition> textList = charactersByArticle.get( i );
>             Iterator textIter = textList.iterator();
>             while( textIter.hasNext() )
>             {
>                 TextPosition position = (TextPosition)textIter.next();
> In the given piece of code, position variable correctly iterates through the letters of the first line of the provided pdf document, but its coordinates (x, y, widths, etc) are always the same. Just to be clear, 1 position always relates to 1 letter, and its widths array's length always equals 1. So we get the same coordinates for every letter in a line. Expected behaviour is either having new coordinates per letter or having widths[] contain widths for the characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)