You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Mark Looi (JIRA)" <ji...@apache.org> on 2010/09/30 20:53:32 UTC

[jira] Created: (PDFBOX-846) TextExtraction mixes case of text

TextExtraction mixes case of text
---------------------------------

                 Key: PDFBOX-846
                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.1
         Environment: Windows server, .NET
            Reporter: Mark Looi


Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
"ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.

We are using this code to get the text in C#:

 byte[] pdfData = myWebClient.DownloadData(pdfUrl);
                    string text = string.Empty;

                    ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
                    PDDocument doc = PDDocument.load(stream);
                    PDFTextStripper stripper = new PDFTextStripper();
                    text = stripper.getText(doc);
                    doc.close();


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------

    Attachment: PDFBOX846-Menu_WA_032509.pdf

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler reassigned PDFBOX-846:
-----------------------------------------

    Assignee: Andreas Lehmkühler

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Mark Looi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922439#action_12922439 ] 

Mark Looi commented on PDFBOX-846:
----------------------------------

Thanks Andreas. Hey, do you think you'll be able to post the .NET version
soon? That's the one we use. Much appreciated.

Mark.
Phone: (425) 941 2378 | twitter.com/marklooi | www.looiconsulting.com


On Sat, Oct 16, 2010 at 10:46 AM, Andreas Lehmkühler (JIRA) <jira@apache.org



> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------

    Attachment: PDFBOX846-Menu_WA_032509.txt

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-846.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.3.0

I fixed the calculation of the space width in revision 1023338. The scaling of the text matrix and the ctm wasn't taken into amount before which confused the algo to calculate whether a spave has to be added or not.

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------

    Attachment:     (was: PDFBOX846-Menu_WA_032509.txt)

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-846) TextExtraction mixes case of text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------

    Attachment: PDFBOX846-Menu_WA_032509.txt

First of all you should activate the sorting using this stripper.setSortByPosition(true). 

I'm attaching the extracting result of the current trunk version (1003396). It looks quite good but there are still some extra spaces.

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>         Attachments: PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.