You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/19 20:03:08 UTC

[jira] [Created] (TIKA-724) PDF text sometimes has extra space between letters

PDF text sometimes has extra space between letters
--------------------------------------------------

                 Key: TIKA-724
                 URL: https://issues.apache.org/jira/browse/TIKA-724
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
         Attachments: extraSpaces.pdf

I have a PDF with simple text "Here is some formatted text", but when
I extract with Tika I get extra spaces inserted:

{noformat}
H e re  i s  so me  fo rma tte d  te x t
{noformat}

When I created the text in this PDF (I used the PDFpen tool on OS X),
I set the style of the text to "loosen" (ie, increase space slightly
between the letters), so I think Tika (PDFBox) is trying to "respect"
that whitespace, but it'd be nice to turn this off (if it won't mess
up other places where we DO want the whitespace).

When I copy/paste the text is copied correctly.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152133#comment-13152133 ] 

Michael McCandless commented on TIKA-724:
-----------------------------------------

Alas, no, I don't believe you can control this from Solr today; maybe open a Solr issue?

Likewise for TikaCLI.. would be nice to expose that.  Maybe open an issue / cons up a patch?  Thanks!
                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130514#comment-13130514 ] 

Michael McCandless commented on TIKA-724:
-----------------------------------------

I dug into this one some more.

Handling space between words is tricky in PDF!  This is because a PDF
need not actually include space characters; instead it can (and does!)
simply place the glyphs at x/y positions with added whitespace between
them.  This easily happens for white-space based languages too.

Yet, sometimes PDFs do include space characters themselves (the attached
PDF is such an example).  Ideally we would be able to somehow detect
this (eg if the PDF is encoded differently internally something) but
I don't know how to do this / if it's even possible.

So for the time being I made a simple addition to PDFParser, adding an
option set/getEnableAutoSpace, defaulting to enabled (ie keeping the
behavior today).  So at least if an app hits PDFs like the one
attached here, or somehow they know their PDFs always include explicit
space characters, they can set this option.

                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Ravish Bhagdev (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151954#comment-13151954 ] 

Ravish Bhagdev commented on TIKA-724:
-------------------------------------

and also in tika.config
                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-724:
------------------------------------

    Attachment: TIKA-724.patch

Patch.
                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Ravish Bhagdev (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151950#comment-13151950 ] 

Ravish Bhagdev commented on TIKA-724:
-------------------------------------

Is there a way to control this flag from Solr?  Would have expected I could add something in solrconfig.xml to control this flag?

As I typed this I realized this might not be the place, so is there a way to control this from command line in tika-app?
                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Ravish Bhagdev (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159941#comment-13159941 ] 

Ravish Bhagdev commented on TIKA-724:
-------------------------------------

OK, will open the issue with Solr/Lucene.  Many thanks for your help.
                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-724:
---------------------------------------

    Assignee: Michael McCandless
    
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-724:
------------------------------------

    Attachment: extraSpaces.pdf

> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-724) PDF text sometimes has extra space between letters

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-724.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
    
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira