You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Matt Sheppard (JIRA)" <ji...@apache.org> on 2012/05/02 06:36:49 UTC

[jira] [Created] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Matt Sheppard created TIKA-911:
----------------------------------

             Summary: Converted PDF document contains question marks in place of spaces and inconsistent case
                 Key: TIKA-911
                 URL: https://issues.apache.org/jira/browse/TIKA-911
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.1
            Reporter: Matt Sheppard


The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using

{code}
$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
{code}

Produces substantially worse output than xpdf's pdftotext program.

Specifically, we see...

Some 'spaces' replaced with question marks

{noformat}
...
<body><div class="page"><p/>
<p>How can I help?
When you're overseas:
• ?wherever?possible,?don't?visit?crops?—?contact?with?
</p>
<p>growing?crops?greatly?increases?the?risk?of?contaminating?
footwear?or?clothing;?
...
{noformat}

and some odd case conversions

{noformat}
<p>stem rust in wheat.  
 (soURce: BRAd collIs)</p>
<p/>
</div>
{noformat}

(The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.


To compare that with pdftotext

{code}
$ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
{code}

This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Matt Sheppard (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266502#comment-13266502 ] 

Matt Sheppard commented on TIKA-911:
------------------------------------

Confirmed that it still occurs for me on a different mac (with freshly downloaded PDF and tika-app-1.1.jar).

{noformat}
mercury:Downloads matt$ system_profiler SPSoftwareDataType
Software:

    System Software Overview:

      System Version: Mac OS X 10.7.3 (11D50d)
      Kernel Version: Darwin 11.3.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: Mercury
      User Name: Matthew Sheppard (matt)
      Secure Virtual Memory: Enabled
      64-bit Kernel and Extensions: Yes
      Time since boot: 3 days 1:10

mercury:Downloads matt$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)
mercury:Downloads matt$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf 
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2008-06-06T02:53:07Z"/>
<meta name="trapped" content="False"/>
<meta name="created" content="Fri Jun 06 12:53:07 EST 2008"/>
<meta name="Content-Length" content="755665"/>
<meta name="Last-Modified" content="2008-06-06T02:53:23Z"/>
<meta name="producer" content="Adobe PDF Library 7.0"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
<meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
<title/>
</head>
<body><div class="page"><p/>
<p>How can I help?
When you’re overseas:
• �wherever�possible,�don’t�visit�crops�—�contact�with�
</p>
<p>growing�crops�greatly�increases�the�risk�of�contaminating�
footwear�or�clothing;�
...[snip]...
<p>Initial detection  
points of exotic wheat 
rust incursions
</p>
<p>stem rust in wheat.  
 (soURce: BRAd collIs)</p>
<p/>
</div>
</body></html>
{noformat}

Note that the ?s reported appear to display differently on this machine.

Will attach a copy of the output as a file for reference.
                
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-911:
-----------------------------------

    Component/s: parser
    
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Matt Sheppard (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Sheppard updated TIKA-911:
-------------------------------

    Attachment: Rust Biosecurity Brochure.pdf.html
    
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Matt Sheppard (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266486#comment-13266486 ] 

Matt Sheppard commented on TIKA-911:
------------------------------------

Interesting - I was running Mac OS 10.7.3. Will confirm the version of java when I'm back in the office.
                
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266476#comment-13266476 ] 

Michael McCandless commented on TIKA-911:
-----------------------------------------

Hmm, I can't reproduce these issues.

I downloaded the PDF from the URL, downloaded tika-app-1.1.jar, ran java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf, and I don't see the ? for spaces nor the mixed casing.  I'm using Java 1.7.0_04 on Ubuntu 12.04.
                
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Matt Sheppard (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Sheppard updated TIKA-911:
-------------------------------

    Attachment: Rust Biosecurity Brochure.pdf

Attached PDF document in case is removed from the source site.
                
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266801#comment-13266801 ] 

Michael McCandless commented on TIKA-911:
-----------------------------------------

So strange ... I tested on a Mac (10.6.8) with Java 1.6.0_31, and I don't see the ? for spaces nor the mixed case.

Hmm, my header has a different content-length then yours:

{noformat}
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2012-05-02T10:25:00Z"/>
<meta name="created" content="Wed May 02 06:25:00 EDT 2012"/>
<meta name="Content-Length" content="639985"/>
<meta name="Last-Modified" content="2012-05-02T10:25:00Z"/>
<meta name="producer" content="Mac OS X 10.6.8 Quartz PDFContext"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
<meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
{noformat}

OK! If I used the PDF attached to the issue, I indeed see these problems (I had downloaded from the web site).  Maybe the web site has since changed/fixed the PDF?  Hmm.

So, the extra characters (where there should be spaces) are U+FFFD (the unicode replacement character); Tika outputs this whenever there is a character it can't safely output into the XHTML (this is done in SafeContentHanderl.java).  Tika used to (before 0.10) simply replace such characters with space (ASCII 32), so, to get back to pre-0.10 behaviour you can replace U+FFFD with space.

Not sure about the mixed case issue...

                
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira