You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Staffan <so...@gmail.com> on 2010/11/11 10:14:52 UTC

Single line in extracted PDF contents

Hi,

Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into
one line. Last time I tested trunk, about a month ago, it did not. See
the following command line output:

$> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
1   ·   untitled 3   ·   2010-02-13 09:52   ·   Staffan Olsson
PDF Title For Short Document
veryshortpdfcontents

$> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="27166"/>
<meta name="subject" content="The PDF subject"/>
<meta name="Author" content="The PDF Author"/>
<meta name="Last-Modified" content="2010-02-13T08:52:56Z"/>
<meta name="AAPL:Keywords" content="keywordinsaveaspdf someotherkeyword"/>
<meta name="creator" content="Smultron"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2010-02-13T08:52:56Z"/>
<meta name="created" content="Sat Feb 13 09:52:56 CET 2010"/>
<meta name="producer" content="Mac OS X 10.6.2 Quartz PDFContext"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="shortpdf.pdf"/>
<meta name="Keywords" content="keywordinsaveaspdf someotherkeyword"/>
<title>PDF Title For Short
Document</title>solsson@mackou:~/disk1/workspace/search/test$
</head>
<body>
<div class="page">
<p>1   ·   untitled 3   ·   2010-02-13 09:52   ·   Staffan OlssonPDF
Title For Short Documentveryshortpdfcontents</p>
</div>
</body>
</html>

$> java -jar tika-app-0.7.jar docs/shortpdf.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>PDF Title For Short Document</title>
</head>
<body>
<div class="page">
<p>1      untitled 3      2010-02-13 09:52      Staan Olsson
PDF Title For Short Document
veryshortpdfcontents</p>
</div>
</body>
</html>

Should I report a bug?

/Staffan

Re: Single line in extracted PDF contents

Posted by Staffan <so...@gmail.com>.
On Thu, Nov 11, 2010 at 10:14 AM, Staffan <so...@gmail.com> wrote:
> Hi,
>
> Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into
> one line. Last time I tested trunk, about a month ago, it did not. See
> the following command line output:
>
Had the time to make a unit test now and track the regression to a
specific revision. No solution yet. See
https://issues.apache.org/jira/browse/TIKA-548.

/Staffan


> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   ·   untitled 3   ·   2010-02-13 09:52   ·   Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
>
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="27166"/>
> <meta name="subject" content="The PDF subject"/>
> <meta name="Author" content="The PDF Author"/>
> <meta name="Last-Modified" content="2010-02-13T08:52:56Z"/>
> <meta name="AAPL:Keywords" content="keywordinsaveaspdf someotherkeyword"/>
> <meta name="creator" content="Smultron"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="Creation-Date" content="2010-02-13T08:52:56Z"/>
> <meta name="created" content="Sat Feb 13 09:52:56 CET 2010"/>
> <meta name="producer" content="Mac OS X 10.6.2 Quartz PDFContext"/>
> <meta name="Content-Type" content="application/pdf"/>
> <meta name="resourceName" content="shortpdf.pdf"/>
> <meta name="Keywords" content="keywordinsaveaspdf someotherkeyword"/>
> <title>PDF Title For Short
> Document</title>solsson@mackou:~/disk1/workspace/search/test$
> </head>
> <body>
> <div class="page">
> <p>1   ·   untitled 3   ·   2010-02-13 09:52   ·   Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>
> </div>
> </body>
> </html>
>
> $> java -jar tika-app-0.7.jar docs/shortpdf.pdf
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title>PDF Title For Short Document</title>
> </head>
> <body>
> <div class="page">
> <p>1      untitled 3      2010-02-13 09:52      Staan Olsson
> PDF Title For Short Document
> veryshortpdfcontents</p>
> </div>
> </body>
> </html>
>
> Should I report a bug?
>
> /Staffan
>