You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Staffan <so...@gmail.com> on 2010/11/11 10:14:52 UTC
Single line in extracted PDF contents
Hi,
Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into
one line. Last time I tested trunk, about a month ago, it did not. See
the following command line output:
$> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
1 · untitled 3 · 2010-02-13 09:52 · Staffan Olsson
PDF Title For Short Document
veryshortpdfcontents
$> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="27166"/>
<meta name="subject" content="The PDF subject"/>
<meta name="Author" content="The PDF Author"/>
<meta name="Last-Modified" content="2010-02-13T08:52:56Z"/>
<meta name="AAPL:Keywords" content="keywordinsaveaspdf someotherkeyword"/>
<meta name="creator" content="Smultron"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2010-02-13T08:52:56Z"/>
<meta name="created" content="Sat Feb 13 09:52:56 CET 2010"/>
<meta name="producer" content="Mac OS X 10.6.2 Quartz PDFContext"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="shortpdf.pdf"/>
<meta name="Keywords" content="keywordinsaveaspdf someotherkeyword"/>
<title>PDF Title For Short
Document</title>solsson@mackou:~/disk1/workspace/search/test$
</head>
<body>
<div class="page">
<p>1 · untitled 3 · 2010-02-13 09:52 · Staffan OlssonPDF
Title For Short Documentveryshortpdfcontents</p>
</div>
</body>
</html>
$> java -jar tika-app-0.7.jar docs/shortpdf.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>PDF Title For Short Document</title>
</head>
<body>
<div class="page">
<p>1 untitled 3 2010-02-13 09:52 Staan Olsson
PDF Title For Short Document
veryshortpdfcontents</p>
</div>
</body>
</html>
Should I report a bug?
/Staffan
Re: Single line in extracted PDF contents
Posted by Staffan <so...@gmail.com>.
On Thu, Nov 11, 2010 at 10:14 AM, Staffan <so...@gmail.com> wrote:
> Hi,
>
> Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into
> one line. Last time I tested trunk, about a month ago, it did not. See
> the following command line output:
>
Had the time to make a unit test now and track the regression to a
specific revision. No solution yet. See
https://issues.apache.org/jira/browse/TIKA-548.
/Staffan
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1 · untitled 3 · 2010-02-13 09:52 · Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
>
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="27166"/>
> <meta name="subject" content="The PDF subject"/>
> <meta name="Author" content="The PDF Author"/>
> <meta name="Last-Modified" content="2010-02-13T08:52:56Z"/>
> <meta name="AAPL:Keywords" content="keywordinsaveaspdf someotherkeyword"/>
> <meta name="creator" content="Smultron"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="Creation-Date" content="2010-02-13T08:52:56Z"/>
> <meta name="created" content="Sat Feb 13 09:52:56 CET 2010"/>
> <meta name="producer" content="Mac OS X 10.6.2 Quartz PDFContext"/>
> <meta name="Content-Type" content="application/pdf"/>
> <meta name="resourceName" content="shortpdf.pdf"/>
> <meta name="Keywords" content="keywordinsaveaspdf someotherkeyword"/>
> <title>PDF Title For Short
> Document</title>solsson@mackou:~/disk1/workspace/search/test$
> </head>
> <body>
> <div class="page">
> <p>1 · untitled 3 · 2010-02-13 09:52 · Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>
> </div>
> </body>
> </html>
>
> $> java -jar tika-app-0.7.jar docs/shortpdf.pdf
> <?xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title>PDF Title For Short Document</title>
> </head>
> <body>
> <div class="page">
> <p>1 untitled 3 2010-02-13 09:52 Staan Olsson
> PDF Title For Short Document
> veryshortpdfcontents</p>
> </div>
> </body>
> </html>
>
> Should I report a bug?
>
> /Staffan
>