You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ganesh <em...@yahoo.co.in> on 2010/12/03 06:26:38 UTC

PDF text extracted without spaces

Hello all,

I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Could any one in this group could help me?

Regards
Ganesh
   
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

Re: PDF text extracted without spaces

Posted by Ganesh <em...@yahoo.co.in>.
Excatly the same issue. The spaces and newline is not extracted properly. 

When could we expect the new release?

Regards
Ganesh 

----- Original Message ----- 
From: "Jukka Zitting" <jz...@adobe.com>
To: <us...@tika.apache.org>
Sent: Sunday, December 05, 2010 5:24 PM
Subject: RE: PDF text extracted without spaces


> Hi,
> 
> From: Ganesh [mailto:emailgane@yahoo.co.in]
>> I newbie with Tika. I am using latest version 0.8 version. I extracted
>> text from PDF document but found spaces and new line missing. Indexing
>> the data gives wrong result. Could any one in this group could help me?
> 
> That's an unfortunate regression that got included in the 0.8 release. See TIKA-548 [1] for the details.
> 
> The problem is fixed in the latest 0.9-SNAPSHOT version, and we probably should cut a new release soon with this fix.
> 
> [1] https://issues.apache.org/jira/browse/TIKA-548
> 
> BR,
> 
> Jukka Zitting
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

RE: PDF text extracted without spaces

Posted by Jukka Zitting <jz...@adobe.com>.
Hi,

From: Ganesh [mailto:emailgane@yahoo.co.in]
> I newbie with Tika. I am using latest version 0.8 version. I extracted
> text from PDF document but found spaces and new line missing. Indexing
> the data gives wrong result. Could any one in this group could help me?

That's an unfortunate regression that got included in the 0.8 release. See TIKA-548 [1] for the details.

The problem is fixed in the latest 0.9-SNAPSHOT version, and we probably should cut a new release soon with this fix.

[1] https://issues.apache.org/jira/browse/TIKA-548

BR,

Jukka Zitting

Re: PDF text extracted without spaces

Posted by Grant Ingersoll <gs...@apache.org>.
Can you share more about how you are using it.  Also, can you show a test case?

-Grant

On Dec 3, 2010, at 12:26 AM, Ganesh wrote:

> Hello all,
> 
> I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Could any one in this group could help me?
> 
> Regards
> Ganesh
> 
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

--------------------------
Grant Ingersoll
http://www.lucidimagination.com