You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ganesh <em...@yahoo.co.in> on 2010/12/03 06:35:22 UTC

PDF text extracted without spaces

Hello all,

I know, this is not the right group to ask this question, thought some of you guys might have experienced.  

I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Could any one in this group could help me? I am using tika directly to extract the contents, which later gets indexed.

Regards
Ganesh    
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: PDF text extracted without spaces

Posted by Ganesh <em...@yahoo.co.in>.

Thanks for the link. I downloaded the tool and pdftotext extracts correctly. I also dropped a mail to the tika user group. It is a regression in latest release https://issues.apache.org/jira/browse/TIKA-548. Hopefully coming release will have a fix.

Regards
Ganesh

----- Original Message ----- 
From: "Ralph Seward" <rj...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Friday, December 03, 2010 8:21 PM
Subject: Re: PDF text extracted without spaces


pdftotext has usually worked quite well for my purposes. More info at
http://www.foolabs.com/xpdf/about.html .

"Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
components (pdftops, pdftotext, etc.) also run on Win32 systems and should
run on pretty much any system with a decent C++ compiler."

Ralph

On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl <hm...@rightonpoint.us> wrote:
>
> pdftotext is much better and faster from my experience.
>
>
> On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fa...@nunes.me> wrote:
>
> > Have you ever tried other extractor tool than PDFBox? I used to extract
> > contents with pdfbox: its capability of extract contents wasn't a
problem,
> > but its lack of structure information was.
> > You can try poppler-utils (pdftotext) to extract contents with
> > layout structure.
> >
> > Fabiano Nunes
> >
> >
> >
> >
> >
> > On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ia...@gmail.com> wrote:
> >
> > > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> > > Have you tried asking on the tika mailing list?
> > > http://tika.apache.org/mail-lists.html.
> > >
> > >
> > > --
> > > Ian.
> > >
> > >
> > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <em...@yahoo.co.in> wrote:
> > > > I first extract the contents from documents using tika and latter
index
> > > it with Lucene. The problem is the extracted text from PDF using tika
has
> > no
> > > whitespaces.
> > > >
> > > > Regards
> > > > Ganesh
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> > > > To: <ja...@lucene.apache.org>
> > > > Sent: Friday, December 03, 2010 4:40 PM
> > > > Subject: RE: PDF text extracted without spaces
> > > >
> > > >
> > > >> Hi Ganesh
> > > >>
> > > >> I encountered this same problem last week. I was thinking if it was
> > > possible to include at minimum a WhitespaceAnalyzer somewhere within
Tika
> > > which would solve the problem. I am not sure of how this would be done
as
> > I
> > > am not familiar with Tika codebase.
> > > >>
> > > >> Unfortunately I don't think that the solution to the first part of
> > this
> > > problem lies within the java-user mailing list.
> > > >>
> > > >> When were you sending extracted contents to Lucene... at what later
> > > stage?
> > > >>
> > > >> Thank you
> > > >>
> > > >> Lewis
> > > >>
> > > >> -----Original Message-----
> > > >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> > > >> Sent: 03 December 2010 10:44
> > > >> To: java-user@lucene.apache.org
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >> The main problem is i am not getting whitespace and newline char.
This
> > > is happening only for PDF documents.
> > > >>
> > > >> Sample outoput: Someofthedifferencesare but it should be Some of
the
> > > differences are
> > > >>
> > > >> Regards
> > > >> Ganesh
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Alexander Aristov" <al...@gmail.com>
> > > >> To: <ja...@lucene.apache.org>
> > > >> Sent: Friday, December 03, 2010 2:39 PM
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >>
> > > >>> anyway even if you get correct whitespaces and new lines this
won't
> > > affect
> > > >>> indexing.
> > > >>>
> > > >>> Best Regards
> > > >>> Alexander Aristov
> > > >>>
> > > >>>
> > > >>> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
> > > >>>
> > > >>>> The text should come out as a stream of words with space, but
> > without
> > > >>>> any of the formatting in the PDF. Extraction is only good enough
to
> > > >>>> tell you that a word is somewhere inside a PDF file.  Can you
post a
> > > >>>> short bit of the text that it extracted?
> > > >>>>
> > > >>>> Also, you should try this test on different PDF files that were
made
> > > >>>> with different software.
> > > >>>>
> > > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in>
> > wrote:
> > > >>>> > Hello all,
> > > >>>> >
> > > >>>> > I know, this is not the right group to ask this question,
thought
> > > some of
> > > >>>> you guys might have experienced.
> > > >>>> >
> > > >>>> > I newbie with Tika. I am using latest version 0.8 version. I
> > > extracted
> > > >>>> text from PDF document but found spaces and new line missing.
> > Indexing
> > > the
> > > >>>> data gives wrong result. Could any one in this group could help
me?
> > I
> > > am
> > > >>>> using tika directly to extract the contents, which later gets
> > indexed.
> > > >>>> >
> > > >>>> > Regards
> > > >>>> > Ganesh
> > > >>>> > Send free SMS to your Friends on Mobile from your Yahoo!
> > Messenger.
> > > >>>> Download Now! http://messenger.yahoo.com/download.php
> > > >>>> >
> > > >>>> >
> > > ---------------------------------------------------------------------
> > > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> > For additional commands, e-mail:
java-user-help@lucene.apache.org
> > > >>>> >
> > > >>>> >
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Lance Norskog
> > > >>>> goksron@gmail.com
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> > > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >>
> > > >>
---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >> Email has been scanned for viruses by Altman Technologies' email
> > > management service - www.altman.co.uk/emailsystems
> > > >>
> > > >> Glasgow Caledonian University is a registered Scottish charity,
number
> > > SC021474
> > > >>
> > > >> Winner: Times Higher Education’s Widening Participation Initiative
of
> > > the Year 2009 and Herald Society’s Education Initiative of the Year
2009
> > > >>
> > >
> >
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> > > >>
> > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >
> > > >
---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
> --
>
> Hans Merkl
> Right On Point, LLC
> 215 Victor Parkway, Suite E
> Annapolis, MD 21403
>
> Phone: (443) 951-4324
> E-mail: hmerkl@rightonpoint.us

Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: PDF text extracted without spaces

Posted by Ralph Seward <rj...@gmail.com>.

pdftotext has usually worked quite well for my purposes. More info at
http://www.foolabs.com/xpdf/about.html .

"Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X
components (pdftops, pdftotext, etc.) also run on Win32 systems and should
run on pretty much any system with a decent C++ compiler."

Ralph

On Fri, Dec 3, 2010 at 9:35 AM, Hans Merkl <hm...@rightonpoint.us> wrote:
>
> pdftotext is much better and faster from my experience.
>
>
> On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fa...@nunes.me> wrote:
>
> > Have you ever tried other extractor tool than PDFBox? I used to extract
> > contents with pdfbox: its capability of extract contents wasn't a
problem,
> > but its lack of structure information was.
> > You can try poppler-utils (pdftotext) to extract contents with
> > layout structure.
> >
> > Fabiano Nunes
> >
> >
> >
> >
> >
> > On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ia...@gmail.com> wrote:
> >
> > > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> > > Have you tried asking on the tika mailing list?
> > > http://tika.apache.org/mail-lists.html.
> > >
> > >
> > > --
> > > Ian.
> > >
> > >
> > > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <em...@yahoo.co.in> wrote:
> > > > I first extract the contents from documents using tika and latter
index
> > > it with Lucene. The problem is the extracted text from PDF using tika
has
> > no
> > > whitespaces.
> > > >
> > > > Regards
> > > > Ganesh
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> > > > To: <ja...@lucene.apache.org>
> > > > Sent: Friday, December 03, 2010 4:40 PM
> > > > Subject: RE: PDF text extracted without spaces
> > > >
> > > >
> > > >> Hi Ganesh
> > > >>
> > > >> I encountered this same problem last week. I was thinking if it was
> > > possible to include at minimum a WhitespaceAnalyzer somewhere within
Tika
> > > which would solve the problem. I am not sure of how this would be done
as
> > I
> > > am not familiar with Tika codebase.
> > > >>
> > > >> Unfortunately I don't think that the solution to the first part of
> > this
> > > problem lies within the java-user mailing list.
> > > >>
> > > >> When were you sending extracted contents to Lucene... at what later
> > > stage?
> > > >>
> > > >> Thank you
> > > >>
> > > >> Lewis
> > > >>
> > > >> -----Original Message-----
> > > >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> > > >> Sent: 03 December 2010 10:44
> > > >> To: java-user@lucene.apache.org
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >> The main problem is i am not getting whitespace and newline char.
This
> > > is happening only for PDF documents.
> > > >>
> > > >> Sample outoput: Someofthedifferencesare but it should be Some of
the
> > > differences are
> > > >>
> > > >> Regards
> > > >> Ganesh
> > > >>
> > > >> ----- Original Message -----
> > > >> From: "Alexander Aristov" <al...@gmail.com>
> > > >> To: <ja...@lucene.apache.org>
> > > >> Sent: Friday, December 03, 2010 2:39 PM
> > > >> Subject: Re: PDF text extracted without spaces
> > > >>
> > > >>
> > > >>> anyway even if you get correct whitespaces and new lines this
won't
> > > affect
> > > >>> indexing.
> > > >>>
> > > >>> Best Regards
> > > >>> Alexander Aristov
> > > >>>
> > > >>>
> > > >>> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
> > > >>>
> > > >>>> The text should come out as a stream of words with space, but
> > without
> > > >>>> any of the formatting in the PDF. Extraction is only good enough
to
> > > >>>> tell you that a word is somewhere inside a PDF file.  Can you
post a
> > > >>>> short bit of the text that it extracted?
> > > >>>>
> > > >>>> Also, you should try this test on different PDF files that were
made
> > > >>>> with different software.
> > > >>>>
> > > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in>
> > wrote:
> > > >>>> > Hello all,
> > > >>>> >
> > > >>>> > I know, this is not the right group to ask this question,
thought
> > > some of
> > > >>>> you guys might have experienced.
> > > >>>> >
> > > >>>> > I newbie with Tika. I am using latest version 0.8 version. I
> > > extracted
> > > >>>> text from PDF document but found spaces and new line missing.
> > Indexing
> > > the
> > > >>>> data gives wrong result. Could any one in this group could help
me?
> > I
> > > am
> > > >>>> using tika directly to extract the contents, which later gets
> > indexed.
> > > >>>> >
> > > >>>> > Regards
> > > >>>> > Ganesh
> > > >>>> > Send free SMS to your Friends on Mobile from your Yahoo!
> > Messenger.
> > > >>>> Download Now! http://messenger.yahoo.com/download.php
> > > >>>> >
> > > >>>> >
> > > ---------------------------------------------------------------------
> > > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> > For additional commands, e-mail:
java-user-help@lucene.apache.org
> > > >>>> >
> > > >>>> >
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Lance Norskog
> > > >>>> goksron@gmail.com
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> > > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >>
> > > >>
---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >> Email has been scanned for viruses by Altman Technologies' email
> > > management service - www.altman.co.uk/emailsystems
> > > >>
> > > >> Glasgow Caledonian University is a registered Scottish charity,
number
> > > SC021474
> > > >>
> > > >> Winner: Times Higher Education’s Widening Participation Initiative
of
> > > the Year 2009 and Herald Society’s Education Initiative of the Year
2009
> > > >>
> > >
> >
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> > > >>
> > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > > Download Now! http://messenger.yahoo.com/download.php
> > > >
> > > >
---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
> --
>
> Hans Merkl
> Right On Point, LLC
> 215 Victor Parkway, Suite E
> Annapolis, MD 21403
>
> Phone: (443) 951-4324
> E-mail: hmerkl@rightonpoint.us

Re: PDF text extracted without spaces

Posted by Hans Merkl <hm...@rightonpoint.us>.

pdftotext is much better and faster from my experience.


On Fri, Dec 3, 2010 at 08:52, Fabiano Nunes <fa...@nunes.me> wrote:

> Have you ever tried other extractor tool than PDFBox? I used to extract
> contents with pdfbox: its capability of extract contents wasn't a problem,
> but its lack of structure information was.
> You can try poppler-utils (pdftotext) to extract contents with
> layout structure.
>
> Fabiano Nunes
>
>
>
>
>
> On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ia...@gmail.com> wrote:
>
> > Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> > Have you tried asking on the tika mailing list?
> > http://tika.apache.org/mail-lists.html.
> >
> >
> > --
> > Ian.
> >
> >
> > On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <em...@yahoo.co.in> wrote:
> > > I first extract the contents from documents using tika and latter index
> > it with Lucene. The problem is the extracted text from PDF using tika has
> no
> > whitespaces.
> > >
> > > Regards
> > > Ganesh
> > >
> > >
> > > ----- Original Message -----
> > > From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> > > To: <ja...@lucene.apache.org>
> > > Sent: Friday, December 03, 2010 4:40 PM
> > > Subject: RE: PDF text extracted without spaces
> > >
> > >
> > >> Hi Ganesh
> > >>
> > >> I encountered this same problem last week. I was thinking if it was
> > possible to include at minimum a WhitespaceAnalyzer somewhere within Tika
> > which would solve the problem. I am not sure of how this would be done as
> I
> > am not familiar with Tika codebase.
> > >>
> > >> Unfortunately I don't think that the solution to the first part of
> this
> > problem lies within the java-user mailing list.
> > >>
> > >> When were you sending extracted contents to Lucene... at what later
> > stage?
> > >>
> > >> Thank you
> > >>
> > >> Lewis
> > >>
> > >> -----Original Message-----
> > >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> > >> Sent: 03 December 2010 10:44
> > >> To: java-user@lucene.apache.org
> > >> Subject: Re: PDF text extracted without spaces
> > >>
> > >> The main problem is i am not getting whitespace and newline char. This
> > is happening only for PDF documents.
> > >>
> > >> Sample outoput: Someofthedifferencesare but it should be Some of the
> > differences are
> > >>
> > >> Regards
> > >> Ganesh
> > >>
> > >> ----- Original Message -----
> > >> From: "Alexander Aristov" <al...@gmail.com>
> > >> To: <ja...@lucene.apache.org>
> > >> Sent: Friday, December 03, 2010 2:39 PM
> > >> Subject: Re: PDF text extracted without spaces
> > >>
> > >>
> > >>> anyway even if you get correct whitespaces and new lines this won't
> > affect
> > >>> indexing.
> > >>>
> > >>> Best Regards
> > >>> Alexander Aristov
> > >>>
> > >>>
> > >>> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
> > >>>
> > >>>> The text should come out as a stream of words with space, but
> without
> > >>>> any of the formatting in the PDF. Extraction is only good enough to
> > >>>> tell you that a word is somewhere inside a PDF file.  Can you post a
> > >>>> short bit of the text that it extracted?
> > >>>>
> > >>>> Also, you should try this test on different PDF files that were made
> > >>>> with different software.
> > >>>>
> > >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in>
> wrote:
> > >>>> > Hello all,
> > >>>> >
> > >>>> > I know, this is not the right group to ask this question, thought
> > some of
> > >>>> you guys might have experienced.
> > >>>> >
> > >>>> > I newbie with Tika. I am using latest version 0.8 version. I
> > extracted
> > >>>> text from PDF document but found spaces and new line missing.
> Indexing
> > the
> > >>>> data gives wrong result. Could any one in this group could help me?
> I
> > am
> > >>>> using tika directly to extract the contents, which later gets
> indexed.
> > >>>> >
> > >>>> > Regards
> > >>>> > Ganesh
> > >>>> > Send free SMS to your Friends on Mobile from your Yahoo!
> Messenger.
> > >>>> Download Now! http://messenger.yahoo.com/download.php
> > >>>> >
> > >>>> >
> > ---------------------------------------------------------------------
> > >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>> >
> > >>>> >
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Lance Norskog
> > >>>> goksron@gmail.com
> > >>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>>
> > >>>>
> > >>>
> > >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > Download Now! http://messenger.yahoo.com/download.php
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >> Email has been scanned for viruses by Altman Technologies' email
> > management service - www.altman.co.uk/emailsystems
> > >>
> > >> Glasgow Caledonian University is a registered Scottish charity, number
> > SC021474
> > >>
> > >> Winner: Times Higher Education’s Widening Participation Initiative of
> > the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> > >>
> >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> > >>
> > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> > Download Now! http://messenger.yahoo.com/download.php
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>



-- 

Hans Merkl
Right On Point, LLC
215 Victor Parkway, Suite E
Annapolis, MD 21403

Phone: (443) 951-4324
E-mail: hmerkl@rightonpoint.us

Re: PDF text extracted without spaces

Posted by Fabiano Nunes <fa...@nunes.me>.

Have you ever tried other extractor tool than PDFBox? I used to extract
contents with pdfbox: its capability of extract contents wasn't a problem,
but its lack of structure information was.
You can try poppler-utils (pdftotext) to extract contents with
layout structure.

Fabiano Nunes





On Fri, Dec 3, 2010 at 10:08 AM, Ian Lea <ia...@gmail.com> wrote:

> Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
> Have you tried asking on the tika mailing list?
> http://tika.apache.org/mail-lists.html.
>
>
> --
> Ian.
>
>
> On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <em...@yahoo.co.in> wrote:
> > I first extract the contents from documents using tika and latter index
> it with Lucene. The problem is the extracted text from PDF using tika has no
> whitespaces.
> >
> > Regards
> > Ganesh
> >
> >
> > ----- Original Message -----
> > From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> > To: <ja...@lucene.apache.org>
> > Sent: Friday, December 03, 2010 4:40 PM
> > Subject: RE: PDF text extracted without spaces
> >
> >
> >> Hi Ganesh
> >>
> >> I encountered this same problem last week. I was thinking if it was
> possible to include at minimum a WhitespaceAnalyzer somewhere within Tika
> which would solve the problem. I am not sure of how this would be done as I
> am not familiar with Tika codebase.
> >>
> >> Unfortunately I don't think that the solution to the first part of this
> problem lies within the java-user mailing list.
> >>
> >> When were you sending extracted contents to Lucene... at what later
> stage?
> >>
> >> Thank you
> >>
> >> Lewis
> >>
> >> -----Original Message-----
> >> From: Ganesh [mailto:emailgane@yahoo.co.in]
> >> Sent: 03 December 2010 10:44
> >> To: java-user@lucene.apache.org
> >> Subject: Re: PDF text extracted without spaces
> >>
> >> The main problem is i am not getting whitespace and newline char. This
> is happening only for PDF documents.
> >>
> >> Sample outoput: Someofthedifferencesare but it should be Some of the
> differences are
> >>
> >> Regards
> >> Ganesh
> >>
> >> ----- Original Message -----
> >> From: "Alexander Aristov" <al...@gmail.com>
> >> To: <ja...@lucene.apache.org>
> >> Sent: Friday, December 03, 2010 2:39 PM
> >> Subject: Re: PDF text extracted without spaces
> >>
> >>
> >>> anyway even if you get correct whitespaces and new lines this won't
> affect
> >>> indexing.
> >>>
> >>> Best Regards
> >>> Alexander Aristov
> >>>
> >>>
> >>> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
> >>>
> >>>> The text should come out as a stream of words with space, but without
> >>>> any of the formatting in the PDF. Extraction is only good enough to
> >>>> tell you that a word is somewhere inside a PDF file.  Can you post a
> >>>> short bit of the text that it extracted?
> >>>>
> >>>> Also, you should try this test on different PDF files that were made
> >>>> with different software.
> >>>>
> >>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
> >>>> > Hello all,
> >>>> >
> >>>> > I know, this is not the right group to ask this question, thought
> some of
> >>>> you guys might have experienced.
> >>>> >
> >>>> > I newbie with Tika. I am using latest version 0.8 version. I
> extracted
> >>>> text from PDF document but found spaces and new line missing. Indexing
> the
> >>>> data gives wrong result. Could any one in this group could help me? I
> am
> >>>> using tika directly to extract the contents, which later gets indexed.
> >>>> >
> >>>> > Regards
> >>>> > Ganesh
> >>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> >>>> Download Now! http://messenger.yahoo.com/download.php
> >>>> >
> >>>> >
> ---------------------------------------------------------------------
> >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goksron@gmail.com
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >> Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> Email has been scanned for viruses by Altman Technologies' email
> management service - www.altman.co.uk/emailsystems
> >>
> >> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
> >>
> >> Winner: Times Higher Education’s Widening Participation Initiative of
> the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> >>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
> >>
> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PDF text extracted without spaces

Posted by Ian Lea <ia...@gmail.com>.

Maybe https://issues.apache.org/jira/browse/TIKA-548 is relevant.
Have you tried asking on the tika mailing list?
http://tika.apache.org/mail-lists.html.


--
Ian.


On Fri, Dec 3, 2010 at 11:55 AM, Ganesh <em...@yahoo.co.in> wrote:
> I first extract the contents from documents using tika and latter index it with Lucene. The problem is the extracted text from PDF using tika has no whitespaces.
>
> Regards
> Ganesh
>
>
> ----- Original Message -----
> From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> To: <ja...@lucene.apache.org>
> Sent: Friday, December 03, 2010 4:40 PM
> Subject: RE: PDF text extracted without spaces
>
>
>> Hi Ganesh
>>
>> I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.
>>
>> Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list.
>>
>> When were you sending extracted contents to Lucene... at what later stage?
>>
>> Thank you
>>
>> Lewis
>>
>> -----Original Message-----
>> From: Ganesh [mailto:emailgane@yahoo.co.in]
>> Sent: 03 December 2010 10:44
>> To: java-user@lucene.apache.org
>> Subject: Re: PDF text extracted without spaces
>>
>> The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.
>>
>> Sample outoput: Someofthedifferencesare but it should be Some of the differences are
>>
>> Regards
>> Ganesh
>>
>> ----- Original Message -----
>> From: "Alexander Aristov" <al...@gmail.com>
>> To: <ja...@lucene.apache.org>
>> Sent: Friday, December 03, 2010 2:39 PM
>> Subject: Re: PDF text extracted without spaces
>>
>>
>>> anyway even if you get correct whitespaces and new lines this won't affect
>>> indexing.
>>>
>>> Best Regards
>>> Alexander Aristov
>>>
>>>
>>> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
>>>
>>>> The text should come out as a stream of words with space, but without
>>>> any of the formatting in the PDF. Extraction is only good enough to
>>>> tell you that a word is somewhere inside a PDF file.  Can you post a
>>>> short bit of the text that it extracted?
>>>>
>>>> Also, you should try this test on different PDF files that were made
>>>> with different software.
>>>>
>>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
>>>> > Hello all,
>>>> >
>>>> > I know, this is not the right group to ask this question, thought some of
>>>> you guys might have experienced.
>>>> >
>>>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>>>> text from PDF document but found spaces and new line missing. Indexing the
>>>> data gives wrong result. Could any one in this group could help me? I am
>>>> using tika directly to extract the contents, which later gets indexed.
>>>> >
>>>> > Regards
>>>> > Ganesh
>>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>>>> Download Now! http://messenger.yahoo.com/download.php
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
>>
>> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>>
>> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>>
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: PDF text extracted without spaces

Posted by Ganesh <em...@yahoo.co.in>.

I first extract the contents from documents using tika and latter index it with Lucene. The problem is the extracted text from PDF using tika has no whitespaces. 

Regards
Ganesh


----- Original Message ----- 
From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
To: <ja...@lucene.apache.org>
Sent: Friday, December 03, 2010 4:40 PM
Subject: RE: PDF text extracted without spaces


> Hi Ganesh
> 
> I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.
> 
> Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list.
> 
> When were you sending extracted contents to Lucene... at what later stage?
> 
> Thank you
> 
> Lewis
> 
> -----Original Message-----
> From: Ganesh [mailto:emailgane@yahoo.co.in]
> Sent: 03 December 2010 10:44
> To: java-user@lucene.apache.org
> Subject: Re: PDF text extracted without spaces
> 
> The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.
> 
> Sample outoput: Someofthedifferencesare but it should be Some of the differences are
> 
> Regards
> Ganesh
> 
> ----- Original Message -----
> From: "Alexander Aristov" <al...@gmail.com>
> To: <ja...@lucene.apache.org>
> Sent: Friday, December 03, 2010 2:39 PM
> Subject: Re: PDF text extracted without spaces
> 
> 
>> anyway even if you get correct whitespaces and new lines this won't affect
>> indexing.
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
>>
>>> The text should come out as a stream of words with space, but without
>>> any of the formatting in the PDF. Extraction is only good enough to
>>> tell you that a word is somewhere inside a PDF file.  Can you post a
>>> short bit of the text that it extracted?
>>>
>>> Also, you should try this test on different PDF files that were made
>>> with different software.
>>>
>>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
>>> > Hello all,
>>> >
>>> > I know, this is not the right group to ask this question, thought some of
>>> you guys might have experienced.
>>> >
>>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>>> text from PDF document but found spaces and new line missing. Indexing the
>>> data gives wrong result. Could any one in this group could help me? I am
>>> using tika directly to extract the contents, which later gets indexed.
>>> >
>>> > Regards
>>> > Ganesh
>>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>>> Download Now! http://messenger.yahoo.com/download.php
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
> 
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
> 
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: PDF text extracted without spaces

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.

Hi Ganesh

I encountered this same problem last week. I was thinking if it was possible to include at minimum a WhitespaceAnalyzer somewhere within Tika which would solve the problem. I am not sure of how this would be done as I am not familiar with Tika codebase.

Unfortunately I don't think that the solution to the first part of this problem lies within the java-user mailing list.

When were you sending extracted contents to Lucene... at what later stage?

Thank you

Lewis

-----Original Message-----
From: Ganesh [mailto:emailgane@yahoo.co.in]
Sent: 03 December 2010 10:44
To: java-user@lucene.apache.org
Subject: Re: PDF text extracted without spaces

The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents.

Sample outoput: Someofthedifferencesare but it should be Some of the differences are

Regards
Ganesh

----- Original Message -----
From: "Alexander Aristov" <al...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Friday, December 03, 2010 2:39 PM
Subject: Re: PDF text extracted without spaces


> anyway even if you get correct whitespaces and new lines this won't affect
> indexing.
>
> Best Regards
> Alexander Aristov
>
>
> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
>
>> The text should come out as a stream of words with space, but without
>> any of the formatting in the PDF. Extraction is only good enough to
>> tell you that a word is somewhere inside a PDF file.  Can you post a
>> short bit of the text that it extracted?
>>
>> Also, you should try this test on different PDF files that were made
>> with different software.
>>
>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
>> > Hello all,
>> >
>> > I know, this is not the right group to ask this question, thought some of
>> you guys might have experienced.
>> >
>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>> text from PDF document but found spaces and new line missing. Indexing the
>> data gives wrong result. Could any one in this group could help me? I am
>> using tika directly to extract the contents, which later gets indexed.
>> >
>> > Regards
>> > Ganesh
>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>> Download Now! http://messenger.yahoo.com/download.php
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Re: PDF text extracted without spaces

Posted by Ganesh <em...@yahoo.co.in>.

The main problem is i am not getting whitespace and newline char. This is happening only for PDF documents. 

Sample outoput: Someofthedifferencesare but it should be Some of the differences are

Regards
Ganesh

----- Original Message ----- 
From: "Alexander Aristov" <al...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Friday, December 03, 2010 2:39 PM
Subject: Re: PDF text extracted without spaces


> anyway even if you get correct whitespaces and new lines this won't affect
> indexing.
> 
> Best Regards
> Alexander Aristov
> 
> 
> On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:
> 
>> The text should come out as a stream of words with space, but without
>> any of the formatting in the PDF. Extraction is only good enough to
>> tell you that a word is somewhere inside a PDF file.  Can you post a
>> short bit of the text that it extracted?
>>
>> Also, you should try this test on different PDF files that were made
>> with different software.
>>
>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
>> > Hello all,
>> >
>> > I know, this is not the right group to ask this question, thought some of
>> you guys might have experienced.
>> >
>> > I newbie with Tika. I am using latest version 0.8 version. I extracted
>> text from PDF document but found spaces and new line missing. Indexing the
>> data gives wrong result. Could any one in this group could help me? I am
>> using tika directly to extract the contents, which later gets indexed.
>> >
>> > Regards
>> > Ganesh
>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
>> Download Now! http://messenger.yahoo.com/download.php
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: PDF text extracted without spaces

Posted by Alexander Aristov <al...@gmail.com>.

anyway even if you get correct whitespaces and new lines this won't affect
indexing.

Best Regards
Alexander Aristov


On 3 December 2010 10:00, Lance Norskog <go...@gmail.com> wrote:

> The text should come out as a stream of words with space, but without
> any of the formatting in the PDF. Extraction is only good enough to
> tell you that a word is somewhere inside a PDF file.  Can you post a
> short bit of the text that it extracted?
>
> Also, you should try this test on different PDF files that were made
> with different software.
>
> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
> > Hello all,
> >
> > I know, this is not the right group to ask this question, thought some of
> you guys might have experienced.
> >
> > I newbie with Tika. I am using latest version 0.8 version. I extracted
> text from PDF document but found spaces and new line missing. Indexing the
> data gives wrong result. Could any one in this group could help me? I am
> using tika directly to extract the contents, which later gets indexed.
> >
> > Regards
> > Ganesh
> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger.
> Download Now! http://messenger.yahoo.com/download.php
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: PDF text extracted without spaces

Posted by Lance Norskog <go...@gmail.com>.

The text should come out as a stream of words with space, but without
any of the formatting in the PDF. Extraction is only good enough to
tell you that a word is somewhere inside a PDF file.  Can you post a
short bit of the text that it extracted?

Also, you should try this test on different PDF files that were made
with different software.

On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <em...@yahoo.co.in> wrote:
> Hello all,
>
> I know, this is not the right group to ask this question, thought some of you guys might have experienced.
>
> I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Could any one in this group could help me? I am using tika directly to extract the contents, which later gets indexed.
>
> Regards
> Ganesh
> Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org