You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "WuDG@infoPro.cn" <wu...@infopro.cn> on 2004/09/08 07:42:41 UTC
pdf in Chinese
Hi all,
i use pdfbox to parse pdf file to lucene document.when i parse Chinese
pdf file,pdfbox is not always success.
Is anyone have some advice?
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdf in Chinese
Posted by Ben Litchfield <be...@csh.rit.edu>.
This appears to be more of a PDFBox issue than a lucene issue, please post
an issue to the PDFBox site.
Also note, that because of certain encodings that a PDF writer can use, it
is impossible to extract text from all PDF documents.
Ben
On Wed, 8 Sep 2004, WuDG@infoPro.cn wrote:
> it is not about analyzer ,i need to read text from pdf file first.
>
> ----- Original Message -----
> From: "Chandan Tamrakar" <ch...@ccnep.com.np>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, September 08, 2004 4:15 PM
> Subject: Re: pdf in Chinese
>
>
> > which analyzer you are using to index chinese pdf documents ?
> > I think you should use cjkanalyzer
> > ----- Original Message -----
> > From: "WuDG@infoPro.cn" <wu...@infopro.cn>
> > To: <lu...@jakarta.apache.org>
> > Sent: Wednesday, September 08, 2004 11:27 AM
> > Subject: pdf in Chinese
> >
> >
> > > Hi all,
> > > i use pdfbox to parse pdf file to lucene document.when i parse
> > Chinese
> > > pdf file,pdfbox is not always success.
> > > Is anyone have some advice?
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: PDF->Text Performance comparison
Posted by Ben Litchfield <be...@csh.rit.edu>.
> 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
> problems with parsing the same pdf documents, which worked well for
> 0.6.3. I mentioned my problems here:
> https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314
I am waiting for a response from you on this issue, try to login to SF
when posting bugs so you get a notification when it is updated.
> 2) When I were started with 0.6.3 I experienced perfomance problems
> too, especially with large pdf documents (I had several with more
> then 20MB size). I changed a bit source, wrapping the following line
> of BaseParser class:
I will give that a try, thanks for letting me know.
Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: PDF->Text Performance comparison
Posted by Maxim Patramanskij <ma...@osua.de>.
Hello Ben,
I've been using PDFBox within last year, but only version 0.6.3,
because of 2 reasons:
1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
problems with parsing the same pdf documents, which worked well for
0.6.3. I mentioned my problems here:
https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314
2) When I were started with 0.6.3 I experienced perfomance problems
too, especially with large pdf documents (I had several with more
then 20MB size). I changed a bit source, wrapping the following line
of BaseParser class:
out = stream.createFilteredStream( streamLength );
to
out = new BufferedOutputStream(stream.createFilteredStream( streamLength ));
The performance increase, I've got, was huge:
parsing 21MB pdf document to text before modifacatrion was taking 78
seconds, after modification 12 seconds, so more the 6 times faster.
I tried also to use buffered streams in some other places, but it was
not that visible. I hope this change can also be incorporated into
the current 0.6.6 release and then benchmarks may stay in PDFBox side
:)
Max
BL> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications
>> http://snowtide.com/home/PDFTextStream/
BL> For those that have not seen, snowtide.com has done a performance
BL> comparison against several Java PDF->Text libraries, including Snowtide's
BL> PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well
BL> done.
BL> http://snowtide.com/home/PDFTextStream/Performance
BL> PDFBox: slow PDF text extraction for Java applications
BL> http://www.pdfbox.org
BL> :)
BL> Ben
BL> ---------------------------------------------------------------------
BL> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
BL> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
--
Best regards,
Maxim mailto:max@osua.de
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: PDF->Text Performance comparison
Posted by Chas Emerick <ce...@snowtide.com>.
Ben,
Wow, thanks for the plug! :-)
Truthfully, I was worried that our open-source brethren might feel
slighted by the comparison -- that's partially why we wanted to make
sure it was as thorough and transparent as possible so that anyone
could review the results for themselves. I'm glad that you're not at
all sore.
Chas Emerick | cemerick@snowtide.com
PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/
On Sep 8, 2004, at 10:41 AM, Ben Litchfield wrote:
>
> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications
>> http://snowtide.com/home/PDFTextStream/
>
>
> For those that have not seen, snowtide.com has done a performance
> comparison against several Java PDF->Text libraries, including
> Snowtide's
> PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly
> well
> done.
>
> http://snowtide.com/home/PDFTextStream/Performance
>
>
> PDFBox: slow PDF text extraction for Java applications
> http://www.pdfbox.org
>
> :)
>
> Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: PDF->Text Performance comparison
Posted by Ben Litchfield <be...@csh.rit.edu>.
Yes, that and a few other adjectives, but I didn't want to get carried
away.
Ben
On Wed, 8 Sep 2004, Doug Cutting wrote:
> Ben Litchfield wrote:
> > PDFBox: slow PDF text extraction for Java applications
> > http://www.pdfbox.org
>
> Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java
> applications, with Lucene integration"?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: PDF->Text Performance comparison
Posted by Doug Cutting <cu...@apache.org>.
Ben Litchfield wrote:
> PDFBox: slow PDF text extraction for Java applications
> http://www.pdfbox.org
Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java
applications, with Lucene integration"?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
PDF->Text Performance comparison
Posted by Ben Litchfield <be...@csh.rit.edu>.
On Wed, 8 Sep 2004, Chas Emerick wrote:
> PDFTextStream: fast PDF text extraction for Java applications
> http://snowtide.com/home/PDFTextStream/
For those that have not seen, snowtide.com has done a performance
comparison against several Java PDF->Text libraries, including Snowtide's
PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well
done.
http://snowtide.com/home/PDFTextStream/Performance
PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org
:)
Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdf in Chinese
Posted by Chas Emerick <ce...@snowtide.com>.
I'm not aware of any Java library that can reliably extract Chinese
text from PDF documents. We're planning on supporting Chinese,
Japanese, and Korean in version 2 of PDFTextStream, but there's no
doubt that it's a huge challenge.
Chas Emerick | cemerick@snowtide.com
PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/
On Sep 8, 2004, at 5:58 AM, WuDG@infoPro.cn wrote:
> it is not about analyzer ,i need to read text from pdf file first.
>
> ----- Original Message -----
> From: "Chandan Tamrakar" <ch...@ccnep.com.np>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, September 08, 2004 4:15 PM
> Subject: Re: pdf in Chinese
>
>
>> which analyzer you are using to index chinese pdf documents ?
>> I think you should use cjkanalyzer
>> ----- Original Message -----
>> From: "WuDG@infoPro.cn" <wu...@infopro.cn>
>> To: <lu...@jakarta.apache.org>
>> Sent: Wednesday, September 08, 2004 11:27 AM
>> Subject: pdf in Chinese
>>
>>
>>> Hi all,
>>> i use pdfbox to parse pdf file to lucene document.when i parse
>> Chinese
>>> pdf file,pdfbox is not always success.
>>> Is anyone have some advice?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdf in Chinese
Posted by "WuDG@infoPro.cn" <wu...@infopro.cn>.
it is not about analyzer ,i need to read text from pdf file first.
----- Original Message -----
From: "Chandan Tamrakar" <ch...@ccnep.com.np>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, September 08, 2004 4:15 PM
Subject: Re: pdf in Chinese
> which analyzer you are using to index chinese pdf documents ?
> I think you should use cjkanalyzer
> ----- Original Message -----
> From: "WuDG@infoPro.cn" <wu...@infopro.cn>
> To: <lu...@jakarta.apache.org>
> Sent: Wednesday, September 08, 2004 11:27 AM
> Subject: pdf in Chinese
>
>
> > Hi all,
> > i use pdfbox to parse pdf file to lucene document.when i parse
> Chinese
> > pdf file,pdfbox is not always success.
> > Is anyone have some advice?
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: pdf in Chinese
Posted by Chandan Tamrakar <ch...@ccnep.com.np>.
which analyzer you are using to index chinese pdf documents ?
I think you should use cjkanalyzer
----- Original Message -----
From: "WuDG@infoPro.cn" <wu...@infopro.cn>
To: <lu...@jakarta.apache.org>
Sent: Wednesday, September 08, 2004 11:27 AM
Subject: pdf in Chinese
> Hi all,
> i use pdfbox to parse pdf file to lucene document.when i parse
Chinese
> pdf file,pdfbox is not always success.
> Is anyone have some advice?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org