You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "WuDG@infoPro.cn" <wu...@infopro.cn> on 2004/09/08 07:42:41 UTC

pdf in Chinese

Hi all,
    i use pdfbox to parse pdf file to lucene document.when i parse  Chinese
pdf file,pdfbox is not always success.
    Is anyone have some advice?


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: pdf in Chinese

Posted by Ben Litchfield <be...@csh.rit.edu>.
This appears to be more of a PDFBox issue than a lucene issue, please post
an issue to the PDFBox site.

Also note, that because of certain encodings that a PDF writer can use, it
is impossible to extract text from all PDF documents.

Ben

On Wed, 8 Sep 2004, WuDG@infoPro.cn wrote:

> it is not about analyzer ,i  need to read text from pdf file first.
>
> ----- Original Message -----
> From: "Chandan Tamrakar" <ch...@ccnep.com.np>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, September 08, 2004 4:15 PM
> Subject: Re: pdf in Chinese
>
>
> > which analyzer you are using to index chinese pdf documents ?
> > I think you should use cjkanalyzer
> > ----- Original Message -----
> > From: "WuDG@infoPro.cn" <wu...@infopro.cn>
> > To: <lu...@jakarta.apache.org>
> > Sent: Wednesday, September 08, 2004 11:27 AM
> > Subject: pdf in Chinese
> >
> >
> > > Hi all,
> > >     i use pdfbox to parse pdf file to lucene document.when i parse
> > Chinese
> > > pdf file,pdfbox is not always success.
> > >     Is anyone have some advice?
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PDF->Text Performance comparison

Posted by Ben Litchfield <be...@csh.rit.edu>.
>  1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
>  problems with parsing the same pdf documents, which worked well for
>  0.6.3. I mentioned my problems here:
>   https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314

I am waiting for a response from you on this issue, try to login to SF
when posting bugs so you get a notification when it is updated.



>  2) When I were started with 0.6.3 I experienced perfomance problems
>  too, especially with large pdf documents (I had several with more
>  then 20MB size). I changed a bit source, wrapping the following line
>  of BaseParser class:

I will give that a try, thanks for letting me know.

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PDF->Text Performance comparison

Posted by Maxim Patramanskij <ma...@osua.de>.
Hello Ben,

I've been using PDFBox within last year, but only version 0.6.3,
because of 2 reasons:

 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
 problems with parsing the same pdf documents, which worked well for
 0.6.3. I mentioned my problems here:
  https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314

 2) When I were started with 0.6.3 I experienced perfomance problems
 too, especially with large pdf documents (I had several with more
 then 20MB size). I changed a bit source, wrapping the following line
 of BaseParser class:

            out = stream.createFilteredStream( streamLength );

            to
            
            out = new BufferedOutputStream(stream.createFilteredStream( streamLength ));
            

 The performance increase, I've got, was huge:
 parsing 21MB pdf document to text before modifacatrion was taking 78
 seconds, after modification 12 seconds, so more the 6 times faster.

 I tried also to use buffered streams in some other places, but it was
 not that visible. I hope this change can also be incorporated into
 the current 0.6.6 release and then benchmarks may stay in PDFBox side
 :)


 Max


BL> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications
>> http://snowtide.com/home/PDFTextStream/


BL> For those that have not seen, snowtide.com has done a performance
BL> comparison against several Java PDF->Text libraries, including Snowtide's
BL> PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
BL> done.

BL> http://snowtide.com/home/PDFTextStream/Performance


BL> PDFBox: slow PDF text extraction for Java applications
BL> http://www.pdfbox.org

BL> :)

BL> Ben


BL> ---------------------------------------------------------------------
BL> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
BL> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




-- 
Best regards,
 Maxim                            mailto:max@osua.de


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PDF->Text Performance comparison

Posted by Chas Emerick <ce...@snowtide.com>.
Ben,

Wow, thanks for the plug! :-)

Truthfully, I was worried that our open-source brethren might feel 
slighted by the comparison -- that's partially why we wanted to make 
sure it was as thorough and transparent as possible so that anyone 
could review the results for themselves.  I'm glad that you're not at 
all sore.

Chas Emerick   |   cemerick@snowtide.com

PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/

On Sep 8, 2004, at 10:41 AM, Ben Litchfield wrote:

>
> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications
>> http://snowtide.com/home/PDFTextStream/
>
>
> For those that have not seen, snowtide.com has done a performance
> comparison against several Java PDF->Text libraries, including 
> Snowtide's
> PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly 
> well
> done.
>
> http://snowtide.com/home/PDFTextStream/Performance
>
>
> PDFBox: slow PDF text extraction for Java applications
> http://www.pdfbox.org
>
> :)
>
> Ben


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PDF->Text Performance comparison

Posted by Ben Litchfield <be...@csh.rit.edu>.
Yes, that and a few other adjectives, but I didn't want to get carried
away.

Ben


On Wed, 8 Sep 2004, Doug Cutting wrote:

> Ben Litchfield wrote:
> > PDFBox: slow PDF text extraction for Java applications
> > http://www.pdfbox.org
>
> Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java
> applications, with Lucene integration"?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: PDF->Text Performance comparison

Posted by Doug Cutting <cu...@apache.org>.
Ben Litchfield wrote:
> PDFBox: slow PDF text extraction for Java applications
> http://www.pdfbox.org

Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java 
applications, with Lucene integration"?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


PDF->Text Performance comparison

Posted by Ben Litchfield <be...@csh.rit.edu>.
On Wed, 8 Sep 2004, Chas Emerick wrote:
> PDFTextStream: fast PDF text extraction for Java applications
> http://snowtide.com/home/PDFTextStream/


For those that have not seen, snowtide.com has done a performance
comparison against several Java PDF->Text libraries, including Snowtide's
PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
done.

http://snowtide.com/home/PDFTextStream/Performance


PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org

:)

Ben


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: pdf in Chinese

Posted by Chas Emerick <ce...@snowtide.com>.
I'm not aware of any Java library that can reliably extract Chinese 
text from PDF documents.  We're planning on supporting Chinese, 
Japanese, and Korean in version 2 of PDFTextStream, but there's no 
doubt that it's a huge challenge.

Chas Emerick   |   cemerick@snowtide.com

PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/

On Sep 8, 2004, at 5:58 AM, WuDG@infoPro.cn wrote:

> it is not about analyzer ,i  need to read text from pdf file first.
>
> ----- Original Message -----
> From: "Chandan Tamrakar" <ch...@ccnep.com.np>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, September 08, 2004 4:15 PM
> Subject: Re: pdf in Chinese
>
>
>> which analyzer you are using to index chinese pdf documents ?
>> I think you should use cjkanalyzer
>> ----- Original Message -----
>> From: "WuDG@infoPro.cn" <wu...@infopro.cn>
>> To: <lu...@jakarta.apache.org>
>> Sent: Wednesday, September 08, 2004 11:27 AM
>> Subject: pdf in Chinese
>>
>>
>>> Hi all,
>>>     i use pdfbox to parse pdf file to lucene document.when i parse
>> Chinese
>>> pdf file,pdfbox is not always success.
>>>     Is anyone have some advice?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: pdf in Chinese

Posted by "WuDG@infoPro.cn" <wu...@infopro.cn>.
it is not about analyzer ,i  need to read text from pdf file first.

----- Original Message ----- 
From: "Chandan Tamrakar" <ch...@ccnep.com.np>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, September 08, 2004 4:15 PM
Subject: Re: pdf in Chinese


> which analyzer you are using to index chinese pdf documents ?
> I think you should use cjkanalyzer
> ----- Original Message ----- 
> From: "WuDG@infoPro.cn" <wu...@infopro.cn>
> To: <lu...@jakarta.apache.org>
> Sent: Wednesday, September 08, 2004 11:27 AM
> Subject: pdf in Chinese
> 
> 
> > Hi all,
> >     i use pdfbox to parse pdf file to lucene document.when i parse
> Chinese
> > pdf file,pdfbox is not always success.
> >     Is anyone have some advice?
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: pdf in Chinese

Posted by Chandan Tamrakar <ch...@ccnep.com.np>.
which analyzer you are using to index chinese pdf documents ?
I think you should use cjkanalyzer
----- Original Message ----- 
From: "WuDG@infoPro.cn" <wu...@infopro.cn>
To: <lu...@jakarta.apache.org>
Sent: Wednesday, September 08, 2004 11:27 AM
Subject: pdf in Chinese


> Hi all,
>     i use pdfbox to parse pdf file to lucene document.when i parse
Chinese
> pdf file,pdfbox is not always success.
>     Is anyone have some advice?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org