You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Nicolas Paris <ni...@gmail.com> on 2016/03/01 13:33:45 UTC
ExtractText command line tool - modify code
Hello,
My use case is I extract text from the same pdf in 2 ways : one sorted and
one non sorted.
This process takes 2 seconds. Its too long (I have 1M pdf to extract)
I wonder if it could be feaseable to modify the code (
https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java)
in order to combine the two actions in one.
The output would be something like
extractSorted
separator
extractNonSorted
And the command line would be "pdfbox..extractText -combine -nonSort -sort"
.
Maybe this is not a good idea. Then have you advices in order to improve
extract performances ?
Thanks by advance,
Re: ExtractText command line tool - modify code
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 01.03.2016 um 20:29 schrieb Nicolas Paris:
> 2016-03-01 19:28 GMT+01:00 Tilman Hausherr <TH...@t-online.de>:
>
>> Am 01.03.2016 um 13:33 schrieb Nicolas Paris:
>>
>>> Hello,
>>>
>>> My use case is I extract text from the same pdf in 2 ways : one sorted and
>>> one non sorted.
>>> This process takes 2 seconds. Its too long (I have 1M pdf to extract)
>>>
>>> I wonder if it could be feaseable to modify the code (
>>>
>>>
>>>
>>> https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java
>>> )
>>> in order to combine the two actions in one.
>>>
>>> The output would be something like
>>> extractSorted
>>> separator
>>> extractNonSorted
>>>
>>> And the command line would be "pdfbox..extractText -combine -nonSort
>>> -sort"
>>> .
>>>
>>> Maybe this is not a good idea. Then have you advices in order to improve
>>> extract performances ?
>>>
>> You could write a software that does both extracts in parallel (it should
>> use different PDDocument objects).
>>
>
> I made it work. Just by editing the java file I was talking about. line
> 230. By adding a new
> stripper.writeText( document, output );
> with an other config, I am able multiply performances by 2 (the use case
> described in previous email). I could do that in 2 threads, but I allready
> run the command in multi linux processes.
>
>
>> Re performance - the current snapshot is a bit faster than RC3., thanks to
>> PDFBOX-3224 which improved performance by about 20%.
>>
> You mean the github version I cloned and compile is not the RC3 ?
Sorry, that is of course the latest snapshot (mirror), so you do already
have max speed.
Tilman
>
>
>
>> I don't have a suggestion how to improve performance... use a fast
>> computer with enough memory. Or try other products:
>>
>> https://pdfliberation.wordpress.com/
>
> Thanks for the link I didn't knew them. Actually I already
> have
> tested others but the hability to "sort" the text is very important for my
> pdf.
>
> (python pdfminer, linux pdf2html)
>
>>
>> But I think PDFBox is not that bad, considering this project:
>> https://github.com/jsonstein/HRC-emails-PDF2TXT
>>
>> Tilman
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: ExtractText command line tool - modify code
Posted by Nicolas Paris <ni...@gmail.com>.
2016-03-01 19:28 GMT+01:00 Tilman Hausherr <TH...@t-online.de>:
> Am 01.03.2016 um 13:33 schrieb Nicolas Paris:
>
>> Hello,
>>
>> My use case is I extract text from the same pdf in 2 ways : one sorted and
>> one non sorted.
>> This process takes 2 seconds. Its too long (I have 1M pdf to extract)
>>
>> I wonder if it could be feaseable to modify the code (
>>
>>
>>
>> https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java
>> )
>> in order to combine the two actions in one.
>>
>> The output would be something like
>> extractSorted
>> separator
>> extractNonSorted
>>
>> And the command line would be "pdfbox..extractText -combine -nonSort
>> -sort"
>> .
>>
>> Maybe this is not a good idea. Then have you advices in order to improve
>> extract performances ?
>>
>
> You could write a software that does both extracts in parallel (it should
> use different PDDocument objects).
>
I made it work. Just by editing the java file I was talking about. line
230. By adding a new
stripper.writeText( document, output );
with an other config, I am able multiply performances by 2 (the use case
described in previous email). I could do that in 2 threads, but I allready
run the command in multi linux processes.
> Re performance - the current snapshot is a bit faster than RC3., thanks to
> PDFBOX-3224 which improved performance by about 20%.
>
You mean the github version I cloned and compile is not the RC3 ?
> I don't have a suggestion how to improve performance... use a fast
> computer with enough memory. Or try other products:
>
> https://pdfliberation.wordpress.com/
Thanks for the link I didn't knew them. Actually I already
have
tested others but the hability to "sort" the text is very important for my
pdf.
(python pdfminer, linux pdf2html)
>
>
> But I think PDFBox is not that bad, considering this project:
> https://github.com/jsonstein/HRC-emails-PDF2TXT
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
Re: ExtractText command line tool - modify code
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 01.03.2016 um 13:33 schrieb Nicolas Paris:
> Hello,
>
> My use case is I extract text from the same pdf in 2 ways : one sorted and
> one non sorted.
> This process takes 2 seconds. Its too long (I have 1M pdf to extract)
>
> I wonder if it could be feaseable to modify the code (
> https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java)
> in order to combine the two actions in one.
>
> The output would be something like
> extractSorted
> separator
> extractNonSorted
>
> And the command line would be "pdfbox..extractText -combine -nonSort -sort"
> .
>
> Maybe this is not a good idea. Then have you advices in order to improve
> extract performances ?
You could write a software that does both extracts in parallel (it
should use different PDDocument objects).
Re performance - the current snapshot is a bit faster than RC3., thanks
to PDFBOX-3224 which improved performance by about 20%.
I don't have a suggestion how to improve performance... use a fast
computer with enough memory. Or try other products:
https://pdfliberation.wordpress.com/
But I think PDFBox is not that bad, considering this project:
https://github.com/jsonstein/HRC-emails-PDF2TXT
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org