You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "robyp7 ." <ro...@gmail.com> on 2015/10/09 10:34:11 UTC

how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

hi,

I have some questions about parsing pdf anfd how to:

1) what is the purpose of using

PDDocument.loadNonSeq method that include a scratch/temporary file?


2) I have big pdf and i need to parse it and get text contents. I use
PDDocument.load() and then PDFTextStripper to extract data page by page
(pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using
loadNonSeq insted load?

For example

File pdfFile =  new File("mypdf.pdf");
File tmp_file =  new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
  PDFTextStripper stripper = new PDFTextStripper();
        Writer destination = new StringWriter();
        String xml="";
        stripper.setStartPage(index);
        stripper.setEndPage(index);
        stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}

Is this code above a right loadNonSeq use and is it a good practice to read
pdf page per page without vaste in memory?
I use page per page reading because i need to write text in xml using dom
memory (using stripping technique, i decide to produce an xml for every
page)

Thank you very much

Roby

Re: how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

Posted by "robyp7 ." <ro...@gmail.com>.
thanks you Tilman!
I have decide to use Apache Tika. It uses SAX handler to perform xhtml, and
i rewrite new one personal sax handler for my specific xml format.
The last version of Tika use the last PDFBox version and i found loadNoSeq
method call inside Tika parser library:
i think its a good idea to use robust code instead of mine above. bye

2015-10-09 19:40 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:

> Am 09.10.2015 um 10:34 schrieb robyp7 .:
>
>> hi,
>>
>> I have some questions about parsing pdf anfd how to:
>>
>> 1) what is the purpose of using
>>
>> PDDocument.loadNonSeq method that include a scratch/temporary file?
>>
>
> saves memory
>
>
>>
>> 2) I have big pdf and i need to parse it and get text contents. I use
>> PDDocument.load() and then PDFTextStripper to extract data page by page
>> (pdfstripper have got setStartPage(n) and setEndPage(n)
>> where n=n+1 every page loop ). Is more efficient for memory using
>> loadNonSeq insted load?
>>
>
> Don't know, but loadNonSeq is the correct parser. load() is an outdated
> parsing method. So you might get wrong results with load() in some rare
> cases. In the upcoming 2.0 version, the old parser will be removed anyway.
>
>
>> For example
>>
>> File pdfFile =  new File("mypdf.pdf");
>> File tmp_file =  new File("result.tmp");
>> PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
>> RandomAccessFile(tmp_file, READ_WRITE));
>> int index=1;
>> int numpages = doc.getNumberOfPages();
>> for (int index = 1; index <= numpages; index++){
>>    PDFTextStripper stripper = new PDFTextStripper();
>>          Writer destination = new StringWriter();
>>          String xml="";
>>          stripper.setStartPage(index);
>>          stripper.setEndPage(index);
>>          stripper.writeText(this.doc, destination);
>> .... //filtering text and then convert it in xml
>> }
>>
>> Is this code above a right loadNonSeq use and is it a good practice to
>> read
>> pdf page per page without vaste in memory?
>> I use page per page reading because i need to write text in xml using dom
>> memory (using stripping technique, i decide to produce an xml for every
>> page)
>>
>
> If your results need to be separated by page, then your code is OK.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 09.10.2015 um 10:34 schrieb robyp7 .:
> hi,
>
> I have some questions about parsing pdf anfd how to:
>
> 1) what is the purpose of using
>
> PDDocument.loadNonSeq method that include a scratch/temporary file?

saves memory

>
>
> 2) I have big pdf and i need to parse it and get text contents. I use
> PDDocument.load() and then PDFTextStripper to extract data page by page
> (pdfstripper have got setStartPage(n) and setEndPage(n)
> where n=n+1 every page loop ). Is more efficient for memory using
> loadNonSeq insted load?

Don't know, but loadNonSeq is the correct parser. load() is an outdated 
parsing method. So you might get wrong results with load() in some rare 
cases. In the upcoming 2.0 version, the old parser will be removed anyway.

>
> For example
>
> File pdfFile =  new File("mypdf.pdf");
> File tmp_file =  new File("result.tmp");
> PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
> RandomAccessFile(tmp_file, READ_WRITE));
> int index=1;
> int numpages = doc.getNumberOfPages();
> for (int index = 1; index <= numpages; index++){
>    PDFTextStripper stripper = new PDFTextStripper();
>          Writer destination = new StringWriter();
>          String xml="";
>          stripper.setStartPage(index);
>          stripper.setEndPage(index);
>          stripper.writeText(this.doc, destination);
> .... //filtering text and then convert it in xml
> }
>
> Is this code above a right loadNonSeq use and is it a good practice to read
> pdf page per page without vaste in memory?
> I use page per page reading because i need to write text in xml using dom
> memory (using stripping technique, i decide to produce an xml for every
> page)

If your results need to be separated by page, then your code is OK.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org