You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2013/12/13 21:54:07 UTC
[jira] [Commented] (PDFBOX-1808) PDFTextStripper.getText - hight memory usage

    [ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847891#comment-13847891 ] 

Tilman Hausherr commented on PDFBOX-1808:
-----------------------------------------

What happens if you extract from several PDFs in the software, or of the same PDF several times? Is there more and more memory used? Or does it stay the same?

I'm asking this to clarify wether 1) pdfbox is just using a lot of memory or 2) pdfbox has memory leaks.

If you are using netbeans, the profiler has some cool features. It helped me find a bug in PDFBOX-1694.

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Priority: Critical
>              Labels: performance
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
> 		System.out.println("START - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
> 		System.out.println("PDDocument getNumberOfPages - Nombre de pages: " + cd.getNumberOfPages());
> 		System.out.println("PDDocument load - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> String pdfText = "";
> try{
> 	PDFTextStripper stripper = new PDFTextStripper();
> 	pdfText = stripper.getText(cd);
> 			System.out.println("PDFTextStripper getText - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> 	stripper.resetEngine();
> 	stripper = null;
> 			System.out.println("PDFTextStripper resetEngine - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> }
> finally{
> 	if( cd!=null ){
> 		cd.close();
> 		cd = null;
> 				System.out.println("PDDocument close - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> 	}
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
> 		System.out.println("TextField - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)