You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Jérôme Wacongne <ch...@c4-soft.com> on 2011/02/01 18:02:41 UTC

undeleted pdf box temp files (+~JF*.tmp)

Hi,

I have an issue with pdfbox leaving tmp files (+~JF*.tmp) in the JVM temp
folder.
After enough runs, that temp folder just gets full...

Here are the Groovy code source and unit test to reproduce (ok.pdf is a file
parsed with no error, ko.pdf is a file with one corrupted object, but
parsing can continue) :

package com.c4soft.mvntest

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper

/**
 * Example Groovy class.
 */
class MvnTest {
static main(args) {
if(!args || args.size() != 2) {
 println 'Exactly two arguments are required:\n- path to the pdf to read\n-
path to the file were to put the pdf content'
return
 }
File pdfFile = new File(args[0])
if(!pdfFile.exists()) {
 println "${args[0]} does not exist"
return
}
 InputStream pdfStream = new FileInputStream(pdfFile)
OutputStream text = new FileOutputStream(args[1])
 PDDocument document
try {
document = PDDocument.load(pdfStream, true)
 PDFTextStripper stripper = new PDFTextStripper('UTF-8')
text << stripper.getText(document)
 } finally {
document.close()
}
 }
}




package com.c4soft.mvntest;

import static org.junit.Assert.*
import java.util.regex.Pattern
import groovy.io.FileType
import org.junit.Test

class MvnTestTest {
@Test
 public void testMain() {
File tmpDir = new File(System.getProperty('java.io.tmpdir'))
 Pattern p = ~/\+~JF.+\.tmp/
tmpDir.eachFileMatch(FileType.FILES, p) { File tmpFile ->
 tmpFile.delete()
}
Integer cnt = 0
 tmpDir.eachFileMatch(FileType.FILES, p) { File tmpFile ->
cnt++
}
 assertEquals(0, cnt)
 OutputStream out = new FileOutputStream('out.txt')
 Integer i = 1
while(i < 21) {
MvnTest.main('src/test/resources/ok.pdf', 'ok.txt')
 cnt = 0
tmpDir.eachFileMatch(FileType.FILES, p) { File tmpFile ->
cnt++
 }
out << "${i} ok: ${cnt}" << '\n'
i++
 }
out << '\n'
i = 1
 while(i < 21) {
 MvnTest.main('src/test/resources/ko.pdf', 'ko.txt')
 cnt = 0
tmpDir.eachFileMatch(FileType.FILES, p) { File tmpFile ->
cnt++
 }
out << "${i} ko: ${cnt}" << '\n'
i++
 }
}
}


And here is corresponding output (count of pdfbox temp files in JVM temp
folder):
1 ok: 35
2 ok: 36
3 ok: 37
4 ok: 42
5 ok: 43
6 ok: 46
7 ok: 47
8 ok: 50
9 ok: 50
10 ok: 53
11 ok: 53
12 ok: 56
13 ok: 65
14 ok: 68
15 ok: 73
16 ok: 75
17 ok: 80
18 ok: 81
19 ok: 84
20 ok: 85

1 ko: 62
2 ko: 62
3 ko: 62
4 ko: 62
5 ko: 62
6 ko: 62
7 ko: 62
8 ko: 62
9 ko: 71
10 ko: 65
11 ko: 67
12 ko: 68
13 ko: 68
14 ko: 68
15 ko: 68
16 ko: 68
17 ko: 68
18 ko: 77
19 ko: 71
20 ko: 73

After JUnit returns, 3 tmp files are left in temp folder

Has anyone ever noticed something like that ?
Do I miss use the API ?

Regards,
Ch4mp

Re: undeleted pdf box temp files (+~JF*.tmp)

Posted by Jérôme Wacongne <ch...@c4-soft.com>.
Thank you Kevin :D

2011/2/12 Kevin Jackson <kj...@ebscohost.com>

> I submitted a patch with Jira PDFBOX-959 which fixed this problem if you
> are only doing text extraction.
>
> Kevin
>
> -----Original Message-----
> From: Jérôme Wacongne [mailto:ch4mp@c4-soft.com]
> Sent: Friday, February 11, 2011 5:40 PM
> To: users@pdfbox.apache.org
> Subject: Re: undeleted pdf box temp files (+~JF*.tmp)
>
> I'm glad others noticed because it is a production issue to me.
> We had to set up a batch that cleans old temp files in JVM tmp folder not
> to
> fill it up. Kind of really dirty.
>
> Best regards
>
> 2011/2/11 Ryan McKinley <ry...@gmail.com>
>
> > >
> > > Has anyone ever noticed something like that ?
> > > Do I miss use the API ?
> > >
> >
> > I get the same behavior.... lots and lots of +~JF*.tmp files...
> >
> > last time i looked into it, it is a tempfile that java makes when you
> > reference fonts.
> >
> > would be nice to somehow clean them up automatically though.
> >
> > ryan
> >
>

Re: undeleted pdf box temp files (+~JF*.tmp)

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 12.02.2011 02:05, schrieb Kevin Jackson:
> I submitted a patch with Jira PDFBOX-959 which fixed this problem if you are only doing text extraction.
I added the patch in revision 1070073.

Thnaks again for the contribution!

BR
Andreas Lehmkühler

RE: undeleted pdf box temp files (+~JF*.tmp)

Posted by Kevin Jackson <kj...@ebscohost.com>.
I submitted a patch with Jira PDFBOX-959 which fixed this problem if you are only doing text extraction.

Kevin

-----Original Message-----
From: Jérôme Wacongne [mailto:ch4mp@c4-soft.com] 
Sent: Friday, February 11, 2011 5:40 PM
To: users@pdfbox.apache.org
Subject: Re: undeleted pdf box temp files (+~JF*.tmp)

I'm glad others noticed because it is a production issue to me.
We had to set up a batch that cleans old temp files in JVM tmp folder not to
fill it up. Kind of really dirty.

Best regards

2011/2/11 Ryan McKinley <ry...@gmail.com>

> >
> > Has anyone ever noticed something like that ?
> > Do I miss use the API ?
> >
>
> I get the same behavior.... lots and lots of +~JF*.tmp files...
>
> last time i looked into it, it is a tempfile that java makes when you
> reference fonts.
>
> would be nice to somehow clean them up automatically though.
>
> ryan
>

Re: undeleted pdf box temp files (+~JF*.tmp)

Posted by Jérôme Wacongne <ch...@c4-soft.com>.
I'm glad others noticed because it is a production issue to me.
We had to set up a batch that cleans old temp files in JVM tmp folder not to
fill it up. Kind of really dirty.

Best regards

2011/2/11 Ryan McKinley <ry...@gmail.com>

> >
> > Has anyone ever noticed something like that ?
> > Do I miss use the API ?
> >
>
> I get the same behavior.... lots and lots of +~JF*.tmp files...
>
> last time i looked into it, it is a tempfile that java makes when you
> reference fonts.
>
> would be nice to somehow clean them up automatically though.
>
> ryan
>

Re: undeleted pdf box temp files (+~JF*.tmp)

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

Am 11.02.2011 22:54, schrieb Ryan McKinley:
>>
>> Has anyone ever noticed something like that ?
>> Do I miss use the API ?
>>
>
> I get the same behavior.... lots and lots of +~JF*.tmp files...
>
> last time i looked into it, it is a tempfile that java makes when you
> reference fonts.
>
> would be nice to somehow clean them up automatically though.
This may be related to PDFBOX-959 [1]


BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-959

Re: undeleted pdf box temp files (+~JF*.tmp)

Posted by Ryan McKinley <ry...@gmail.com>.
>
> Has anyone ever noticed something like that ?
> Do I miss use the API ?
>

I get the same behavior.... lots and lots of +~JF*.tmp files...

last time i looked into it, it is a tempfile that java makes when you
reference fonts.

would be nice to somehow clean them up automatically though.

ryan