You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bertrand VENZAL <be...@cirso.fr> on 2005/01/31 19:01:11 UTC

Use an executable from java ...

Hi all,

I ve a kind of problem to execute a converting tool to modify a pdf to an 
html under Linux. In fact, i have an executable "pdftohtml" which work 
correctly on batch mode, and when I want to use it through Java under 
Windows 2000 works also,BUT it does not work at all on the server under 
linux. I m using the following code.

scommand = "/bin/sh -c \"myCommand fileName output\" ";

Runtime runtime = Runtime.getRuntime();
Process proc = runtime.exec(scommand);
proc.waitFor();


I m running my code under Linux-redhat with a classic shell. 
Is there an other way to do the same thing or maybe am i missing something 
?
Any help will be grandly appreciate.

Thanks
Bertrand



Re: Use an executable from java ...

Posted by Ben Litchfield <be...@csh.rit.edu>.
I will assume you are asking this question on the lucene mailing list
because you now want to index that PDF document.

Have you tried PDFBox?  It can't create an html file for you but it can
extract text.

Ben
http://www.pdfbox.org



On Mon, 31 Jan 2005, Bertrand VENZAL wrote:

> Hi all,
>
> I ve a kind of problem to execute a converting tool to modify a pdf to an
> html under Linux. In fact, i have an executable "pdftohtml" which work
> correctly on batch mode, and when I want to use it through Java under
> Windows 2000 works also,BUT it does not work at all on the server under
> linux. I m using the following code.
>
> scommand = "/bin/sh -c \"myCommand fileName output\" ";
>
> Runtime runtime = Runtime.getRuntime();
> Process proc = runtime.exec(scommand);
> proc.waitFor();
>
>
> I m running my code under Linux-redhat with a classic shell.
> Is there an other way to do the same thing or maybe am i missing something
> ?
> Any help will be grandly appreciate.
>
> Thanks
> Bertrand
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use an executable from java ...

Posted by Ben Litchfield <be...@csh.rit.edu>.
Kristian,

I assume all of you comments are with the 0.7.0 version of PDFBox.  There
were some great improvements in that version in terms of speed and
accuracy.

> That's courious beacause we experienced that pdftotext was able to
> convert 33% more pdf documents than PDFBox.

Depending on the set of PDF documents you will notice different results.
I welcome any bug reports(if they don't already exist) on that 33% that
are not working for you.  In particular, PDFBox needs some work on
non-english languages.


> That's good. Out application supports alternative conversion pipelines
> that provide fallback mechanims. If the first converter cannot convert a
> document a second converter is called. So PDFBox is our fallback
> converter.


Well, at least PDFBox made it as the "fallback.  :)

Ben
http://www.pdfbox.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use an executable from java ...

Posted by Kristian Hermsdorf <kr...@ifbus.de>.
Hi Christiaan

> Just to defend PDFBox: we actually recently decided to move in the
> opposite direction.

I didn't want to offend PDFBox *g*

> We just removed pdftotext from our application and are now using PDFBox
> 0.7.0 for all our PDF processing. Before we were using them both in
> parallel: pdftotext for fast text extraction and PDFBox for all metadata
> such as titles, authors, etc.

pdftotext is able to produce html output which contains these metadata as well.
Conversion from pdf to html and parsing html is (with our tests) still twice as fast as PDFBox.

> Upon closer inspection of the output, we also saw that pdftotext was not
> able to extract text from a significant amount of PDFs (9 out of 113
> documents, all perfectly readable PDF documents) while PDFBox performed
> flawlessly. For us, quality is of greater concern than speed.

That's courious beacause we experienced that pdftotext was able to convert 33% more pdf documents than PDFBox.

> Finally, I must say that the speed and quality of Ben's replies to bug
> reports and suggestions is very impressive, giving us confidence in that
> future problems will be handled satisfactorily.

That's good. Out application supports alternative conversion pipelines that provide fallback mechanims. If the first converter cannot convert a document a second converter is called. So PDFBox is our fallback converter.

Greetings
Kristian

-- 
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  

Kristian Hermsdorf

Interface Projects GmbH
Tolkewitzer Straße  49		
01277 Dresden			


tel.: ++49-351-3 18 09 39

mail: Kristian.Hermsdorf@interface-business.de
priv: kristian@entropus.de


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use an executable from java ...

Posted by Christiaan Fluit <ch...@aduna.biz>.
Kristian Hermsdorf wrote:

> We're using pdftotext as well, because PDFbox ist really slow. If your 
> application should work under Windows you will probably experiance some 
> mystic Java-VM crashes while executing external processes in batch-mode. 
> (This is because of a bug in Windows-VM... we implemented out own 
> Process with JNI to compensate this bug).

Just to defend PDFBox: we actually recently decided to move in the 
opposite direction.

We just removed pdftotext from our application and are now using PDFBox 
0.7.0 for all our PDF processing. Before we were using them both in 
parallel: pdftotext for fast text extraction and PDFBox for all metadata 
such as titles, authors, etc.

One reason for this is that with version 0.7.0 the difference in 
performance was only marginal on our testset of 113 PDF documents from 
various sources. Of course the difference will be bigger when you are 
only extracting text, because in the old situation we had to let two 
tools process the same file.

Upon closer inspection of the output, we also saw that pdftotext was not 
able to extract text from a significant amount of PDFs (9 out of 113 
documents, all perfectly readable PDF documents) while PDFBox performed 
flawlessly. For us, quality is of greater concern than speed.

Finally, I must say that the speed and quality of Ben's replies to bug 
reports and suggestions is very impressive, giving us confidence in that 
future problems will be handled satisfactorily.


Regards,

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use an executable from java ...

Posted by Kristian Hermsdorf <kr...@ifbus.de>.
Hi

> I ve a kind of problem to execute a converting tool to modify a pdf to an
> html under Linux. In fact, i have an executable "pdftohtml" which work
> correctly on batch mode, and when I want to use it through Java under
> Windows 2000 works also,BUT it does not work at all on the server under
> linux. I m using the following code....

you've got to read the processes stdout and stderr while the process is running. If you don't read those streams the process will block after it wrote some (about 8k) bytes to ist's stdout/stderr.

We're using pdftotext as well, because PDFbox ist really slow. If your application should work under Windows you will probably experiance some mystic Java-VM crashes while executing external processes in batch-mode. (This is because of a bug in Windows-VM... we implemented out own Process with JNI to compensate this bug).

Greetings,
Kristian


-- 
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  

Kristian Hermsdorf

Interface Projects GmbH
Tolkewitzer Straße  49		
01277 Dresden			


tel.: ++49-351-3 18 09 39

mail: Kristian.Hermsdorf@interface-business.de
priv: kristian@entropus.de


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Use an executable from java ...

Posted by Kelvin Tan <ke...@relevanz.com>.
Check out http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html which provides some pointers and code which should be helpful.

Cheers,
Kelvin
http://www.supermind.org

On Mon, 31 Jan 2005 19:01:11 +0100, Bertrand VENZAL wrote:
>�Hi all,
>
>�I ve a kind of problem to execute a converting tool to modify a pdf
>�to an html under Linux. In fact, i have an executable "pdftohtml"
>�which work correctly on batch mode, and when I want to use it
>�through Java under Windows 2000 works also,BUT it does not work at
>�all on the server under linux. I m using the following code.
>
>�scommand = "/bin/sh -c \"myCommand fileName output\" ";
>
>�Runtime runtime = Runtime.getRuntime();
>�Process proc = runtime.exec(scommand);
>�proc.waitFor();
>
>
>�I m running my code under Linux-redhat with a classic shell. Is
>�there an other way to do the same thing or maybe am i missing
>�something ? Any help will be grandly appreciate.
>
>�Thanks
>�Bertrand



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org