You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Christian Heimes <li...@cheimes.de> on 2009/02/28 18:22:37 UTC

Wrapping PDFBox with JCC

Hello!

I'm trying to wrap pdfbox with JCC and I run into multiple issues. At
first the list of required jars kept on growing. I had to remove two
packages and one class to keep the list small. Is there a better way to
omit some packages from getting wrapped? --exclude didn't do what I was
expecting it to do.

Once I got the wrappring right I run into another issue. The generated
code failed to compile:

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall
-Wstrict-prototypes -fPIC -DPYTHON=1 -I/usr/lib/jvm/java-6-sun/include
-I/usr/lib/jvm/java-6-sun/include/linux -Ibuild/_pdfbox
-I/usr/lib/python2.5/site-packages/JCC-2.1-py2.5-linux-x86_64.egg/jcc/sources
-I/usr/include/python2.5 -c build/_pdfbox/__init__.cpp -o
build/temp.linux-x86_64-2.5/build/_pdfbox/__init__.o
-fno-strict-aliasing -Wno-write-strings
cc1plus: Warnung: Kommandozeilenoption "-Wstrict-prototypes" ist gültig
für Ada/C/ObjC, aber nicht für C++
In file included from build/_pdfbox/__init__.cpp:597:
build/_pdfbox/com/ibm/icu/impl/duration/impl/XMLRecordReader.h:52:
Fehler: expected unqualified-id before »const«
build/_pdfbox/com/ibm/icu/impl/duration/impl/XMLRecordReader.h:52:
Fehler: expected `)' before »const«
In file included from build/_pdfbox/__init__.cpp:598:
build/_pdfbox/com/ibm/icu/impl/duration/impl/XMLRecordWriter.h:54:
Fehler: expected unqualified-id before »const«
build/_pdfbox/com/ibm/icu/impl/duration/impl/XMLRecordWriter.h:54:
Fehler: expected `)' before »const«
In file included from
build/_pdfbox/org/apache/pdfbox/examples/util/PrintImageLocations.h:4,
                 from build/_pdfbox/__init__.cpp:2497:
...


XMLRecordReader.h
-----------------
...
XMLRecordReader(const XMLRecordReader& obj) : java::lang::Object(obj) {}

jboolean bool(const java::lang::String&) const; # <<< line 52
JArray<jboolean> boolArray(const java::lang::String&) const;
jchar character(const java::lang::String&) const;
JArray<jchar> characterArray(const java::lang::String&) const;
jboolean close() const;
...

I'm using JCC 2.1 (from pypi.python.org) on an AMD64 Linux box with
Python 2.5 and java-6-sun-1.6.0.10.


Project
=======
homepage: http://incubator.apache.org/pdfbox/
svn repository: http://svn.apache.org/repos/asf/incubator/pdfbox/trunk


Remove some files and packages
==============================
rm -r src/main/java/org/apache/pdfbox/ant
rm -r src/main/java/org/apache/pdfbox/searchengine
rm
src/main/java/org/apache/pdfbox/pdmodel/encryption/PublicKeySecurityHandler.java


Patch one file
==============

src/main/java/org/apache/pdfbox/pdmodel/encryption/SecurityHandlersManager.java
@@ -21,8 +21,6 @@
 import java.security.Security;
 import java.util.Hashtable;

-import org.bouncycastle.jce.provider.BouncyCastleProvider;
-
 /**
  * This class manages security handlers for the application. It follows
the singleton pattern.
  * To be usable, security managers must be registered in it. Security
managers are retrieved by
@@ -69,10 +67,6 @@
                 StandardSecurityHandler.FILTER,
                 StandardSecurityHandler.class,
                 StandardProtectionPolicy.class);
-            this.registerHandler(
-                PublicKeySecurityHandler.FILTER,
-                PublicKeySecurityHandler.class,
-                PublicKeyProtectionPolicy.class);
         }
         catch(Exception e)
         {
@@ -144,7 +138,6 @@
         {
             instance = new SecurityHandlersManager();
         }
-        Security.addProvider(new BouncyCastleProvider());

         return instance;
     }


build
=====

python2.5 -m jcc.__main__ \
    --jar lib/apache-pdfbox-0.8.0-incubator-dev.jar \
    --jar external/junit.jar \
    --jar external/FontBox-0.2.0-dev.jar \
    --jar external/JempBox-0.2.0.jar \
    --jar external/icu4j-4_0.jar \
    --package java.lang \
    --python pdfbox --version 0.8.0 --files 2 --build

Re: Wrapping PDFBox with JCC

Posted by Christian Heimes <li...@cheimes.de>.
Andi Vajda wrote:
> Something's not setup right and the error is not properly reported.
> To debug this, I'd use gdb to step through the code.
> If you send me a self-contained test case that reproduces this, I can
> give it a try.

I'll prepare a test case tomorrow. Thanks again!

Christian

Re: Wrapping PDFBox with JCC

Posted by Andi Vajda <va...@apache.org>.
On Mon, 9 Mar 2009, Christian Heimes wrote:

> Andi Vajda wrote:
>> At first quick glance, you declare a native startPage method but your
>> python class is not implementing it. On the other hand, it has a
>> startArticle method.
>
> Oh, you are right. But that doesn't fix the issue. It's still failing
> with the error message "SystemError: NULL result without error in
> PyObject_Call". I adde some System.out.println() calls to the Java code.
> The error is caused by "startDocument(document);".
>
> Do you have some tips how to debug the code?

Something's not setup right and the error is not properly reported.
To debug this, I'd use gdb to step through the code.
If you send me a self-contained test case that reproduces this, I can give 
it a try.

Andi..


Re: Wrapping PDFBox with JCC

Posted by Christian Heimes <li...@cheimes.de>.
Andi Vajda wrote:
> At first quick glance, you declare a native startPage method but your
> python class is not implementing it. On the other hand, it has a
> startArticle method.

Oh, you are right. But that doesn't fix the issue. It's still failing
with the error message "SystemError: NULL result without error in
PyObject_Call". I adde some System.out.println() calls to the Java code.
The error is caused by "startDocument(document);".

Do you have some tips how to debug the code?

Re: Wrapping PDFBox with JCC

Posted by Andi Vajda <va...@apache.org>.
On Mon, 9 Mar 2009, Christian Heimes wrote:

>    public native void pythonDecRef();
>
>    public native void processTextPosition( TextPosition text );
>    public native void startDocument(PDDocument pdf);
>    public native void startPage(PDPage page);
> }
>
>
> pdfbox.initVM(classpath=pdfbox.CLASSPATH)
>
> class Stripper(pdfbox.PyPDFTextStripper):
>    """
>    """
>    def processTextPosition(self, text):
>        print text
>
>    def startDocument(self, doc):
>        print doc
>
>    def startArticle(self, isltr):
>        print isltr
>

At first quick glance, you declare a native startPage method but your python 
class is not implementing it. On the other hand, it has a startArticle 
method.

Andi..

Re: Wrapping PDFBox with JCC

Posted by Christian Heimes <li...@cheimes.de>.
Andi Vajda wrote
> > After both these fixes, I was able to build wrappers for pdfbox:
> >
> >   >>> from pdfbox import *
> >   >>> initVM(CLASSPATH, vmargs='-Djava.awt.headless=true')
> >   <jcc.JCCEnv object at 0x295c0>
> >   >>>
> >
> > This is all checked into rev 751772.
> >
> > Please let me know if this works for you, I'd like to get a PyLucene
> > 2.4.1 release started now that Java Lucene 2.4.1 has been released. If I
> > broke something while doing these non-trivial fixes, now is the time to
> > find out.

Thanks Andi!

I was able to build a pdfbox wrapper with your changes, too. The changes
to setup.py makes it much easier to get the script working. Good work!

As a JCC and Java newbie I didn't understand the difference between
--jar, --include and --classpath at first. Could you please extend the
README in order to explain the three options?

Today I've started to play with subclassable Python wrappers. I couldn't
get the appended example to work. I run into several issues like
"SystemError: NULL result without error in PyObject_Call". Could you
have a look, please? The jar with PyPDFTextStripper was wrapped together
with the pdfbox jar.

public class PyPDFTextStripper extends PDFTextStripper {

	private PDDocument document;
	private long pythonObject;
	
	public PyPDFTextStripper(String filename) throws IOException
	{
		System.out.println( "Loading: " + filename );
		document = PDDocument.load(filename);
	        List allPages = document.getDocumentCatalog().getAllPages();
	        startDocument(document);
	        for( int i=0; i<allPages.size(); i++ )
        	{
	            PDPage page = (PDPage)allPages.get( i );
        	    System.out.println( "Processing page: " + i );
	            PDStream contents = page.getContents();
        	    if( contents != null )
	            {
        	        processStream(page, page.findResources(),
page.getContents().getStream());
	            }
	        }
	}

	public void pythonExtension(long pythonObject)
    {
        this.pythonObject = pythonObject;
    }

	public long pythonExtension()
    {
        return this.pythonObject;
    }

    public void finalize()
        throws Throwable
    {
        pythonDecRef();
    }

    public native void pythonDecRef();

    public native void processTextPosition( TextPosition text );
    public native void startDocument(PDDocument pdf);
    public native void startPage(PDPage page);
}


pdfbox.initVM(classpath=pdfbox.CLASSPATH)

class Stripper(pdfbox.PyPDFTextStripper):
    """
    """
    def processTextPosition(self, text):
        print text

    def startDocument(self, doc):
        print doc

    def startArticle(self, isltr):
        print isltr




Re: Wrapping PDFBox with JCC

Posted by Andi Vajda <va...@apache.org>.
On Sat, 28 Feb 2009, Christian Heimes wrote:

> Christian Heimes wrote:
>> Thanks Andi! Your proposed solution worked like a charm and the file
>> compiles. However the next file breaks with another error. This time it
>> didn't help to add "operator" to the list of RES
>
> Follow up:
> Apparently JCC doesn't check the list of RESERVED words when it creates
> namespaces. I tried a quick fix but it's much more work than I thought.
> Both cpp.py and python.py must be changed. Every #include, namespace and
> class related line has to check for RESERVED and mangle the name if
> necessary.

  Hi Christian,

Thank you for sending in detailed instructions on how to reproduce this (in 
an earlier message). I believe I now fixed the bug with Java packages or 
classes named using C++ reserved words. As with methods or fields, a '$' is 
added to the namespace or class name.

Along the way, I also fixed a long standing bug where static methods would 
be shadowed in Python wrappers by non-static methods of the same name.

Now, when a Java class has static and non-static methods with the same name 
the static methods are suffixed with '_' on the Python class. JCC emits a 
warning to stderr when this occurs.

   >>> Long.toString_(9L)
   u'9'
   >>> Long(9L).toString()
   u'9'

After both these fixes, I was able to build wrappers for pdfbox:

   >>> from pdfbox import *
   >>> initVM(CLASSPATH, vmargs='-Djava.awt.headless=true')
   <jcc.JCCEnv object at 0x295c0>
   >>>

This is all checked into rev 751772.

Please let me know if this works for you, I'd like to get a PyLucene 2.4.1 
release started now that Java Lucene 2.4.1 has been released. If I broke 
something while doing these non-trivial fixes, now is the time to find out.

Thanks !

Andi..

Re: Wrapping PDFBox with JCC

Posted by Christian Heimes <li...@cheimes.de>.
Christian Heimes wrote:
> Thanks Andi! Your proposed solution worked like a charm and the file
> compiles. However the next file breaks with another error. This time it
> didn't help to add "operator" to the list of RES

Follow up:
Apparently JCC doesn't check the list of RESERVED words when it creates
namespaces. I tried a quick fix but it's much more work than I thought.
Both cpp.py and python.py must be changed. Every #include, namespace and
class related line has to check for RESERVED and mangle the name if
necessary.

Christian

Re: Wrapping PDFBox with JCC

Posted by Andi Vajda <va...@apache.org>.
Ah well, using a reserved word as a c++ namespace name wasn't planned.
Sneakier, but clearly a bug in JCC.

Andi..

On Feb 28, 2009, at 19:08, Christian Heimes <li...@cheimes.de> wrote:

> Andi Vajda schrieb:
>>
>> On Sat, 28 Feb 2009, Christian Heimes wrote:
>>
>>> Once I got the wrappring right I run into another issue. The  
>>> generated
>>> code failed to compile:
>>>
>>> ...
>>>
>>>
>>> XMLRecordReader.h
>>> -----------------
>>> ...
>>> XMLRecordReader(const XMLRecordReader& obj) :  
>>> java::lang::Object(obj) {}
>>>
>>> jboolean bool(const java::lang::String&) const; # <<< line 52
>>
>> That looks like a function called 'bool'. It's very likely that  
>> 'bool'
>> is already taken by a type declared in a system or language header  
>> file.
>>
>> Line 72 in JCC's cpp.py file is a list of reserved words called  
>> 'RESERVED'.
>> You can probably work around this problem by adding 'bool' to this  
>> list
>> rebuilding JCC and reattempting your program.
>
> Thanks Andi! Your proposed solution worked like a charm and the file
> compiles. However the next file breaks with another error. This time  
> it
> didn't help to add "operator" to the list of RES
>
> In file included from
> build/_pdfbox/org/apache/pdfbox/examples/util/PrintImageLocations.h:4,
>                 from build/_pdfbox/__init__.cpp:2497:
> build/_pdfbox/org/apache/pdfbox/util/PDFStreamEngine.h:10: error:
> expected identifier before 'operator'
> build/_pdfbox/org/apache/pdfbox/util/PDFStreamEngine.h:10: error:
> expected type-specifier before '{' token
> build/_pdfbox/org/apache/pdfbox/util/PDFStreamEngine.h:94: error:
> expected ',' or '...'
> build/_pdfbox/__init__.cpp:4226: error: expected identifier before
> 'operator'
>
>
> namespace org {
>    namespace apache {
>        namespace pdfbox {
>            namespace util {
>                namespace operator { # <<< line 10
>                    class OperatorProcessor;
>                }
>                class Matrix;
>            }
>
> Christian

Re: Wrapping PDFBox with JCC

Posted by Christian Heimes <li...@cheimes.de>.
Andi Vajda schrieb:
> 
> On Sat, 28 Feb 2009, Christian Heimes wrote:
> 
>> Once I got the wrappring right I run into another issue. The generated
>> code failed to compile:
>>
>> ...
>>
>>
>> XMLRecordReader.h
>> -----------------
>> ...
>> XMLRecordReader(const XMLRecordReader& obj) : java::lang::Object(obj) {}
>>
>> jboolean bool(const java::lang::String&) const; # <<< line 52
> 
> That looks like a function called 'bool'. It's very likely that 'bool'
> is already taken by a type declared in a system or language header file.
> 
> Line 72 in JCC's cpp.py file is a list of reserved words called 'RESERVED'.
> You can probably work around this problem by adding 'bool' to this list
> rebuilding JCC and reattempting your program.

Thanks Andi! Your proposed solution worked like a charm and the file
compiles. However the next file breaks with another error. This time it
didn't help to add "operator" to the list of RES

In file included from
build/_pdfbox/org/apache/pdfbox/examples/util/PrintImageLocations.h:4,
                 from build/_pdfbox/__init__.cpp:2497:
build/_pdfbox/org/apache/pdfbox/util/PDFStreamEngine.h:10: error:
expected identifier before 'operator'
build/_pdfbox/org/apache/pdfbox/util/PDFStreamEngine.h:10: error:
expected type-specifier before '{' token
build/_pdfbox/org/apache/pdfbox/util/PDFStreamEngine.h:94: error:
expected ',' or '...'
build/_pdfbox/__init__.cpp:4226: error: expected identifier before
'operator'


namespace org {
    namespace apache {
        namespace pdfbox {
            namespace util {
                namespace operator { # <<< line 10
                    class OperatorProcessor;
                }
                class Matrix;
            }

Christian

Re: Wrapping PDFBox with JCC

Posted by Andi Vajda <va...@apache.org>.
On Sat, 28 Feb 2009, Christian Heimes wrote:

> Once I got the wrappring right I run into another issue. The generated
> code failed to compile:
>
> ...
>
>
> XMLRecordReader.h
> -----------------
> ...
> XMLRecordReader(const XMLRecordReader& obj) : java::lang::Object(obj) {}
>
> jboolean bool(const java::lang::String&) const; # <<< line 52

That looks like a function called 'bool'. It's very likely that 'bool' is 
already taken by a type declared in a system or language header file.

Line 72 in JCC's cpp.py file is a list of reserved words called 'RESERVED'.
You can probably work around this problem by adding 'bool' to this list
rebuilding JCC and reattempting your program.

Andi..

> JArray<jboolean> boolArray(const java::lang::String&) const;
> jchar character(const java::lang::String&) const;
> JArray<jchar> characterArray(const java::lang::String&) const;
> jboolean close() const;
> ...
>
> I'm using JCC 2.1 (from pypi.python.org) on an AMD64 Linux box with
> Python 2.5 and java-6-sun-1.6.0.10.
>
>
> Project
> =======
> homepage: http://incubator.apache.org/pdfbox/
> svn repository: http://svn.apache.org/repos/asf/incubator/pdfbox/trunk
>
>
> Remove some files and packages
> ==============================
> rm -r src/main/java/org/apache/pdfbox/ant
> rm -r src/main/java/org/apache/pdfbox/searchengine
> rm
> src/main/java/org/apache/pdfbox/pdmodel/encryption/PublicKeySecurityHandler.java
>
>
> Patch one file
> ==============
>
> src/main/java/org/apache/pdfbox/pdmodel/encryption/SecurityHandlersManager.java
> @@ -21,8 +21,6 @@
> import java.security.Security;
> import java.util.Hashtable;
>
> -import org.bouncycastle.jce.provider.BouncyCastleProvider;
> -
> /**
>  * This class manages security handlers for the application. It follows
> the singleton pattern.
>  * To be usable, security managers must be registered in it. Security
> managers are retrieved by
> @@ -69,10 +67,6 @@
>                 StandardSecurityHandler.FILTER,
>                 StandardSecurityHandler.class,
>                 StandardProtectionPolicy.class);
> -            this.registerHandler(
> -                PublicKeySecurityHandler.FILTER,
> -                PublicKeySecurityHandler.class,
> -                PublicKeyProtectionPolicy.class);
>         }
>         catch(Exception e)
>         {
> @@ -144,7 +138,6 @@
>         {
>             instance = new SecurityHandlersManager();
>         }
> -        Security.addProvider(new BouncyCastleProvider());
>
>         return instance;
>     }
>
>
> build
> =====
>
> python2.5 -m jcc.__main__ \
>    --jar lib/apache-pdfbox-0.8.0-incubator-dev.jar \
>    --jar external/junit.jar \
>    --jar external/FontBox-0.2.0-dev.jar \
>    --jar external/JempBox-0.2.0.jar \
>    --jar external/icu4j-4_0.jar \
>    --package java.lang \
>    --python pdfbox --version 0.8.0 --files 2 --build
>