You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Andi Vajda <va...@apache.org> on 2009/03/10 17:12:46 UTC

Re: PDFBox testcase

On Tue, 10 Mar 2009, Christian Heimes wrote:

> I've attached an isolated testcase for you. You'll surely recognize the
> make file. It's based on your make file from PyLucene. I hope you don't
> mind ;)

Thank you, that is very helpful in debugging this.

But first, please do not contact me off list. Use the 
pylucene-dev@lucene.apache.org mailing list. Your issue is of interest to 
others.

The reason for the error is that you're calling one of your native extension 
methods, startDocument, from the PyPDFTextStripper constructor.

While this is valid Java, it violates an unstated constraint of the code 
generated by JCC: after the Java constructor returns, JCC generated code to 
finish initializing the object, calling the pythonExtension(pythonObject) 
method.

The problem with this sequence of events is that if you call a native 
extension method from the constructor, the python object to call a method on 
from that native method is not yet set on the Java instance. In other words, 
inside the constructor, the native extension methods such as startDocument() 
depend on state on the instance that is not yet set.
In order to set that state, the object has to be constructed first, so we're 
in a bit of a catch-22 here.

It is possible to remove this constraint by changing the extension protocol 
such that _all_ extension class constructors require a first parameter, that 
'pythonObject' long (in fact, the python instance pointer, the python self), 
and set it to the pythonObject instance variable. This is ugly though, so 
it needs more thought. At least, some code should be added to check for this 
condition.

In the meantime, the workaround is simple: move the offending code to its 
own method and call it after the constructor returns.
I attached the modified PyPDFTextStripper.java class and test case that now 
work.

Andi..

>
> $ python2.5 tests/test_textstripper.py
> Loading: /home/heimes/software/misc/pdfbox/pdfbox-0.8.0/test/input/warp.pdf
> E
> ======================================================================
> ERROR: test_subclass (__main__.TestTextStripper)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>  File "tests/test_textstripper.py", line 24, in test_subclass
>    Stripper(PDF)
> SystemError: NULL result without error in PyObject_Call
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.264s
>
> FAILED (errors=1)
>

Re: PDFBox testcase

Posted by Christian Heimes <li...@cheimes.de>.
Andi Vajda wrote:
>> I've attached an isolated testcase for you. You'll surely recognize the
>> make file. It's based on your make file from PyLucene. I hope you don't
>> mind ;)
> 
> Thank you, that is very helpful in debugging this.
> 
> But first, please do not contact me off list. Use the
> pylucene-dev@lucene.apache.org mailing list. Your issue is of interest
> to others.

Understood! I didn't want to spam a mailing list with an attachment.
Most mailing lists I'm subscribed to have a no-attachment policy.


> The reason for the error is that you're calling one of your native
> extension methods, startDocument, from the PyPDFTextStripper constructor.
> 
> While this is valid Java, it violates an unstated constraint of the code
> generated by JCC: after the Java constructor returns, JCC generated code
> to finish initializing the object, calling the
> pythonExtension(pythonObject) method.
> 
> The problem with this sequence of events is that if you call a native
> extension method from the constructor, the python object to call a
> method on from that native method is not yet set on the Java instance.
> In other words, inside the constructor, the native extension methods
> such as startDocument() depend on state on the instance that is not yet
> set.
> In order to set that state, the object has to be constructed first, so
> we're in a bit of a catch-22 here.

Unstated indeed ... :)

Thanks again for your analysis and explanation of my issue.
(Un)fortunately I'm very good in hitting obscure corner cases. The
limitation isn't an issue for me. You've already shown me one way to
work around it. I suggest that you add a note to the corresponding
chapter of the README.

I've permission from my employee to release the code once it's ready. I
like to get your opinion regarding the packaging and setup of my code.

Christian