You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Andi Vajda <va...@apache.org> on 2009/04/06 18:26:33 UTC

Re: JCC: wrapping iText / issues with standard library macros / suggestions

On Mon, 6 Apr 2009, Jonas Maurus wrote:

> seeing that last month had a thread on wrapping PDFBox, obviously
> there's some demand for a fully-featured PDF library for Python :-).
> So for one of my projects I started working on wrapping iText with JCC
> today and I want to state that I'm really impressed! JCC rocks.

Thanks !

> Besides the fact that JCC could need a "--help" parameter, it all went
> very smoothly, I just ran into two small problems:

Most of the command line flags are documented here
http://lucene.apache.org/pylucene/jcc/documentation/readme.html#use
A --help flag would be nice indeed. Would you like to contribute a patch ?

>  * I had to to some trial-and-error recompiling because JCC doesn't
> include subtypes in the dependencies which means, for example, that
> when you really should use a FileOutputStream, iText usually only
> imports a OutputStream (probably a calculated dependency from the
> libraries method signatures), so it needs some fiddling with --package
> to get all necessary classes in.

Yes, unless the API you're wrapping directly states FileOutputStream, JCC 
can't guess that that's what you need. If you'd like to have wrappers 
generated for FileOutputStream but none of the classes you're already 
generating wrappers for mention it, you need to add it to the list of 
classes you want wrappers for.

The earlier link also documents this behaviour. The reason for this is to 
avoid runaway transitive dependency closures. The code generated by JCC can 
easily get huge.

> Is there a good way for forcing the import of a whole package?

No, for two reasons:
   - because there is no good way to find all the classes in a whole package
     (it's limited by what can be found by your classpath)
   - again, runaway wrappers will cause runaway dependencies and a huge
     amount code, most of which not needed, to be generated.

The --package flag tells JCC to generate wrappers for classes in that 
package found via dependencies. If you don't mention that package and the 
dependency can be done away with (it's not in the superclasses or 
interfaces), and a method's signature depends on a class in that package, 
the method will be skipped. The earlier link also documents this.

> The way it is now, I really don't use more than 5% of iText's API and 
> can't really tell if the wrapper contains all necessary external classes

Two situations here:
   1. you're trying to generate wrappers for the 5% you're using:
      get to know those 5% and be sure to have all you need either in what
      these classes pull in or via --package or direct class listing

   2. you're trying to generate wrappers for the entire API (like PyLucene):
      same as above, get to know the entire API, and port all unit tests and
      samples to python to test that you've got the coverage you need
      (assuming the tests and samples have good coverage). It may also be
      likely (as is the case with PyLucene) that you might have to provide
      extension point classes when the Java API you're wrapping expects its
      users to provide subclasses or interface implementations of their own.
      (for example, PyLucene custom analyzers).

> to use iText properly. I'm a bit lost there and would welcome pointers and 
> ideas on how to do this correctly. What happens if a Python program uses 
> multiple JCC-wrapped JVMs, would the wrapped types 
> "itext.FileOutputStream" and "lib2.FileOutputStream" collide?

A given process may embed only one JavaVM. If you want to use multiple 
JCC-built extensions in the same process be sure to use shared mode:
http://lucene.apache.org/pylucene/jcc/documentation/install.html#shared

> I also haven't started to identify any iterables or sequences that can
> be made "pythonic", using JCC's built-in extensions. Is there a good
> way to grep for those?

No, not really. You have to know the API you're generating wrappers for to 
pick out these. If you don't know them, then you're not missing them and 
your question is moot unless you're trying to generate wrappers for an 
entire API. Getting to know intimately the API you're wrapping is going to 
make it a much more usable Python extension in the long run.

>  * when compiling the extension, JCC generated the following code:
> namespace com {
>    namespace lowagie {
>        namespace text {
>            namespace pdf {
>                [...]
>                    static PdfName *DOMAIN;
>                [...]
>
> which led to this error:
> build/_itext/com/lowagie/text/pdf/PdfName.h:173: error: expected ';'
> before numeric constant
>
> This happens, because "DOMAIN", unfortunately, collides with the macro
>    #define DOMAIN 1
> in <math.h>.

Yes, this is a well known problem. JCC has a hardcoded list of words that 
can lead to such unfortunate collisions. I need to add another command line 
argument that makes it possible to add more such reserved words. Any use 
of words in this list is mangled.
Currently, that list of words is in cpp.py, line 71:

   RESERVED = set(['delete', 'and', 'or', 'not', 'xor', 'union', 'NULL',
                   'register', 'const', 'bool', 'operator'])

JCC 2.2 does a much better job than earlier versions at this already.

Andi..

Re: JCC: wrapping iText / issues with standard library macros / suggestions

Posted by Andi Vajda <va...@apache.org>.

On Mon, 6 Apr 2009, Andi Vajda wrote:

> Yes, this is a well known problem. JCC has a hardcoded list of words that can 
> lead to such unfortunate collisions. I need to add another command line 
> argument that makes it possible to add more such reserved words.

You can now add --reserved DOMAIN on the command line and that word is added 
to the list of reserved words. To specify more than one such reserved word 
--reserved can be used multiple times or a comma-separated list of words can 
be passed as well.

This is available from svn trunk and from the pylucene_2_4 branch.

Andi..

Re: JCC: wrapping iText / issues with standard library macros / suggestions

Posted by Andi Vajda <va...@apache.org>.

  Hi Jonas,

> I didn't get my hopes up anyway for an easier way to do this than
> "just learn everything there is to the API you're trying to wrap". For

Consider that before JCC, PyLucene, for which JCC was originally written, 
had its wrappers written by hand. Tens of thousands of lines of boilerplate 
C++ code. And before being handwritten, SWIG was used instead. 'Easier' is a 
matter of perspective :)

>> A given process may embed only one JavaVM. If you want to use multiple
>> JCC-built extensions in the same process be sure to use shared mode:
>> http://lucene.apache.org/pylucene/jcc/documentation/install.html#shared
>
> Yes, I do. I installed the patch for setuptools and created a
> shared-mode library. In this case, would the two "FileOutputStream"
> wrappers be recognized as the same type by two libraries hosted in the
> same JVM? (the question is probably stupid, but I don't know a lot
> about JCC, yet)

The Python types for FileOutputStream would be different. One from module 
'a' and another one from module 'b'. The Java classes for FileOutputStream 
would be the same provided they came from the same class loader. This is an 
oversimplification of course as class loaders can behave in subtle ways.

When a Python instance of a wrapper is passed to Java, it is unwrapped, only 
the Java instance is actually used.

If, inspite of the expected simplicity of this premise (note the careful 
wording here :) problems were to arise in sharing objects between different 
JCC-built Python extensions, they could be worked around by simply 
generating a new extension that is the union of both. This also has the 
advantage that this union is smaller than the sum since the intersection 
would not be wrapped twice (only one wrapper for FileOutputStream, for 
example).

I say 'simply' here because, if the JCC invocations for building both 
extensions are known, you're a compile away of generating the union.

Then, you may also hit bugs since this --shared method is rather 
infrequently used to its full extent. I've been using this to implement Java 
servlets in Python under Tomcat (sort of a reverse embedding). Shared 
mode there is essential and represents, probably, the most extensive use of 
it that I'm aware of.

Andi..

Re: JCC: wrapping iText / issues with standard library macros / suggestions

Posted by Jonas Maurus <jo...@gmail.com>.

Hi Andi

>> Besides the fact that JCC could need a "--help" parameter, it all went
>> very smoothly, I just ran into two small problems:
>
> Most of the command line flags are documented here
> http://lucene.apache.org/pylucene/jcc/documentation/readme.html#use
> A --help flag would be nice indeed. Would you like to contribute a patch ?

I'll see what i can do about that :-)

[...]
>> The way it is now, I really don't use more than 5% of iText's API and
>> can't really tell if the wrapper contains all necessary external classes
>
> Two situations here:
>  1. you're trying to generate wrappers for the 5% you're using:
>     get to know those 5% and be sure to have all you need either in what
>     these classes pull in or via --package or direct class listing
>
>  2. you're trying to generate wrappers for the entire API (like PyLucene):
>     same as above, get to know the entire API, and port all unit tests and
>     samples to python to test that you've got the coverage you need
>     (assuming the tests and samples have good coverage). It may also be
>     likely (as is the case with PyLucene) that you might have to provide
>     extension point classes when the Java API you're wrapping expects its
>     users to provide subclasses or interface implementations of their own.
>     (for example, PyLucene custom analyzers).

I didn't get my hopes up anyway for an easier way to do this than
"just learn everything there is to the API you're trying to wrap". For
the moment the wrapper works perfectly for me, but I would like to
make it available to the internet at large... I'll probably settle for
a "website with building instructions, a release clearly marked
'alpha' and a rapid release cycle"-type of pet-project :-).

>> to use iText properly. I'm a bit lost there and would welcome pointers and
>> ideas on how to do this correctly. What happens if a Python program uses
>> multiple JCC-wrapped JVMs, would the wrapped types "itext.FileOutputStream"
>> and "lib2.FileOutputStream" collide?
>
> A given process may embed only one JavaVM. If you want to use multiple
> JCC-built extensions in the same process be sure to use shared mode:
> http://lucene.apache.org/pylucene/jcc/documentation/install.html#shared

Yes, I do. I installed the patch for setuptools and created a
shared-mode library. In this case, would the two "FileOutputStream"
wrappers be recognized as the same type by two libraries hosted in the
same JVM? (the question is probably stupid, but I don't know a lot
about JCC, yet)

[...]
> Yes, this is a well known problem. JCC has a hardcoded list of words that
> can lead to such unfortunate collisions. I need to add another command line
> argument that makes it possible to add more such reserved words. Any use of
> words in this list is mangled.
> Currently, that list of words is in cpp.py, line 71:
>
>  RESERVED = set(['delete', 'and', 'or', 'not', 'xor', 'union', 'NULL',
>                  'register', 'const', 'bool', 'operator'])
>
> JCC 2.2 does a much better job than earlier versions at this already.

I saw your follow-up on the new "--reserved" functionality. This is
much appreciated!

cheers
Jonas