You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pylucene-dev@lucene.apache.org by Chris Wilson <ch...@aptivate.org> on 2012/02/01 14:27:03 UTC

Changes to enable easy_install of packages using JCC

Dear sirs,

I have been working on integrating Apache Tika (in Java) with our 
open source intranet application (in Python/Django) using JCC, as 
described here:

http://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/

In order to make it easy to install Tika (which normally requires mystic 
incantations of JCC) I have packaged it up with jar files and a setup.py 
script. This required some changes to JCC. I hope you will consider these 
for inclusion in your project. I don't believe that they break backwards 
compatibility.

Changes implemented by the attached patch and visible online (formatted) 
at <https://github.com/aptivate/jcc/commits/master>:

* Allow calling cpp.jcc with a --maxheap argument to reduce the heap size, 
as the default doesn't fit in memory on a reasonably small virtual 
machine.

* Allow calling cpp.jcc with --egg-info to generate the egg_info, without 
doing a build.

* Allow calling cpp.jcc with --extra-setup-arg <arg> to pass additional 
arguments to the setup() function call.

Changes that require more work:

* Can JCC please not fail completely if setuptools hasn't been patched? 
Can it monkeypatch it instead, or at least fall back to non-shared mode?

* Why does JCC use non-standard command line arguments like --build and 
--install? Can it be modified to make it easier to invoke from a 
setup.py-style environment, such as exporting a setup() function as 
setuptools does?

* Could JCC be used to generate dynamic proxies at runtime (with a 
performance cost) in Python, to avoid the need for a compiler?

* Could JCC generate a source distribution (sdist) that could be uploaded 
to pypi?

* "setup.py develop" is still broken in the current implementation

* JCC silently skips wrapping methods whose return type it doesn't know 
(for example because I forgot to include a JAR file) which requires a lot 
of debugging to track down and fix. This is doubly hard because it only 
seems to work when installed, so I can't monkey patch it on the fly to 
investigate problems, I have to remember to "setup.py install" each time.

Thanks in advance for your consideration.

Cheers, Chris.
-- 
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.

Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
On Feb 2, 2012, at 15:22, Bill Janssen <ja...@parc.com> wrote:

> Andi Vajda <va...@apache.org> wrote:
> 
>>> I see that distutils2 has functions like "link_shared_lib", and
>>> "link_shared_object", which is a good sign.  Whether they work or not is
>>> another matter.
>> 
>> As far as I know, distutils and setuptools also had the capability to
>> link with a vanilla shared library. Here, I'm talking about these
>> systems helping us with building a vanilla shared library that uses
>> Python but that is not a Python extension.
> 
> Ah, but in distutils2, "link_share_lib" means, link, producing a shared
> library.  Tell you what, I'll download it play around with it a bit, and
> see if it knows what we need to know (at least, on OS X).

Ooooh, that would very welcome news indeed. Thank you for looking into it.

Andi..

> 
> Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Bill Janssen <ja...@parc.com>.
Andi Vajda <va...@apache.org> wrote:

> > I see that distutils2 has functions like "link_shared_lib", and
> > "link_shared_object", which is a good sign.  Whether they work or not is
> > another matter.
> 
> As far as I know, distutils and setuptools also had the capability to
> link with a vanilla shared library. Here, I'm talking about these
> systems helping us with building a vanilla shared library that uses
> Python but that is not a Python extension.

Ah, but in distutils2, "link_share_lib" means, link, producing a shared
library.  Tell you what, I'll download it play around with it a bit, and
see if it knows what we need to know (at least, on OS X).

Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
On Thu, 2 Feb 2012, Bill Janssen wrote:

> Andi Vajda <va...@apache.org> wrote:
>
>>> I think the right thing to do is to
>>>
>>> 1.  re-write the current jcc setup.py to use distutils2, and then
>>
>> Is distutils2 supported on older Python 2.x versions like 2.4, 2.5 ?
>> I'd be happy to drop support for older releases for sure. We currently go all the way back to 2.3.5.
>
> At https://bitbucket.org/tarek/distutils2/wiki/Home, it says,
> ``distutils2 will be distributed as a third party module compatible with
> Python 2.4-3.2 under the name ?distutils2"''.  So, yes.
>
>>> 2.  at the PyLucene level, add a configure.ac script which figures out
>>>    the proper settings for those six or so defines in the Makefile,
>>>    which is a fairly trivial configure script.
>>
>> If I understand this correctly, your configure script does not take
>> care of the libjcc.so build part, the 'cause' of the setuptools
>> patching mess.
>
> That's correct.  It just figures out the environment variables and
> parameters to set so that setuptools will work properly.  The right
> thing to do is to eliminate setuptools.

Yes, by replacing it with distutils2, correct ? Not by a configure script.
The only reason setuptools is used is because of this vanilla library 
business. Otherwise, plain distutils works just fine.

>> Unless distutils2 handles building vanilla shared
>> libraries (as opposed to python extension shared libraries), we'd
>> still be stuck with that problem.
>
> The question for JCC is whether to abandon setup.py (and do everything
> with autotools and configure), or to move setup.py onto a newer and
> probably better platform (distutils2 instead of setuptools).  Or, I
> suppose, some combo of the two.

That could be a question too but not the one I was asking. It seems that JCC 
could integrate completely (as a Compiler ?) into distutils/distutil2. But 
where is the vanilla shared library build support going to come from then ?

> "configure" would provide better support for building shared libraries,
> but less understanding of and support for Python specifics, like
> building extensions.  Moving to distutils2 instead would probably be
> less work.
>
> I see that distutils2 has functions like "link_shared_lib", and
> "link_shared_object", which is a good sign.  Whether they work or not is
> another matter.

As far as I know, distutils and setuptools also had the capability to link 
with a vanilla shared library. Here, I'm talking about these systems helping 
us with building a vanilla shared library that uses Python but that is not a 
Python extension.

Andi..

Re: Changes to enable easy_install of packages using JCC

Posted by Bill Janssen <ja...@parc.com>.
Andi Vajda <va...@apache.org> wrote:

> > I think the right thing to do is to
> > 
> > 1.  re-write the current jcc setup.py to use distutils2, and then
> 
> Is distutils2 supported on older Python 2.x versions like 2.4, 2.5 ?
> I'd be happy to drop support for older releases for sure. We currently go all the way back to 2.3.5.

At https://bitbucket.org/tarek/distutils2/wiki/Home, it says,
``distutils2 will be distributed as a third party module compatible with
Python 2.4-3.2 under the name “distutils2"''.  So, yes.

> > 2.  at the PyLucene level, add a configure.ac script which figures out
> >    the proper settings for those six or so defines in the Makefile,
> >    which is a fairly trivial configure script.
> 
> If I understand this correctly, your configure script does not take
> care of the libjcc.so build part, the 'cause' of the setuptools
> patching mess.

That's correct.  It just figures out the environment variables and
parameters to set so that setuptools will work properly.  The right
thing to do is to eliminate setuptools.

> Unless distutils2 handles building vanilla shared
> libraries (as opposed to python extension shared libraries), we'd
> still be stuck with that problem.

The question for JCC is whether to abandon setup.py (and do everything
with autotools and configure), or to move setup.py onto a newer and
probably better platform (distutils2 instead of setuptools).  Or, I
suppose, some combo of the two.

"configure" would provide better support for building shared libraries,
but less understanding of and support for Python specifics, like
building extensions.  Moving to distutils2 instead would probably be
less work.

I see that distutils2 has functions like "link_shared_lib", and
"link_shared_object", which is a good sign.  Whether they work or not is
another matter.

Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
On Feb 2, 2012, at 8:29, Bill Janssen <ja...@parc.com> wrote:

> Andi Vajda <va...@apache.org> wrote:
> 
>> 
>> On Feb 1, 2012, at 20:49, Bill Janssen <ja...@parc.com> wrote:
>> 
>>> Andi Vajda <va...@apache.org> wrote:
>>> 
>>>> Seriously, though, I think that the right thing to do to better
>>>> integrate JCC with distutils/setuptools/distribute/pip/etc... is to
>>>> make it into a distutils 'compiler'. This requires some work, though,
>>>> and I haven't done it in all thee years. Anyone with the itch to hack
>>>> on distutils is welcome to take that on.
>>> 
>>> The future here is Python 3.3's "packaging" (aka "distutils2", on
>>> earlier versions of Python):
>>> 
>>> try:
>>>   import packaging
>>> except ImportError:
>>>   import distutils2
>>> 
>>> 
>>>> Additionally, issue 43 is all about using the distutils/setuptools
>>>> compiler and linker invocation machinery for building a vanilla shared
>>>> library (as opposed to a Python extension). On linux this is a bit
>>>> cumbersome. On Windows, at little less so. On Mac OS X, it just works.
>>>> 
>>>> The alternative would be to write a 'configure' script for that part
>>>> of the JCC build. A configure script would also solve the chicken/egg
>>>> problem of building that library on Windows (the first time, the build
>>>> needs to be done twice for the import library to be in the right
>>>> place).
>>>> 
>>>> Currently, I'm leaning towards the configure script solution since
>>>> none of the projects mentioned above seems to have taken issue 43 on
>>>> (by simply integrating my patches) in all these years and Pylucene's
>>>> issue 13 is curently blocked:
>>>> https://issues.apache.org/jira/browse/PYLUCENE-13?focusedCommentId=13162273&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13162273
>>>> 
>>>> I have very little itch to dabble in configure scripts either so I've
>>>> been dragging my feet. If someone were to step forward with a patch
>>>> for that, I'd be delighted in ripping out all this patching
>>>> brittleness.
>>> 
>>> I have big chunks of such a configure script written (for Windows/OS
>>> X/Linux) which I'd be happy to contribute.
>> 
>> That would be cool. What are the missing chunks made off ?
>> 
>> Andi..
> 
> Hard to say what the *missing* chunks are made of -- chocolate?
> 
> What I've got is a configure.in script for UpLib, part of which is
> designed to figure out the necessary parameters to build JCC and
> PyLucene properly on those systems.  But it still uses setuptools,
> which is still going to be broken.
> 
> I think the right thing to do is to
> 
> 1.  re-write the current jcc setup.py to use distutils2, and then

Is distutils2 supported on older Python 2.x versions like 2.4, 2.5 ?
I'd be happy to drop support for older releases for sure. We currently go all the way back to 2.3.5.

> 2.  at the PyLucene level, add a configure.ac script which figures out
>    the proper settings for those six or so defines in the Makefile,
>    which is a fairly trivial configure script.

If I understand this correctly, your configure script does not take care of the libjcc.so build part, the 'cause' of the setuptools patching mess. Unless distutils2 handles building vanilla shared libraries (as opposed to python extension shared libraries), we'd still be stuck with that problem.

Andi..

> 
> Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Bill Janssen <ja...@parc.com>.
Andi Vajda <va...@apache.org> wrote:

> 
> On Feb 1, 2012, at 20:49, Bill Janssen <ja...@parc.com> wrote:
> 
> > Andi Vajda <va...@apache.org> wrote:
> > 
> >> Seriously, though, I think that the right thing to do to better
> >> integrate JCC with distutils/setuptools/distribute/pip/etc... is to
> >> make it into a distutils 'compiler'. This requires some work, though,
> >> and I haven't done it in all thee years. Anyone with the itch to hack
> >> on distutils is welcome to take that on.
> > 
> > The future here is Python 3.3's "packaging" (aka "distutils2", on
> > earlier versions of Python):
> > 
> > try:
> >    import packaging
> > except ImportError:
> >    import distutils2
> > 
> > 
> >> Additionally, issue 43 is all about using the distutils/setuptools
> >> compiler and linker invocation machinery for building a vanilla shared
> >> library (as opposed to a Python extension). On linux this is a bit
> >> cumbersome. On Windows, at little less so. On Mac OS X, it just works.
> >> 
> >> The alternative would be to write a 'configure' script for that part
> >> of the JCC build. A configure script would also solve the chicken/egg
> >> problem of building that library on Windows (the first time, the build
> >> needs to be done twice for the import library to be in the right
> >> place).
> >> 
> >> Currently, I'm leaning towards the configure script solution since
> >> none of the projects mentioned above seems to have taken issue 43 on
> >> (by simply integrating my patches) in all these years and Pylucene's
> >> issue 13 is curently blocked:
> >> https://issues.apache.org/jira/browse/PYLUCENE-13?focusedCommentId=13162273&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13162273
> >> 
> >> I have very little itch to dabble in configure scripts either so I've
> >> been dragging my feet. If someone were to step forward with a patch
> >> for that, I'd be delighted in ripping out all this patching
> >> brittleness.
> > 
> > I have big chunks of such a configure script written (for Windows/OS
> > X/Linux) which I'd be happy to contribute.
> 
> That would be cool. What are the missing chunks made off ?
> 
> Andi..

Hard to say what the *missing* chunks are made of -- chocolate?

What I've got is a configure.in script for UpLib, part of which is
designed to figure out the necessary parameters to build JCC and
PyLucene properly on those systems.  But it still uses setuptools,
which is still going to be broken.

I think the right thing to do is to

1.  re-write the current jcc setup.py to use distutils2, and then

2.  at the PyLucene level, add a configure.ac script which figures out
    the proper settings for those six or so defines in the Makefile,
    which is a fairly trivial configure script.

Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
On Feb 1, 2012, at 20:49, Bill Janssen <ja...@parc.com> wrote:

> Andi Vajda <va...@apache.org> wrote:
> 
>> Seriously, though, I think that the right thing to do to better
>> integrate JCC with distutils/setuptools/distribute/pip/etc... is to
>> make it into a distutils 'compiler'. This requires some work, though,
>> and I haven't done it in all thee years. Anyone with the itch to hack
>> on distutils is welcome to take that on.
> 
> The future here is Python 3.3's "packaging" (aka "distutils2", on
> earlier versions of Python):
> 
> try:
>    import packaging
> except ImportError:
>    import distutils2
> 
> 
>> Additionally, issue 43 is all about using the distutils/setuptools
>> compiler and linker invocation machinery for building a vanilla shared
>> library (as opposed to a Python extension). On linux this is a bit
>> cumbersome. On Windows, at little less so. On Mac OS X, it just works.
>> 
>> The alternative would be to write a 'configure' script for that part
>> of the JCC build. A configure script would also solve the chicken/egg
>> problem of building that library on Windows (the first time, the build
>> needs to be done twice for the import library to be in the right
>> place).
>> 
>> Currently, I'm leaning towards the configure script solution since
>> none of the projects mentioned above seems to have taken issue 43 on
>> (by simply integrating my patches) in all these years and Pylucene's
>> issue 13 is curently blocked:
>> https://issues.apache.org/jira/browse/PYLUCENE-13?focusedCommentId=13162273&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13162273
>> 
>> I have very little itch to dabble in configure scripts either so I've
>> been dragging my feet. If someone were to step forward with a patch
>> for that, I'd be delighted in ripping out all this patching
>> brittleness.
> 
> I have big chunks of such a configure script written (for Windows/OS
> X/Linux) which I'd be happy to contribute.

That would be cool. What are the missing chunks made off ?

Andi..

> 
> Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Bill Janssen <ja...@parc.com>.
Andi Vajda <va...@apache.org> wrote:

> Seriously, though, I think that the right thing to do to better
> integrate JCC with distutils/setuptools/distribute/pip/etc... is to
> make it into a distutils 'compiler'. This requires some work, though,
> and I haven't done it in all thee years. Anyone with the itch to hack
> on distutils is welcome to take that on.

The future here is Python 3.3's "packaging" (aka "distutils2", on
earlier versions of Python):

try:
    import packaging
except ImportError:
    import distutils2


> Additionally, issue 43 is all about using the distutils/setuptools
> compiler and linker invocation machinery for building a vanilla shared
> library (as opposed to a Python extension). On linux this is a bit
> cumbersome. On Windows, at little less so. On Mac OS X, it just works.
> 
> The alternative would be to write a 'configure' script for that part
> of the JCC build. A configure script would also solve the chicken/egg
> problem of building that library on Windows (the first time, the build
> needs to be done twice for the import library to be in the right
> place).
> 
> Currently, I'm leaning towards the configure script solution since
> none of the projects mentioned above seems to have taken issue 43 on
> (by simply integrating my patches) in all these years and Pylucene's
> issue 13 is curently blocked:
> https://issues.apache.org/jira/browse/PYLUCENE-13?focusedCommentId=13162273&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13162273
> 
> I have very little itch to dabble in configure scripts either so I've
> been dragging my feet. If someone were to step forward with a patch
> for that, I'd be delighted in ripping out all this patching
> brittleness.

I have big chunks of such a configure script written (for Windows/OS
X/Linux) which I'd be happy to contribute.

Bill

Re: Changes to enable easy_install of packages using JCC

Posted by Chris Wilson <ch...@aptivate.org>.
Hi Andi,

On Sat, 4 Feb 2012, Andi Vajda wrote:

> I integrated your patches with rev 1240624. I moved a few changes around 
> :parameters to their section in __main__.py and 'maxstack' hardcoding to 
> where it used to be.
>
> Thank you for your contribution.

Thanks :)

Cheers, Chris.
-- 
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.


Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
  Hi Chris,

On Wed, 1 Feb 2012, Andi Vajda wrote:

>>> No objections to these patches in principle but it would be easier for me 
>>> to integrate them if you could provide patches computed from the svn 
>>> repository of JCC: 
>>> http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/ Your patches 
>>> seem to be small enough so I should be able to do without but it would be 
>>> nicer if I didn't have to guess...
>> 
>> I think the patch that I attached was already based on trunk. The git 
>> repository includes the .svn directories, points to trunk, and I generated 
>> the patch using "svn diff".
>
> Sorry, I missed that you indeed had attached a patch last time.
> (to be continued...)
>
>>> Also, please write small descriptions for these new command line flags to 
>>> go into JCC's __main__.py file:
>>> http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/jcc/__main__.py
>> 
>> Done, new patch attached.
>
> Thank you !

I integrated your patches with rev 1240624.
I moved a few changes around :parameters to their section in __main__.py and 
'maxstack' hardcoding to where it used to be.

Thank you for your contribution.

Andi..

>
>>> This mess of setuptools patching was meant to be *temporary* until 
>>> setuptools' issue 43 was fixed. As you can see, I filed this bug 3 1/2 
>>> years ago, http://bugs.python.org/setuptools/issue43, and my patch for 
>>> issue 43 still hasn't been accepted, rejected, integrated, anything'ed... 
>>> Dormant. For over three years.
>> 
>> Sorry about that. I've had similar experience with bugs reported against 
>> ubuntu, hibernate, rails... :(
>>
>>>>  * Why does JCC use non-standard command line arguments like --build and
>>>>  --install? Can it be modified to make it easier to invoke from a
>>>>  setup.py-style environment, such as exporting a setup() function as
>>>>  setuptools does?
>>> 
>>> What standard are you referring to ?
>>> The python extension module build/install/deploy story on Python keeps 
>>> evolving... Add Python 3.x support into the mix, and the mess is complete.
>>> 
>>> Seriously, though, I think that the right thing to do to better integrate 
>>> JCC with distutils/setuptools/distribute/pip/etc... is to make it into a 
>>> distutils 'compiler'. This requires some work, though, and I haven't done 
>>> it in all thee years. Anyone with the itch to hack on distutils is welcome 
>>> to take that on.
>> 
>> I'm afraid I don't fully understand how distutils works, it seems to be 
>> sparsely documented, and I don't have a lot of time and energy to work on 
>> refactoring jcc. I am a bit surprised that we can't just generate a source 
>> distribution containing the jars, .cpp files and a setup.py which does the 
>> rest like any other Python extension.
>
> Same here. I don't know distutils too well and whenever I tried to dig into 
> it, I quickly gave up. I don't know what it means to "just generate a source 
> distribution".
>
> If they contain .class files, JAR files are not source files. My 
> understanding could be wrong here, but I don't think they're even compatible 
> between 32- and 64-bit VMs. Or is that incompatible between Java 5 and 6 ?
>
>>> I have very little itch to dabble in configure scripts either so I've been 
>>> dragging my feet. If someone were to step forward with a patch for that, 
>>> I'd be delighted in ripping out all this patching brittleness.
>> 
>> How would a configure script solve the problem and what would it have to 
>> do? Generate the .cpp files? How does it integrate with Python extensions?
>
> A configure script for building libjcc.dylib (libjcc.so on Linux, jcc.dll on 
> Windows, etc...) would take care of doing what setuptools + the issue43 patch 
> is doing for us currently: invoking the C++ compiler and linker against the 
> correct Python headers and Libraries to produce a vanilla shared library. 
> With such a contribute script, there is no longer a need to patch setuptools.
>
>>> That is a whole different project. If I remember correctly, the JPype 
>>> project is (or was) taking that approach: http://jpype.sourceforge.net
>> 
>> OK, thanks.
>>
>>>>  * Could JCC generate a source distribution (sdist) that could be
>>>>    uploaded to pypi?
>>> 
>>> You mean a source distribution that includes the Java sources of all the 
>>> libraries/classes wrapped ?
>> 
>> I was thinking more of the jars. Something like 
>> https://github.com/aptivate/python-tika that doesn't depend on jcc any 
>> more.
>>
>>>>  * "setup.py develop" is still broken in the current implementation
>>> 
>>> I'm not familiar with this 'develop' command nor that it is broken. What 
>>> is it supposed to be doing and how is it broken ?
>> 
>> http://packages.python.org/distribute/setuptools.html#development-mode
>> 
>> It seems that when invoked this way, my setup.py (from python-tika) which 
>> calls jcc ends up creating build/_tika as a file (not a directory).
>> 
>> For example, this command:
>>
>>  sudo pip install -e git+https://github.com/aptivate/python-tika#egg=tika
>> 
>> (note the -e for editable mode) results in this:
>>
>>  Running setup.py develop for tika
>>  ...
>>    Traceback (most recent call last):
>>      File "<string>", line 1, in <module>
>>      File "/tmp/src/tika/setup.py", line 108, in <module>
>>        cpp.jcc(jcc_args)
>>      File 
>> "/usr/local/lib/python2.6/dist-packages/JCC-2.12-py2.6-linux-i686.egg/jcc/cpp.py", 
>> line 587, in jcc
>>        os.makedirs(cppdir)
>>      File "/usr/lib/python2.6/os.py", line 157, in makedirs
>>        mkdir(name, mode)
>>    OSError: [Errno 17] File exists: 'build/_tika'
>> 
>> That file appears to contain the source code for the JCCEnv.cpp wrapper.
>
> Please, file a bug with the explanation above. Not that I promise to fix it 
> (a patch would be welcome, of course) but this failure should be logged at 
> least.
>
>>> A patch could be written to noisily emit a warning on all methods that are 
>>> skipped. Silently wrapping everything would simply wrap the entire JDK by 
>>> transitive closure and produce a huge library, assuming you'd have the 
>>> patience to watch it compile.
>>> 
>>> The skipping of method whose signature contains types that are not on the 
>>> 'wrap this' list (explicit or implicit) is by design. Not being able to 
>>> request emitting a warning is a problem.
>> 
>> Perhaps it's useful to (automatically) emit warnings for classes in the JAR 
>> files included with --jar or an explicit class name, but not those in 
>> --include files or otherwise automatically included (e.g. the JDK 
>> classpath)?
>
> That'd be possible but has the potential of being very noisy...
> It's worth a try, for sure.
>
> Andi..
>

Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
  Hi Chris,

On Wed, 1 Feb 2012, Chris Wilson wrote:

> Thank you for your quick and positive reply :)
>
> On Wed, 1 Feb 2012, Andi Vajda wrote:
>
>>>  I have been working on integrating Apache Tika (in Java) with our open
>>>  source intranet application (in Python/Django) using JCC...
>> 
>> Using Maven there helped considerably with getting all the pieces on the 
>> Java side.
>
> Although I used maven for an initial compile of Tika, I realised that it 
> would work just as well if I downloaded pre-built jar files, which I did from 
> http://repo1.maven.org/maven2/org/apache/tika/.
>
>> Your remark about not needing JCC's shared library mode is probably correct 
>> right now but as soon as anyone brings in another JCC-built library into 
>> the same process as yours, shared mode is going to be required since the 
>> Java VM can only be initialized once per process.
>
> I understand that, but I'm prepared to live with that limitation for now, as 
> this is likely to be the only Java library that I integrate into this 
> Python/Django application. I tried hard to find pure Python solutions, but 
> Tika is simply miles ahead of the competition.
>
>> No objections to these patches in principle but it would be easier for me 
>> to integrate them if you could provide patches computed from the svn 
>> repository of JCC: 
>> http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/ Your patches 
>> seem to be small enough so I should be able to do without but it would be 
>> nicer if I didn't have to guess...
>
> I think the patch that I attached was already based on trunk. The git 
> repository includes the .svn directories, points to trunk, and I generated 
> the patch using "svn diff".

Sorry, I missed that you indeed had attached a patch last time.
(to be continued...)

>> Also, please write small descriptions for these new command line flags to 
>> go into JCC's __main__.py file:
>> http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/jcc/__main__.py
>
> Done, new patch attached.

Thank you !

>> This mess of setuptools patching was meant to be *temporary* until 
>> setuptools' issue 43 was fixed. As you can see, I filed this bug 3 1/2 
>> years ago, http://bugs.python.org/setuptools/issue43, and my patch for 
>> issue 43 still hasn't been accepted, rejected, integrated, anything'ed... 
>> Dormant. For over three years.
>
> Sorry about that. I've had similar experience with bugs reported against 
> ubuntu, hibernate, rails... :(
>
>>>  * Why does JCC use non-standard command line arguments like --build and
>>>  --install? Can it be modified to make it easier to invoke from a
>>>  setup.py-style environment, such as exporting a setup() function as
>>>  setuptools does?
>> 
>> What standard are you referring to ?
>> The python extension module build/install/deploy story on Python keeps 
>> evolving... Add Python 3.x support into the mix, and the mess is complete.
>> 
>> Seriously, though, I think that the right thing to do to better integrate 
>> JCC with distutils/setuptools/distribute/pip/etc... is to make it into a 
>> distutils 'compiler'. This requires some work, though, and I haven't done 
>> it in all thee years. Anyone with the itch to hack on distutils is welcome 
>> to take that on.
>
> I'm afraid I don't fully understand how distutils works, it seems to be 
> sparsely documented, and I don't have a lot of time and energy to work on 
> refactoring jcc. I am a bit surprised that we can't just generate a source 
> distribution containing the jars, .cpp files and a setup.py which does the 
> rest like any other Python extension.

Same here. I don't know distutils too well and whenever I tried to dig into 
it, I quickly gave up. I don't know what it means to "just generate a source 
distribution".

If they contain .class files, JAR files are not source files. My 
understanding could be wrong here, but I don't think they're even compatible 
between 32- and 64-bit VMs. Or is that incompatible between Java 5 and 6 ?

>> I have very little itch to dabble in configure scripts either so I've been 
>> dragging my feet. If someone were to step forward with a patch for that, 
>> I'd be delighted in ripping out all this patching brittleness.
>
> How would a configure script solve the problem and what would it have to do? 
> Generate the .cpp files? How does it integrate with Python extensions?

A configure script for building libjcc.dylib (libjcc.so on Linux, jcc.dll on 
Windows, etc...) would take care of doing what setuptools + the issue43 
patch is doing for us currently: invoking the C++ compiler and linker 
against the correct Python headers and Libraries to produce a vanilla shared 
library. With such a contribute script, there is no longer a need to patch 
setuptools.

>> That is a whole different project. If I remember correctly, the JPype 
>> project is (or was) taking that approach: http://jpype.sourceforge.net
>
> OK, thanks.
>
>>>  * Could JCC generate a source distribution (sdist) that could be
>>>    uploaded to pypi?
>> 
>> You mean a source distribution that includes the Java sources of all the 
>> libraries/classes wrapped ?
>
> I was thinking more of the jars. Something like 
> https://github.com/aptivate/python-tika that doesn't depend on jcc any more.
>
>>>  * "setup.py develop" is still broken in the current implementation
>> 
>> I'm not familiar with this 'develop' command nor that it is broken. What is 
>> it supposed to be doing and how is it broken ?
>
> http://packages.python.org/distribute/setuptools.html#development-mode
>
> It seems that when invoked this way, my setup.py (from python-tika) which 
> calls jcc ends up creating build/_tika as a file (not a directory).
>
> For example, this command:
>
>  sudo pip install -e git+https://github.com/aptivate/python-tika#egg=tika
>
> (note the -e for editable mode) results in this:
>
>  Running setup.py develop for tika
>  ...
>    Traceback (most recent call last):
>      File "<string>", line 1, in <module>
>      File "/tmp/src/tika/setup.py", line 108, in <module>
>        cpp.jcc(jcc_args)
>      File 
> "/usr/local/lib/python2.6/dist-packages/JCC-2.12-py2.6-linux-i686.egg/jcc/cpp.py", 
> line 587, in jcc
>        os.makedirs(cppdir)
>      File "/usr/lib/python2.6/os.py", line 157, in makedirs
>        mkdir(name, mode)
>    OSError: [Errno 17] File exists: 'build/_tika'
>
> That file appears to contain the source code for the JCCEnv.cpp wrapper.

Please, file a bug with the explanation above. Not that I promise to fix it 
(a patch would be welcome, of course) but this failure should be logged at 
least.

>> A patch could be written to noisily emit a warning on all methods that are 
>> skipped. Silently wrapping everything would simply wrap the entire JDK by 
>> transitive closure and produce a huge library, assuming you'd have the 
>> patience to watch it compile.
>> 
>> The skipping of method whose signature contains types that are not on the 
>> 'wrap this' list (explicit or implicit) is by design. Not being able to 
>> request emitting a warning is a problem.
>
> Perhaps it's useful to (automatically) emit warnings for classes in the JAR 
> files included with --jar or an explicit class name, but not those in 
> --include files or otherwise automatically included (e.g. the JDK classpath)?

That'd be possible but has the potential of being very noisy...
It's worth a try, for sure.

Andi..

Re: Changes to enable easy_install of packages using JCC

Posted by Chris Wilson <ch...@aptivate.org>.
Hi Andi,

Thank you for your quick and positive reply :)

On Wed, 1 Feb 2012, Andi Vajda wrote:

>>  I have been working on integrating Apache Tika (in Java) with our open
>>  source intranet application (in Python/Django) using JCC...
>
> Using Maven there helped considerably with getting all the pieces on the 
> Java side.

Although I used maven for an initial compile of Tika, I realised that it 
would work just as well if I downloaded pre-built jar files, which I did 
from http://repo1.maven.org/maven2/org/apache/tika/.

> Your remark about not needing JCC's shared library mode is probably 
> correct right now but as soon as anyone brings in another JCC-built 
> library into the same process as yours, shared mode is going to be 
> required since the Java VM can only be initialized once per process.

I understand that, but I'm prepared to live with that limitation for now, 
as this is likely to be the only Java library that I integrate into this 
Python/Django application. I tried hard to find pure Python solutions, but 
Tika is simply miles ahead of the competition.

> No objections to these patches in principle but it would be easier for 
> me to integrate them if you could provide patches computed from the svn 
> repository of JCC: 
> http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/ Your patches 
> seem to be small enough so I should be able to do without but it would 
> be nicer if I didn't have to guess...

I think the patch that I attached was already based on trunk. The git 
repository includes the .svn directories, points to trunk, and I generated 
the patch using "svn diff".

> Also, please write small descriptions for these new command line flags to go 
> into JCC's __main__.py file:
> http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/jcc/__main__.py

Done, new patch attached.

> This mess of setuptools patching was meant to be *temporary* until 
> setuptools' issue 43 was fixed. As you can see, I filed this bug 3 1/2 
> years ago, http://bugs.python.org/setuptools/issue43, and my patch for 
> issue 43 still hasn't been accepted, rejected, integrated, 
> anything'ed... Dormant. For over three years.

Sorry about that. I've had similar experience with bugs reported against 
ubuntu, hibernate, rails... :(

>>  * Why does JCC use non-standard command line arguments like --build and
>>  --install? Can it be modified to make it easier to invoke from a
>>  setup.py-style environment, such as exporting a setup() function as
>>  setuptools does?
>
> What standard are you referring to ?
> The python extension module build/install/deploy story on Python keeps 
> evolving... Add Python 3.x support into the mix, and the mess is complete.
>
> Seriously, though, I think that the right thing to do to better integrate JCC 
> with distutils/setuptools/distribute/pip/etc... is to make it into a 
> distutils 'compiler'. This requires some work, though, and I haven't done it 
> in all thee years. Anyone with the itch to hack on distutils is welcome to 
> take that on.

I'm afraid I don't fully understand how distutils works, it seems to be 
sparsely documented, and I don't have a lot of time and energy to work on 
refactoring jcc. I am a bit surprised that we can't just generate a source 
distribution containing the jars, .cpp files and a setup.py which does the 
rest like any other Python extension.

> I have very little itch to dabble in configure scripts either so I've 
> been dragging my feet. If someone were to step forward with a patch for 
> that, I'd be delighted in ripping out all this patching brittleness.

How would a configure script solve the problem and what would it have to 
do? Generate the .cpp files? How does it integrate with Python extensions?

> That is a whole different project. If I remember correctly, the JPype 
> project is (or was) taking that approach: http://jpype.sourceforge.net

OK, thanks.

>>  * Could JCC generate a source distribution (sdist) that could be
>>    uploaded to pypi?
>
> You mean a source distribution that includes the Java sources of all the 
> libraries/classes wrapped ?

I was thinking more of the jars. Something like 
https://github.com/aptivate/python-tika that doesn't depend on jcc any 
more.

>>  * "setup.py develop" is still broken in the current implementation
>
> I'm not familiar with this 'develop' command nor that it is broken. What 
> is it supposed to be doing and how is it broken ?

http://packages.python.org/distribute/setuptools.html#development-mode

It seems that when invoked this way, my setup.py (from python-tika) which 
calls jcc ends up creating build/_tika as a file (not a directory).

For example, this command:

   sudo pip install -e git+https://github.com/aptivate/python-tika#egg=tika

(note the -e for editable mode) results in this:

   Running setup.py develop for tika
   ...
     Traceback (most recent call last):
       File "<string>", line 1, in <module>
       File "/tmp/src/tika/setup.py", line 108, in <module>
         cpp.jcc(jcc_args)
       File 
"/usr/local/lib/python2.6/dist-packages/JCC-2.12-py2.6-linux-i686.egg/jcc/cpp.py", 
line 587, in jcc
         os.makedirs(cppdir)
       File "/usr/lib/python2.6/os.py", line 157, in makedirs
         mkdir(name, mode)
     OSError: [Errno 17] File exists: 'build/_tika'

That file appears to contain the source code for the JCCEnv.cpp wrapper.

> A patch could be written to noisily emit a warning on all methods that are 
> skipped. Silently wrapping everything would simply wrap the entire JDK by 
> transitive closure and produce a huge library, assuming you'd have the 
> patience to watch it compile.
>
> The skipping of method whose signature contains types that are not on the 
> 'wrap this' list (explicit or implicit) is by design. Not being able to 
> request emitting a warning is a problem.

Perhaps it's useful to (automatically) emit warnings for classes in the 
JAR files included with --jar or an explicit class name, but not those in 
--include files or otherwise automatically included (e.g. the JDK 
classpath)?

> Thank you very much for your interest and contributions !

Thanks again for your help :)

Cheers, Chris.
-- 
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.

Re: Changes to enable easy_install of packages using JCC

Posted by Andi Vajda <va...@apache.org>.
  Hello,

Comments and replies inline...

On Wed, 1 Feb 2012, Chris Wilson wrote:

> I have been working on integrating Apache Tika (in Java) with our open source 
> intranet application (in Python/Django) using JCC, as described here:
>
> http://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/

Very cool. I had done a Tika build with JCC some time ago and found that it 
required a very long list of parameters as well as it integrates with a 
large number of Java libraries. Using Maven there helped considerably with 
getting all the pieces on the Java side.

Your remark about not needing JCC's shared library mode is probably correct 
right now but as soon as anyone brings in another JCC-built library into the 
same process as yours, shared mode is going to be required since the Java VM 
can only be initialized once per process.

> In order to make it easy to install Tika (which normally requires mystic 
> incantations of JCC) I have packaged it up with jar files and a setup.py 
> script. This required some changes to JCC. I hope you will consider these for 
> inclusion in your project. I don't believe that they break backwards 
> compatibility.
>
> Changes implemented by the attached patch and visible online (formatted) at 
> <https://github.com/aptivate/jcc/commits/master>:
>
> * Allow calling cpp.jcc with a --maxheap argument to reduce the heap size, as 
> the default doesn't fit in memory on a reasonably small virtual machine.
>
> * Allow calling cpp.jcc with --egg-info to generate the egg_info, without 
> doing a build.
>
> * Allow calling cpp.jcc with --extra-setup-arg <arg> to pass additional 
> arguments to the setup() function call.

No objections to these patches in principle but it would be easier for me to 
integrate them if you could provide patches computed from the svn repository 
of JCC: http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/
Your patches seem to be small enough so I should be able to do without but 
it would be nicer if I didn't have to guess...

Also, please write small descriptions for these new command line flags to go 
into JCC's __main__.py file:
http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/jcc/__main__.py

> Changes that require more work:
>
> * Can JCC please not fail completely if setuptools hasn't been patched? Can 
> it monkeypatch it instead, or at least fall back to non-shared mode?

This mess of setuptools patching was meant to be *temporary* until 
setuptools' issue 43 was fixed. As you can see, I filed this bug 3 1/2 years 
ago, http://bugs.python.org/setuptools/issue43, and my patch for issue 43 
still hasn't been accepted, rejected, integrated, anything'ed... Dormant. 
For over three years.

If one doesn't want support for shared mode:
   - add a NO_SHARED environment variable during build
   - don't use --shared with JCC during builds

> * Why does JCC use non-standard command line arguments like --build and 
> --install? Can it be modified to make it easier to invoke from a 
> setup.py-style environment, such as exporting a setup() function as 
> setuptools does?

What standard are you referring to ?
The python extension module build/install/deploy story on Python keeps 
evolving... Add Python 3.x support into the mix, and the mess is complete.

Seriously, though, I think that the right thing to do to better integrate 
JCC with distutils/setuptools/distribute/pip/etc... is to make it into a 
distutils 'compiler'. This requires some work, though, and I haven't done it 
in all thee years. Anyone with the itch to hack on distutils is welcome to 
take that on.

Additionally, issue 43 is all about using the distutils/setuptools compiler 
and linker invocation machinery for building a vanilla shared library (as 
opposed to a Python extension). On linux this is a bit cumbersome. On 
Windows, at little less so. On Mac OS X, it just works.

The alternative would be to write a 'configure' script for that part of the 
JCC build. A configure script would also solve the chicken/egg problem of 
building that library on Windows (the first time, the build needs to be done 
twice for the import library to be in the right place).

Currently, I'm leaning towards the configure script solution since none of 
the projects mentioned above seems to have taken issue 43 on (by simply 
integrating my patches) in all these years and Pylucene's issue 13 is 
curently blocked: 
https://issues.apache.org/jira/browse/PYLUCENE-13?focusedCommentId=13162273&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13162273

I have very little itch to dabble in configure scripts either so I've been 
dragging my feet. If someone were to step forward with a patch for that, 
I'd be delighted in ripping out all this patching brittleness.

> * Could JCC be used to generate dynamic proxies at runtime (with a 
> performance cost) in Python, to avoid the need for a compiler?

That is a whole different project. If I remember correctly, the JPype 
project is (or was) taking that approach: http://jpype.sourceforge.net

> * Could JCC generate a source distribution (sdist) that could be uploaded to 
> pypi?

You mean a source distribution that includes the Java sources of all the 
libraries/classes wrapped ?

> * "setup.py develop" is still broken in the current implementation

I'm not familiar with this 'develop' command nor that it is broken.
What is it supposed to be doing and how is it broken ?

> * JCC silently skips wrapping methods whose return type it doesn't know (for 
> example because I forgot to include a JAR file) which requires a lot of 
> debugging to track down and fix. This is doubly hard because it only seems to 
> work when installed, so I can't monkey patch it on the fly to investigate 
> problems, I have to remember to "setup.py install" each time.

A patch could be written to noisily emit a warning on all methods that are 
skipped. Silently wrapping everything would simply wrap the entire JDK by 
transitive closure and produce a huge library, assuming you'd have the 
patience to watch it compile.

The skipping of method whose signature contains types that are not on the 
'wrap this' list (explicit or implicit) is by design. Not being able 
to request emitting a warning is a problem.

Thank you very much for your interest and contributions !

Andi..