You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pylucene-dev@lucene.apache.org by Alban Mouton <al...@univ-ubs.fr> on 2010/04/09 13:44:50 UTC

Python wrapper for Tika using JCC

Hello,

I didn't find specific data on the web to do this, except for this mail :
http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
JCC doc might be enough, but it won't hurt to add a few specifics.

With me that makes at least 2 persons who needed it.. Enough to set up a
small wiki page in my opinion :
http://redmine.djity.net/projects/pythontika/wiki

The wrapper seems to work fine, but it wasn't very much tested yet.

Thanks to JCC developers, it's a very useful piece of software !

Alban Mouton

Re: Python wrapper for Tika using JCC

Posted by Andi Vajda <va...@apache.org>.

On Fri, 9 Apr 2010, Alban Mouton wrote:

> Actually I choose the lazy solution and included tika-app jar, which if i
> am correct already contains all dependencies (it's a standalone version of
> tika) that you included one by one. I didn't test all formats with my
> version of the wrapper but I think it should be fine as it is (maybe a few
> more classes or --package but probably no --include).

--include is actually neat because all thus included jars are added to the 
classpath at runtime and are included in the tika egg. In other words, you 
can then take that egg and install it elsewhere without having to worry 
about carrying your maven repository around or manually setting your 
classpath during initVM() or via the environment.

Andi..

> Very cool tika indeed ! And it covers functionalities that are missing in
> the python world (to my knowledge anyway), so this wrapper might be
> useful..
>
> Alban
>
>>
>> On Fri, 9 Apr 2010, Alban Mouton wrote:
>>
>>> Hello,
>>>
>>> I didn't find specific data on the web to do this, except for this mail
>>> :
>>> http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
>>> JCC doc might be enough, but it won't hurt to add a few specifics.
>>>
>>> With me that makes at least 2 persons who needed it.. Enough to set up a
>>> small wiki page in my opinion :
>>> http://redmine.djity.net/projects/pythontika/wiki
>>>
>>> The wrapper seems to work fine, but it wasn't very much tested yet.
>>>
>>> Thanks to JCC developers, it's a very useful piece of software !
>>
>> A guy at work asked me the same question a couple of days ago.
>> Not knowing Tika, I helped him with the JCC command line aspects. He used
>> a
>> similar Tika example to what you're mentioning on your wiki page.
>>
>> Lucene and Tika are a bit different in that Tika depends on a long list of
>> thirdparty libraries, all helpfully downloaded by Maven as you build Tika
>> into the local Maven repository in your home directory. Lucene is
>> standalone.
>>
>> The approach we took was to --jar the tika core and tika parsers, getting
>> access to all public classes in these two jar file from Python and use
>> --include for all the other dependencies we found so that we avoid
>> generating wrappers for them but get them included in the resulting tika
>> egg. We also needed to allow wrapper generation for the java.io and
>> org.xml.sax packages by using --package and explicitely request
>> java.io.FileInputStream. I noticed you used class names with --package
>> there...
>>
>> This is the JCC command we used for wrapping Tika 0.7 with Python 2.6.2:
>>
>> python -m jcc.__main__ --shared --python tika --version 0.7 \
>>    --build --install \
>>    --jar tika-core/target/tika-core-0.7.jar \
>>    --jar tika-parsers/target/tika-parsers-0.7.jar \
>>    --package java.io java.io.FileInputStream \
>>    --package org.xml.sax \
>>    --include
>> ~/.m2/repository/com/drewnoakes/metadata-extractor/2.4.0-beta-1/metadata-extractor-2.4.0-beta-1.jar
>> \
>>    --include ~/.m2/repository/org/apache/poi/poi/3.6/poi-3.6.jar \
>>    --include ~/.m2/repository/asm/asm/3.1/asm-3.1.jar \
>>    --include
>> ~/.m2/repository/org/apache/poi/poi-ooxml/3.6/poi-ooxml-3.6.jar \
>>    --include
>> ~/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar
>> \
>>    --include
>> ~/.m2/repository/org/apache/pdfbox/pdfbox/1.1.0/pdfbox-1.1.0.jar \
>>    --include
>> ~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
>> \
>>    --include ~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar \
>>    --include
>> ~/.m2/repository/org/apache/commons/commons-compress/1.0/commons-compress-1.0.jar
>> \
>>    --include
>> ~/.m2/repository/org/apache/poi/poi-scratchpad/3.6/poi-scratchpad-3.6.jar
>> \
>>    --include
>> ~/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.6/poi-ooxml-schemas-3.6.jar
>> \
>>    --include
>> ~/.m2/repository/org/ccil/cowan/tagsoup/tagsoup/1.2/tagsoup-1.2.jar \
>>    --include
>> ~/.m2/repository/org/apache/pdfbox/fontbox/1.1.0/fontbox-1.1.0.jar
>>
>> We kept adding --include pairs until we were able to run the example code
>> we
>> had in mind which looked like:
>>   >>> from tika import *
>>   >>> initVM()
>>   >>> metadata = Metadata()
>>   >>> handler = MetadataHandler(metadata, "foo")
>>   >>> parser = AutoDetectParser()
>>   >>> parser.parse(FileInputStream("image.jpg"), handler, metadata)
>>   >>> metadata
>>    <Metadata: Number of Components=3 Model=HP psc1300 Image Height=728
>> pixels
>> Data Precision=8 bits YCbCr Positioning=Datum point Reference
>> Black/White=[0,128,128] [255,255,255] Component 1=Y component:
>> Quantization
>> table 0, Sampling factors 2 horiz/2 vert Component 2=Cb component:
>> Quantization table 1, Sampling factors 1 horiz/1 vert Component 3=Cr
>> component: Quantization table 1, Sampling factors 1 horiz/1 vert X
>> Resolution=200 dots per inch Resolution Unit=Inch Image Width=1114 pixels
>> Content-Type=image/jpeg Y Resolution=200 dots per inch Make=HP >
>>
>> Very cool, Tika !!
>>
>> I'm sure more --include pairs are necessary for supporting other formats
>> we
>> haven't tested but you get the idea...
>>
>> I hope this helps !
>>
>> Andi..
>>
>
>

Re: Python wrapper for Tika using JCC

Posted by Andi Vajda <va...@apache.org>.

On Fri, 9 Apr 2010, Alban Mouton wrote:

> Hello,
>
> I didn't find specific data on the web to do this, except for this mail :
> http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
> JCC doc might be enough, but it won't hurt to add a few specifics.
>
> With me that makes at least 2 persons who needed it.. Enough to set up a
> small wiki page in my opinion :
> http://redmine.djity.net/projects/pythontika/wiki
>
> The wrapper seems to work fine, but it wasn't very much tested yet.
>
> Thanks to JCC developers, it's a very useful piece of software !

A guy at work asked me the same question a couple of days ago.
Not knowing Tika, I helped him with the JCC command line aspects. He used a
similar Tika example to what you're mentioning on your wiki page.

Lucene and Tika are a bit different in that Tika depends on a long list of 
thirdparty libraries, all helpfully downloaded by Maven as you build Tika 
into the local Maven repository in your home directory. Lucene is 
standalone.

The approach we took was to --jar the tika core and tika parsers, getting 
access to all public classes in these two jar file from Python and use 
--include for all the other dependencies we found so that we avoid 
generating wrappers for them but get them included in the resulting tika 
egg. We also needed to allow wrapper generation for the java.io and 
org.xml.sax packages by using --package and explicitely request 
java.io.FileInputStream. I noticed you used class names with --package 
there...

This is the JCC command we used for wrapping Tika 0.7 with Python 2.6.2:

python -m jcc.__main__ --shared --python tika --version 0.7 \
   --build --install \
   --jar tika-core/target/tika-core-0.7.jar \
   --jar tika-parsers/target/tika-parsers-0.7.jar \
   --package java.io java.io.FileInputStream \
   --package org.xml.sax \
   --include ~/.m2/repository/com/drewnoakes/metadata-extractor/2.4.0-beta-1/metadata-extractor-2.4.0-beta-1.jar \
   --include ~/.m2/repository/org/apache/poi/poi/3.6/poi-3.6.jar \
   --include ~/.m2/repository/asm/asm/3.1/asm-3.1.jar \
   --include ~/.m2/repository/org/apache/poi/poi-ooxml/3.6/poi-ooxml-3.6.jar \
   --include ~/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar \
   --include ~/.m2/repository/org/apache/pdfbox/pdfbox/1.1.0/pdfbox-1.1.0.jar \
   --include ~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar \
   --include ~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar \
   --include ~/.m2/repository/org/apache/commons/commons-compress/1.0/commons-compress-1.0.jar \
   --include ~/.m2/repository/org/apache/poi/poi-scratchpad/3.6/poi-scratchpad-3.6.jar \
   --include ~/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.6/poi-ooxml-schemas-3.6.jar \
   --include ~/.m2/repository/org/ccil/cowan/tagsoup/tagsoup/1.2/tagsoup-1.2.jar \
   --include ~/.m2/repository/org/apache/pdfbox/fontbox/1.1.0/fontbox-1.1.0.jar

We kept adding --include pairs until we were able to run the example code we 
had in mind which looked like:
   >>> from tika import *
   >>> initVM()
   >>> metadata = Metadata()
   >>> handler = MetadataHandler(metadata, "foo")
   >>> parser = AutoDetectParser()
   >>> parser.parse(FileInputStream("image.jpg"), handler, metadata)
   >>> metadata
   <Metadata: Number of Components=3 Model=HP psc1300 Image Height=728 pixels 
Data Precision=8 bits YCbCr Positioning=Datum point Reference 
Black/White=[0,128,128] [255,255,255] Component 1=Y component: Quantization 
table 0, Sampling factors 2 horiz/2 vert Component 2=Cb component: 
Quantization table 1, Sampling factors 1 horiz/1 vert Component 3=Cr 
component: Quantization table 1, Sampling factors 1 horiz/1 vert X 
Resolution=200 dots per inch Resolution Unit=Inch Image Width=1114 pixels 
Content-Type=image/jpeg Y Resolution=200 dots per inch Make=HP >

Very cool, Tika !!

I'm sure more --include pairs are necessary for supporting other formats we 
haven't tested but you get the idea...

I hope this helps !

Andi..