You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kevin Milburn <km...@gpslsolutions.com> on 2012/07/05 19:10:47 UTC

using tika with eclipse

Hi

I've been trying to add tika 1.1 support  to an Eclipse RCP application 
but am struggling to get the parsers loaded.
I have both tika-core-1.1.jar and tika-bundle-1.1.jar plugins added to 
the target and selected within product and have confirmed both plugins 
are present in the running program.

The fundamental problem appears to be that the TikaConfig is ultimately 
reaching ServiceLoader.findServiceResources,  looking for 
META-INF/services/org.apache.tika.parser.Parser.   While doing so, it 
only appears to check the org.apache.tika.core plugin, it doesn't 
contain it, so not Parsers are available.

Any ideas where I may have gone wrong or how to get it working?

TIA
Kevin.

Re: using tika with eclipse

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 18 Jul 2012, rodgersh wrote:
> And here is my custom-mimetypes.xml file:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <mime-info>
>   <mime-type type="image/nitf">
>       <alias type="image/ntf"/>
>       <glob pattern="*.nitf"/>
>   </mime-type>
> </mime-info>

I've no idea about OSGi, so I can't comment on what you need to do to have 
it look at your extra file. Hopefully one of our OSGi experts can help you 
with the appropriate incantation / jar file blessing / etc.

However, I do know about mimetypes in Tika, so I've fixed your problem 
that way - see TIKA-957. As of r1363160 Tika should now know about NTIF 
files, and should have some mime magic for them (works on the few sample 
files I tried)

Nick

Re: using tika with eclipse

Posted by rodgersh <hu...@lmco.com>.
I have a very similar issue, but using Tika on Karaf vs. eclipse.

I am using Tika v1.2 and Karaf v2.2.7 on Windows 7.

I have made an OSGi bundle that uses Tika and provides a
getFileExtensionForMimeType(...) method. I have added a
org/apache/tika/mime/custom-mimetypes.xml file to my src/main/resources
directory. I have made a custom parser and added a
META-INF/services/org.apache.tika.parser.Parser file that lists it (although
I am not trying to use the custom parser yet).

When another bundle invokes this bundle's getFileExtensionForMimeType(...)
method it works for mime types that Tika supports by default, but it does
not find the mime types in my custom-mimetypes.xml file.

It's like this custom mime types file is not found by the OSGi container.
Any help is appreciated.

Here is my method's code:

public String getFileExtensionForMimeType( String contentType ) throws
MimeTypeException
    {
        //TikaConfig config = TikaConfig.getDefaultConfig();   // this did
not work for custom mime types
        TikaConfig config = null;
        try
        {
            config = new TikaConfig( this.getClass().getClassLoader() );
        }
        catch ( IOException e )
        {
            logger.warn( "Error creating TikaConfig with ClassLoader", e );
            return null;
        }

        MimeTypes mimeTypes = config.getMimeRepository();
        String extension = null;
        try
        {
            MimeType mimeType = mimeTypes.forName( contentType );
            extension = mimeType.getExtension();
        }
        catch ( Exception e )
        {
            logger.warn( "Exception caught getting file extension for mime
type" + contentType, e );
        }
        
        logger.debug( "mimeType = " + contentType + ",   file extension = ["
+ extension + "]" );
        
        return extension;
    }


And here is my custom-mimetypes.xml file:

<?xml version="1.0" encoding="UTF-8"?>
 <mime-info>
   <mime-type type="image/nitf">
       <alias type="image/ntf"/>
       <glob pattern="*.nitf"/>
   </mime-type>
 </mime-info>


I have verified my input is "image/nitf" mime type. This method worked when
the input was "application/octet-stream", it returned ".bin"


--
View this message in context: http://apache-tika-users.1629097.n2.nabble.com/using-tika-with-eclipse-tp7572799p7572828.html
Sent from the Apache Tika - Users mailing list archive at Nabble.com.

Re: using tika with eclipse

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 6 Jul 2012, Kevin Milburn wrote:
> It appears my main mistake is trying to use Tika or TikaConfig, like all 
> every example I've found has done, which appears to be completely 
> incompatible with using Tika in an OSGI environment!   :(
>
> e.g.  the following produces no output,  despite the file containing text.
>    Tika tika = new Tika();
>    System.out.print(tika.parseToString(new FileInputStream(xmlFile)));

Once you work out the appropriate incantation, any chance you could write 
something up for the Tika wiki about it? <http://wiki.apache.org/tika/>

(As you may have gathered, there aren't a lot of people using Tika with 
OSGi yet, so the trail you blaze can hopefully help others later!)

Cheers
Nick

Re: using tika with eclipse

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jul 16, 2012 at 1:13 PM, Kevin Milburn
<km...@gpslsolutions.com> wrote:
> It would be nice if the Tika and TikaConfig classes had greater awareness of
> the OSGI environment as they currently perform redundant work trying to load
> the services files which they'll never find.

Note that there are cases where people embed the tika-core jar into a
larger bundle that also comes with some of the parser libraries. Or
when a client bundle uses Tika with parser services loaded from the
class loader of the client bundle. In such cases it's a good idea that
also the Java service provider mechanism is used to load services.

And in any case the static service loading is a fairly cheap operation
that's typically only done once during the lifetime of an application
or a bundle.

BR,

Jukka Zitting

Re: using tika with eclipse

Posted by Kevin Milburn <km...@gpslsolutions.com>.
On 2012/07/06 22:31, Jukka Zitting wrote:
> On Fri, Jul 6, 2012 at 7:47 PM, Kevin Milburn
> <km...@gpslsolutions.com> wrote:
>> I've tested this by modifying the tika-core/pom.xml (see attached), and
>> adding the following line:
>>
>>              <Bundle-Activator>
>>                org.apache.tika.config.TikaActivator
>>              </Bundle-Activator>
>> + <Bundle-ActivationPolicy>lazy</Bundle-ActivationPolicy>
>>
>> Any chance of this for the 1.2 release?
> Sure, I just committed it, see https://issues.apache.org/jira/browse/TIKA-951.
>

Thanks for that,    I've tested the latest snapshot (and RC1) and things 
behave themselves a lot better.

It would be nice if the Tika and TikaConfig classes had greater 
awareness of the OSGI environment as they currently perform redundant 
work trying to load the services files which they'll never find.

Thanks again
Kevin.

p.s.  For those trying to get Tika to work in Eclipse, you need to do 
something along these lines.

Change the Target Definition  (or create a new one)
         On the Definition tab, add the location of the tika-bundle and 
tika-core jars
         On the Content tab,  make sure the core and bundle plugins are 
selected
         Set as Target Platform

In each plugin that needs Tika support,  add org.apache.tika.core to the 
plugins dependencies

Change the Product Configuration  (or create a new one),
         On the Dependencies tab, add org.apache.tika.core and o.a.t.bundle
         On the Configuration tab, add o.a.t.bundle to the Start levels, 
and set Auto-Start to true.
         On the Overview tab, Test the product by launching a runtime 
instance of it.


Re: using tika with eclipse

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jul 6, 2012 at 7:47 PM, Kevin Milburn
<km...@gpslsolutions.com> wrote:
> Eclipse is a finicky beast, even if a bundle has an Activator it won't be
> activated if the Bundle-ActivationPolicy is not set, unless the product is
> modified to explicitly auto start the bundle.

Interesting, I didn't know that.

> Ideally,  it would be preferable to set the Bundle-ActivationPolicy to lazy
> to allow Eclipse (and others?) to do the right thing without needless
> complication.

Sounds like a good idea!

> I've tested this by modifying the tika-core/pom.xml (see attached), and
> adding the following line:
>
>             <Bundle-Activator>
>               org.apache.tika.config.TikaActivator
>             </Bundle-Activator>
> + <Bundle-ActivationPolicy>lazy</Bundle-ActivationPolicy>
>
> Any chance of this for the 1.2 release?

Sure, I just committed it, see https://issues.apache.org/jira/browse/TIKA-951.

> p.s.  an alternative method of obtaining access to the Detector and Parser
> involves something like this in your own bundles activator:

The reason why we use ServiceTrackers instead is that we want to
support deployments where new parser and detector services can be
added or removed dynamically from the running system.

BR,

Jukka Zitting

Re: using tika with eclipse

Posted by Kevin Milburn <km...@gpslsolutions.com>.
On 2012/07/06 17:43, Jukka Zitting wrote:
> You'll want to make sure that both the tika-bundle and tika-core 
> bundles are actually started/activated by the OSGi environment, as 
> otherwise the relevant Activators that Tika uses to hook up with the 
> available services won't get started. 
Bingo,  having spent much time on why the Parsers were not behaving,  
it's actually the tika-core bunde that is not activating.
Eclipse is a finicky beast, even if a bundle has an Activator it won't 
be activated if the Bundle-ActivationPolicy is not set, unless the 
product is modified to explicitly auto start the bundle.

Ideally,  it would be preferable to set the Bundle-ActivationPolicy to 
lazy to allow Eclipse (and others?) to do the right thing without 
needless complication.

I've tested this by modifying the tika-core/pom.xml (see attached), and 
adding the following line:

             <Bundle-Activator>
               org.apache.tika.config.TikaActivator
             </Bundle-Activator>
+ <Bundle-ActivationPolicy>lazy</Bundle-ActivationPolicy>

Any chance of this for the 1.2 release?

Thanks for the help.
Kevin.

p.s.  an alternative method of obtaining access to the Detector and 
Parser involves something like this in your own bundles activator:

import org.apache.tika.detect.Detector;
import org.apache.tika.parser.Parser;
...
     @Override
     public void start(BundleContext context) throws Exception {
         super.start(context);
         detector = (Detector) 
context.getService(context.getServiceReference(Detector.class.getName()));
         parser = (Parser) 
context.getService(context.getServiceReference(Parser.class.getName()));
     }


Re: using tika with eclipse

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jul 6, 2012 at 6:27 PM, Kevin Milburn
<km...@gpslsolutions.com> wrote:
> It appears my main mistake is trying to use Tika or TikaConfig, like all
> every example I've found has done, which appears to be completely
> incompatible with using Tika in an OSGI environment!   :(

That shouldn't be the case. What's the code you're using.

You'll want to make sure that both the tika-bundle and tika-core
bundles are actually started/activated by the OSGi environment, as
otherwise the relevant Activators that Tika uses to hook up with the
available services won't get started.

Adding a breakpoint or a System.out print to the
o.a.t.config.TikaActivator class in tika-core and the
o.a.t.parser.internal.Activator class in tika-parsers/-bundle should
help making sure that these Activators really are being invoked by the
OSGi environment.

> e.g.  the following produces no output,  despite the file containing text.
>     Tika tika = new Tika();
>     System.out.print(tika.parseToString(new FileInputStream(xmlFile)));

See the BundleIT test case inside the tika-bundle component. That's a
pretty similar piece of code that works fine in an OSGi environment.

BR,

Jukka Zitting

Re: using tika with eclipse

Posted by Kevin Milburn <km...@gpslsolutions.com>.
On 2012/07/06 16:14, Jukka Zitting wrote:
> The tika-bundle should start up Parser and Detector services that 
> tika-core will then access through the OSGi framework.

OK,   I've done a bit more debugging,  and think I know where I've gone 
wrong.
Having got a breakpoint in the right place,  I can see that the Parser 
and Detector services are being generate correctly.

It appears my main mistake is trying to use Tika or TikaConfig, like all 
every example I've found has done, which appears to be completely 
incompatible with using Tika in an OSGI environment!   :(

e.g.  the following produces no output,  despite the file containing text.
     Tika tika = new Tika();
     System.out.print(tika.parseToString(new FileInputStream(xmlFile)));



Re: using tika with eclipse

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jul 6, 2012 at 5:00 PM, Kevin Milburn
<km...@gpslsolutions.com> wrote:
> On 2012/07/05 18:22, Jukka Zitting wrote:
>> upgrade to the latest 1.2 SNAPSHOT where declarative services is no longer
>> needed (see https://issues.apache.org/jira/browse/TIKA-896).
>
> I've built and installed the 1.2 SNAPSHOT, but it has made no difference.

Hmm, do you start/activate the bundles after deploying them to the
OSGi environment? I've seen some OSGi setups that only resolve bundles
by default, which only makes the contained classes available, but
doesn't start up the services provided by the bundles.

> It still suffers from the same fundamental problem that the ServiceLoader
> (in tika-core) cannot find "META-INF/services/org.apache.tika.parser.Parser"
> (in tika-bundle).

It's not supposed to. The tika-bundle should start up Parser and
Detector services that tika-core will then access through the OSGi
framework.

As you mentioned, OSGi and SPI don't work that well together, which is
why we're using the OSGi services when Tika gets deployed to an OSGi
environment.

BR,

Jukka Zitting

Re: using tika with eclipse

Posted by Kevin Milburn <km...@gpslsolutions.com>.
On 2012/07/05 18:22, Jukka Zitting wrote:
> upgrade to the latest 1.2 SNAPSHOT where declarative services is no 
> longer needed (see https://issues.apache.org/jira/browse/TIKA-896). 

I've built and installed the 1.2 SNAPSHOT, but it has made no difference.

It still suffers from the same fundamental problem that the 
ServiceLoader (in tika-core) cannot find 
"META-INF/services/org.apache.tika.parser.Parser" (in tika-bundle).

Is there any guidance anywhere on how to setup an eclipse RCP 
application to use the bundles?

Kevin..



Re: using tika with eclipse

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Jul 5, 2012 at 7:10 PM, Kevin Milburn
<km...@gpslsolutions.com> wrote:
> Any ideas where I may have gone wrong or how to get it working?

In an OSGi environment Tika makes the Parser and Detector
implementations available as OSGi services that tika-core then
automatically picks up for use with things like AutoDetectParser and
the Tika facade.

In 1.1 you need declarative services support for that to happen, which
is probably why you don't see the parsers coming up in your
deployment. You can either deploy Tika 1.1 with declarative services,
or upgrade to the latest 1.2 SNAPSHOT where declarative services is no
longer needed (see https://issues.apache.org/jira/browse/TIKA-896).

BR,

Jukka Zitting

Re: using tika with eclipse

Posted by Kevin Milburn <km...@gpslsolutions.com>.
On 2012/07/05 18:26, Uwe Schindler wrote:
> Do you have the JAR files in classpath or do you extract them and merge all
> class files and resources? This happens, e.g. if you ask Eclipse to create
> one uber-jar containing everything. The problem that then appears is, that
> every meta-inf file coming from separate jar files are overwriting each
> over. SPI is relying on actual jar packages as deployment units.

The JAR files (which are pulled from a Maven repository) have been added 
to the plugins section of the RCP product and are both loaded (i.e. on 
the apps classpath).  The problem stems from the tika-bundle not being 
on the classpath of tika-core bundle.

I could repackage the tika-core and tika-bundle into a single 
OSGI-Bundle,  effectively replicating the bundle before the 1.0 
release.  However,  this would seem to defeat the purpose of the 
OSGi-bundles provided by the tika project.

Also, From what I can gather,  SPI is the cause of the problem, as OSGI 
and SPI are largely incompatible.


RE: using tika with eclipse

Posted by Uwe Schindler <uw...@thetaphi.de>.
Do you have the JAR files in classpath or do you extract them and merge all
class files and resources? This happens, e.g. if you ask Eclipse to create
one uber-jar containing everything. The problem that then appears is, that
every meta-inf file coming from separate jar files are overwriting each
over. SPI is relying on actual jar packages as deployment units.

If you only add the unmodified jar files to classpath, this should work. The
same applies by the way for Solr and Lucene 4.0, which also use SPI for
their codec infrastructure.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Kevin Milburn [mailto:kmilburn@gpslsolutions.com]
> Sent: Thursday, July 05, 2012 7:11 PM
> To: user@tika.apache.org
> Subject: using tika with eclipse
> 
> Hi
> 
> I've been trying to add tika 1.1 support  to an Eclipse RCP application
but am
> struggling to get the parsers loaded.
> I have both tika-core-1.1.jar and tika-bundle-1.1.jar plugins added to the
target
> and selected within product and have confirmed both plugins are present in
the
> running program.
> 
> The fundamental problem appears to be that the TikaConfig is ultimately
> reaching ServiceLoader.findServiceResources,  looking for
> META-INF/services/org.apache.tika.parser.Parser.   While doing so, it
> only appears to check the org.apache.tika.core plugin, it doesn't contain
it, so
> not Parsers are available.
> 
> Any ideas where I may have gone wrong or how to get it working?
> 
> TIA
> Kevin.