You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Shalom Ben-Zvi <sh...@gmail.com> on 2012/03/08 05:45:04 UTC

tika in servicemix - empty result, parsers not found.

Hi all
we are using Tika in a web app to extract text from pdf documents with
no problem.
but now we wanted to move some of our code to run in servicemix and Tika
doesn't work anymore. just produces empty results. there is no exception
or anything, just empty result.
I think it all comes down to some classpath issue with servicemix that I
can't understand how to fix.

I installed tika-bundle-1.0 and tika-core-1.0 into servicemix.
I'm invoking Tika from my bundle, actually a camel route.

this is the code I tried:

            Tika tika = new Tika();
            tika.setMaxStringLength(-1);
            Metadata metadata = new Metadata();
            Reader in = tika.parse(new
FileInputStream("/myopt/books/ejb-3_0-fr-spec-persistence.pdf"),metadata);
            org.apache.commons.io.IOUtils.copy(in, new
FileOutputStream("/tmp/tikatest/tika-text.txt"));


running the same code fragment outside of servicemix produces the 13000
lines of text.


debugging the code I found that 
org.apache.tika.config.ServiceLoader#loadServiceProviders can't find the
list of parsers in location
META-INF/services/org.apache.tika.parser.Parser although the file is
there. as a result the DefaultParser is initialized with no parsers
list, and when calling parse on Tika object then in 
org.apache.tika.parser.CompositeParser#parse getParser(metadata) returns
the EmptyParser.

I tried to put the files org.apache.tika.parser.Parser and
org.apache.tika.detect.Detector in my own bundle under the same location
but didn't help.

I tried to provide Tika with a tika config file that looks like that:
<properties>

    <mimeTypeRepository
resource="/org/apache/tika/mime/tika-mimetypes.xml" magic="false"/>

    <parsers>
        <parser name="parse-pdf"
class="org.apache.tika.parser.pdf.PDFParser">
                <mime>application/pdf</mime>
        </parser>
    </parsers>

</properties>

and initialized tika:
tikaConfig = new TikaConfig(configFile);
tika = new Tika(tikaConfig);

but then I got an excetption saying that the pdf parser not found, the
class is there of course:
org.apache.tika.exception.TikaException: Configured parser class not
found: org.apache.tika.parser.pdf.PDFParser
Caused by: java.lang.ClassNotFoundException:
org.apache.tika.parser.pdf.PDFParser not found by org.apache.tika.core [314]
        at
org.apache.felix.framework.ModuleImpl.findClassOrResourceByDelegation(ModuleImpl.java:787)
        at
org.apache.felix.framework.ModuleImpl.access$400(ModuleImpl.java:71)
        at
org.apache.felix.framework.ModuleImpl$ModuleClassLoader.loadClass(ModuleImpl.java:1768)


I tried installing tika-parsers into servicemix also, although I'm sure
its not necessary as the tika-bundle contains them all. didn't help either.

I tried installing the regular maven Tika jars with all their
dependencies into servicemix but I got the same result.

I've been trying all night to make it work, I ran out of ideas,
 
Thank you for any help.

Shalom






Re: tika in servicemix - empty result, parsers not found.

Posted by Shalom Ben-Zvi <sh...@gmail.com>.
Thanks Nick.
The Tika bundle does not contain tika-core. it contains all necessary
dependencies and the parsers. thats why i had to install tika-core. I
think if the bundle contained tika-core there wouldn't be a problem.



On 03/08/12 13:15, Nick Burch wrote:
> On Thu, 8 Mar 2012, Shalom Ben-Zvi wrote:
>> I installed tika-bundle-1.0 and tika-core-1.0 into servicemix.
>> I'm invoking Tika from my bundle, actually a camel route.
>
> I don't know a lot about OSGi, but that might be your issue - you have
> some bits of Tika coming from a bundle, and some bits from outside. Do
> you not need to ensure that everything is in the same place? Also, the
> bundle normally does have Tika Core in it
>
> Nick

Re: tika in servicemix - empty result, parsers not found.

Posted by Shalom Ben-Zvi <sh...@gmail.com>.
Hi
I'm sure that tika-bundle-1.0 isn't working, at lease for me in servicemix.
I switched to tika-bundle-0.9 and tika 0.9 is my code and everything
works again.
I looked at the tika source code and tika-core is not bundled in
tika-bundle.

Thank you.

On 03/08/12 13:15, Nick Burch wrote:
> On Thu, 8 Mar 2012, Shalom Ben-Zvi wrote:
>> I installed tika-bundle-1.0 and tika-core-1.0 into servicemix.
>> I'm invoking Tika from my bundle, actually a camel route.
>
> I don't know a lot about OSGi, but that might be your issue - you have
> some bits of Tika coming from a bundle, and some bits from outside. Do
> you not need to ensure that everything is in the same place? Also, the
> bundle normally does have Tika Core in it
>
> Nick

Re: tika in servicemix - empty result, parsers not found.

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 8 Mar 2012, Shalom Ben-Zvi wrote:
> I installed tika-bundle-1.0 and tika-core-1.0 into servicemix.
> I'm invoking Tika from my bundle, actually a camel route.

I don't know a lot about OSGi, but that might be your issue - you have 
some bits of Tika coming from a bundle, and some bits from outside. Do you 
not need to ensure that everything is in the same place? Also, the bundle 
normally does have Tika Core in it

Nick