You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Stephan Mühlstrasser <st...@pdflib.com> on 2012/02/14 13:20:41 UTC

Problem with overriding built-in parser

Hi,

I'm trying to override the built-in PDF parser with another one. I 
looked through the mailing list archive and found the following hints 
how to override a built-in parser:

http://mail-archives.apache.org/mod_mbox/tika-user/201105.mbox/%3CBANLkTimp4omHywv_ptOmqEX9v-%2BW4e7fVA%40mail.gmail.com%3E

https://issues.apache.org/jira/browse/TIKA-527

Is there any documentation of the syntax of the configuration file 
available?

The problem is that using the proposed method does not work for me. Any 
use of the configuration file apparently sends Tika into an endless 
recursion, even without overriding a built-in parser in the 
configuration file.

If I understand it correctly, the following configuration file should 
have the same effect as the built-in configuration:

> $ cat tika-config.xml
> <properties>
> <parsers>
> <parser class="org.apache.tika.parser.DefaultParser"/>
> </parsers>
> </properties>

But if I provide that to Tika, after a while the command line 
application is terminated with an exception:

> $ java -Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.Arrays.copyOfRange(Arrays.java:3209)
>         at java.lang.String.<init>(String.java:216)
>         at java.lang.StringBuilder.toString(StringBuilder.java:430)
>         at org.apache.tika.mime.MediaType.toString(MediaType.java:237)
>         at org.apache.tika.detect.MagicDetector.<init>(MagicDetector.java:142)
>         at org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:254)
>         at org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:202)
>         at org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:186)
>         at org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:152)
>         at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:124)
>         at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:107)
>         at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:63)
>         at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:91)
>         at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:147)
>         at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:455)
>         at org.apache.tika.config.TikaConfig.typesFromDomElement(TikaConfig.java:273)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:161)
>         at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
>         at org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
>         at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52)
>         at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at java.lang.Class.newInstance0(Class.java:355)
>         at java.lang.Class.newInstance(Class.java:308)
>         at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:288)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:162)
>         at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237)
>         at org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42)
>         at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52)
>         at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

Is this a bug in Tika, or am I doing something wrong?

Thanks
Stephan

-- 
_______________________________________________________________
Stephan Mühlstrasser   stm@pdflib.com            www.pdflib.com
   PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München,  Germany
        Court of registry/Amtsgericht München HRB 129497
  Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
     PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________



Re: Problem with overriding built-in parser

Posted by Stephan Mühlstrasser <st...@pdflib.com>.
Am 17.02.12 13:22, schrieb Nick Burch:
> On Fri, 17 Feb 2012, Stephan Mühlstrasser wrote:
>>> That's not a unit test though - yours needs to be run manually. If we
>>> can run it automatically, we can add it to the test suite to make
>>> sure it doesn't get broken in future.
>>
>> I understand, here is the reproduction as a unit test:
>
> Looks good, thanks!
>
> Any chance you could open a new issue in JIRA, and attach it there?
> We'll need to decide what to do in the case of missing entries in the
> config file (abort vs silently put in the default), by having it in JIRA
> we won't forget it :)

I created TIKA-866 and attached the unit test.

Best Regards
Stephan

-- 
_______________________________________________________________
Stephan Mühlstrasser   stm@pdflib.com            www.pdflib.com
   PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München,  Germany
        Court of registry/Amtsgericht München HRB 129497
  Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
     PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________


Re: Problem with overriding built-in parser

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 17 Feb 2012, Stephan Mühlstrasser wrote:
>> That's not a unit test though - yours needs to be run manually. If we 
>> can run it automatically, we can add it to the test suite to make sure 
>> it doesn't get broken in future.
>
> I understand, here is the reproduction as a unit test:

Looks good, thanks!

Any chance you could open a new issue in JIRA, and attach it there? We'll 
need to decide what to do in the case of missing entries in the config 
file (abort vs silently put in the default), by having it in JIRA we won't 
forget it :)

Cheers
Nick

Re: Problem with overriding built-in parser

Posted by Stephan Mühlstrasser <st...@pdflib.com>.
Am 16.02.12 17:22, schrieb Nick Burch:
> On Thu, 16 Feb 2012, Stephan Mühlstrasser wrote:

> That's not a unit test though - yours needs to be run manually. If we
> can run it automatically, we can add it to the test suite to make sure
> it doesn't get broken in future.

I understand, here is the reproduction as a unit test:

package org.apache.tika;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;

import junit.framework.TestCase;

import org.junit.Before;

/**
  * Provoke endless recursion with small configuration file that loads the
  * default parser, but omits mimetypes and detector. If this is an 
insuffient
  * configuration file, Tika should report an error. Instead it 
terminates with
  * an OutOfMemoryError.
  *
  * @author stm@pdflib.com
  */
public class ConfigFile extends TestCase {

     File configFile;

     @Before
     public void setUp() throws Exception {
         configFile = File.createTempFile("tika-config", ".xml");
         configFile.deleteOnExit();
         OutputStreamWriter osw = new OutputStreamWriter(new 
FileOutputStream(configFile), "UTF-8");
         osw.write("<properties><parsers><parser 
class=\"org.apache.tika.parser.DefaultParser\"/></parsers></properties>\n");
         osw.close();
     }

     public void test() throws IOException {
         System.setProperty("tika.config", configFile.getAbsolutePath());

         new Tika();
     }
}

Best Regards
Stephan

-- 
_______________________________________________________________
Stephan Mühlstrasser   stm@pdflib.com            www.pdflib.com
   PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München,  Germany
        Court of registry/Amtsgericht München HRB 129497
  Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
     PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________


Re: Problem with overriding built-in parser

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 16 Feb 2012, Stephan Mühlstrasser wrote:
>> Are you able to produce a unit test that shows the problem?
>
> That's what I was trying to provide with the example in my previous message:

That's not a unit test though - yours needs to be run manually. If we can 
run it automatically, we can add it to the test suite to make sure it 
doesn't get broken in future.

Nick

Re: Problem with overriding built-in parser

Posted by Stephan Mühlstrasser <st...@pdflib.com>.
Hi Nick,

thanks for your reply.

Am 16.02.12 16:51, schrieb Nick Burch:
> On Tue, 14 Feb 2012, Stephan Mühlstrasser wrote:
>> https://issues.apache.org/jira/browse/TIKA-527
>...
>
>> The problem is that using the proposed method does not work for me.
>> Any use of the configuration file apparently sends Tika into an
>> endless recursion, even without overriding a built-in parser in the
>> configuration file.
>
> Are you able to produce a unit test that shows the problem?

That's what I was trying to provide with the example in my previous message:

>
>> If I understand it correctly, the following configuration file should
>> have the same effect as the built-in configuration:
>>
>>> $ cat tika-config.xml
>>> <properties>
>>> <parsers>
>>> <parser class="org.apache.tika.parser.DefaultParser"/>
>>> </parsers>
>>> </properties>

If you invoke the Tika CLI application with this configuration file, the 
error happens. Just start it like this: "java 
-Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers" and 
the error will happen.

> Ah, I'm not sure that's correct. I think you also need to give a
> mimetypes and a detector. Looking at lines 145 to 172 of TikaConfig, it
> seems that you either get the defaults with no config, or specify them
> all with your own config
>

Ok, I see now in the source what you mean. Then the example in TIKA-527 
is not complete, as it does not have mimetypes and a detector.

In the meantime since yesterday I got my override working by packaging a 
META-INF/services/org.apache.tika.parser.Parser into the JAR file 
together with my parser. So I don't need the configuration file approach 
anymore. But I think it still could be considered a bug if an 
incorrect/insufficient configuration file sends Tika into an endless 
recursion instead of producing a meaningful error message.

Thanks
Stephan

-- 
_______________________________________________________________
Stephan Mühlstrasser   stm@pdflib.com            www.pdflib.com
   PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München,  Germany
        Court of registry/Amtsgericht München HRB 129497
  Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
     PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________


Re: Problem with overriding built-in parser

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 14 Feb 2012, Stephan Mühlstrasser wrote:
> https://issues.apache.org/jira/browse/TIKA-527
>
> Is there any documentation of the syntax of the configuration file 
> available?

You could look at the code that process the file, but the example in that 
JIRA ought to cover most uses cases


> The problem is that using the proposed method does not work for me. Any 
> use of the configuration file apparently sends Tika into an endless 
> recursion, even without overriding a built-in parser in the 
> configuration file.

Are you able to produce a unit test that shows the problem?


> If I understand it correctly, the following configuration file should have 
> the same effect as the built-in configuration:
>
>> $ cat tika-config.xml
>> <properties>
>> <parsers>
>> <parser class="org.apache.tika.parser.DefaultParser"/>
>> </parsers>
>> </properties>

Ah, I'm not sure that's correct. I think you also need to give a mimetypes 
and a detector. Looking at lines 145 to 172 of TikaConfig, it seems that 
you either get the defaults with no config, or specify them all with your 
own config

Nick