You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/09/10 05:22:50 UTC

Error thrown with TikaConfig() constructor

Hi all,

In the past, we'd build our Hadoop job jars using a dependency on Tika- 
parsers but excluding the supporting jars for types that we know we  
don't need to process (e.g. Microsoft docs, PDFs, etc). This  
dramatically reduces the size of the resulting Hadoop job jar.

With 0.8-SNAPSHOT, the TikaConfig(Classpath) constructor now finds and  
instantiates all Parser-based classes found on the classpath. Which,  
as expected, triggers a storm of Exceptions and Errors.

I'm wondering how best to handle this type of configuration, in a way  
that's relatively resilient to Tika configuration changes and my  
target set of formats.

The quick & cheesy hack is to change the TikaConfig constructor to  
catch exceptions thrown by parser instantiation, and ignore (or log)  
them. But that seems likely to create lots of pain and suffering for  
people who have broken setups, as it fails slowly & silently.

We could try to avoid triggering the construction of TikaConfig, and  
do our own dispatching based on mime-types, but that seems both kludgy  
and brittle.

We could build a custom version of Tika that only includes the parser  
classes we use, but that also seems brittle.

Any other thoughts/options?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Error thrown with TikaConfig() constructor

Posted by Ken Krugler <kk...@transpac.com>.

On Sep 13, 2010, at 2:42am, Nick Burch wrote:

> On Sat, 11 Sep 2010, Jukka Zitting wrote:
>> The reason why I originally didn't want to simply catch and ignore  
>> the potential exceptions in the TikaConfig constructor was the lack  
>> of a good error reporting mechanism. The trick of insulating the  
>> external library dependencies to separate extractor classes nicely  
>> solved that problem by delaying the exceptions to the actual  
>> parse() method calls on specific document types, which obviously  
>> would then give the end user a much better idea of what's wrong.
>
> My thinking on exceptions during creating the parser are:
> * ClassNotFound for parser class - throw the exception, as the user  
> has
>  specified a parser that isn't there

> * Any other ClassNotFound - warning, as this means that a dependency  
> is
>  missing, but that may be what the user wanted

If you use this approach, then you'd also want to do this special  
handling for the NoSuchMethodError, as that was getting thrown by Tika  
0.7-SNAPSHOT when POI support was excluded.

> * Any other problem - throw the exception, as there is a fault with  
> the
>  parser, and there's a fair chance that this is a customer parser
>  that has broken. (The standard tika parsers shouldn't do this!)

Interesting idea - I'm worried that there will be new exceptions when  
future versions of Tika change their parser implementations.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Error thrown with TikaConfig() constructor

Posted by Nick Burch <ni...@alfresco.com>.

On Sat, 11 Sep 2010, Jukka Zitting wrote:
> The reason why I originally didn't want to simply catch and ignore the 
> potential exceptions in the TikaConfig constructor was the lack of a 
> good error reporting mechanism. The trick of insulating the external 
> library dependencies to separate extractor classes nicely solved that 
> problem by delaying the exceptions to the actual parse() method calls on 
> specific document types, which obviously would then give the end user a 
> much better idea of what's wrong.

My thinking on exceptions during creating the parser are:
* ClassNotFound for parser class - throw the exception, as the user has
   specified a parser that isn't there
* Any other ClassNotFound - warning, as this means that a dependency is
   missing, but that may be what the user wanted
* Any other problem - throw the exception, as there is a fault with the
   parser, and there's a fair chance that this is a customer parser
   that has broken. (The standard tika parsers shouldn't do this!)

Nick

Re: Error thrown with TikaConfig() constructor

Posted by Oleg Tikhonov <ol...@gmail.com>.

There are the situations, I could think about, where you would like to
implement customized classloader:
1. You need different hierarchy to load classes, as OSGi for instance.
Hollywood principle if you like.
2. When you need to run different versions of classes or jars. For example,
you want to load class A with version 1.1.2, while class B needs version
2.3.4.
3. At runtime you need to edit byte-code of class and reload it.
4. Most obviously, you need to load class/es from network, default
classloader loads classes that placed locally
5. Dynamically create classes and load them on the fly
6. Run multiple java applications inside a single JVM

BR,
Oleg.


On Sun, Sep 12, 2010 at 4:46 PM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Sep 11, 2010, at 1:17pm, Ken Krugler wrote:
>
>  On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch <ni...@alfresco.com>
>>> wrote:
>>>
>>>> Quite a lot of OfficeParser does depend on poifs code though, as well as
>>>> a
>>>> few bits that depend on some of the less common POI text extractors.
>>>>
>>>
>>> It looks like a number of our other new parsers also have direct
>>> dependencies to external libraries, so this problem is not just
>>> related to the OfficeParser class.
>>>
>>> The basic problem here is that the service loader used by the default
>>> TikaConfig constructor throws an exception when it can't load a class
>>> listed in a org.apache.tika.parser.Parser service file. The solution I
>>> implemented in TIKA-378 for the 0.7 release was to move the external
>>> parser library references to separate extractor classes so that the
>>> parser class could be instantiated without problems. Unfortunately
>>> this was a one-off solution that obviously hasn't survived further
>>> development in the svn trunk.
>>>
>>> The reason why I originally didn't want to simply catch and ignore the
>>> potential exceptions in the TikaConfig constructor was the lack of a
>>> good error reporting mechanism. The trick of insulating the external
>>> library dependencies to separate extractor classes nicely solved that
>>> problem by delaying the exceptions to the actual parse() method calls
>>> on specific document types, which obviously would then give the end
>>> user a much better idea of what's wrong.
>>>
>>> Perhaps the best solution would actually be to combine the above
>>> approaches, i.e. to strive to maintain the parser/extractor separation
>>> where possible and to use a catch block in the TikaConfig constructor
>>> to catch and ignore any problems that the insulation approach fails to
>>> address.
>>>
>>
>> IIRC, the main concern about this approach is when people are using custom
>> parsers, where instantiation exceptions can happen due to bugs in the actual
>> parser (versus explicitly excluded jars). Quietly ignoring these errors
>> leads to late failing, which can be a bad thing.
>>
>> What I would propose is two changes:
>>
>> 1. Add a new TikaConfig(ClassLoader, Class<Parser>...) constructor that
>> can be used to instantiate all parsers from the variable list that around
>> found using the ClassLoader. For example:
>>
>>   public TikaConfig(ClassLoader loader, Class<Parser>...targetParsers)
>>           throws MimeTypeException, IOException {
>>       for (Class<Parser> parserClass : targetParsers) {
>>           ParseContext context = new ParseContext();
>>
>>           try {
>>               Parser parser = parserClass.newInstance();
>>               for (MediaType type : parser.getSupportedTypes(context)) {
>>                   parsers.put(type, parser);
>>               }
>>           } catch (InstantiationException e) {
>>               throw new IOException(e);
>>           } catch (IllegalAccessException e) {
>>               throw new IOException(e);
>>           }
>>       }
>>
>>       mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
>>   }
>>
>
> So after looking again at the code snippet I threw together above, it's not
> using the provided Classloader. I could iterate over parsers and
> catch/ignore errors to parsers not in the provided list, but that seems less
> than clean.
>
> I don't have much experience with classloaders - I see that each instance
> of a Class has a classloader associated with it, to mapping from its
> classload to the provided classloader would need something like:
>
>    Class<Parser> resolvedClass =
> (Class<Parser>)loader.loadClass(parserClass.getCanonicalName());
>    Parser parser = resolvedClass.newInstance();
>
> But that also seems clunky. Any other suggestions?
>
> As an aside, what's the standard use case for specifying an explicit
> classloader? I haven't seen this used in other projects, so I'm curious.
>
> Thanks,
>
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>


-- 
Best regards, Oleg.

Re: Error thrown with TikaConfig() constructor

Posted by Ken Krugler <kk...@transpac.com>.

Hi Jukka,

> On Sun, Sep 12, 2010 at 5:46 PM, Ken Krugler
> <kk...@transpac.com> wrote:
>> But that also seems clunky. Any other suggestions?
>
> A simpler approach would be to simply pass a list of already
> instantiated Parser objects to AutoDetectParser, like this:
>
>    public AutoDetectParser(Detector detector, Parser... parsers) {
>        setDetector(detector);
>        Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
>        ParseContext context = new ParseContext();
>        for (Parser parser : parsers) {
>            for (MediaType type : parser.getSupportedTypes(context)) {
>                map.put(type, parser);
>            }
>        }
>        setParsers(map);
>    }

Thanks for the suggestion. This would work for the current 0.8 code  
base, so I might just go ahead and add that.

But I found a few other places that called  
TikaConfig.getDefaultConfig() besides AutoDetectParser():
	
  - Tika()
  - MediaTypeRegistry.getDefaultRegistry()

These don't seem to be used outside of test code, but I could easily  
see people adding calls to them (and getDefaultConfig).

Depending on not having any calls to this from anywhere else in the  
Tika sub-system seems fragile, so a more resilient solution would be  
good. Especially since this is the second time this problem has bitten  
me during a big parse job (20M+ documents).

-- Ken


> BTW, the need to pass a MediaType->Parser map to
> CompositeParser.setParsers() is a remnant of the time when we didn't
> have the Parser.getSupportedTypes() method. Nowadays it would probably
> be better to simply pass a collection of parsers and use
> getSupportedTypes() calls for dispatch during CompositeParser.parse().
>
>> As an aside, what's the standard use case for specifying an explicit
>> classloader? I haven't seen this used in other projects, so I'm  
>> curious.
>
> See TIKA-419 [1] the relevant background.
>
> [1] https://issues.apache.org/jira/browse/TIKA-419

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Error thrown with TikaConfig() constructor

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Sun, Sep 12, 2010 at 5:46 PM, Ken Krugler
<kk...@transpac.com> wrote:
> But that also seems clunky. Any other suggestions?

A simpler approach would be to simply pass a list of already
instantiated Parser objects to AutoDetectParser, like this:

    public AutoDetectParser(Detector detector, Parser... parsers) {
        setDetector(detector);
        Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
        ParseContext context = new ParseContext();
        for (Parser parser : parsers) {
            for (MediaType type : parser.getSupportedTypes(context)) {
                map.put(type, parser);
            }
        }
        setParsers(map);
    }

BTW, the need to pass a MediaType->Parser map to
CompositeParser.setParsers() is a remnant of the time when we didn't
have the Parser.getSupportedTypes() method. Nowadays it would probably
be better to simply pass a collection of parsers and use
getSupportedTypes() calls for dispatch during CompositeParser.parse().

> As an aside, what's the standard use case for specifying an explicit
> classloader? I haven't seen this used in other projects, so I'm curious.

See TIKA-419 [1] the relevant background.

[1] https://issues.apache.org/jira/browse/TIKA-419

BR,

Jukka Zitting

Re: Error thrown with TikaConfig() constructor

Posted by Ken Krugler <kk...@transpac.com>.

On Sep 11, 2010, at 1:17pm, Ken Krugler wrote:

>> On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch  
>> <ni...@alfresco.com> wrote:
>>> Quite a lot of OfficeParser does depend on poifs code though, as  
>>> well as a
>>> few bits that depend on some of the less common POI text extractors.
>>
>> It looks like a number of our other new parsers also have direct
>> dependencies to external libraries, so this problem is not just
>> related to the OfficeParser class.
>>
>> The basic problem here is that the service loader used by the default
>> TikaConfig constructor throws an exception when it can't load a class
>> listed in a org.apache.tika.parser.Parser service file. The  
>> solution I
>> implemented in TIKA-378 for the 0.7 release was to move the external
>> parser library references to separate extractor classes so that the
>> parser class could be instantiated without problems. Unfortunately
>> this was a one-off solution that obviously hasn't survived further
>> development in the svn trunk.
>>
>> The reason why I originally didn't want to simply catch and ignore  
>> the
>> potential exceptions in the TikaConfig constructor was the lack of a
>> good error reporting mechanism. The trick of insulating the external
>> library dependencies to separate extractor classes nicely solved that
>> problem by delaying the exceptions to the actual parse() method calls
>> on specific document types, which obviously would then give the end
>> user a much better idea of what's wrong.
>>
>> Perhaps the best solution would actually be to combine the above
>> approaches, i.e. to strive to maintain the parser/extractor  
>> separation
>> where possible and to use a catch block in the TikaConfig constructor
>> to catch and ignore any problems that the insulation approach fails  
>> to
>> address.
>
> IIRC, the main concern about this approach is when people are using  
> custom parsers, where instantiation exceptions can happen due to  
> bugs in the actual parser (versus explicitly excluded jars). Quietly  
> ignoring these errors leads to late failing, which can be a bad thing.
>
> What I would propose is two changes:
>
> 1. Add a new TikaConfig(ClassLoader, Class<Parser>...) constructor  
> that can be used to instantiate all parsers from the variable list  
> that around found using the ClassLoader. For example:
>
>    public TikaConfig(ClassLoader loader,  
> Class<Parser>...targetParsers)
>            throws MimeTypeException, IOException {
>        for (Class<Parser> parserClass : targetParsers) {
>            ParseContext context = new ParseContext();
>
>            try {
>                Parser parser = parserClass.newInstance();
>                for (MediaType type :  
> parser.getSupportedTypes(context)) {
>                    parsers.put(type, parser);
>                }
>            } catch (InstantiationException e) {
>                throw new IOException(e);
>            } catch (IllegalAccessException e) {
>                throw new IOException(e);
>            }
>        }
>
>        mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
>    }

So after looking again at the code snippet I threw together above,  
it's not using the provided Classloader. I could iterate over parsers  
and catch/ignore errors to parsers not in the provided list, but that  
seems less than clean.

I don't have much experience with classloaders - I see that each  
instance of a Class has a classloader associated with it, to mapping  
from its classload to the provided classloader would need something  
like:

     Class<Parser> resolvedClass =  
(Class<Parser>)loader.loadClass(parserClass.getCanonicalName());
     Parser parser = resolvedClass.newInstance();

But that also seems clunky. Any other suggestions?

As an aside, what's the standard use case for specifying an explicit  
classloader? I haven't seen this used in other projects, so I'm curious.

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Error thrown with TikaConfig() constructor

Posted by Ken Krugler <kk...@transpac.com>.

> On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch  
> <ni...@alfresco.com> wrote:
>> Quite a lot of OfficeParser does depend on poifs code though, as  
>> well as a
>> few bits that depend on some of the less common POI text extractors.
>
> It looks like a number of our other new parsers also have direct
> dependencies to external libraries, so this problem is not just
> related to the OfficeParser class.
>
> The basic problem here is that the service loader used by the default
> TikaConfig constructor throws an exception when it can't load a class
> listed in a org.apache.tika.parser.Parser service file. The solution I
> implemented in TIKA-378 for the 0.7 release was to move the external
> parser library references to separate extractor classes so that the
> parser class could be instantiated without problems. Unfortunately
> this was a one-off solution that obviously hasn't survived further
> development in the svn trunk.
>
> The reason why I originally didn't want to simply catch and ignore the
> potential exceptions in the TikaConfig constructor was the lack of a
> good error reporting mechanism. The trick of insulating the external
> library dependencies to separate extractor classes nicely solved that
> problem by delaying the exceptions to the actual parse() method calls
> on specific document types, which obviously would then give the end
> user a much better idea of what's wrong.
>
> Perhaps the best solution would actually be to combine the above
> approaches, i.e. to strive to maintain the parser/extractor separation
> where possible and to use a catch block in the TikaConfig constructor
> to catch and ignore any problems that the insulation approach fails to
> address.

IIRC, the main concern about this approach is when people are using  
custom parsers, where instantiation exceptions can happen due to bugs  
in the actual parser (versus explicitly excluded jars). Quietly  
ignoring these errors leads to late failing, which can be a bad thing.

What I would propose is two changes:

1. Add a new TikaConfig(ClassLoader, Class<Parser>...) constructor  
that can be used to instantiate all parsers from the variable list  
that around found using the ClassLoader. For example:

     public TikaConfig(ClassLoader loader,  
Class<Parser>...targetParsers)
             throws MimeTypeException, IOException {
         for (Class<Parser> parserClass : targetParsers) {
             ParseContext context = new ParseContext();

             try {
                 Parser parser = parserClass.newInstance();
                 for (MediaType type :  
parser.getSupportedTypes(context)) {
                     parsers.put(type, parser);
                 }
             } catch (InstantiationException e) {
                 throw new IOException(e);
             } catch (IllegalAccessException e) {
                 throw new IOException(e);
             }
         }

         mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
     }

2. Add a TikaConfig.setDefaultConfig() static method, so that callers  
can set the default config that might get used in various places.

One question here is that the current TikaConfig.getDefaultConfig()  
method has this comment:

      * Provides a default configuration (TikaConfig).  Currently  
creates a
      * new instance each time it's called; we may be able to have it
      * return a shared instance once it is completely immutable.

Any insight into this comment? I see that it was based on https://issues.apache.org/jira/browse/TIKA-34

 From what I can tell, making TikaConfig immutable would require  
wrapping the parsers map in a nonmodifiable map, and a bit more  
serious modifications to MediaTypes (registry, types, magics, xmls,  
patterns) to be able to create an immutable version of that.

The above changes would let me instantiate the TikaConfig that I need,  
without having to dup/edit/keep in sync any XML files, and make sure  
that all of the Tika code base uses this configuration particular  
configuration.

Thoughts?

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Error thrown with TikaConfig() constructor

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Sep 10, 2010 at 10:31 PM, Nick Burch <ni...@alfresco.com> wrote:
> Quite a lot of OfficeParser does depend on poifs code though, as well as a
> few bits that depend on some of the less common POI text extractors.

It looks like a number of our other new parsers also have direct
dependencies to external libraries, so this problem is not just
related to the OfficeParser class.

The basic problem here is that the service loader used by the default
TikaConfig constructor throws an exception when it can't load a class
listed in a org.apache.tika.parser.Parser service file. The solution I
implemented in TIKA-378 for the 0.7 release was to move the external
parser library references to separate extractor classes so that the
parser class could be instantiated without problems. Unfortunately
this was a one-off solution that obviously hasn't survived further
development in the svn trunk.

The reason why I originally didn't want to simply catch and ignore the
potential exceptions in the TikaConfig constructor was the lack of a
good error reporting mechanism. The trick of insulating the external
library dependencies to separate extractor classes nicely solved that
problem by delaying the exceptions to the actual parse() method calls
on specific document types, which obviously would then give the end
user a much better idea of what's wrong.

Perhaps the best solution would actually be to combine the above
approaches, i.e. to strive to maintain the parser/extractor separation
where possible and to use a catch block in the TikaConfig constructor
to catch and ignore any problems that the insulation approach fails to
address.

BR,

Jukka Zitting

Re: Error thrown with TikaConfig() constructor

Posted by Nick Burch <ni...@alfresco.com>.

On Fri, 10 Sep 2010, Ken Krugler wrote:
> The issue is that the definitions of the types that are supported come from 
> POI:
>
>       Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
>       	POIFSDocumentType.WORKBOOK.type,
>       	POIFSDocumentType.OLE10_NATIVE.type,

POIFSDocumentType is actually a Tika class, not a poi one. However, 
POIFSDocumentType does depend on several POI classes, as it contains both 
a list of poi types, and a detector for them

Quite a lot of OfficeParser does depend on poifs code though, as well as a 
few bits that depend on some of the less common POI text extractors.

Nick

Re: Error thrown with TikaConfig() constructor

Posted by Ken Krugler <kk...@transpac.com>.

Hi Jukka,

On Sep 10, 2010, at 5:35am, Jukka Zitting wrote:

> Hi,
>
> On Fri, Sep 10, 2010 at 5:22 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> With 0.8-SNAPSHOT, the TikaConfig(Classpath) constructor now finds  
>> and
>> instantiates all Parser-based classes found on the classpath.  
>> Which, as
>> expected, triggers a storm of Exceptions and Errors.
>
> Which errors are you seeing? In TIKA-378 [1] I tried to make the
> TikaConfig(Classpath) behave better in such situations by making many
> of our Parser classes loadable even when the respective parser library
> is not available (I usually moved the direct class dependencies to a
> separate Extractor class). I'm not sure how well that work has
> survived recent changes in trunk.

Here's the stack trace.

      <error>java.lang.NoClassDefFoundError: org/apache/poi/poifs/ 
filesystem/DirectoryEntry
	at org.apache.tika.parser.microsoft.OfficeParser.&lt;clinit&gt; 
(OfficeParser.java:55)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at sun.misc.Service$LazyIterator.next(Service.java:271)
	at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:170)
	at org.apache.tika.config.TikaConfig.&lt;init&gt;(TikaConfig.java:189)
	at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java: 
268)
	at org.apache.tika.parser.AutoDetectParser.&lt;init&gt; 
(AutoDetectParser.java:51)

> [1] https://issues.apache.org/jira/browse/TIKA-378

The issue is that the definitions of the types that are supported come  
from POI:

         Collections.unmodifiableSet(new  
HashSet<MediaType>(Arrays.asList(
         	POIFSDocumentType.WORKBOOK.type,
         	POIFSDocumentType.OLE10_NATIVE.type,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Error thrown with TikaConfig() constructor

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Sep 10, 2010 at 5:22 AM, Ken Krugler
<kk...@transpac.com> wrote:
> With 0.8-SNAPSHOT, the TikaConfig(Classpath) constructor now finds and
> instantiates all Parser-based classes found on the classpath. Which, as
> expected, triggers a storm of Exceptions and Errors.

Which errors are you seeing? In TIKA-378 [1] I tried to make the
TikaConfig(Classpath) behave better in such situations by making many
of our Parser classes loadable even when the respective parser library
is not available (I usually moved the direct class dependencies to a
separate Extractor class). I'm not sure how well that work has
survived recent changes in trunk.

[1] https://issues.apache.org/jira/browse/TIKA-378

BR,

Jukka Zitting

Re: Error thrown with TikaConfig() constructor

Posted by Oleg Tikhonov <ol...@gmail.com>.

+1 to Nick's suggestion.

On Fri, Sep 10, 2010 at 12:35 PM, Nick Burch <ni...@alfresco.com>wrote:

> On Thu, 9 Sep 2010, Ken Krugler wrote:
>
>> I'm wondering how best to handle this type of configuration, in a way
>> that's relatively resilient to Tika configuration changes and my target set
>> of formats.
>>
>
> Would it not make more sense to use the xml based TikaConfig constructor
> (file, inputstream etc), rather than the default one? You should be able to
> list just the parsers you want
>
> Nick
>

Re: Error thrown with TikaConfig() constructor

Posted by Nick Burch <ni...@alfresco.com>.

On Thu, 9 Sep 2010, Ken Krugler wrote:
> I'm wondering how best to handle this type of configuration, in a way 
> that's relatively resilient to Tika configuration changes and my target 
> set of formats.

Would it not make more sense to use the xml based TikaConfig constructor 
(file, inputstream etc), rather than the default one? You should be able 
to list just the parsers you want

Nick