You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by kbennett <kb...@bbsinc.biz> on 2007/09/24 19:22:01 UTC

Providing a Default Tika Configuration

All -

I think it would be convenient for Tika users to not need to specify a Tika
configuration resource, if they want the default configuration.  Most users
(like me ;) ) trust you guys to come up with a correct configuration.  After
all, Tika developers are the ones providing the library that ties all the
parsers together; it would be reasonable to expect that those developers
could come up with a reasonable default configuration.

One way this could be accomplished would be to move the default
configuration file from (root)/config.xml to
org.apache.tika.tika-config.xml.  Then we could do a
getResource("/org/apache/tika/tika-config.xml") or equivalent and be fairly
confident that we are not accidentally getting a different file.  If the
user chose not to specify a configuration (by passing null, or calling a
method that does not require one), then this default configuration could be
used.

I can post a JIRA issue if you like, but wanted to get your feedback first
in case I'm misunderstanding something.

Thanks,
- Keith

-- 
View this message in context: http://www.nabble.com/Providing-a-Default-Tika-Configuration-tf4510478.html#a12864244
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Providing a Default Tika Configuration

Posted by kbennett <kb...@bbsinc.biz>.
Jukka -

It looks to me that the TikaConfig is now almost completely immutable.  It
can return a ParserConfig, but that is immutable since you removed the
setContents() and made the getContents() return an immutable map.  The
MimeTypes object, also available from the TikaConfig instance, is almost
completely immutable, except that 1) it contains an add() method, and 2)
that a MimeType instance managed by MimeTypes has a setLevel() method.  But
it looks like those mutabilities could be removed by refactoring.

That being the case, are we close to making a TikaConfig object totally
reusable?  Would you like me to look at refactoring MimeTypes to make it
immutable?

Thanks,
- Keith



Jukka Zitting wrote:
> 
> Hi,
> 
> On 9/25/07, kbennett <kb...@bbsinc.biz> wrote:
>> This means that every time a parse methods that uses a default
>> configuration
>> is used, the default configuration's XML will be reparsed.  This may not
>> be
>> a big deal for apps that only occasionally do this, but for an app whose
>> mission is to parse documents, it seems kind of wasteful, especially when
>> it
>> can be remedied with a small number of simple lines of code.  Certainly I
>> can get the default configuration once, hold onto it, and then call the
>> parse methods that take it, but it seems odd to me that I would have to
>> do
>> that.  I realize it's a minor issue, though.
> 
> I would argue that that's (reusing the configuration instance) the
> preferred mode of operation. Currently I wouldn't do that due to the
> mutability of Content instances, but as we get to the point of having
> stateless Parser instances, I'd even advocate instantiating the full
> set of configured parsers when your application starts and reusing
> this configuration for any number of documents.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Providing-a-Default-Tika-Configuration-tf4510478.html#a12912002
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Providing a Default Tika Configuration

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 9/25/07, kbennett <kb...@bbsinc.biz> wrote:
> This means that every time a parse methods that uses a default configuration
> is used, the default configuration's XML will be reparsed.  This may not be
> a big deal for apps that only occasionally do this, but for an app whose
> mission is to parse documents, it seems kind of wasteful, especially when it
> can be remedied with a small number of simple lines of code.  Certainly I
> can get the default configuration once, hold onto it, and then call the
> parse methods that take it, but it seems odd to me that I would have to do
> that.  I realize it's a minor issue, though.

I would argue that that's (reusing the configuration instance) the
preferred mode of operation. Currently I wouldn't do that due to the
mutability of Content instances, but as we get to the point of having
stateless Parser instances, I'd even advocate instantiating the full
set of configured parsers when your application starts and reusing
this configuration for any number of documents.

BR,

Jukka Zitting

Re: Providing a Default Tika Configuration

Posted by kbennett <kb...@bbsinc.biz>.
This means that every time a parse methods that uses a default configuration
is used, the default configuration's XML will be reparsed.  This may not be
a big deal for apps that only occasionally do this, but for an app whose
mission is to parse documents, it seems kind of wasteful, especially when it
can be remedied with a small number of simple lines of code.  Certainly I
can get the default configuration once, hold onto it, and then call the
parse methods that take it, but it seems odd to me that I would have to do
that.  I realize it's a minor issue, though.

- Keith



Bertrand Delacretaz wrote:
> 
> On 9/25/07, Jukka Zitting <ju...@gmail.com> wrote:
> 
>> ...How about something like this:
>>
>>     public static TikaConfig getDefaultConfig()
>>             throws IOException, JDOMException {
>>         return new TikaConfig(
>>                 TikaConfig.class.getResourceAsStream("tika-config.xml"));
>>     }...
> 
> Agreed, this is good enough.
> 
> -Bertrand
> 
> 

-- 
View this message in context: http://www.nabble.com/Providing-a-Default-Tika-Configuration-tf4510478.html#a12880793
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Providing a Default Tika Configuration

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 9/25/07, Jukka Zitting <ju...@gmail.com> wrote:

> ...How about something like this:
>
>     public static TikaConfig getDefaultConfig()
>             throws IOException, JDOMException {
>         return new TikaConfig(
>                 TikaConfig.class.getResourceAsStream("tika-config.xml"));
>     }...

Agreed, this is good enough.

-Bertrand

Re: Providing a Default Tika Configuration

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 9/25/07, Bertrand Delacretaz <bd...@apache.org> wrote:
> On 9/25/07, kbennett <kb...@bbsinc.biz> wrote:
> > ...We could also make one default configuration per thread using
> > threadlocal to avoid the overhead of synchronization, but I've heard that
> > synchronization is not as slow as it used to be, so that might be overkill....
>
> Yes, compared to what happens next (parsing of possibly large files),
> one synchronized block won't matter much!

On the other hand, compared to that I'd just as well just recreate the
full config instance. Using a static variable, synchronization, etc.
sounds too complex to me.

How about something like this:

    public static TikaConfig getDefaultConfig()
            throws IOException, JDOMException {
        return new TikaConfig(
                TikaConfig.class.getResourceAsStream("tika-config.xml"));
    }

BR,

Jukka Zitting

Re: Providing a Default Tika Configuration

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 9/25/07, kbennett <kb...@bbsinc.biz> wrote:

> ...We could also make one default configuration per thread using
> threadlocal to avoid the overhead of synchronization, but I've heard that
> synchronization is not as slow as it used to be, so that might be overkill....

Yes, compared to what happens next (parsing of possibly large files),
one synchronized block won't matter much!

-Bertrand

Re: Providing a Default Tika Configuration

Posted by kbennett <kb...@bbsinc.biz>.
Bertrand -

Thanks for responding.  Regarding the doubling of the methods, normally I
wouldn't suggest it, but these new methods may be the only methods some
programmers ever use in Tika.  I was thinking that their use will be so
common that the additional verbosity would be justified.  Also, in my use
cases, I don't anticipate *ever* overriding the default configuration.  My
two cents.

Good point about synchronizing the lazy instantiation.  I hadn't thought
about that.  We could also make one default configuration per thread using
threadlocal to avoid the overhead of synchronization, but I've heard that
synchronization is not as slow as it used to be, so that might be overkill.  

- Keith



Bertrand Delacretaz wrote:
> 
> On 9/25/07, kbennett <kb...@bbsinc.biz> wrote:
> 
>> On further thought, a static method returning the default configuration
>> might
>> be better than a no-arg constructor....
> 
> Sounds good to me.
> 
> I'm not sure about doubling the number of methods which use configs
> though, having to use config==null to get the default sounds
> reasonable. But that's not terribly important, if you find places
> where an additional method is useful I have no problem with that.
> 
> +        if (defaultConfig == null) {
> +            URL url = new URL(DEFAULT_CONFIG_URL);
> +            defaultConfig = new TikaConfig(url);
> +        }
> 
> I'd just make this synchronized(TikaConfig.class).
> 
> -Bertrand
> 
> 

-- 
View this message in context: http://www.nabble.com/Providing-a-Default-Tika-Configuration-tf4510478.html#a12873645
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Providing a Default Tika Configuration

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 9/25/07, kbennett <kb...@bbsinc.biz> wrote:

> On further thought, a static method returning the default configuration might
> be better than a no-arg constructor....

Sounds good to me.

I'm not sure about doubling the number of methods which use configs
though, having to use config==null to get the default sounds
reasonable. But that's not terribly important, if you find places
where an additional method is useful I have no problem with that.

+        if (defaultConfig == null) {
+            URL url = new URL(DEFAULT_CONFIG_URL);
+            defaultConfig = new TikaConfig(url);
+        }

I'd just make this synchronized(TikaConfig.class).

-Bertrand

Re: Providing a Default Tika Configuration

Posted by kbennett <kb...@bbsinc.biz>.
On further thought, a static method returning the default configuration might
be better than a no-arg constructor.  That way, the single instance could be
shared.  So, assuming the tika-config.xml file were moved to
org/apache/tika, the following few lines could be added to tika-config.xml. 
Then the methods that use this shared instance could be added to ParseUtils.

Index: src/main/java/org/apache/tika/config/TikaConfig.java
===================================================================
--- src/main/java/org/apache/tika/config/TikaConfig.java        (revision
578987)
+++ src/main/java/org/apache/tika/config/TikaConfig.java        (working
copy)
@@ -45,6 +45,22 @@
     
     private static MimeUtils mimeTypeRepo;
 
+    private static final String DEFAULT_CONFIG_URL
+            = "/org/apache/tika/tika-config.xml";
+
+    private static TikaConfig defaultConfig;
+
+
+    public static final TikaConfig getDefaultConfig()
+            throws IOException, JDOMException {
+
+        if (defaultConfig == null) {
+            URL url = new URL(DEFAULT_CONFIG_URL);
+            defaultConfig = new TikaConfig(url);
+        }
+        return defaultConfig;
+    }
+
     public TikaConfig(String file) throws JDOMException, IOException {
         this(new File(file));
     }


- Keith

-- 
View this message in context: http://www.nabble.com/Providing-a-Default-Tika-Configuration-tf4510478.html#a12869890
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: Providing a Default Tika Configuration

Posted by kbennett <kb...@bbsinc.biz>.
Ok, here's a concrete path that could be taken:

1) Move tika-config.xml from the default package to org.apache.tika.

2) Create a no-args constructor on TikaConfig that uses a single shared
instance of the default configuration object created with the above
tika-config.xml.  (These objects are immutable, so this should be threadsafe
(right?).)

3) Add ParseUtils methods that do not require a TikaConfig object.  This
would require doubling the number of methods, but I'm willing to write them
if you're willing to include them.

If you like, I can do this and submit a patch.

- Keith

-- 
View this message in context: http://www.nabble.com/Providing-a-Default-Tika-Configuration-tf4510478.html#a12869536
Sent from the Apache Tika - Development mailing list archive at Nabble.com.