You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2010/08/21 13:51:16 UTC

[jira] Created: (TIKA-490) Support for adding language profiles dynamically

Support for adding language profiles dynamically
------------------------------------------------

                 Key: TIKA-490
                 URL: https://issues.apache.org/jira/browse/TIKA-490
             Project: Tika
          Issue Type: Improvement
          Components: languageidentifier
    Affects Versions: 0.7
            Reporter: Jan Høydahl
             Fix For: 0.8


Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.

It would be better to make this configurable, so you could add your own languages without recompiling.

Suggested approach:
Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!

Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901176#action_12901176 ] 

Jan Høydahl commented on TIKA-490:
----------------------------------

(Adding same comment to this issue as I sent to mailing list)

My rationale for the override part is as follows:

The default properties file will be embedded within tika-xx.jar. I assume most people are not keen to unpack and repack JARs to make a config change. We COULD put a similar named properties file at another location, but then the user needs to make sure that location is EARLIER in classpath than the JAR file. In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure this, and to avoid any confusion about class-loader peculiarities, it's more straight-forward to look for an override file.

Take Solr as an example. The user would then put the override properties file along with his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/


> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901180#action_12901180 ] 

Jan Høydahl commented on TIKA-490:
----------------------------------

Either solution is OK by me.

But from my experience classpath problems in application servers can cause a headache, and therefore I tried to be as explicit as possible.

It would of course work to put a tika.language.properties file on classpath as well, if you know it's earlier in classpath than the tika jar.

Or we could name the jar-internal properties file tika.language.default.properties, and let people override through tika.language.properties :)

I'll leave the decision up to you. Remove the override if you feel strong about it.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-490:
-----------------------------

    Attachment: TIKA-490.janhoy.092010.patch

Updated patch which adds ability to load your own map of LanguageProfile-s programmatically. I chose to do that not via a constructor, but rather a new signature for for initProfiles(), which will replace the profiles map with the one provided.

The public API after this patch will be:
    // Existing
    public LanguageIdentifier(LanguageProfile profile)
    public LanguageIdentifier(String content)
    public String getLanguage()
    public boolean isReasonablyCertain()
    public String toString()

    // New from patch
    public static void initProfiles()
    public static void initProfiles(Map<String, LanguageProfile> profilesMap)
    public static void clearProfiles()
    public static void addProfile(String language, LanguageProfile profile)
    public static Set<String> getSupportedLanguages()
    public static boolean hasErrors()
    public static String getErrors()

None of the original signatures are changed.
You can now initialize language profiles in these ways:
A) Loading tika.language.properties and profiles from jar file (default, equivalent to v0.7)
B) Loading tika.language.override.properties from classpath and profiles from classpath (nice for integrators)
C) Dynamic from client code, using addProfile and/or initProfiles(Map<String, LanguageProfile>)

PS: I removed the getDistance() method again, since it's exposing internal value. Already available through toString() for those who need it. Also removed the "force" parameter to initProfiles() as it's not needed - we always want to init from scratch.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.092010.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-490:
-----------------------------------

    Attachment: TIKA-490.Mattmann.082210.2.patch.txt

- updated patch. I decided against throwing TikaExceptions, and just decided to throw an IOException if there was a problem loading the props file containing the profile default information. This makes the *real* exception the one that's exposed to the consumer of this API, which makes more sense. I don't like returning null in ProfilingWriter#getLanguage, but I couldn't think of anything more elegant at this time. So, I documented that there's a possibility that that method could return null now, so a caller should plan accordingly. 

If there are no objections, I'll commit this in the next 24 hrs.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-490:
-----------------------------

    Attachment: TIKA-490.patch

Here is my first patch (against current trunk 987074) which adds a properties file for languages list, and initializes profiles in constructor. I also made the new initProfiles() public so that it is possible to programmatically re-load the profiles if you have changed the property file, without restarting the JavaVM.

I also changed the URL for ISO639 to a more complete version, and extracted some strings into constants.

Also added a JDK logger to add some output. Did not use log4j or slf4j because those would require new dependencies and hurt backwards drop-in compatibility. We now log what languages get initalized and from which property file, which will aid in debugging.

All existing tests PASS. No API changes are done which breaks backward compatibility.

It would be great if someone could test it and commit it in time for 0.8, as I suspect to use it with Solr with some custom language profiles.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901335#action_12901335 ] 

Jan Høydahl commented on TIKA-490:
----------------------------------

Jukka,

Your idea for programmatic adding of profiles is good for some integration scenarios.

My original intent with filing this issue however, was to support the case where an (end) user or integrator needs to add support for a new langue on the fly without programming.

For me as a Search engine Consultant I can then add say Sami Language support to Solr extracting request handler by placing one property file and one Language profile file on the classpath. With a constructor only this would require code change to Solr.

So we can add constructor but please don't remove property file.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-490:
-----------------------------

    Attachment: TIKA-490.janhoy.082310.patch

Been thinking a bit more, and attaching a (in my eyes) approved patch.

Those who override the config, they need better logging, else they're in the dark as to what happened.
Since we cannot use a log framework, I've added a way to detect and fetch errors through the API.

Also I've removed the throw of IOException in constructor. Since there is only a risk of that particular exception being thrown if the property file does not exist, in my opinion this will NEVER happen. So we should not force people to catch it. Also, it is better to keep backward API compatibility for all the existing applications out there. Users that override config will not either get the IOException since - if their tika.language.override.properties does not exist, tika will load the built-in one. And if the property file exists but is faulty, you'll not get IOException either. If there are errors in the properties file, you can test that through the new methods, and even ask for a list of languages that are successfully initialized.

Changes in the attached patch:
* public static boolean hasErrors()
* public static String getErrors()
* public static Set<String> getSupportedLanguages()
* Constructor no longer throws IOException
* Added better comments to public methods, including a warning for isReasonablyCertain() and short texts

Sample output from a test application:
Supported languages: [is, da, it, no, hu, th, de, el, fi, pt, pl, sv, fr, en, ru, et, es, nl]
Number of languages supported: 18
Has errors? true
    Language xx (Unknown) not initialized. Message: Failed trying to load language profile for language "xx". Error: null

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901243#action_12901243 ] 

Chris A. Mattmann commented on TIKA-490:
----------------------------------------

Hey Jukka:

+1, let me take a crack at addressing your issues. I think we can make this look like the way that the Parsers are dynamically loaded, including the API support for it.

Cheers,
Chris


> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-490.
------------------------------------

    Resolution: Fixed

- I applied the latest patch from Jan in r1029556. Ultimately we should move to a service-provider like mechanism on this though the existing patch is quite close to that mechanism in the end. It would be nice to just have language profiles picked up dynamically on the classpath, but this is good enough for now. Thanks Jan!

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.092010.patch, TIKA-490.janhoy.092010.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-490:
-----------------------------

    Attachment: TIKA-490.janhoy.082310.patch

Yepp, I saw that.

This patch removes the try's from test.
I also removed the call to initProfiles() from constructor, as we now call it statically, and we no longer throw IOException in constructor, so we don't need it there for that purpose.

I also took the freedom of adding a simple getter
public double getDistance()
I know that it is not directly related to this issue, so feel free to discard it and I will create a new ticket fir it.
But given its low risk and high gain... at least until TIKA-369 is implemented.
it would let the consumer decide what is certain enough for his use case. by e.g. computing normalized distance based on text length.
An alternative is to mark it as deprecated and semi-internal, as the numerical value will probably change with TIKA-369.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-490:
-----------------------------

    Attachment: TIKA-490.janhoy.092010.patch

Replacing with correct patch

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.092010.patch, TIKA-490.janhoy.092010.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-490:
-----------------------------------

    Attachment: TIKA-490.Mattmann.082210.patch.txt

- see my updated patch. I've replaced all logging in the patch with TikaExceptions and proper handling. We don't want to have any logging in tika-core. If logging needs to happen it can happen downstream in the clients of the functionality provided by core. I also decided I'm fine with the override file notion (thanks for the explanation, Jan!). If there are no objections I'd like to commit this in the next 24 hours.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901231#action_12901231 ] 

Jukka Zitting commented on TIKA-490:
------------------------------------

I wonder why we'd need these extra properties files in the first place. As an alternative, how about if we just added an optional Map<String, LanguageProfile> argument to the LanguageIdentifier constructors, so applications that want to add new languages or customize the default profiles can easily do so. Such an approach would be much more flexible and less error-prone (no need for extra logging) than using properties files, and an application that needs runtime configuration of language profiles can still implement its own properties files for this purpose.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917733#action_12917733 ] 

Chris A. Mattmann commented on TIKA-490:
----------------------------------------

Hi Jan:

I've been a bit bogged down lately but plan on looking at this later in the week and trying to resolve it then.

Cheers,
Chris


> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.092010.patch, TIKA-490.janhoy.092010.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-490:
-----------------------------------

    Comment: was deleted

(was: Comment from Jan on the ML:

{quote}
My rationale for the override part is as follows:

The default properties file will be embedded within tika-xx.jar
I assume most people are not keen to unpack and repack JARs to make a config change.
We COULD put a similar named properties file at another location, but then the user
needs to make sure that location is EARLIER in classpath than the JAR file.
In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure this,
and to avoid any confusion about class-loader peculiarities, it's more straight-forward
to look for an override file.

Take the Solr example. The user would then put the properties file along with
his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/

{quote})

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917718#action_12917718 ] 

Jan Høydahl commented on TIKA-490:
----------------------------------

Chris, Jukka,

Are we approaching a solution on this one?

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.082310.patch, TIKA-490.janhoy.092010.patch, TIKA-490.janhoy.092010.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901220#action_12901220 ] 

Chris A. Mattmann commented on TIKA-490:
----------------------------------------

Hi Jan:

I like your approach! Definitely a nice set of improvements. I noticed though in the patch there are still places (the unit tests) where we check for Exceptions -- this isn't needed anymore since your updates obviate any exception being thrown. No worries, I'll address this when I commit your patch. Looks good to me. I plan to commit this sometime tonight or tomorrow.

Cheers,
Chris


> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.janhoy.082310.patch, TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt, TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901177#action_12901177 ] 

Chris A. Mattmann commented on TIKA-490:
----------------------------------------

Comment from Jan on the ML:

{quote}
My rationale for the override part is as follows:

The default properties file will be embedded within tika-xx.jar
I assume most people are not keen to unpack and repack JARs to make a config change.
We COULD put a similar named properties file at another location, but then the user
needs to make sure that location is EARLIER in classpath than the JAR file.
In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure this,
and to avoid any confusion about class-loader peculiarities, it's more straight-forward
to look for an override file.

Take the Solr example. The user would then put the properties file along with
his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/

{quote}

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901178#action_12901178 ] 

Chris A. Mattmann commented on TIKA-490:
----------------------------------------

Interesting. What why couldn't we just name the file the same thing, then? Would this be putting it up as a gamble to the Classloader?

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-490:
--------------------------------------

    Assignee: Chris A. Mattmann

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Jan,

Interesting. What why couldn't we just name the file the same thing, then?
Would this be putting it up as a gamble to the Classloader?

Cheers,
Chris


On 8/22/10 8:40 AM, "Jan Høydahl / Cominvent" <ja...@cominvent.com> wrote:

> Hi,
> 
> My rationale for the override part is as follows:
> 
> The default properties file will be embedded within tika-xx.jar
> I assume most people are not keen to unpack and repack JARs to make a config
> change.
> We COULD put a similar named properties file at another location, but then the
> user
> needs to make sure that location is EARLIER in classpath than the JAR file.
> In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure
> this,
> and to avoid any confusion about class-loader peculiarities, it's more
> straight-forward
> to look for an override file.
> 
> Take the Solr example. The user would then put the properties file along with
> his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 22. aug. 2010, at 16.40, Chris A. Mattmann (JIRA) wrote:
> 
>> 
>>    [ 
>> https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin
>> .system.issuetabpanels:comment-tabpanel&focusedCommentId=12901170#action_1290
>> 1170 ]
>> 
>> Chris A. Mattmann commented on TIKA-490:
>> ----------------------------------------
>> 
>> Hi Jan,
>> 
>> I don't get the point of the override properties part. What does it buy us?
>> The way you've set it up, it also loads from the classpath just like the
>> language identifier properties proper file, so, it shouldn't be any more
>> arduous to just mod that file if necessary (since they both are classpath
>> loaded).
>> 
>> Let me know what you think. I've reviewed the rest of the patch it looks good
>> and I'm ready to commit it, sans the override part.
>> 
>> Cheers,
>> Chris
>> 
>> 
>>> Support for adding language profiles dynamically
>>> ------------------------------------------------
>>> 
>>>                Key: TIKA-490
>>>                URL: https://issues.apache.org/jira/browse/TIKA-490
>>>            Project: Tika
>>>         Issue Type: Improvement
>>>         Components: languageidentifier
>>>   Affects Versions: 0.7
>>>           Reporter: Jan Høydahl
>>>           Assignee: Chris A. Mattmann
>>>            Fix For: 0.8
>>> 
>>>        Attachments: TIKA-490.patch, TIKA-490.patch
>>> 
>>>  Original Estimate: 24h
>>> Remaining Estimate: 24h
>>> 
>>> Currently the Tika LanguageIdentifier loads language profiles thorugh a
>>> hardcoded static block in the java code.
>>> It would be better to make this configurable, so you could add your own
>>> languages without recompiling.
>>> Suggested approach:
>>> Remove the static code block loading all languages. Instead look for a
>>> tika.languageidentification.properties file on classpath.
>>> Now the user can simply make his/her own (additional) language profile
>>> files, put them on the classpath together with a properties file and off you
>>> go!
>>> Also, once you make it configurable, there might be an issue of having the
>>> profiles as static members, as you will force the same behaviour for the
>>> whole VM. A static Map of Maps could solve this.
>> 
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



Re: [jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi,

My rationale for the override part is as follows:

The default properties file will be embedded within tika-xx.jar
I assume most people are not keen to unpack and repack JARs to make a config change.
We COULD put a similar named properties file at another location, but then the user
needs to make sure that location is EARLIER in classpath than the JAR file.
In the case of e.g. Solr (Tomcat, Jetty..) it is not obvious how to ensure this,
and to avoid any confusion about class-loader peculiarities, it's more straight-forward
to look for an override file.

Take the Solr example. The user would then put the properties file along with
his new language profiles in a folder $SOLR_HOME/lib/org/apache/tika/language/

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 22. aug. 2010, at 16.40, Chris A. Mattmann (JIRA) wrote:

> 
>    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901170#action_12901170 ] 
> 
> Chris A. Mattmann commented on TIKA-490:
> ----------------------------------------
> 
> Hi Jan,
> 
> I don't get the point of the override properties part. What does it buy us? The way you've set it up, it also loads from the classpath just like the language identifier properties proper file, so, it shouldn't be any more arduous to just mod that file if necessary (since they both are classpath loaded). 
> 
> Let me know what you think. I've reviewed the rest of the patch it looks good and I'm ready to commit it, sans the override part.
> 
> Cheers,
> Chris
> 
> 
>> Support for adding language profiles dynamically
>> ------------------------------------------------
>> 
>>                Key: TIKA-490
>>                URL: https://issues.apache.org/jira/browse/TIKA-490
>>            Project: Tika
>>         Issue Type: Improvement
>>         Components: languageidentifier
>>   Affects Versions: 0.7
>>           Reporter: Jan Høydahl
>>           Assignee: Chris A. Mattmann
>>            Fix For: 0.8
>> 
>>        Attachments: TIKA-490.patch, TIKA-490.patch
>> 
>>  Original Estimate: 24h
>> Remaining Estimate: 24h
>> 
>> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
>> It would be better to make this configurable, so you could add your own languages without recompiling.
>> Suggested approach:
>> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
>> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
>> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 


[jira] Commented: (TIKA-490) Support for adding language profiles dynamically

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901170#action_12901170 ] 

Chris A. Mattmann commented on TIKA-490:
----------------------------------------

Hi Jan,

I don't get the point of the override properties part. What does it buy us? The way you've set it up, it also loads from the classpath just like the language identifier properties proper file, so, it shouldn't be any more arduous to just mod that file if necessary (since they both are classpath loaded). 

Let me know what you think. I've reviewed the rest of the patch it looks good and I'm ready to commit it, sans the override part.

Cheers,
Chris


> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-490) Support for adding language profiles dynamically

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-490:
-----------------------------

    Attachment: TIKA-490.patch

Changed initProfiles() to a static method, and also made the Properties object static. Re-ran tests.

> Support for adding language profiles dynamically
> ------------------------------------------------
>
>                 Key: TIKA-490
>                 URL: https://issues.apache.org/jira/browse/TIKA-490
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>             Fix For: 0.8
>
>         Attachments: TIKA-490.patch, TIKA-490.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files, put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the profiles as static members, as you will force the same behaviour for the whole VM. A static Map of Maps could solve this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.