You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2021/04/09 19:13:00 UTC

[jira] [Commented] (TIKA-3343) Move Tika's legacy lang id to its own submodule for Tika 2.0

    [ https://issues.apache.org/jira/browse/TIKA-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318217#comment-17318217 ] 

Hudson commented on TIKA-3343:
------------------------------

UNSTABLE: Integrated in Jenkins build Tika ยป tika-main-jdk8 #190 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/190/])
TIKA-3343 -- move Tika's legacy lang detector to its own submodule in tika-langdetect (tallison: [https://github.com/apache/tika/commit/40690250f9b20703b7a92da17d1f42e6108e9109])
* (delete) tika-core/src/main/resources/org/apache/tika/language/en.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/ru.ngp
* (delete) tika-core/src/test/resources/org/apache/tika/language/el.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/en.test
* (delete) tika-core/src/main/java/org/apache/tika/language/ProfilingHandler.java
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/de.test
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/lt.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/lt.test
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/et.test
* (edit) tika-langdetect/pom.xml
* (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/LanguageIdentifierTest.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/lt.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/it.test
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/sv.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/da.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/et.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/gl.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/langbuilder/welsh_corpus.txt
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/en.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/sv.ngp
* (delete) tika-core/src/test/resources/org/apache/tika/language/et.test
* (delete) tika-langdetect/overview.html
* (delete) tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java
* (delete) tika-core/src/test/resources/org/apache/tika/language/da.test
* (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/TikaLanguageDetector.java
* (delete) tika-core/src/test/java/org/apache/tika/language/ProfilingWriterTest.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/th.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/hu.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/nl.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/is.ngp
* (delete) tika-core/src/main/java/org/apache/tika/language/LanguageProfile.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/pt.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/fa.ngp
* (delete) tika-core/src/test/java/org/apache/tika/language/LanguageProfileTest.java
* (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/LanguageProfilerBuilder.java
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/nl.test
* (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/ProfilingWriterTest.java
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/el.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/be.ngp
* (delete) tika-core/src/test/java/org/apache/tika/language/LanguageIdentifierTest.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/fr.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/LanguageProfileTest.java
* (add) tika-langdetect/tika-langdetect-tika/src/main/resources/META-INF/services/org.apache.tika.language.detect.LanguageDetector
* (delete) tika-core/src/test/resources/org/apache/tika/language/it.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/nl.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/pt.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/de.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/uk.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/ca.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/no.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/sl.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/LanguageProfile.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/eo.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/da.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/pl.ngp
* (delete) tika-core/src/test/resources/org/apache/tika/language/fr.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/langbuilder/welsh_corpus.txt
* (delete) tika-core/src/main/resources/org/apache/tika/language/el.ngp
* (add) tika-langdetect/tika-langdetect-tika/pom.xml
* (delete) tika-core/src/main/resources/org/apache/tika/language/it.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/ProfilingWriter.java
* (delete) tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java
* (delete) tika-core/src/test/resources/org/apache/tika/language/de.test
* (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/ProfilingHandler.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/ro.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/pt.test
* (delete) tika-core/src/main/java/org/apache/tika/language/ProfilingWriter.java
* (add) tika-langdetect/tika-langdetect-tika/src/test/java/org/apache/tika/langdetect/tika/LanguageProfilerBuilderTest.java
* (delete) tika-core/src/main/resources/org/apache/tika/language/sk.ngp
* (delete) tika-core/src/main/resources/org/apache/tika/language/tika.language.properties
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/fi.test
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/fr.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/es.ngp
* (add) tika-langdetect/tika-langdetect-tika/src/test/resources/org/apache/tika/langdetect/tika/es.test
* (delete) tika-core/src/main/resources/org/apache/tika/language/fi.ngp
* (delete) tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java
* (add) tika-langdetect/tika-langdetect-tika/src/main/java/org/apache/tika/langdetect/tika/LanguageIdentifier.java
* (delete) tika-core/src/test/resources/org/apache/tika/language/fi.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/sv.test
* (delete) tika-core/src/test/resources/org/apache/tika/language/es.test


> Move Tika's legacy lang id to its own submodule for Tika 2.0
> ------------------------------------------------------------
>
>                 Key: TIKA-3343
>                 URL: https://issues.apache.org/jira/browse/TIKA-3343
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0
>
>
> In the back of my mind, this was an agreed upon change for 2.x. I can't find documentation, tho, so I'm opening this issue to discuss.  
> My memory is that we agreed that we should outsource language id to other tools and remove our own lang ider for 2.x.  If my memory is wrong, or if there's a good reason to keep our language detection algorithm and data, let's discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)