You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrei Dobrescu (Jira)" <ji...@apache.org> on 2021/05/11 15:05:00 UTC

[jira] [Comment Edited] (TIKA-3392) Apache Tika V1.26 doen't work on Android anymore. Issue with org.xml dependencies.

    [ https://issues.apache.org/jira/browse/TIKA-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342626#comment-17342626 ] 

Andrei Dobrescu edited comment on TIKA-3392 at 5/11/21, 3:04 PM:
-----------------------------------------------------------------

I did a bit of research before posting this issue. Thing is:
 - All Android apps do bundle all their dependent libraries. So if in app A1 you import library L1 with version V1 and in app A2 you import library L1 with version V2, it will be ok, because the APK file format is containerized.
 - The exception are the classes from the Android SDK. The SDK is the only system-level library, common to all apps. The SDK is deeply bundle to the Android OS version (so you'll have a version of the SDK for each OS version). It contains Java SE classes, Android specific classes, such as the UI toolkit. Problem is, when they developed Android, some Genius from Google thought it's a good idea to put in the SDK JSON.org, Apache HTTP client, org.xml.*, org.xmlpull.* libraries. [You can find the documen tation of the SDK here|https://developer.android.com/reference/packages]

As you can see, the SDK contains an implementation of org.xml.sax. I can import latest Apache Xerces but org.xml.* classes will always resolve to the ones from the SDK. The classes from the SDK doesn't support "secure-processing", and because of that Tika library will crash.

I can think of 3 solutions to this problem:
 - Guys from Google could update or remove their org.xml.* classes from the SDK. This surely won't happen.
 - I can stop using Tika, and start using another mime type detector, such as the linux file command: [https://stackoverflow.com/a/2227201/11536597|like this]. I could compile the [http://www.darwinsys.com/file/|source code] to target Android, then bundle the native library.
 - Tika could stop using secure-processing XML feature. Why is it even needed? Is it important? Can the library work without it? It basically crashes at MimeTypesReader.java:429 / newSaxParser method / factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);


was (Author: andob):
I did a bit of research before posting this issue. Thing is:
- All Android apps do bundle all their dependent libraries. So if in app A1 you import library L1 with version V1 and in app A2 you import library L1 with version V2, it will be ok, because the APK file format is containerized.
- The exception are the classes from the Android SDK. The SDK is the only system-level library, common to all apps. It contains Java SE classes, Android specific classes, such as the UI toolkit. Problem is, when they developed Android, some Genius from Google thought it's a good idea to put in the SDK JSON.org, Apache HTTP client, org.xml.*, org.xmlpull.* libraries. [You can find the documen tation of the SDK here|https://developer.android.com/reference/packages]


As you can see, the SDK contains an implementation of org.xml.sax. I can import latest Apache Xerces but org.xml.* classes will always resolve to the ones from the SDK. The classes from the SDK doesn't support "secure-processing", and because of that Tika library will crash.

I can think of 3 solutions to this problem:
- Guys from Google could update or remove their org.xml.* classes from the SDK. This surely won't happen.
- I can stop using Tika, and start using another mime type detector, such as the linux file command: [https://stackoverflow.com/a/2227201/11536597|like this]. I could compile the [http://www.darwinsys.com/file/|source code] to target Android, then bundle the native library.
- Tika could stop using secure-processing XML feature. Why is it even needed? Is it important? Can the library work without it? It basically crashes at MimeTypesReader.java:429 / newSaxParser method / factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);

> Apache Tika V1.26 doen't work on Android anymore. Issue with org.xml dependencies.
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-3392
>                 URL: https://issues.apache.org/jira/browse/TIKA-3392
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.26
>         Environment: Android 11
>            Reporter: Andrei Dobrescu
>            Priority: Major
>              Labels: android
>         Attachments: image-2021-05-11-17-53-58-291.png
>
>
> I use Apache Tika on Android in order to detect mime type of varios files:
> Apache Tika V1.10 works fine on Android:
> {code:java}
> implementation 'org.apache.tika:tika-core:1.10'
> {code}
> {code:java}
> val mimeType = file.inputStream().buffered().use { inputStream ->
>     AutoDetectParser().detector .detect(inputStream, Metadata()).toString()
> }
> {code}
> However, Tika V1.26 will crash when trying to detect the mime type:
> {code:java}
> implementation 'org.apache.tika:tika-core:1.10'
> {code}
> {noformat}
> java.lang.ExceptionInInitializerError
>     at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:69)
>     at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:100)
>     at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:189)
>     at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:604)
>     at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:83)
>     at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:257)
>     at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:422)
>     at org.apache.tika.parser.AutoDetectParser.<init>(AutoDetectParser.java:55){noformat}
> {noformat}
> CAUSE:
> java.lang.RuntimeException: problem initializing SAXParser pool
>         at org.apache.tika.mime.MimeTypesReader.<clinit>(MimeTypesReader.java:119)
>         at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:69)
>         at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:100)
>         at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:189)
>         at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:604)
>         at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:83)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:257)
>         at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:422)
>         at org.apache.tika.parser.AutoDetectParser.<init>(AutoDetectParser.java:55){noformat}
> {noformat}
> CAUSE OF CAUSE:
>  org.apache.tika.exception.TikaException: problem creating SAX parser factory
>      at org.apache.tika.mime.MimeTypesReader.newSAXParser(MimeTypesReader.java:433)
>      at org.apache.tika.mime.MimeTypesReader.setPoolSize(MimeTypesReader.java:417)
>      at org.apache.tika.mime.MimeTypesReader.<clinit>(MimeTypesReader.java:117)
>      at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:69)
>      at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:100)
>      at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:189)
>      at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:604)
>      at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:83)
>      at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:257)
>      at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:422)
>      at org.apache.tika.parser.AutoDetectParser.<init>(AutoDetectParser.java:55){noformat}
> {noformat}
> CAUSE OF CAUSE OF CAUSE:
> org.xml.sax.SAXNotRecognizedException: http://javax.xml.XMLConstants/feature/secure-processing
>      at org.apache.harmony.xml.parsers.SAXParserFactoryImpl.setFeature(SAXParserFactoryImpl.java:93)
>      at org.apache.tika.mime.MimeTypesReader.newSAXParser(MimeTypesReader.java:429)
>      at org.apache.tika.mime.MimeTypesReader.setPoolSize(MimeTypesReader.java:417)
>      at org.apache.tika.mime.MimeTypesReader.<clinit>(MimeTypesReader.java:117)
>      at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:69)
>      at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:100)
>      at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:189)
>      at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:604)
>      at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:83)
>      at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:257)
>      at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:422)
>      at org.apache.tika.parser.AutoDetectParser.<init>(AutoDetectParser.java:55)
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)