You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Aravinth (Jira)" <ji...@apache.org> on 2022/01/18 09:41:00 UTC

[jira] [Comment Edited] (TIKA-3650) Removal of duplicate classes from Xerces in tika-app jar

    [ https://issues.apache.org/jira/browse/TIKA-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477712#comment-17477712 ] 

Aravinth edited comment on TIKA-3650 at 1/18/22, 9:40 AM:
----------------------------------------------------------

The earlier discussion on this issue - [link|https://www.mail-archive.com/user@tika.apache.org/msg03076.html]


was (Author: imaravin):
The earlier discussion on this issue - [link link|https://www.mail-archive.com/user@tika.apache.org/msg03076.html]

> Removal of duplicate classes from Xerces in tika-app jar
> --------------------------------------------------------
>
>                 Key: TIKA-3650
>                 URL: https://issues.apache.org/jira/browse/TIKA-3650
>             Project: Tika
>          Issue Type: Improvement
>         Environment: Java 8 
>            Reporter: Aravinth
>            Priority: Major
>
> The javax.xml.parsers.DocumentBuilderFactory.class present both in rt.jar from JDK and tika-app.jar. 
> We are using child first classloader to isolate the tika-app jar from the classpath for file parsing, the child first classloader loads the DocumentBuilderFactory interface from the tika-app jar. 
> If the tika-app.jar didn't contain the DocumentBuilderFactory class, the class will be loaded from the rt.jar. 
> Inside the serviceloader, there is a check happening to validate whether the interface and implementation classes are assignable to each other. We are facing a break here, as the interface is loaded from the tika-app jar. 
> {{public static DocumentBuilderFactory newInstance() {}}
> {{ return FactoryFinder.find(}}
> {{ /* The default property name according to the JAXP spec */}}
> {{ DocumentBuilderFactory.class, // "javax.xml.parsers.DocumentBuilderFactory"}}
> {{ /* The fallback implementation class name */}}
> {{ "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl");}}
> {{}}}
>  
> DocumentBuilderFactory.class - this .class operator loads the class into the default classloader regardless of which classloader is in the current path. So this will always return the class object from the default classloader. 
> But during the tika parsing execution, the class loader will be different from the default one (child first classloader), and it will load both interface and implementation from the tika app jar. 
> As the DocumentBuilderFactory.class is created from the default classloader and the implementation class org.apache.xerces.jaxp.DocumentBuilderFactoryImpl is created in a different classloader (interface too loaded in the child first classloader), 
> both are not assignable to each other. 
> In a normal scenario ( most of us will use parent first classloader I assume), The javax.xml.parsers.DocumentBuilderFactory.class will be always loaded from the rt.jar (Java 8 has). The javax.xml.parsers.DocumentBuilderFactory inside the tika-app jar is redundant. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)