You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@any23.apache.org by "Lewis John McGibbney (Created) (JIRA)" <ji...@apache.org> on 2012/01/06 15:49:39 UTC

[jira] [Created] (ANY23-26) Upgrade dependency to Apache Tika 1.0

Upgrade dependency to Apache Tika 1.0
-------------------------------------

                 Key: ANY23-26
                 URL: https://issues.apache.org/jira/browse/ANY23-26
             Project: Apache Any23
          Issue Type: Improvement
            Reporter: Lewis John McGibbney


Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-26) Upgrade dependency to Apache Tika 1.0

Posted by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated ANY23-26:
--------------------------------------

    Affects Version/s: 0.7.0
        Fix Version/s: 0.8.0
    
> Upgrade dependency to Apache Tika 1.0
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Lewis John McGibbney (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254789#comment-13254789 ] 

Lewis John McGibbney edited comment on ANY23-26 at 4/16/12 4:23 PM:
--------------------------------------------------------------------

Initial WIP. This breaks HCardExtractorTest#testImgSrcDataUrl and #testObjectDataDataUri. 

I've attached my failing tests, along with the two HTML documents which the tests currently fail on. They both seem to be failing on either AbstractExtractorTestCase#assertExtract or HCardExtractorTest#assertDefaultVCard... 

For reference we only use Tika core and parsers on the following two classes

./core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java
./core/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java
                
      was (Author: lewismc):
    Initial WIP. This breaks HCardExtractorTest#testImgSrcDataUrl and #testObjectDataDataUri. 

I've attached my failing tests, along with the two HTML documents which the tests currently fail on. They both seem to be failing on either AbstractExtractorTestCase#assertExtract or HCardExtractorTest#assertDefaultVCard... 

For reference we only use Tika core and parsers on the following two classes

./core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java:import org.apache.tika.mime.MimeTypes;
./core/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java:import org.apache.tika.parser.txt.CharsetDetector;  
                  
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433871#comment-13433871 ] 

Peter Ansell commented on ANY23-26:
-----------------------------------

The Xerces DOM parser seems to be corrupted by the data: URI in the test document for testObjectDataDataUri after the upgrade to Tika-1.2 but not before it. This is even though the xerces parser version did not change, with both before and after at 2.9.1. May have something to do with the JDOM and DOM4J dependency changes underneath but I am not sure how to proceed with debugging that.

Before the upgrade the full DOM is visible in the debugger with a breakpoint in TagSoupParser.getDOM(), but after the upgrade the BODY element has a first and only child node of type text with unrecognisable characters.
                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt, tika-1.2-dependency-tree-compare.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433853#comment-13433853 ] 

Peter Ansell edited comment on ANY23-26 at 8/14/12 2:34 PM:
------------------------------------------------------------

I am going to do some refactoring to replace the current Extractor factory patterns that rely on reflection with java.util.ServiceLoader equivalents before coming back to this issue.
                
      was (Author: p_ansell):
    The point at which the two similar tests, 18 and 19 are different is that in:

                  
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt, tika-1.2-dependency-tree-compare.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433765#comment-13433765 ] 

Peter Ansell commented on ANY23-26:
-----------------------------------

This may be a Crawler4J bug, as it has Tika as a dependency. Still investigating what the exact path is though.
                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Ansell updated ANY23-26:
------------------------------

    Attachment: tika-1.2-dependency-tree-compare.txt

Attaching a diff of the maven dependency trees before and after the version bump from tika-0.6 to tika-1.2.

For reference the commands against my GitHub repository are as follows

#~/gitrepos/any23$ git checkout tika-12
#~/gitrepos/any23$ mvn dependency:tree > after-tika-1.2.txt
#~/gitrepos/any23$ git checkout trunk 
Previous HEAD position was c58f165... ANY23-26 : bump to tika-1.2
Switched to branch 'trunk'
#~/gitrepos/any23$ mvn dependency:tree > before-tika-1.2.txt
#~/gitrepos/any23$ diff -u before-tika-1.2.txt after-tika-1.2.txt > tika-1.2-dependency-tree-compare.txt

                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt, tika-1.2-dependency-tree-compare.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433795#comment-13433795 ] 

Peter Ansell commented on ANY23-26:
-----------------------------------

It cannot be a Crawler4J bug actually, as Crawler4J is only used in the basic crawler plugin and the bug appears above that level in the core module tests.

Another candidate may be that Tika-1.2 introduces a dependency on JDOM, where we are also using DOM4J, which may produce a conflict. Investigating that now
                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432363#comment-13432363 ] 

Peter Ansell commented on ANY23-26:
-----------------------------------

Only one of the tests Lewis referred to above, testObjectDataDataUri is now failing on my branch for this issue:

https://github.com/ansell/any23/compare/ansell:trunk...ansell:tika-12
                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated ANY23-26:
--------------------------------------

    Attachment: 19-object-data-data-uri.html
                14-img-src-data-url.html
                org.apache.any23.extractor.html.HCardExtractorTest.txt
                ANY23-26.patch

Initial WIP. This breaks HCardExtractorTest#testImgSrcDataUrl and #testObjectDataDataUri. 

I've attached my failing tests, along with the two HTML documents which the tests currently fail on. They both seem to be failing on either AbstractExtractorTestCase#assertExtract or HCardExtractorTest#assertDefaultVCard... 

For reference we only use Tika core and parsers on the following two classes

./core/src/main/java/org/apache/any23/mime/TikaMIMETypeDetector.java:import org.apache.tika.mime.MimeTypes;
./core/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java:import org.apache.tika.parser.txt.CharsetDetector;  
                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated ANY23-26:
--------------------------------------

    Summary: Upgrade dependency to Apache Tika 1.1  (was: Upgrade dependency to Apache Tika 1.0)
    
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ANY23-26) Upgrade dependency to Apache Tika 1.1

Posted by "Peter Ansell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433853#comment-13433853 ] 

Peter Ansell commented on ANY23-26:
-----------------------------------

The point at which the two similar tests, 18 and 19 are different is that in:

                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt, tika-1.2-dependency-tree-compare.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the project. This issue should act as an umbrella issue to track these changes. It would be great to delegate as much as possible to Tika if deemed suitable to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira