You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2010/05/04 00:22:57 UTC

[jira] Created: (SOLR-1902) Tika no longer properly extracts content in Solr

Tika no longer properly extracts content in Solr
------------------------------------------------

                 Key: SOLR-1902
                 URL: https://issues.apache.org/jira/browse/SOLR-1902
             Project: Solr
          Issue Type: Bug
          Components: contrib - Solr Cell (Tika extraction)
            Reporter: Grant Ingersoll


See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24

It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: [jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by ka...@nokia.com.

I saw the same thing yesterday when I tried to fold in current trunk jars into a solr build from March.  Not sure what the problem was, but when I used the example solrconfig.xml from the current build, I didn't see this anymore.

Karl

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866114#action_12866114 ] 

Erik Hatcher commented on SOLR-1902:
------------------------------------

Is there a test case that could have caught this issue that we can add?

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Assigned: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned SOLR-1902:
-------------------------------------

    Assignee: Grant Ingersoll

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866340#action_12866340 ] 

Grant Ingersoll commented on SOLR-1902:
---------------------------------------

I suppose one could setup a Jetty fire off test to do it.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Reopened: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man reopened SOLR-1902:
----------------------------


reopening based on mailing list discussion indicating problem is still in trunk

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884295#action_12884295 ] 

Grant Ingersoll commented on SOLR-1902:
---------------------------------------

I'm not seeing this.  I just tried trunk and it works for me.  Brad, can you produce a test case?  What happens if you run with extractOnly?  Does it return the content?  I tried both that and indexing using trunk and the example per the wiki docs and it all appears to work for me.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved SOLR-1902.
-----------------------------------

    Fix Version/s: 1.4.2
                   3.1
       Resolution: Fixed

Trunk, branch-1.4 (i.e. 1.4.2) and branch-3.x should all be on the same version of Tika at this point.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 1.4.2, 3.1, 4.0
>
>         Attachments: SOLR1902_patch_to_141.txt
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Tommaso Teofili (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892957#action_12892957 ] 

Tommaso Teofili commented on SOLR-1902:
---------------------------------------

Hi all, I had the same issue David has, so I applied the patch (modifying files one by one) to a fresh Solr 1.4.1 checkout and I managed to have most of my PDFs being indexed with text extracted (with the "example" Solr instance). 
Within the apache-solr-1.4.1 release I substituted all the files inside apache-solr-1.4.1/dist with the ones generated (inside the dist directory) invoking 'ant dist' on the patched 1.4.1 source code, also I substituted the release war with the generated (patched) war inside example/webapps (this last one was mandatory to avoid the NoSuchMethodError reported above) . Then I ran 'java -jar start.jar' from example dir and everything worked.
Note that I used the latest version of pdfbox, jembox and fontbox (1.2.1).
I can attach the patch to 1.4.1 code I used.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Tommaso Teofili (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tommaso Teofili updated SOLR-1902:
----------------------------------

    Attachment: SOLR1902_patch_to_141.txt

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: SOLR1902_patch_to_141.txt
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "David Thibault (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892757#action_12892757 ] 

David Thibault commented on SOLR-1902:
--------------------------------------

I just tried this patch and the patch for ExtractingRequestHandler does not work when applied to the ExtractingRequestHandler from Solr 1.4.1.  If it's a 1.4.0-specific patch maybe it should say something to that effect.  I was able to read the patch and manually change the code, though.  I have not yet tried the resulting compiled classes to see if they fix my issue, though.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "David Thibault (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892813#action_12892813 ] 

David Thibault commented on SOLR-1902:
--------------------------------------

OK, I just did an ant clean dist with these patches applied.  When I try to use curl to post a file to Solr it gives me this error:
SEVERE: java.lang.NoSuchMethodError: org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
        at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
        at java.lang.Thread.run(Thread.java:619)

I'm not sure why, because if I look at the patched Java source for SolrResourceLoader the getClassLoader() method is there.  Any thoughts?


> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Brad Greenlee (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875725#action_12875725 ] 

Brad Greenlee commented on SOLR-1902:
-------------------------------------

I am still seeing this issue. It works if I downgrade Tika to 0.6, but neither the 0.8-SNAPSHOT included in the current Solr trunk nor a snapshot from the Tika trunk work for me. I'm running Java 1.6.0_20 on OS X 10.6.3. I posted about the issue to the solr-user mailing list: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "David Thibault (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893328#action_12893328 ] 

David Thibault commented on SOLR-1902:
--------------------------------------

OK, I tried Tommaso's patch and it worked great.  However, using the solr.war that came with the 1.4.1 distribution, it still gave the NoSuchMethodError.  However, when I try the patched and newly-recompiled apache-solr-1.4.2-dev.war, it worked.  I thought I tried that before with the other patches and it didn't work.  In any case, it seems to be working with the following:
SOLR_DIST=the folder where the solr 1.4.1 distribution was unzipped.
SOLR_HOME=the folder where tomcat or jetty will look to load SOLR.

1) fresh copy of solr 1.4.1 distribution unzipped to SOLR_DIST

2) update SOLR_DIST/contrib/extraction/lib with the following:
   jempbox-1.2.1.jar
   fontbox-1.2.1.jar
   pdfbox-1.2.1.jar
   tika-core-0.8-SNAPSHOT.jar
   tika-parsers-0.8-SNAPSHOT.jar
  (and remove old tika and pdfbox-related jars)

3) patch with Tommaso's patch above in the SOLR_DIST directory:
patch -p0 < SOLR1902_patch_to_141.txt

4) still in SOLR_DIST, run "ant dist"

5) copy SOLR_DIST/dist/*.jar to SOLR_HOME/lib
6) copy SOLR_DIST/dist/solrj-lib to SOLR_HOME/lib/solrj-lib
7) copy SOLR_DIST/dist/apache-solr-1.4.2-dev.war to SOLR_HOME/
8) remove SOLR_HOME/contrib/extraction/lib/*.jar
9) copy SOLR_DIST/contrib/extraction/lib/*.jar to SOLR_HOME/contrib/extraction/lib/
10) (for me in tomcat) add CATALINA_HOME/conf/Catalina/localhost/solr.xml with the following content (substitute the actual directory for <SOLR_HOME> as that is not correct syntax):
<?xml version="1.0" encoding="utf-8"?>
  <Context docBase="<SOLR_HOME>\apache-solr-1.4.2-dev.war" debug="0" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="<SOLR_HOME>" override="true"/>
</Context>
11) restart tomcat.
12) upload content via curl.
13) jump for joy when it doesn't crash on me again...=)

Best,
Dave 

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: SOLR1902_patch_to_141.txt
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863548#action_12863548 ] 

Grant Ingersoll commented on SOLR-1902:
---------------------------------------

Further debugging shows that on startup, Tika did not load any parsers, which is the difference as to why the tests pass.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895672#action_12895672 ] 

Grant Ingersoll commented on SOLR-1902:
---------------------------------------

OK, so it seems that the comments on this were all about 1.4 and 1.4.1, which was never upgraded.  So, I believe trunk is working.  So, I'm going to mark this as a Fix Version for 1.4.2 as well and put up a batch for that based on the patch here.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: SOLR1902_patch_to_141.txt
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved SOLR-1902.
-----------------------------------

    Resolution: Fixed

Upgraded to Tika 0.8-SNAPSHOT and added class loading capabilities.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-1902) Tika no longer properly extracts content in Solr

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-1902:
---------------------------

    Fix Version/s: 4.0


Correcting Fix Version based on CHANGES.txt, see this thread for more details...

http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>
>                 Key: SOLR-1902
>                 URL: https://issues.apache.org/jira/browse/SOLR-1902
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
>
>
> See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that the tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org