You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Martijn van Groningen (JIRA)" <ji...@apache.org> on 2011/01/03 23:22:46 UTC

[jira] Updated: (SOLR-2116) TikaEntityProcessor does not find parser by default

     [ https://issues.apache.org/jira/browse/SOLR-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn van Groningen updated SOLR-2116:
----------------------------------------

    Attachment: SOLR-2116.patch

I've encountered the same issue on my Solr setup. After some digging I found the problem, it is simply not loading classes from the lib directory.

When no tika config is specified in the data-config.xml, the TikaEntityProcessor tries to load the TikaConfig in the manner specified below:
{code}
....
String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
if (tikaConfigFile == null) {
  tikaConfig = TikaConfig.getDefaultConfig();
} else {
....
{code}

The problem with this way of loading the TIkaConfig is, that it doesn't use the classloader from the SolrResourceLoader and therefore not loading any jars from the solr lib directory. The attached patch resolves the issue that no content is parsed by Tika. I simply use the constructor that requires a ClassLoader as argument. I retrieve the classloader from the SolrCore.
{code}
...
String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
if (tikaConfigFile == null) {
   ClassLoader classLoader = context.getSolrCore().getResourceLoader().getClassLoader();
   tikaConfig = new TikaConfig(classLoader);
} else {
...
{code}

I haven't added a test that demonstrates this bug, since it only occurs when Tika libs (and its dependencies) are in the Solr lib directory and I don't know how to replicate this situation in the solr build. The TestTikaEntityProcessor class doesn't have this problem since all classes are on the normal classpath when the build is running.

> TikaEntityProcessor does not find parser by default
> ---------------------------------------------------
>
>                 Key: SOLR-2116
>                 URL: https://issues.apache.org/jira/browse/SOLR-2116
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.1, 4.0
>            Reporter: Lance Norskog
>         Attachments: pdflist-data-config.xml, pdflist.xml, SOLR-2116.patch
>
>
> The TikaEntityProcessor does not find the correct document parser by default.
> This is in a two-level DIH config file. I have attached pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test this, you will need the current 3.x branch or 4.0 trunk.
> # Set up a Tika-enabled Solr 
> # copy any PDF file to /tmp/testfile.pdf
> # copy the pdflist-data-config.xml to your solr/conf
> # and add this snippet to your solrconfig.xml
> {code:xml}
> <requestHandler name="/pdflist"
>       class="org.apache.solr.handler.dataimport.DataImportHandler">
>   <lst name="defaults">
>               <str name="config">pdflist-data-config.xml</str>
>       </lst>
> </requestHandler>
> {code}
> [http://localhost:8983/solr/pdflist?command=full-import] will make one document with the id and text fields populated. If you remove this line:
> {code}
>  parser="org.apache.tika.parser.pdf.PDFParser"
> {code}
> from the TikaEntityProcessor entity, the parser will not be found and you will get a document with the "id" field and nothing else.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org