You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Anatharaman, Srinatha (Contractor)" <Sr...@comcast.com> on 2017/02/06 22:45:19 UTC

DataImportHandler - Unable to load Tika Config Processing Document # 1

Hi,

I am having below error while trying to index using dataImporthandler

Data-Config file is mentioned below. zookeeper is not able to read "tikaConfig.xml" on below statement

  processor="TikaEntityProcessor" tikaConfig="tikaConfig.xml"

Please help me to resolve this issue

ion: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load Tika Config Processing Document # 1
        at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270)
        at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
        at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
        at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load Tika Config Processing Document # 1
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:416)
        at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
        at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
        ... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load Tika Config Processing Document # 1
        at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
        at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:96)
        at org.apache.solr.handler.dataimport.EntityProcessorBase.init(EntityProcessorBase.java:60)
        at org.apache.solr.handler.dataimport.TikaEntityProcessor.init(TikaEntityProcessor.java:76)
        at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:75)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:433)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:514)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
        ... 5 more
Caused by: org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does not support getConfigDir() - likely, what you are trying to do is not supported in ZooKeeper mode
        at org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:149)
        at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:91)
        ... 11 more


<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <dataSource  name="bin" type="BinFileDataSource" />
        <document>
            <entity name="f" dataSource="fileSource" rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="/app/home/source/"
            fileName=".*\.(com)|(txt)|(docx)"
            onError="skip"
            recursive="true">
                <field column="fileAbsolutePath" name="path" />
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                <field column="link" name="link"/>

                <entity
                    name="documentImport" dataSource="bin"
                    processor="TikaEntityProcessor" tikaConfig="tikaConfig.xml"
                    url="${f.fileAbsolutePath}"
                    format="text">
                    <field column="file" name="fileName"/>
                    <field column="content" name="content"/>
                    <field column="Author" name="author" meta="true"/>
                    <field column="title" name="title" meta="true"/>
                    <field column="text" name="text"/>

                </entity>
        </entity>
        </document>
</dataConfig>

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

Posted by "Anatharaman, Srinatha (Contractor)" <Sr...@comcast.com>.

Shawn,

Thank you I will follow Erick's steps
BTW I am also trying to ingesting using Flume , Flume uses Morphlines along with Tika
Even Flume SolrSink will have the same issue?

Currently my SolrSink does not ingest the data and also I do not see any error in my logs.
I am seeing lot of issues with Solr

Could you please suggest me what could be the issue with my Flume SolrSink?

I have attached my another email sent on SolrSink issue

Regards,
~Sri

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org] 
Sent: Wednesday, February 08, 2017 2:21 PM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler - Unable to load Tika Config Processing Document # 1

On 2/8/2017 9:08 AM, Anatharaman, Srinatha (Contractor) wrote:
> Thank you for your reply
> Other archive message you mentioned is posted by me only I am new to 
> Solr, When you say process outside Solr program. What exactly I should do?
>
> I am having lots of text document which I need to index, what should I apply to these document before loading it to Solr?

Did you not see Erick's reply, where he provided the following link, and said that the program shown there was a decent guide to writing your own program to handle Tika processing?

https://lucidworks.com/2012/02/14/indexing-with-solrj/

The blog post includes code that talks to a database, which would be fairly easy to remove/change.  Some knowledge of how to write Java programs is required.  Tika is a Java API, so writing the program in Java is a prerequisite.

The entire point of this idea is to take the Tika processing out of the Solr server(s).  If Tika runs within Solr, it can cause Solr to hang or crash.  The authors of Tika try as hard as they can to make sure it works well, but the software is dealing with proprietary data formats that are not publicly documented.  Sometimes one of those documents can cause Tika to explode.  Crashes in client code won't break your application, and it is likely easier to recover from a crash at that level.

Thanks,
Shawn

Re: DataImportHandler - Unable to load Tika Config Processing Document # 1

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/8/2017 9:08 AM, Anatharaman, Srinatha (Contractor) wrote:
> Thank you for your reply
> Other archive message you mentioned is posted by me only
> I am new to Solr, When you say process outside Solr program. What exactly I should do?
>
> I am having lots of text document which I need to index, what should I apply to these document before loading it to Solr?

Did you not see Erick's reply, where he provided the following link, and
said that the program shown there was a decent guide to writing your own
program to handle Tika processing?

https://lucidworks.com/2012/02/14/indexing-with-solrj/

The blog post includes code that talks to a database, which would be
fairly easy to remove/change.  Some knowledge of how to write Java
programs is required.  Tika is a Java API, so writing the program in
Java is a prerequisite.

The entire point of this idea is to take the Tika processing out of the
Solr server(s).  If Tika runs within Solr, it can cause Solr to hang or
crash.  The authors of Tika try as hard as they can to make sure it
works well, but the software is dealing with proprietary data formats
that are not publicly documented.  Sometimes one of those documents can
cause Tika to explode.  Crashes in client code won't break your
application, and it is likely easier to recover from a crash at that level.

Thanks,
Shawn

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

Posted by "Anatharaman, Srinatha (Contractor)" <Sr...@comcast.com>.

Shawn,

Thank you for your reply
Other archive message you mentioned is posted by me only
I am new to Solr, When you say process outside Solr program. What exactly I should do?

I am having lots of text document which I need to index, what should I apply to these document before loading it to Solr?

Regards,
~Sri

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org] 
Sent: Wednesday, February 08, 2017 9:46 AM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler - Unable to load Tika Config Processing Document # 1

On 2/6/2017 3:45 PM, Anatharaman, Srinatha (Contractor) wrote:
> I am having below error while trying to index using dataImporthandler
>
> Data-Config file is mentioned below. zookeeper is not able to read 
> "tikaConfig.xml" on below statement
>
>   processor="TikaEntityProcessor" tikaConfig="tikaConfig.xml"
>
> Please help me to resolve this issue
>
> ion: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
> to load Tika Config Processing Document # 1
<snip>
> Caused by: org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does not support getConfigDir() - likely, what you are trying to do is not supported in ZooKeeper mode
>         at org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:149)
>         at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:91)
>         ... 11 more

This sounds to me like there's something making TikaEntityProcessor incompatible with running in SolrCloud mode.  The way that this processor loads its config appears to NOT work when the config comes from zookeeper, which it always will when you're running SolrCloud.

I don't know if this is expected or not, or whether it will be considered a bug.

It is *strongly* recommended to *not* use the Tika that's embedded within Solr, but instead to do the processing outside of Solr in a program of your own and index the results.  Tika is very touchy software that sometimes hangs or crashes as it processes rich-text documents.  If that happens to the embedded Tika, then Solr itself will also be affected.

Doing Tika processing outside of Solr is more important with SolrCloud, because all replicas will need to independently index the data in cloud mode.  Here's an archive of a message from this list about pretty much the exact same problem:

https://www.mail-archive.com/solr-user@lucene.apache.org/msg127924.html

Note that this message was sent only a week ago.

Thanks,
Shawn

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

Posted by "Anatharaman, Srinatha (Contractor)" <Sr...@comcast.com>.

In my requirement when a Solr search finds the string it has to return the entire text document(emails in RTF format). If I process it outside the Solr how do I achieve this?
When you say process outside, what do I process with rtf document? And also search result have to return original document

I was able to successfully do this in Solr Core stand alone



-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Wednesday, February 08, 2017 1:56 PM
To: solr-user@lucene.apache.org
Subject: RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

>It is *strongly* recommended to *not* use >the Tika that's embedded within Solr, but >instead to do the processing outside of Solr >in a program of your own and index the results.  

+1 

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3CBY2PR09MB11210EDFCFA297528940B07C7F30%40BY2PR09MB112.namprd09.prod.outlook.com%3E

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>It is *strongly* recommended to *not* use >the Tika that's embedded within Solr, but >instead to do the processing outside of Solr >in a program of your own and index the results.  

+1 

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3CBY2PR09MB11210EDFCFA297528940B07C7F30%40BY2PR09MB112.namprd09.prod.outlook.com%3E

Re: DataImportHandler - Unable to load Tika Config Processing Document # 1

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/6/2017 3:45 PM, Anatharaman, Srinatha (Contractor) wrote:
> I am having below error while trying to index using dataImporthandler
>
> Data-Config file is mentioned below. zookeeper is not able to read "tikaConfig.xml" on below statement
>
>   processor="TikaEntityProcessor" tikaConfig="tikaConfig.xml"
>
> Please help me to resolve this issue
>
> ion: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load Tika Config Processing Document # 1
<snip>
> Caused by: org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does not support getConfigDir() - likely, what you are trying to do is not supported in ZooKeeper mode
>         at org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:149)
>         at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:91)
>         ... 11 more

This sounds to me like there's something making TikaEntityProcessor
incompatible with running in SolrCloud mode.  The way that this
processor loads its config appears to NOT work when the config comes
from zookeeper, which it always will when you're running SolrCloud.

I don't know if this is expected or not, or whether it will be
considered a bug.

It is *strongly* recommended to *not* use the Tika that's embedded
within Solr, but instead to do the processing outside of Solr in a
program of your own and index the results.  Tika is very touchy software
that sometimes hangs or crashes as it processes rich-text documents.  If
that happens to the embedded Tika, then Solr itself will also be affected.

Doing Tika processing outside of Solr is more important with SolrCloud,
because all replicas will need to independently index the data in cloud
mode.  Here's an archive of a message from this list about pretty much
the exact same problem:

https://www.mail-archive.com/solr-user@lucene.apache.org/msg127924.html

Note that this message was sent only a week ago.

Thanks,
Shawn