You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org> on 2017/06/13 07:20:03 UTC

ManifoldCF documentum indexing issue

Hi,

The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the tomcat and startup script is pointed with the MF properties.xml during server startup. Manifold along with the bundled ZK, tomcat are running on the same host with OS as  Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB is running on a windows box.
The ZK is integrated with the DB through the properties.xml and properties-global.xml
The ZK, the documentum related processes(registry and server) are up and the  two agents (start-agents.sh and start-agents-2.sh) are started  which produce multiple threads to index the documemtum contents into SOLR through ManifoldCF.

The Current no of the connections configured on the MF are as below.
SOLR Output max connection : 25
Document repository  Max Connections: 25
Properties.xml:
<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
<property name="org.apache.manifoldcf.crawler.threads" value="25"/>
Total documentum document count : 0.5 million

After the Job is started, it indexed some 20000+ documents and gets terminated with the below error on the Manifold JOB.
Error: Repeated service interruptions - failure processing document: Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188

Please find the attached manifoldCF error log and agent log.

Please let me know the observations on the cause of the issue and the configuration on the threads used  for crawling. Please share your thoughts.

Regards,
Tamizh Kumaran


Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
I have committed a stop-gap solution to the MCF Solr connector, but the
real problem is in Apache HttpComponents/HttpClient.  I've gotten
permission to suggest a fix for that project as well.

Karl


On Thu, Jun 22, 2017 at 4:27 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Thanks Karl.
>
>
>
> After installing the patch, filename with double quotes and backslashes
> were getting indexed to Solr and the issue is resolved.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Wednesday, June 21, 2017 5:07 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've attached a tentative patch to the ticket CONNECTORS-1434.  Please
> confirm whether or not the patch works for you before I commit it to trunk.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Thanks Karl.
>
>
>
> Please find the below steps to recreate the issue on file system
> repository.
>
>
>
> Output connector : Solr
>
> Repository : File system
>
> File name in repository : “dummy” file “name.pdf
>
>
>
> Additional Solr parameter : expandMacros=false
>
>
>
> On starting the job with above configuration, we are getting “missing
> content stream” .
>
> Please find the attached file for complete log trace.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Wednesday, June 21, 2017 3:35 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've created a ticket, CONNECTORS-1434, to look at the file name issues.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <da...@gmail.com> wrote:
>
> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
>
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> Thanks for the update!!!
>
>
>
> As per the response from Solr team, expandMacros=false is added to the
> output connector as additional parameter.
>
> After adding  expandMacros=false, the indexing job is getting completed
> with “Missing content stream” error for few of the documents and those are
> not indexed into Solr.
>
>
>
> As per our analysis, the pdf document’s file name we are trying to index
> from documentum  contains whitespace and special characters like double
> quotes.
>
> Which makes the file non readable and missing content stream error is
> thrown.
>
>
>
> If there is any work around to overcome this issue, kindly share it with
> us.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Wednesday, June 14, 2017 7:20 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Here's the response:
>
>
>
> >>>>>>
>
> Karl -
>
> There’s expandMacros=false, as covered here: https://cwiki.apache.
> org/confluence/display/solr/Parameter+Substitution
>
> But… what exactly is being sent to Solr?    Is there some kind of “${…”
> being sent as a parameter?   Just curious what’s getting you into this in
> the first place.   But disabling probably is your most desired solution.
>
>         Erik
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> Here's the question I posted:
>
>
>
> >>>>>>
>
> Hi all,
>
>
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
>
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
>
>
> >>>>>>
>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>
>         at java.lang.String.substring(String.java:1911)
>
>         at org.apache.solr.request.macro.MacroExpander._expand(
> MacroExpander.java:143)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:93)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:59)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:45)
>
>         at org.apache.solr.request.json.RequestUtil.processParams(
> RequestUtil.java:157)
>
>         at org.apache.solr.util.SolrPluginUtils.setDefaults(
> SolrPluginUtils.java:172)
>
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:152)
>
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>
>         at org.apache.solr.servlet.HttpSolrCall.execute(
> HttpSolrCall.java:654)
>
>         at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:460)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:257)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:208)
>
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1652)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:585)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
>
>         at org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:577)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:223)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1127)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:515)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1061)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:215)
>
>         at org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:110)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:97)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>
>         at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:310)
>
>         at org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:257)
>
>         at org.eclipse.jetty.io.AbstractConnection$2.run(
> AbstractConnection.java:540)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:635)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:555)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> <<<<<<
>
>
>
> It looks worrisome to me that there's now possibly some kind of "macro
> expansion" that is being triggered within parameters being sent to Solr.
> Can anyone tell me either how to (a) disable this feature, or (b) how the
> MCF Solr output connector should escape parameters being posted so that
> Solr does not attempt any macro expansion?  If the latter, I also need to
> know when this feature appeared, since obviously whether or not to do the
> escaping will depend on the precise version of the Solr instance involved.
>
>
>
> I'm also quite concerned that considerations of backwards compatibility
> may have been lost at some point with Solr, since heretofore I could count
> on older versions of SolrJ working with newer versions of Solr.  Please
> clarify what the current policy is....
>
>
>
>
>
> Thanks,
>
> Karl
>
> <<<<<<
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com> wrote:
>
> I posted the pertinent question to the solr dev list.  Let's see what they
> say.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:
>
> Hi,
>
>
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
>
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
>
> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=
> 091e8486805142f5 (500)
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)      ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)      Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)      Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)      Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: ManifoldCF documentum indexing issue

Posted by Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>.
Thanks Karl.

After installing the patch, filename with double quotes and backslashes were getting indexed to Solr and the issue is resolved.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Wednesday, June 21, 2017 5:07 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

I've attached a tentative patch to the ticket CONNECTORS-1434.  Please confirm whether or not the patch works for you before I commit it to trunk.

Karl


On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Thanks Karl.

Please find the below steps to recreate the issue on file system repository.

Output connector : Solr
Repository : File system
File name in repository : “dummy” file “name.pdf

Additional Solr parameter : expandMacros=false

On starting the job with above configuration, we are getting “missing content stream” .
Please find the attached file for complete log trace.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddywri@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 21, 2017 3:35 PM

To: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <da...@gmail.com>> wrote:
There is no good way to handle a case where Solr doesn't like the file name.  About the only thing that could be done would be to encode the filename using something like URL encoding.  This might have some effects on existing users, but more importantly, we really would need to know what characters were legal before adopting that solution.

I am not entirely sure how the file name is transmitted to Solr when using multipart forms, but how that is done is critical to know what to do.

Karl


On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi Karl,

Thanks for the update!!!

As per the response from Solr team, expandMacros=false is added to the output connector as additional parameter.
After adding  expandMacros=false, the indexing job is getting completed with “Missing content stream” error for few of the documents and those are not indexed into Solr.

As per our analysis, the pdf document’s file name we are trying to index from documentum  contains whitespace and special characters like double quotes.
Which makes the file non readable and missing content stream error is thrown.

If there is any work around to overcome this issue, kindly share it with us.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddywri@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 14, 2017 7:20 PM

To: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?    Is there some kind of “${…” being sent as a parameter?   Just curious what’s getting you into this in the first place.   But disabling probably is your most desired solution.

        Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com>> wrote:
Here's the question I posted:

>>>>>>
Hi all,

I've got a ManifoldCF user who is posting content to Solr using the MCF Solr output connector.  This connector uses SolrJ under the covers -- a fairly recent version -- but also has overridden some classes to insure that multipart form posts will be used for most content.

The problem is that, for a specific document, the user is getting an ArrayIndexOutOfBounds exception in Solr, as follows:

>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1, node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} - java.lang.StringIndexOutOfBoundsException: String index out of range: -296
        at java.lang.String.substring(String.java:1911)
        at org.apache.solr.request.macro.MacroExpander._expand(MacroExpander.java:143)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:93)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:59)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:45)
        at org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:157)
        at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:172)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:152)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.eclipse.jetty.io<http://org.eclipse.jetty.io>.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)
<<<<<<

It looks worrisome to me that there's now possibly some kind of "macro expansion" that is being triggered within parameters being sent to Solr.  Can anyone tell me either how to (a) disable this feature, or (b) how the MCF Solr output connector should escape parameters being posted so that Solr does not attempt any macro expansion?  If the latter, I also need to know when this feature appeared, since obviously whether or not to do the escaping will depend on the precise version of the Solr instance involved.

I'm also quite concerned that considerations of backwards compatibility may have been lost at some point with Solr, since heretofore I could count on older versions of SolrJ working with newer versions of Solr.  Please clarify what the current policy is....


Thanks,
Karl
<<<<<<



On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com>> wrote:
I posted the pertinent question to the solr dev list.  Let's see what they say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com>> wrote:
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty stream; it's that Solr is attempting to do something with it that it shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover from that.

>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5 (500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi Karl,

After configuring Solr to ignore Tika errors by adding Tika transformer in the job, below behavior is observed.


1)      ManifoldCF fetches the content from documentum, which contains null content and tries to push it to the output connector(Solr).

2)      Solr couldn’t accept the null as a value and throwing “Missing content stream” error.

3)      Each agent thread In ManifoldCF internally held-up with different r_object_id’s that don’t have body content and keeps trying to push the content to Solr  after each failure, but Solr couldn’t accept the content and throws the same error.

4)      Over the time, the manifold job stops with the error thrown by Solr

Please let know if there is any configuration change which can help us resolve this issue.

Please find the attached manifoldCF error log,Solr error log and agent log.

Regards,
Tamizh Kumaran.

From: Karl Wright [mailto:daddywri@gmail.com<ma...@gmail.com>]
Sent: Tuesday, June 13, 2017 2:23 PM
To: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Hi Tamizh,

The reported error is 'Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188'.  The message seemingly indicates that the error was *received* from the solr server for one specific document.  ManifoldCF does not recognize the error as being innocuous and therefore it will retry for a while until it eventually gives up and halts the job.  However, I cannot find that exact text anywhere in the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack trace attached to each one;
(2) Simple history records for that document that are of the type INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.

The last one is the one that would be the most helpful.  It is possible that you are seeing a problem in Solr Cell (Tika) that is manifesting itself in this way.  You can (and should) configure your Solr to ignore Tika errors.

Thanks,
Karl




On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi,

The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the tomcat and startup script is pointed with the MF properties.xml during server startup. Manifold along with the bundled ZK, tomcat are running on the same host with OS as  Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB is running on a windows box.
The ZK is integrated with the DB through the properties.xml and properties-global.xml
The ZK, the documentum related processes(registry and server) are up and the  two agents (start-agents.sh and start-agents-2.sh) are started  which produce multiple threads to index the documemtum contents into SOLR through ManifoldCF.

The Current no of the connections configured on the MF are as below.
SOLR Output max connection : 25
Document repository  Max Connections: 25
Properties.xml:
<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
<property name="org.apache.manifoldcf.cr<http://org.apache.manifoldcf.cr>awler.threads" value="25"/>
Total documentum document count : 0.5 million

After the Job is started, it indexed some 20000+ documents and gets terminated with the below error on the Manifold JOB.
Error: Repeated service interruptions - failure processing document: Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188

Please find the attached manifoldCF error log and agent log.

Please let me know the observations on the cause of the issue and the configuration on the threads used  for crawling. Please share your thoughts.

Regards,
Tamizh Kumaran










Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
I've attached a tentative patch to the ticket CONNECTORS-1434.  Please
confirm whether or not the patch works for you before I commit it to trunk.

Karl


On Wed, Jun 21, 2017 at 6:49 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Thanks Karl.
>
>
>
> Please find the below steps to recreate the issue on file system
> repository.
>
>
>
> Output connector : Solr
>
> Repository : File system
>
> File name in repository : “dummy” file “name.pdf
>
>
>
> Additional Solr parameter : expandMacros=false
>
>
>
> On starting the job with above configuration, we are getting “missing
> content stream” .
>
> Please find the attached file for complete log trace.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Wednesday, June 21, 2017 3:35 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> I've created a ticket, CONNECTORS-1434, to look at the file name issues.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <da...@gmail.com> wrote:
>
> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
>
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> Thanks for the update!!!
>
>
>
> As per the response from Solr team, expandMacros=false is added to the
> output connector as additional parameter.
>
> After adding  expandMacros=false, the indexing job is getting completed
> with “Missing content stream” error for few of the documents and those are
> not indexed into Solr.
>
>
>
> As per our analysis, the pdf document’s file name we are trying to index
> from documentum  contains whitespace and special characters like double
> quotes.
>
> Which makes the file non readable and missing content stream error is
> thrown.
>
>
>
> If there is any work around to overcome this issue, kindly share it with
> us.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Wednesday, June 14, 2017 7:20 PM
>
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Here's the response:
>
>
>
> >>>>>>
>
> Karl -
>
> There’s expandMacros=false, as covered here: https://cwiki.apache.
> org/confluence/display/solr/Parameter+Substitution
>
> But… what exactly is being sent to Solr?    Is there some kind of “${…”
> being sent as a parameter?   Just curious what’s getting you into this in
> the first place.   But disabling probably is your most desired solution.
>
>         Erik
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> Here's the question I posted:
>
>
>
> >>>>>>
>
> Hi all,
>
>
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
>
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
>
>
> >>>>>>
>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>
>         at java.lang.String.substring(String.java:1911)
>
>         at org.apache.solr.request.macro.MacroExpander._expand(
> MacroExpander.java:143)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:93)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:59)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:45)
>
>         at org.apache.solr.request.json.RequestUtil.processParams(
> RequestUtil.java:157)
>
>         at org.apache.solr.util.SolrPluginUtils.setDefaults(
> SolrPluginUtils.java:172)
>
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:152)
>
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>
>         at org.apache.solr.servlet.HttpSolrCall.execute(
> HttpSolrCall.java:654)
>
>         at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:460)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:257)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:208)
>
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1652)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:585)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
>
>         at org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:577)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:223)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1127)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:515)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1061)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:215)
>
>         at org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:110)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:97)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>
>         at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:310)
>
>         at org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:257)
>
>         at org.eclipse.jetty.io.AbstractConnection$2.run(
> AbstractConnection.java:540)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:635)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:555)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> <<<<<<
>
>
>
> It looks worrisome to me that there's now possibly some kind of "macro
> expansion" that is being triggered within parameters being sent to Solr.
> Can anyone tell me either how to (a) disable this feature, or (b) how the
> MCF Solr output connector should escape parameters being posted so that
> Solr does not attempt any macro expansion?  If the latter, I also need to
> know when this feature appeared, since obviously whether or not to do the
> escaping will depend on the precise version of the Solr instance involved.
>
>
>
> I'm also quite concerned that considerations of backwards compatibility
> may have been lost at some point with Solr, since heretofore I could count
> on older versions of SolrJ working with newer versions of Solr.  Please
> clarify what the current policy is....
>
>
>
>
>
> Thanks,
>
> Karl
>
> <<<<<<
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com> wrote:
>
> I posted the pertinent question to the solr dev list.  Let's see what they
> say.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:
>
> Hi,
>
>
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
>
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
>
> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=
> 091e8486805142f5 (500)
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)      ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)      Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)      Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)      Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: ManifoldCF documentum indexing issue

Posted by Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>.
Thanks Karl.

Please find the below steps to recreate the issue on file system repository.

Output connector : Solr
Repository : File system
File name in repository : “dummy” file “name.pdf

Additional Solr parameter : expandMacros=false

On starting the job with above configuration, we are getting “missing content stream” .
Please find the attached file for complete log trace.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Wednesday, June 21, 2017 3:35 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <da...@gmail.com>> wrote:
There is no good way to handle a case where Solr doesn't like the file name.  About the only thing that could be done would be to encode the filename using something like URL encoding.  This might have some effects on existing users, but more importantly, we really would need to know what characters were legal before adopting that solution.

I am not entirely sure how the file name is transmitted to Solr when using multipart forms, but how that is done is critical to know what to do.

Karl


On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi Karl,

Thanks for the update!!!

As per the response from Solr team, expandMacros=false is added to the output connector as additional parameter.
After adding  expandMacros=false, the indexing job is getting completed with “Missing content stream” error for few of the documents and those are not indexed into Solr.

As per our analysis, the pdf document’s file name we are trying to index from documentum  contains whitespace and special characters like double quotes.
Which makes the file non readable and missing content stream error is thrown.

If there is any work around to overcome this issue, kindly share it with us.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddywri@gmail.com<ma...@gmail.com>]
Sent: Wednesday, June 14, 2017 7:20 PM

To: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?    Is there some kind of “${…” being sent as a parameter?   Just curious what’s getting you into this in the first place.   But disabling probably is your most desired solution.

        Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com>> wrote:
Here's the question I posted:

>>>>>>
Hi all,

I've got a ManifoldCF user who is posting content to Solr using the MCF Solr output connector.  This connector uses SolrJ under the covers -- a fairly recent version -- but also has overridden some classes to insure that multipart form posts will be used for most content.

The problem is that, for a specific document, the user is getting an ArrayIndexOutOfBounds exception in Solr, as follows:

>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1, node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} - java.lang.StringIndexOutOfBoundsException: String index out of range: -296
        at java.lang.String.substring(String.java:1911)
        at org.apache.solr.request.macro.MacroExpander._expand(MacroExpander.java:143)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:93)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:59)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:45)
        at org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:157)
        at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:172)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:152)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.eclipse.jetty.io<http://org.eclipse.jetty.io>.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)
<<<<<<

It looks worrisome to me that there's now possibly some kind of "macro expansion" that is being triggered within parameters being sent to Solr.  Can anyone tell me either how to (a) disable this feature, or (b) how the MCF Solr output connector should escape parameters being posted so that Solr does not attempt any macro expansion?  If the latter, I also need to know when this feature appeared, since obviously whether or not to do the escaping will depend on the precise version of the Solr instance involved.

I'm also quite concerned that considerations of backwards compatibility may have been lost at some point with Solr, since heretofore I could count on older versions of SolrJ working with newer versions of Solr.  Please clarify what the current policy is....


Thanks,
Karl
<<<<<<



On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com>> wrote:
I posted the pertinent question to the solr dev list.  Let's see what they say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com>> wrote:
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty stream; it's that Solr is attempting to do something with it that it shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover from that.

>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5 (500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi Karl,

After configuring Solr to ignore Tika errors by adding Tika transformer in the job, below behavior is observed.


1)      ManifoldCF fetches the content from documentum, which contains null content and tries to push it to the output connector(Solr).

2)      Solr couldn’t accept the null as a value and throwing “Missing content stream” error.

3)      Each agent thread In ManifoldCF internally held-up with different r_object_id’s that don’t have body content and keeps trying to push the content to Solr  after each failure, but Solr couldn’t accept the content and throws the same error.

4)      Over the time, the manifold job stops with the error thrown by Solr

Please let know if there is any configuration change which can help us resolve this issue.

Please find the attached manifoldCF error log,Solr error log and agent log.

Regards,
Tamizh Kumaran.

From: Karl Wright [mailto:daddywri@gmail.com<ma...@gmail.com>]
Sent: Tuesday, June 13, 2017 2:23 PM
To: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Hi Tamizh,

The reported error is 'Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188'.  The message seemingly indicates that the error was *received* from the solr server for one specific document.  ManifoldCF does not recognize the error as being innocuous and therefore it will retry for a while until it eventually gives up and halts the job.  However, I cannot find that exact text anywhere in the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack trace attached to each one;
(2) Simple history records for that document that are of the type INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.

The last one is the one that would be the most helpful.  It is possible that you are seeing a problem in Solr Cell (Tika) that is manifesting itself in this way.  You can (and should) configure your Solr to ignore Tika errors.

Thanks,
Karl




On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi,

The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the tomcat and startup script is pointed with the MF properties.xml during server startup. Manifold along with the bundled ZK, tomcat are running on the same host with OS as  Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB is running on a windows box.
The ZK is integrated with the DB through the properties.xml and properties-global.xml
The ZK, the documentum related processes(registry and server) are up and the  two agents (start-agents.sh and start-agents-2.sh) are started  which produce multiple threads to index the documemtum contents into SOLR through ManifoldCF.

The Current no of the connections configured on the MF are as below.
SOLR Output max connection : 25
Document repository  Max Connections: 25
Properties.xml:
<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
<property name="org.apache.manifoldcf.cr<http://org.apache.manifoldcf.cr>awler.threads" value="25"/>
Total documentum document count : 0.5 million

After the Job is started, it indexed some 20000+ documents and gets terminated with the below error on the Manifold JOB.
Error: Repeated service interruptions - failure processing document: Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188

Please find the attached manifoldCF error log and agent log.

Please let me know the observations on the cause of the issue and the configuration on the threads used  for crawling. Please share your thoughts.

Regards,
Tamizh Kumaran









Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <da...@gmail.com> wrote:

> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
> Karl
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
>> Hi Karl,
>>
>>
>>
>> Thanks for the update!!!
>>
>>
>>
>> As per the response from Solr team, expandMacros=false is added to the
>> output connector as additional parameter.
>>
>> After adding  expandMacros=false, the indexing job is getting completed
>> with “Missing content stream” error for few of the documents and those are
>> not indexed into Solr.
>>
>>
>>
>> As per our analysis, the pdf document’s file name we are trying to index
>> from documentum  contains whitespace and special characters like double
>> quotes.
>>
>> Which makes the file non readable and missing content stream error is
>> thrown.
>>
>>
>>
>> If there is any work around to overcome this issue, kindly share it with
>> us.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran Thamizharasan
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Wednesday, June 14, 2017 7:20 PM
>>
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Here's the response:
>>
>>
>>
>> >>>>>>
>>
>> Karl -
>>
>> There’s expandMacros=false, as covered here: https://cwiki.apache.org
>> /confluence/display/solr/Parameter+Substitution
>>
>> But… what exactly is being sent to Solr?    Is there some kind of “${…”
>> being sent as a parameter?   Just curious what’s getting you into this in
>> the first place.   But disabling probably is your most desired solution.
>>
>>         Erik
>>
>> <<<<<<
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Here's the question I posted:
>>
>>
>>
>> >>>>>>
>>
>> Hi all,
>>
>>
>>
>> I've got a ManifoldCF user who is posting content to Solr using the MCF
>> Solr output connector.  This connector uses SolrJ under the covers -- a
>> fairly recent version -- but also has overridden some classes to insure
>> that multipart form posts will be used for most content.
>>
>>
>>
>> The problem is that, for a specific document, the user is getting an
>> ArrayIndexOutOfBounds exception in Solr, as follows:
>>
>>
>>
>> >>>>>>
>>
>> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
>> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
>> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1}
>> - java.lang.StringIndexOutOfBoundsException: String index out of range:
>> -296
>>
>>         at java.lang.String.substring(String.java:1911)
>>
>>         at org.apache.solr.request.macro.MacroExpander._expand(MacroExp
>> ander.java:143)
>>
>>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:93)
>>
>>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:59)
>>
>>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:45)
>>
>>         at org.apache.solr.request.json.RequestUtil.processParams(Reque
>> stUtil.java:157)
>>
>>         at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginU
>> tils.java:172)
>>
>>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
>> uestHandlerBase.java:152)
>>
>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>>
>>         at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
>> java:654)
>>
>>         at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
>> 460)
>>
>>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
>> atchFilter.java:257)
>>
>>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
>> atchFilter.java:208)
>>
>>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
>> r(ServletHandler.java:1652)
>>
>>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
>> dler.java:585)
>>
>>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
>> Handler.java:143)
>>
>>         at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
>> ndler.java:577)
>>
>>         at org.eclipse.jetty.server.session.SessionHandler.doHandle(
>> SessionHandler.java:223)
>>
>>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
>> ContextHandler.java:1127)
>>
>>         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
>> ler.java:515)
>>
>>         at org.eclipse.jetty.server.session.SessionHandler.doScope(
>> SessionHandler.java:185)
>>
>>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(
>> ContextHandler.java:1061)
>>
>>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
>> Handler.java:141)
>>
>>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
>> ndle(ContextHandlerCollection.java:215)
>>
>>         at org.eclipse.jetty.server.handler.HandlerCollection.handle(
>> HandlerCollection.java:110)
>>
>>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
>> erWrapper.java:97)
>>
>>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>>
>>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
>> java:310)
>>
>>         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
>> ction.java:257)
>>
>>         at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnec
>> tion.java:540)
>>
>>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
>> ThreadPool.java:635)
>>
>>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedT
>> hreadPool.java:555)
>>
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> <<<<<<
>>
>>
>>
>> It looks worrisome to me that there's now possibly some kind of "macro
>> expansion" that is being triggered within parameters being sent to Solr.
>> Can anyone tell me either how to (a) disable this feature, or (b) how the
>> MCF Solr output connector should escape parameters being posted so that
>> Solr does not attempt any macro expansion?  If the latter, I also need to
>> know when this feature appeared, since obviously whether or not to do the
>> escaping will depend on the precise version of the Solr instance involved.
>>
>>
>>
>> I'm also quite concerned that considerations of backwards compatibility
>> may have been lost at some point with Solr, since heretofore I could count
>> on older versions of SolrJ working with newer versions of Solr.  Please
>> clarify what the current policy is....
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>> <<<<<<
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> I posted the pertinent question to the solr dev list.  Let's see what
>> they say.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> Hi,
>>
>>
>>
>> The exception in the solr.log should be reported as a Solr bug.  It is
>> not emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>>
>>
>>
>> I wish there was an easy fix for this.  The problem is *not* an empty
>> stream; it's that Solr is attempting to do something with it that it
>> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
>> from that.
>>
>> >>>>>>
>>
>> https://**********/webtop/component/drl?versionLabel=CURRENT
>> &objectId=091e8486805142f5 (500)
>>
>> <<<<<<
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>> Hi Karl,
>>
>>
>>
>> After configuring Solr to ignore Tika errors by adding Tika transformer
>> in the job, below behavior is observed.
>>
>>
>>
>> 1)      ManifoldCF fetches the content from documentum, which contains
>> null content and tries to push it to the output connector(Solr).
>>
>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>> content stream” error.
>>
>> 3)      Each agent thread In ManifoldCF internally held-up with
>> different r_object_id’s that don’t have body content and keeps trying to
>> push the content to Solr  after each failure, but Solr couldn’t accept the
>> content and throws the same error.
>>
>> 4)      Over the time, the manifold job stops with the error thrown by
>> Solr
>>
>>
>>
>> Please let know if there is any configuration change which can help us
>> resolve this issue.
>>
>>
>>
>> Please find the attached manifoldCF error log,Solr error log and agent
>> log.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran.
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Hi Tamizh,
>>
>>
>>
>> The reported error is 'Error from server at http://localhost:8983/solr/
>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>> message seemingly indicates that the error was *received* from the solr
>> server for one specific document.  ManifoldCF does not recognize the error
>> as being innocuous and therefore it will retry for a while until it
>> eventually gives up and halts the job.  However, I cannot find that exact
>> text anywhere in the Solr output connector code, so I wonder if you
>> transcribed it correctly?
>>
>> There should also be the following:
>>
>> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
>> trace attached to each one;
>>
>> (2) Simple history records for that document that are of the type
>> INGESTDOCUMENT.
>>
>> (3) Solr log entries that have a Solr stack trace.
>>
>>
>>
>> The last one is the one that would be the most helpful.  It is possible
>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>> itself in this way.  You can (and should) configure your Solr to ignore
>> Tika errors.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>> Hi,
>>
>>
>>
>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>> app is installed on the tomcat and startup script is pointed with the MF
>> properties.xml during server startup. Manifold along with the bundled ZK,
>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>
>> The ZK is integrated with the DB through the properties.xml and
>> properties-global.xml
>>
>> The ZK, the documentum related processes(registry and server) are up and
>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>> produce multiple threads to index the documemtum contents into SOLR through
>> ManifoldCF.
>>
>>
>>
>> The Current no of the connections configured on the MF are as below.
>>
>> SOLR Output max connection : 25
>>
>> Document repository  Max Connections: 25
>>
>> Properties.xml:
>>
>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>
>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>
>> Total documentum document count : 0.5 million
>>
>>
>>
>> After the Job is started, it indexed some 20000+ documents and gets
>> terminated with the below error on the Manifold JOB.
>>
>> Error: Repeated service interruptions - failure processing document:
>> Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
>> String index out of range: -188
>>
>>
>>
>> Please find the attached manifoldCF error log and agent log.
>>
>>
>>
>> Please let me know the observations on the cause of the issue and the
>> configuration on the threads used  for crawling. Please share your thoughts.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
There is no good way to handle a case where Solr doesn't like the file
name.  About the only thing that could be done would be to encode the
filename using something like URL encoding.  This might have some effects
on existing users, but more importantly, we really would need to know what
characters were legal before adopting that solution.

I am not entirely sure how the file name is transmitted to Solr when using
multipart forms, but how that is done is critical to know what to do.

Karl


On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Hi Karl,
>
>
>
> Thanks for the update!!!
>
>
>
> As per the response from Solr team, expandMacros=false is added to the
> output connector as additional parameter.
>
> After adding  expandMacros=false, the indexing job is getting completed
> with “Missing content stream” error for few of the documents and those are
> not indexed into Solr.
>
>
>
> As per our analysis, the pdf document’s file name we are trying to index
> from documentum  contains whitespace and special characters like double
> quotes.
>
> Which makes the file non readable and missing content stream error is
> thrown.
>
>
>
> If there is any work around to overcome this issue, kindly share it with
> us.
>
>
>
> Regards,
>
> Tamizh Kumaran Thamizharasan
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Wednesday, June 14, 2017 7:20 PM
>
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Here's the response:
>
>
>
> >>>>>>
>
> Karl -
>
> There’s expandMacros=false, as covered here: https://cwiki.apache.
> org/confluence/display/solr/Parameter+Substitution
>
> But… what exactly is being sent to Solr?    Is there some kind of “${…”
> being sent as a parameter?   Just curious what’s getting you into this in
> the first place.   But disabling probably is your most desired solution.
>
>         Erik
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com> wrote:
>
> Here's the question I posted:
>
>
>
> >>>>>>
>
> Hi all,
>
>
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
>
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
>
>
> >>>>>>
>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>
>         at java.lang.String.substring(String.java:1911)
>
>         at org.apache.solr.request.macro.MacroExpander._expand(
> MacroExpander.java:143)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:93)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:59)
>
>         at org.apache.solr.request.macro.MacroExpander.expand(
> MacroExpander.java:45)
>
>         at org.apache.solr.request.json.RequestUtil.processParams(
> RequestUtil.java:157)
>
>         at org.apache.solr.util.SolrPluginUtils.setDefaults(
> SolrPluginUtils.java:172)
>
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:152)
>
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>
>         at org.apache.solr.servlet.HttpSolrCall.execute(
> HttpSolrCall.java:654)
>
>         at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:460)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:257)
>
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:208)
>
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1652)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:585)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
>
>         at org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:577)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:223)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1127)
>
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:515)
>
>         at org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
>
>         at org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1061)
>
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
>
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.
> handle(ContextHandlerCollection.java:215)
>
>         at org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:110)
>
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:97)
>
>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>
>         at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:310)
>
>         at org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:257)
>
>         at org.eclipse.jetty.io.AbstractConnection$2.run(
> AbstractConnection.java:540)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:635)
>
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:555)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> <<<<<<
>
>
>
> It looks worrisome to me that there's now possibly some kind of "macro
> expansion" that is being triggered within parameters being sent to Solr.
> Can anyone tell me either how to (a) disable this feature, or (b) how the
> MCF Solr output connector should escape parameters being posted so that
> Solr does not attempt any macro expansion?  If the latter, I also need to
> know when this feature appeared, since obviously whether or not to do the
> escaping will depend on the precise version of the Solr instance involved.
>
>
>
> I'm also quite concerned that considerations of backwards compatibility
> may have been lost at some point with Solr, since heretofore I could count
> on older versions of SolrJ working with newer versions of Solr.  Please
> clarify what the current policy is....
>
>
>
>
>
> Thanks,
>
> Karl
>
> <<<<<<
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com> wrote:
>
> I posted the pertinent question to the solr dev list.  Let's see what they
> say.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:
>
> Hi,
>
>
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
>
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
>
> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=
> 091e8486805142f5 (500)
>
> <<<<<<
>
>
>
> Karl
>
>
>
>
>
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)      ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)      Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)      Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)      Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: ManifoldCF documentum indexing issue

Posted by Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>.
Hi Karl,

Thanks for the update!!!

As per the response from Solr team, expandMacros=false is added to the output connector as additional parameter.
After adding  expandMacros=false, the indexing job is getting completed with “Missing content stream” error for few of the documents and those are not indexed into Solr.

As per our analysis, the pdf document’s file name we are trying to index from documentum  contains whitespace and special characters like double quotes.
Which makes the file non readable and missing content stream error is thrown.

If there is any work around to overcome this issue, kindly share it with us.

Regards,
Tamizh Kumaran Thamizharasan

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Wednesday, June 14, 2017 7:20 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?    Is there some kind of “${…” being sent as a parameter?   Just curious what’s getting you into this in the first place.   But disabling probably is your most desired solution.

        Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com>> wrote:
Here's the question I posted:

>>>>>>
Hi all,

I've got a ManifoldCF user who is posting content to Solr using the MCF Solr output connector.  This connector uses SolrJ under the covers -- a fairly recent version -- but also has overridden some classes to insure that multipart form posts will be used for most content.

The problem is that, for a specific document, the user is getting an ArrayIndexOutOfBounds exception in Solr, as follows:

>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1, node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} - java.lang.StringIndexOutOfBoundsException: String index out of range: -296
        at java.lang.String.substring(String.java:1911)
        at org.apache.solr.request.macro.MacroExpander._expand(MacroExpander.java:143)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:93)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:59)
        at org.apache.solr.request.macro.MacroExpander.expand(MacroExpander.java:45)
        at org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:157)
        at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:172)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:152)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.eclipse.jetty.io<http://org.eclipse.jetty.io>.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)
<<<<<<

It looks worrisome to me that there's now possibly some kind of "macro expansion" that is being triggered within parameters being sent to Solr.  Can anyone tell me either how to (a) disable this feature, or (b) how the MCF Solr output connector should escape parameters being posted so that Solr does not attempt any macro expansion?  If the latter, I also need to know when this feature appeared, since obviously whether or not to do the escaping will depend on the precise version of the Solr instance involved.

I'm also quite concerned that considerations of backwards compatibility may have been lost at some point with Solr, since heretofore I could count on older versions of SolrJ working with newer versions of Solr.  Please clarify what the current policy is....


Thanks,
Karl
<<<<<<



On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com>> wrote:
I posted the pertinent question to the solr dev list.  Let's see what they say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com>> wrote:
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty stream; it's that Solr is attempting to do something with it that it shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover from that.

>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5 (500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi Karl,

After configuring Solr to ignore Tika errors by adding Tika transformer in the job, below behavior is observed.


1)      ManifoldCF fetches the content from documentum, which contains null content and tries to push it to the output connector(Solr).

2)      Solr couldn’t accept the null as a value and throwing “Missing content stream” error.

3)      Each agent thread In ManifoldCF internally held-up with different r_object_id’s that don’t have body content and keeps trying to push the content to Solr  after each failure, but Solr couldn’t accept the content and throws the same error.

4)      Over the time, the manifold job stops with the error thrown by Solr

Please let know if there is any configuration change which can help us resolve this issue.

Please find the attached manifoldCF error log,Solr error log and agent log.

Regards,
Tamizh Kumaran.

From: Karl Wright [mailto:daddywri@gmail.com<ma...@gmail.com>]
Sent: Tuesday, June 13, 2017 2:23 PM
To: user@manifoldcf.apache.org<ma...@manifoldcf.apache.org>
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Hi Tamizh,

The reported error is 'Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188'.  The message seemingly indicates that the error was *received* from the solr server for one specific document.  ManifoldCF does not recognize the error as being innocuous and therefore it will retry for a while until it eventually gives up and halts the job.  However, I cannot find that exact text anywhere in the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack trace attached to each one;
(2) Simple history records for that document that are of the type INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.

The last one is the one that would be the most helpful.  It is possible that you are seeing a problem in Solr Cell (Tika) that is manifesting itself in this way.  You can (and should) configure your Solr to ignore Tika errors.

Thanks,
Karl




On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi,

The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the tomcat and startup script is pointed with the MF properties.xml during server startup. Manifold along with the bundled ZK, tomcat are running on the same host with OS as  Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB is running on a windows box.
The ZK is integrated with the DB through the properties.xml and properties-global.xml
The ZK, the documentum related processes(registry and server) are up and the  two agents (start-agents.sh and start-agents-2.sh) are started  which produce multiple threads to index the documemtum contents into SOLR through ManifoldCF.

The Current no of the connections configured on the MF are as below.
SOLR Output max connection : 25
Document repository  Max Connections: 25
Properties.xml:
<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
<property name="org.apache.manifoldcf.cr<http://org.apache.manifoldcf.cr>awler.threads" value="25"/>
Total documentum document count : 0.5 million

After the Job is started, it indexed some 20000+ documents and gets terminated with the below error on the Manifold JOB.
Error: Repeated service interruptions - failure processing document: Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188

Please find the attached manifoldCF error log and agent log.

Please let me know the observations on the cause of the issue and the configuration on the threads used  for crawling. Please share your thoughts.

Regards,
Tamizh Kumaran







Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: https://cwiki.apache.org/
confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?    Is there some kind of “${…”
being sent as a parameter?   Just curious what’s getting you into this in
the first place.   But disabling probably is your most desired solution.

        Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <da...@gmail.com> wrote:

> Here's the question I posted:
>
> >>>>>>
> Hi all,
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
> >>>>>>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>         at java.lang.String.substring(String.java:1911)
>         at org.apache.solr.request.macro.MacroExpander._expand(MacroExp
> ander.java:143)
>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
> nder.java:93)
>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
> nder.java:59)
>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
> nder.java:45)
>         at org.apache.solr.request.json.RequestUtil.processParams(Reque
> stUtil.java:157)
>         at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginU
> tils.java:172)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> uestHandlerBase.java:152)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>         at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
> java:654)
>         at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> 460)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:257)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:208)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> r(ServletHandler.java:1652)
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> dler.java:585)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> Handler.java:143)
>         at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> ndler.java:577)
>         at org.eclipse.jetty.server.session.SessionHandler.doHandle(
> SessionHandler.java:223)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> ContextHandler.java:1127)
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> ler.java:515)
>         at org.eclipse.jetty.server.session.SessionHandler.doScope(
> SessionHandler.java:185)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(
> ContextHandler.java:1061)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> Handler.java:141)
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> ndle(ContextHandlerCollection.java:215)
>         at org.eclipse.jetty.server.handler.HandlerCollection.handle(
> HandlerCollection.java:110)
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> erWrapper.java:97)
>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> java:310)
>         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> ction.java:257)
>         at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnec
> tion.java:540)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> ThreadPool.java:635)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedT
> hreadPool.java:555)
>         at java.lang.Thread.run(Thread.java:745)
> <<<<<<
>
> It looks worrisome to me that there's now possibly some kind of "macro
> expansion" that is being triggered within parameters being sent to Solr.
> Can anyone tell me either how to (a) disable this feature, or (b) how the
> MCF Solr output connector should escape parameters being posted so that
> Solr does not attempt any macro expansion?  If the latter, I also need to
> know when this feature appeared, since obviously whether or not to do the
> escaping will depend on the precise version of the Solr instance involved.
>
> I'm also quite concerned that considerations of backwards compatibility
> may have been lost at some point with Solr, since heretofore I could count
> on older versions of SolrJ working with newer versions of Solr.  Please
> clarify what the current policy is....
>
>
> Thanks,
> Karl
> <<<<<<
>
>
>
> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com> wrote:
>
>> I posted the pertinent question to the solr dev list.  Let's see what
>> they say.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> The exception in the solr.log should be reported as a Solr bug.  It is
>>> not emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>>>
>>> I wish there was an easy fix for this.  The problem is *not* an empty
>>> stream; it's that Solr is attempting to do something with it that it
>>> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
>>> from that.
>>>
>>> >>>>>>
>>> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
>>> (500)
>>> <<<<<<
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
>>> tthamizharasan@worldbankgroup.org> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>>
>>>>
>>>> After configuring Solr to ignore Tika errors by adding Tika transformer
>>>> in the job, below behavior is observed.
>>>>
>>>>
>>>>
>>>> 1)      ManifoldCF fetches the content from documentum, which contains
>>>> null content and tries to push it to the output connector(Solr).
>>>>
>>>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>>>> content stream” error.
>>>>
>>>> 3)      Each agent thread In ManifoldCF internally held-up with
>>>> different r_object_id’s that don’t have body content and keeps trying to
>>>> push the content to Solr  after each failure, but Solr couldn’t accept the
>>>> content and throws the same error.
>>>>
>>>> 4)      Over the time, the manifold job stops with the error thrown by
>>>> Solr
>>>>
>>>>
>>>>
>>>> Please let know if there is any configuration change which can help us
>>>> resolve this issue.
>>>>
>>>>
>>>>
>>>> Please find the attached manifoldCF error log,Solr error log and agent
>>>> log.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Tamizh Kumaran.
>>>>
>>>>
>>>>
>>>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>>>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>>>> *To:* user@manifoldcf.apache.org
>>>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>>>> *Subject:* Re: ManifoldCF documentum indexing issue
>>>>
>>>>
>>>>
>>>> Hi Tamizh,
>>>>
>>>>
>>>>
>>>> The reported error is 'Error from server at http://localhost:8983/solr/
>>>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>>>> message seemingly indicates that the error was *received* from the solr
>>>> server for one specific document.  ManifoldCF does not recognize the error
>>>> as being innocuous and therefore it will retry for a while until it
>>>> eventually gives up and halts the job.  However, I cannot find that exact
>>>> text anywhere in the Solr output connector code, so I wonder if you
>>>> transcribed it correctly?
>>>>
>>>> There should also be the following:
>>>>
>>>> (1) A record of the attempts in the manifoldcf.log file, with a MCF
>>>> stack trace attached to each one;
>>>>
>>>> (2) Simple history records for that document that are of the type
>>>> INGESTDOCUMENT.
>>>>
>>>> (3) Solr log entries that have a Solr stack trace.
>>>>
>>>>
>>>>
>>>> The last one is the one that would be the most helpful.  It is possible
>>>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>>>> itself in this way.  You can (and should) configure your Solr to ignore
>>>> Tika errors.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>>>> tthamizharasan@worldbankgroup.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>>>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>>>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>>>> app is installed on the tomcat and startup script is pointed with the MF
>>>> properties.xml during server startup. Manifold along with the bundled ZK,
>>>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>>>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>>>
>>>> The ZK is integrated with the DB through the properties.xml and
>>>> properties-global.xml
>>>>
>>>> The ZK, the documentum related processes(registry and server) are up
>>>> and the  two agents (start-agents.sh and start-agents-2.sh) are started
>>>> which produce multiple threads to index the documemtum contents into SOLR
>>>> through ManifoldCF.
>>>>
>>>>
>>>>
>>>> The Current no of the connections configured on the MF are as below.
>>>>
>>>> SOLR Output max connection : 25
>>>>
>>>> Document repository  Max Connections: 25
>>>>
>>>> Properties.xml:
>>>>
>>>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>>>
>>>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>>>
>>>> Total documentum document count : 0.5 million
>>>>
>>>>
>>>>
>>>> After the Job is started, it indexed some 20000+ documents and gets
>>>> terminated with the below error on the Manifold JOB.
>>>>
>>>> Error: Repeated service interruptions - failure processing document:
>>>> Error from server at http://localhost:8983/solr/doc
>>>> umentum_manifoldcf_stg: String index out of range: -188
>>>>
>>>>
>>>>
>>>> Please find the attached manifoldCF error log and agent log.
>>>>
>>>>
>>>>
>>>> Please let me know the observations on the cause of the issue and the
>>>> configuration on the threads used  for crawling. Please share your thoughts.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Tamizh Kumaran
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
Here's the question I posted:

>>>>>>
Hi all,

I've got a ManifoldCF user who is posting content to Solr using the MCF
Solr output connector.  This connector uses SolrJ under the covers -- a
fairly recent version -- but also has overridden some classes to insure
that multipart form posts will be used for most content.

The problem is that, for a specific document, the user is getting an
ArrayIndexOutOfBounds exception in Solr, as follows:

>>>>>>
2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
{collection=c:documentum_manifoldcf_stg,
core=x:documentum_manifoldcf_stg_shard1_replica1,
node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
java.lang.StringIndexOutOfBoundsException: String index out of range: -296
        at java.lang.String.substring(String.java:1911)
        at org.apache.solr.request.macro.MacroExpander._expand(
MacroExpander.java:143)
        at org.apache.solr.request.macro.MacroExpander.expand(
MacroExpander.java:93)
        at org.apache.solr.request.macro.MacroExpander.expand(
MacroExpander.java:59)
        at org.apache.solr.request.macro.MacroExpander.expand(
MacroExpander.java:45)
        at org.apache.solr.request.json.RequestUtil.processParams(
RequestUtil.java:157)
        at org.apache.solr.util.SolrPluginUtils.setDefaults(
SolrPluginUtils.java:172)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:152)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
        at org.apache.solr.servlet.HttpSolrCall.execute(
HttpSolrCall.java:654)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:257)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:208)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(
ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(
ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(
SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.
doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.
doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(
ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.
doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.
doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(
ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.
handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(
HttpConnection.java:257)
        at org.eclipse.jetty.io.AbstractConnection$2.run(
AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)
<<<<<<

It looks worrisome to me that there's now possibly some kind of "macro
expansion" that is being triggered within parameters being sent to Solr.
Can anyone tell me either how to (a) disable this feature, or (b) how the
MCF Solr output connector should escape parameters being posted so that
Solr does not attempt any macro expansion?  If the latter, I also need to
know when this feature appeared, since obviously whether or not to do the
escaping will depend on the precise version of the Solr instance involved.

I'm also quite concerned that considerations of backwards compatibility may
have been lost at some point with Solr, since heretofore I could count on
older versions of SolrJ working with newer versions of Solr.  Please
clarify what the current policy is....


Thanks,
Karl
<<<<<<



On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <da...@gmail.com> wrote:

> I posted the pertinent question to the solr dev list.  Let's see what they
> say.
>
> Thanks,
> Karl
>
>
> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi,
>>
>> The exception in the solr.log should be reported as a Solr bug.  It is
>> not emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>>
>> I wish there was an easy fix for this.  The problem is *not* an empty
>> stream; it's that Solr is attempting to do something with it that it
>> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
>> from that.
>>
>> >>>>>>
>> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
>> (500)
>> <<<<<<
>>
>> Karl
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>>> Hi Karl,
>>>
>>>
>>>
>>> After configuring Solr to ignore Tika errors by adding Tika transformer
>>> in the job, below behavior is observed.
>>>
>>>
>>>
>>> 1)      ManifoldCF fetches the content from documentum, which contains
>>> null content and tries to push it to the output connector(Solr).
>>>
>>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>>> content stream” error.
>>>
>>> 3)      Each agent thread In ManifoldCF internally held-up with
>>> different r_object_id’s that don’t have body content and keeps trying to
>>> push the content to Solr  after each failure, but Solr couldn’t accept the
>>> content and throws the same error.
>>>
>>> 4)      Over the time, the manifold job stops with the error thrown by
>>> Solr
>>>
>>>
>>>
>>> Please let know if there is any configuration change which can help us
>>> resolve this issue.
>>>
>>>
>>>
>>> Please find the attached manifoldCF error log,Solr error log and agent
>>> log.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Tamizh Kumaran.
>>>
>>>
>>>
>>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>>> *To:* user@manifoldcf.apache.org
>>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>>> *Subject:* Re: ManifoldCF documentum indexing issue
>>>
>>>
>>>
>>> Hi Tamizh,
>>>
>>>
>>>
>>> The reported error is 'Error from server at http://localhost:8983/solr/
>>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>>> message seemingly indicates that the error was *received* from the solr
>>> server for one specific document.  ManifoldCF does not recognize the error
>>> as being innocuous and therefore it will retry for a while until it
>>> eventually gives up and halts the job.  However, I cannot find that exact
>>> text anywhere in the Solr output connector code, so I wonder if you
>>> transcribed it correctly?
>>>
>>> There should also be the following:
>>>
>>> (1) A record of the attempts in the manifoldcf.log file, with a MCF
>>> stack trace attached to each one;
>>>
>>> (2) Simple history records for that document that are of the type
>>> INGESTDOCUMENT.
>>>
>>> (3) Solr log entries that have a Solr stack trace.
>>>
>>>
>>>
>>> The last one is the one that would be the most helpful.  It is possible
>>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>>> itself in this way.  You can (and should) configure your Solr to ignore
>>> Tika errors.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>>> tthamizharasan@worldbankgroup.org> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>>> app is installed on the tomcat and startup script is pointed with the MF
>>> properties.xml during server startup. Manifold along with the bundled ZK,
>>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>>
>>> The ZK is integrated with the DB through the properties.xml and
>>> properties-global.xml
>>>
>>> The ZK, the documentum related processes(registry and server) are up and
>>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>>> produce multiple threads to index the documemtum contents into SOLR through
>>> ManifoldCF.
>>>
>>>
>>>
>>> The Current no of the connections configured on the MF are as below.
>>>
>>> SOLR Output max connection : 25
>>>
>>> Document repository  Max Connections: 25
>>>
>>> Properties.xml:
>>>
>>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>>
>>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>>
>>> Total documentum document count : 0.5 million
>>>
>>>
>>>
>>> After the Job is started, it indexed some 20000+ documents and gets
>>> terminated with the below error on the Manifold JOB.
>>>
>>> Error: Repeated service interruptions - failure processing document:
>>> Error from server at http://localhost:8983/solr/doc
>>> umentum_manifoldcf_stg: String index out of range: -188
>>>
>>>
>>>
>>> Please find the attached manifoldCF error log and agent log.
>>>
>>>
>>>
>>> Please let me know the observations on the cause of the issue and the
>>> configuration on the threads used  for crawling. Please share your thoughts.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Tamizh Kumaran
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
I posted the pertinent question to the solr dev list.  Let's see what they
say.

Thanks,
Karl


On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <da...@gmail.com> wrote:

> Hi,
>
> The exception in the solr.log should be reported as a Solr bug.  It is not
> emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>
> I wish there was an easy fix for this.  The problem is *not* an empty
> stream; it's that Solr is attempting to do something with it that it
> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
> from that.
>
> >>>>>>
> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
> (500)
> <<<<<<
>
> Karl
>
>
>
>
> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
>> Hi Karl,
>>
>>
>>
>> After configuring Solr to ignore Tika errors by adding Tika transformer
>> in the job, below behavior is observed.
>>
>>
>>
>> 1)      ManifoldCF fetches the content from documentum, which contains
>> null content and tries to push it to the output connector(Solr).
>>
>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>> content stream” error.
>>
>> 3)      Each agent thread In ManifoldCF internally held-up with
>> different r_object_id’s that don’t have body content and keeps trying to
>> push the content to Solr  after each failure, but Solr couldn’t accept the
>> content and throws the same error.
>>
>> 4)      Over the time, the manifold job stops with the error thrown by
>> Solr
>>
>>
>>
>> Please let know if there is any configuration change which can help us
>> resolve this issue.
>>
>>
>>
>> Please find the attached manifoldCF error log,Solr error log and agent
>> log.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran.
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Hi Tamizh,
>>
>>
>>
>> The reported error is 'Error from server at http://localhost:8983/solr/
>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>> message seemingly indicates that the error was *received* from the solr
>> server for one specific document.  ManifoldCF does not recognize the error
>> as being innocuous and therefore it will retry for a while until it
>> eventually gives up and halts the job.  However, I cannot find that exact
>> text anywhere in the Solr output connector code, so I wonder if you
>> transcribed it correctly?
>>
>> There should also be the following:
>>
>> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
>> trace attached to each one;
>>
>> (2) Simple history records for that document that are of the type
>> INGESTDOCUMENT.
>>
>> (3) Solr log entries that have a Solr stack trace.
>>
>>
>>
>> The last one is the one that would be the most helpful.  It is possible
>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>> itself in this way.  You can (and should) configure your Solr to ignore
>> Tika errors.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>> Hi,
>>
>>
>>
>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>> app is installed on the tomcat and startup script is pointed with the MF
>> properties.xml during server startup. Manifold along with the bundled ZK,
>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>
>> The ZK is integrated with the DB through the properties.xml and
>> properties-global.xml
>>
>> The ZK, the documentum related processes(registry and server) are up and
>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>> produce multiple threads to index the documemtum contents into SOLR through
>> ManifoldCF.
>>
>>
>>
>> The Current no of the connections configured on the MF are as below.
>>
>> SOLR Output max connection : 25
>>
>> Document repository  Max Connections: 25
>>
>> Properties.xml:
>>
>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>
>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>
>> Total documentum document count : 0.5 million
>>
>>
>>
>> After the Job is started, it indexed some 20000+ documents and gets
>> terminated with the below error on the Manifold JOB.
>>
>> Error: Repeated service interruptions - failure processing document:
>> Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
>> String index out of range: -188
>>
>>
>>
>> Please find the attached manifoldCF error log and agent log.
>>
>>
>>
>> Please let me know the observations on the cause of the issue and the
>> configuration on the threads used  for crawling. Please share your thoughts.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran
>>
>>
>>
>>
>>
>
>

Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not
emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty
stream; it's that Solr is attempting to do something with it that it
shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
from that.

>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
(500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)      ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)      Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)      Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)      Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>

RE: ManifoldCF documentum indexing issue

Posted by Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>.
Hi Karl,

After configuring Solr to ignore Tika errors by adding Tika transformer in the job, below behavior is observed.


1)      ManifoldCF fetches the content from documentum, which contains null content and tries to push it to the output connector(Solr).

2)      Solr couldn’t accept the null as a value and throwing “Missing content stream” error.

3)      Each agent thread In ManifoldCF internally held-up with different r_object_id’s that don’t have body content and keeps trying to push the content to Solr  after each failure, but Solr couldn’t accept the content and throws the same error.

4)      Over the time, the manifold job stops with the error thrown by Solr

Please let know if there is any configuration change which can help us resolve this issue.

Please find the attached manifoldCF error log,Solr error log and agent log.

Regards,
Tamizh Kumaran.

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Tuesday, June 13, 2017 2:23 PM
To: user@manifoldcf.apache.org
Cc: Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
Subject: Re: ManifoldCF documentum indexing issue

Hi Tamizh,

The reported error is 'Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188'.  The message seemingly indicates that the error was *received* from the solr server for one specific document.  ManifoldCF does not recognize the error as being innocuous and therefore it will retry for a while until it eventually gives up and halts the job.  However, I cannot find that exact text anywhere in the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack trace attached to each one;
(2) Simple history records for that document that are of the type INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.

The last one is the one that would be the most helpful.  It is possible that you are seeing a problem in Solr Cell (Tika) that is manifesting itself in this way.  You can (and should) configure your Solr to ignore Tika errors.

Thanks,
Karl




On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <tt...@worldbankgroup.org>> wrote:
Hi,

The Manifoldcf 2.7.1 is running in the multiprocess zk model and integrated with PostgreSQL 9.3. The expected setup is to crawl the Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui app is installed on the tomcat and startup script is pointed with the MF properties.xml during server startup. Manifold along with the bundled ZK, tomcat are running on the same host with OS as  Red Hat Enterprise Linux Server release 6.9 (Santiago). The DB is running on a windows box.
The ZK is integrated with the DB through the properties.xml and properties-global.xml
The ZK, the documentum related processes(registry and server) are up and the  two agents (start-agents.sh and start-agents-2.sh) are started  which produce multiple threads to index the documemtum contents into SOLR through ManifoldCF.

The Current no of the connections configured on the MF are as below.
SOLR Output max connection : 25
Document repository  Max Connections: 25
Properties.xml:
<property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
<property name="org.apache.manifoldcf.crawler.threads" value="25"/>
Total documentum document count : 0.5 million

After the Job is started, it indexed some 20000+ documents and gets terminated with the below error on the Manifold JOB.
Error: Repeated service interruptions - failure processing document: Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: String index out of range: -188

Please find the attached manifoldCF error log and agent log.

Please let me know the observations on the cause of the issue and the configuration on the threads used  for crawling. Please share your thoughts.

Regards,
Tamizh Kumaran



Re: ManifoldCF documentum indexing issue

Posted by Karl Wright <da...@gmail.com>.
Hi Tamizh,

The reported error is 'Error from server at http://localhost:8983/solr/
documentum_manifoldcf_stg: String index out of range: -188'.  The message
seemingly indicates that the error was *received* from the solr server for
one specific document.  ManifoldCF does not recognize the error as being
innocuous and therefore it will retry for a while until it eventually gives
up and halts the job.  However, I cannot find that exact text anywhere in
the Solr output connector code, so I wonder if you transcribed it correctly?

There should also be the following:
(1) A record of the attempts in the manifoldcf.log file, with a MCF stack
trace attached to each one;
(2) Simple history records for that document that are of the type
INGESTDOCUMENT.
(3) Solr log entries that have a Solr stack trace.

The last one is the one that would be the most helpful.  It is possible
that you are seeing a problem in Solr Cell (Tika) that is manifesting
itself in this way.  You can (and should) configure your Solr to ignore
Tika errors.

Thanks,
Karl




On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>