You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/03/18 08:22:51 UTC

Unable to index rich-text documents in Solr Cloud

Hi everyone,

I'm having some issues with indexing rich-text documents from the Solr
Cloud. When I tried to index a pdf or word document, I get the following
error:


org.apache.solr.common.SolrException: Bad Request



request: http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)


I'm able to index .xml and .csv files in Solr Cloud with the same configuration.

I have setup Solr Cloud using the default zookeeper in Solr 5.0.0, and
I have 2 shards with the following details:
Shard1: 192.168.2.2:8983
Shard2: 192.168.2.2:8984

Prior to this, I'm already able to index rich-text documents without
the Solr Cloud, and I'm using the same solrconfig.xml and schema.xml,
so my ExtractRequestHandler is already defined.

Is there other settings required in order to index rich-text documents
in Solr Cloud?


Regards,
Edwin

Re: Unable to index rich-text documents in Solr Cloud

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Shawn,

Yes, I'm using the /update/extract handler. I'm not sure about the
shards.qt parameter too.

Regards,
Edwin


On 19 March 2015 at 13:18, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/18/2015 1:22 AM, Zheng Lin Edwin Yeo wrote:
> > I'm having some issues with indexing rich-text documents from the Solr
> > Cloud. When I tried to index a pdf or word document, I get the following
> > error:
> >
> >
> > org.apache.solr.common.SolrException: Bad Request
> >
> >
> >
> > request:
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
>
> This request appears to be one of the requests that SolrCloud makes
> between its different nodes, but it is using the /update handler.  I
> assume that when you sent the request, you sent it to the
> /update/extract handler because it's a rich text document?  The /update
> handler can't do rich text documents, it's only for documents in json,
> xml, csv, javabin, etc that are formatted in specific ways.
>
> One thing I'm wondering is whether the Extracting handler requires a
> shards.qt parameter, also set to /update/extract, to work right with
> SolrCloud.  I have never used that handler myself, so I've got no idea
> what is required to make it work right.
>
> Thanks,
> Shawn
>
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/18/2015 1:22 AM, Zheng Lin Edwin Yeo wrote:
> I'm having some issues with indexing rich-text documents from the Solr
> Cloud. When I tried to index a pdf or word document, I get the following
> error:
> 
> 
> org.apache.solr.common.SolrException: Bad Request
> 
> 
> 
> request: http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2

This request appears to be one of the requests that SolrCloud makes
between its different nodes, but it is using the /update handler.  I
assume that when you sent the request, you sent it to the
/update/extract handler because it's a rich text document?  The /update
handler can't do rich text documents, it's only for documents in json,
xml, csv, javabin, etc that are formatted in specific ways.

One thing I'm wondering is whether the Extracting handler requires a
shards.qt parameter, also set to /update/extract, to work right with
SolrCloud.  I have never used that handler myself, so I've got no idea
what is required to make it work right.

Thanks,
Shawn

Re: Unable to index rich-text documents in Solr Cloud

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

This is the logs that I got from solr.log. I can't seems to figure out
what's wrong with it. Does anyone knows?



ERROR - 2015-03-18 15:06:51.019;
org.apache.solr.update.StreamingSolrClients$1; error
org.apache.solr.common.SolrException: Bad Request



request:
http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
INFO  - 2015-03-18 15:06:51.019;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update/extract params={literal.id
=C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf&resource.name=C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf}
{add=[C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf]} 0 1252
INFO  - 2015-03-18 15:06:51.029;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-03-18 15:06:51.029;
org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
Skipping IW.commit.
INFO  - 2015-03-18 15:06:51.029; org.apache.solr.core.SolrCore;
SolrIndexSearcher has not changed - not re-opening:
org.apache.solr.search.SolrIndexSearcher
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update params={waitSearcher=true&distrib.from=
http://192.168.2.2:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false}
{commit=} 0 10
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update params={commit=true} {commit=} 0 10



Regards,
Edwin



On 19 March 2015 at 10:56, Damien Kamerman <da...@gmail.com> wrote:

> I suggest you check your solr logs for more info as to the cause.
>
> On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi Erick,
> >
> > No, the PDF file is a testing file which only contains 1 sentence.
> >
> > I've managed to get it to work by removing startup="lazy" in
> > the ExtractingRequestHandler and added the following lines:
> >       <str name="uprefix">ignored_</str>
> >       <str name="captureAttr">true</str>
> >       <str name="fmap.a">links</str>
> >       <str name="fmap.div">ignored_</str>
> >
> > Does the presence of startup="lazy" affect the function of
> > ExtractingRequestHandler , or is it one of the str name values?
> >
> > Regards,
> > Edwin
> >
> >
> > On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com>
> wrote:
> >
> > > Shot in the dark, but is the PDF file significantly larger than the
> > > others? Perhaps your simply exceeding the packet limits for the
> > > servlet container?
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > > <ed...@gmail.com> wrote:
> > > > Hi everyone,
> > > >
> > > > I'm having some issues with indexing rich-text documents from the
> Solr
> > > > Cloud. When I tried to index a pdf or word document, I get the
> > following
> > > > error:
> > > >
> > > >
> > > > org.apache.solr.common.SolrException: Bad Request
> > > >
> > > >
> > > >
> > > > request:
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > >         at
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > >         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > Source)
> > > >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > Source)
> > > >         at java.lang.Thread.run(Unknown Source)
> > > >
> > > >
> > > > I'm able to index .xml and .csv files in Solr Cloud with the same
> > > configuration.
> > > >
> > > > I have setup Solr Cloud using the default zookeeper in Solr 5.0.0,
> and
> > > > I have 2 shards with the following details:
> > > > Shard1: 192.168.2.2:8983
> > > > Shard2: 192.168.2.2:8984
> > > >
> > > > Prior to this, I'm already able to index rich-text documents without
> > > > the Solr Cloud, and I'm using the same solrconfig.xml and schema.xml,
> > > > so my ExtractRequestHandler is already defined.
> > > >
> > > > Is there other settings required in order to index rich-text
> documents
> > > > in Solr Cloud?
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >
>
>
>
> --
> Damien Kamerman
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Charlee,

I've followed the setup from the Solr In Action book, and assign port 8983
to shard1 and port 8984 to shard2. Will it cause any issues?

Regards,
Edwin

On 19 March 2015 at 13:02, Charlee Chitsuk <ch...@gmail.com> wrote:

> The http://192.168.2.2:8984/solr/
> <
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> >
> ,
> the port number 8984 may be an HTTPS. The HTTP port should be 8983.
>
> Hope this help.
>
> --
>    Best Regards,
>
>    Charlee Chitsuk
>
> =======================
> Application Security Product Group
> *Summit Computer Co., Ltd.* <http://www.summitthai.com/>
> E-Mail: charlee@summitthai.com
> Tel: +66-2-238-0895 to 9 ext. 164
> Fax: +66-2-236-7392
> =======================
> *@ Your Success is Our Pride*
> ------------------------------------------
>
> 2015-03-19 11:49 GMT+07:00 Damien Kamerman <da...@gmail.com>:
>
> > It sounds like https://issues.apache.org/jira/browse/SOLR-5551
> > Have you checked the solr.log for all nodes?
> >
> > On 19 March 2015 at 14:43, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> > > This is the logs that I got from solr.log. I can't seems to figure out
> > > what's wrong with it. Does anyone knows?
> > >
> > >
> > >
> > > ERROR - 2015-03-18 15:06:51.019;
> > > org.apache.solr.update.StreamingSolrClients$1; error
> > > org.apache.solr.common.SolrException: Bad Request
> > >
> > >
> > >
> > > request:
> > >
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > <
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > >
> > > at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> > > at java.lang.Thread.run(Unknown Source)
> > > INFO  - 2015-03-18 15:06:51.019;
> > > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> > webapp=/solr
> > > path=/update/extract params={literal.id
> > > =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf&
> > resource.name
> > > =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf}
> > > {add=[C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf]} 0
> > 1252
> > > INFO  - 2015-03-18 15:06:51.029;
> > > org.apache.solr.update.DirectUpdateHandler2; start
> > >
> > >
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > > INFO  - 2015-03-18 15:06:51.029;
> > > org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
> > > Skipping IW.commit.
> > > INFO  - 2015-03-18 15:06:51.029; org.apache.solr.core.SolrCore;
> > > SolrIndexSearcher has not changed - not re-opening:
> > > org.apache.solr.search.SolrIndexSearcher
> > > INFO  - 2015-03-18 15:06:51.039;
> > > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> > > INFO  - 2015-03-18 15:06:51.039;
> > > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> > webapp=/solr
> > > path=/update params={waitSearcher=true&distrib.from=
> > >
> > >
> >
> http://192.168.2.2:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false
> > > }
> > > {commit=} 0 10
> > > INFO  - 2015-03-18 15:06:51.039;
> > > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> > webapp=/solr
> > > path=/update params={commit=true} {commit=} 0 10
> > >
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 19 March 2015 at 10:56, Damien Kamerman <da...@gmail.com> wrote:
> > >
> > > > I suggest you check your solr logs for more info as to the cause.
> > > >
> > > > On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi Erick,
> > > > >
> > > > > No, the PDF file is a testing file which only contains 1 sentence.
> > > > >
> > > > > I've managed to get it to work by removing startup="lazy" in
> > > > > the ExtractingRequestHandler and added the following lines:
> > > > >       <str name="uprefix">ignored_</str>
> > > > >       <str name="captureAttr">true</str>
> > > > >       <str name="fmap.a">links</str>
> > > > >       <str name="fmap.div">ignored_</str>
> > > > >
> > > > > Does the presence of startup="lazy" affect the function of
> > > > > ExtractingRequestHandler , or is it one of the str name values?
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > > >
> > > > >
> > > > > On 18 March 2015 at 23:19, Erick Erickson <erickerickson@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Shot in the dark, but is the PDF file significantly larger than
> the
> > > > > > others? Perhaps your simply exceeding the packet limits for the
> > > > > > servlet container?
> > > > > >
> > > > > > Best,
> > > > > > Erick
> > > > > >
> > > > > > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > > > > > <ed...@gmail.com> wrote:
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I'm having some issues with indexing rich-text documents from
> the
> > > > Solr
> > > > > > > Cloud. When I tried to index a pdf or word document, I get the
> > > > > following
> > > > > > > error:
> > > > > > >
> > > > > > >
> > > > > > > org.apache.solr.common.SolrException: Bad Request
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > request:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > > > > >         at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > > > > Source)
> > > > > > >         at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > > > > Source)
> > > > > > >         at java.lang.Thread.run(Unknown Source)
> > > > > > >
> > > > > > >
> > > > > > > I'm able to index .xml and .csv files in Solr Cloud with the
> same
> > > > > > configuration.
> > > > > > >
> > > > > > > I have setup Solr Cloud using the default zookeeper in Solr
> > 5.0.0,
> > > > and
> > > > > > > I have 2 shards with the following details:
> > > > > > > Shard1: 192.168.2.2:8983
> > > > > > > Shard2: 192.168.2.2:8984
> > > > > > >
> > > > > > > Prior to this, I'm already able to index rich-text documents
> > > without
> > > > > > > the Solr Cloud, and I'm using the same solrconfig.xml and
> > > schema.xml,
> > > > > > > so my ExtractRequestHandler is already defined.
> > > > > > >
> > > > > > > Is there other settings required in order to index rich-text
> > > > documents
> > > > > > > in Solr Cloud?
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Edwin
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Damien Kamerman
> > > >
> > >
> >
> >
> >
> > --
> > Damien Kamerman
> >
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Charlee Chitsuk <ch...@gmail.com>.

The http://192.168.2.2:8984/solr/
<http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2>
,
the port number 8984 may be an HTTPS. The HTTP port should be 8983.

Hope this help.

--
   Best Regards,

   Charlee Chitsuk

=======================
Application Security Product Group
*Summit Computer Co., Ltd.* <http://www.summitthai.com/>
E-Mail: charlee@summitthai.com
Tel: +66-2-238-0895 to 9 ext. 164
Fax: +66-2-236-7392
=======================
*@ Your Success is Our Pride*
------------------------------------------

2015-03-19 11:49 GMT+07:00 Damien Kamerman <da...@gmail.com>:

> It sounds like https://issues.apache.org/jira/browse/SOLR-5551
> Have you checked the solr.log for all nodes?
>
> On 19 March 2015 at 14:43, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > This is the logs that I got from solr.log. I can't seems to figure out
> > what's wrong with it. Does anyone knows?
> >
> >
> >
> > ERROR - 2015-03-18 15:06:51.019;
> > org.apache.solr.update.StreamingSolrClients$1; error
> > org.apache.solr.common.SolrException: Bad Request
> >
> >
> >
> > request:
> >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > <
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > >
> > at
> >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> > at java.lang.Thread.run(Unknown Source)
> > INFO  - 2015-03-18 15:06:51.019;
> > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> webapp=/solr
> > path=/update/extract params={literal.id
> > =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf&
> resource.name
> > =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf}
> > {add=[C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf]} 0
> 1252
> > INFO  - 2015-03-18 15:06:51.029;
> > org.apache.solr.update.DirectUpdateHandler2; start
> >
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > INFO  - 2015-03-18 15:06:51.029;
> > org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
> > Skipping IW.commit.
> > INFO  - 2015-03-18 15:06:51.029; org.apache.solr.core.SolrCore;
> > SolrIndexSearcher has not changed - not re-opening:
> > org.apache.solr.search.SolrIndexSearcher
> > INFO  - 2015-03-18 15:06:51.039;
> > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> > INFO  - 2015-03-18 15:06:51.039;
> > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> webapp=/solr
> > path=/update params={waitSearcher=true&distrib.from=
> >
> >
> http://192.168.2.2:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false
> > }
> > {commit=} 0 10
> > INFO  - 2015-03-18 15:06:51.039;
> > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> webapp=/solr
> > path=/update params={commit=true} {commit=} 0 10
> >
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 19 March 2015 at 10:56, Damien Kamerman <da...@gmail.com> wrote:
> >
> > > I suggest you check your solr logs for more info as to the cause.
> > >
> > > On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > > wrote:
> > >
> > > > Hi Erick,
> > > >
> > > > No, the PDF file is a testing file which only contains 1 sentence.
> > > >
> > > > I've managed to get it to work by removing startup="lazy" in
> > > > the ExtractingRequestHandler and added the following lines:
> > > >       <str name="uprefix">ignored_</str>
> > > >       <str name="captureAttr">true</str>
> > > >       <str name="fmap.a">links</str>
> > > >       <str name="fmap.div">ignored_</str>
> > > >
> > > > Does the presence of startup="lazy" affect the function of
> > > > ExtractingRequestHandler , or is it one of the str name values?
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com>
> > > wrote:
> > > >
> > > > > Shot in the dark, but is the PDF file significantly larger than the
> > > > > others? Perhaps your simply exceeding the packet limits for the
> > > > > servlet container?
> > > > >
> > > > > Best,
> > > > > Erick
> > > > >
> > > > > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > > > > <ed...@gmail.com> wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'm having some issues with indexing rich-text documents from the
> > > Solr
> > > > > > Cloud. When I tried to index a pdf or word document, I get the
> > > > following
> > > > > > error:
> > > > > >
> > > > > >
> > > > > > org.apache.solr.common.SolrException: Bad Request
> > > > > >
> > > > > >
> > > > > >
> > > > > > request:
> > > > >
> > > >
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > > > >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > > > Source)
> > > > > >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > > > Source)
> > > > > >         at java.lang.Thread.run(Unknown Source)
> > > > > >
> > > > > >
> > > > > > I'm able to index .xml and .csv files in Solr Cloud with the same
> > > > > configuration.
> > > > > >
> > > > > > I have setup Solr Cloud using the default zookeeper in Solr
> 5.0.0,
> > > and
> > > > > > I have 2 shards with the following details:
> > > > > > Shard1: 192.168.2.2:8983
> > > > > > Shard2: 192.168.2.2:8984
> > > > > >
> > > > > > Prior to this, I'm already able to index rich-text documents
> > without
> > > > > > the Solr Cloud, and I'm using the same solrconfig.xml and
> > schema.xml,
> > > > > > so my ExtractRequestHandler is already defined.
> > > > > >
> > > > > > Is there other settings required in order to index rich-text
> > > documents
> > > > > > in Solr Cloud?
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Damien Kamerman
> > >
> >
>
>
>
> --
> Damien Kamerman
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Oh ya. The previous log was from shard1. This log is from shard2.

INFO  - 2015-03-18 15:06:51.019;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update params={distrib.from=
http://192.168.2.2:8983/solr/logmill/&update.distrib=TOLEADER&wt=javabin&version=2}
{} 0 20
ERROR - 2015-03-18 15:06:51.019; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: ERROR:
[doc=C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf] unknown
field 'meta_save_date'
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:697)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
at
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:96)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:166)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:190)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
at
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:103)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)

INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
Skipping IW.commit.
INFO  - 2015-03-18 15:06:51.039; org.apache.solr.core.SolrCore;
SolrIndexSearcher has not changed - not re-opening:
org.apache.solr.search.SolrIndexSearcher
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update params={waitSearcher=true&distrib.from=
http://192.168.2.7:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false}
{commit=} 0 0


Regards,
Edwin

On 19 March 2015 at 12:49, Damien Kamerman <da...@gmail.com> wrote:

> It sounds like https://issues.apache.org/jira/browse/SOLR-5551
> Have you checked the solr.log for all nodes?
>
> On 19 March 2015 at 14:43, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > This is the logs that I got from solr.log. I can't seems to figure out
> > what's wrong with it. Does anyone knows?
> >
> >
> >
> > ERROR - 2015-03-18 15:06:51.019;
> > org.apache.solr.update.StreamingSolrClients$1; error
> > org.apache.solr.common.SolrException: Bad Request
> >
> >
> >
> > request:
> >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > <
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > >
> > at
> >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> > at java.lang.Thread.run(Unknown Source)
> > INFO  - 2015-03-18 15:06:51.019;
> > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> webapp=/solr
> > path=/update/extract params={literal.id
> > =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf&
> resource.name
> > =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf}
> > {add=[C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf]} 0
> 1252
> > INFO  - 2015-03-18 15:06:51.029;
> > org.apache.solr.update.DirectUpdateHandler2; start
> >
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > INFO  - 2015-03-18 15:06:51.029;
> > org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
> > Skipping IW.commit.
> > INFO  - 2015-03-18 15:06:51.029; org.apache.solr.core.SolrCore;
> > SolrIndexSearcher has not changed - not re-opening:
> > org.apache.solr.search.SolrIndexSearcher
> > INFO  - 2015-03-18 15:06:51.039;
> > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> > INFO  - 2015-03-18 15:06:51.039;
> > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> webapp=/solr
> > path=/update params={waitSearcher=true&distrib.from=
> >
> >
> http://192.168.2.2:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false
> > }
> > {commit=} 0 10
> > INFO  - 2015-03-18 15:06:51.039;
> > org.apache.solr.update.processor.LogUpdateProcessor; [logmill]
> webapp=/solr
> > path=/update params={commit=true} {commit=} 0 10
> >
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 19 March 2015 at 10:56, Damien Kamerman <da...@gmail.com> wrote:
> >
> > > I suggest you check your solr logs for more info as to the cause.
> > >
> > > On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > > wrote:
> > >
> > > > Hi Erick,
> > > >
> > > > No, the PDF file is a testing file which only contains 1 sentence.
> > > >
> > > > I've managed to get it to work by removing startup="lazy" in
> > > > the ExtractingRequestHandler and added the following lines:
> > > >       <str name="uprefix">ignored_</str>
> > > >       <str name="captureAttr">true</str>
> > > >       <str name="fmap.a">links</str>
> > > >       <str name="fmap.div">ignored_</str>
> > > >
> > > > Does the presence of startup="lazy" affect the function of
> > > > ExtractingRequestHandler , or is it one of the str name values?
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com>
> > > wrote:
> > > >
> > > > > Shot in the dark, but is the PDF file significantly larger than the
> > > > > others? Perhaps your simply exceeding the packet limits for the
> > > > > servlet container?
> > > > >
> > > > > Best,
> > > > > Erick
> > > > >
> > > > > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > > > > <ed...@gmail.com> wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'm having some issues with indexing rich-text documents from the
> > > Solr
> > > > > > Cloud. When I tried to index a pdf or word document, I get the
> > > > following
> > > > > > error:
> > > > > >
> > > > > >
> > > > > > org.apache.solr.common.SolrException: Bad Request
> > > > > >
> > > > > >
> > > > > >
> > > > > > request:
> > > > >
> > > >
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > > > >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > > > Source)
> > > > > >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > > > Source)
> > > > > >         at java.lang.Thread.run(Unknown Source)
> > > > > >
> > > > > >
> > > > > > I'm able to index .xml and .csv files in Solr Cloud with the same
> > > > > configuration.
> > > > > >
> > > > > > I have setup Solr Cloud using the default zookeeper in Solr
> 5.0.0,
> > > and
> > > > > > I have 2 shards with the following details:
> > > > > > Shard1: 192.168.2.2:8983
> > > > > > Shard2: 192.168.2.2:8984
> > > > > >
> > > > > > Prior to this, I'm already able to index rich-text documents
> > without
> > > > > > the Solr Cloud, and I'm using the same solrconfig.xml and
> > schema.xml,
> > > > > > so my ExtractRequestHandler is already defined.
> > > > > >
> > > > > > Is there other settings required in order to index rich-text
> > > documents
> > > > > > in Solr Cloud?
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Damien Kamerman
> > >
> >
>
>
>
> --
> Damien Kamerman
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Damien Kamerman <da...@gmail.com>.

It sounds like https://issues.apache.org/jira/browse/SOLR-5551
Have you checked the solr.log for all nodes?

On 19 March 2015 at 14:43, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:

> This is the logs that I got from solr.log. I can't seems to figure out
> what's wrong with it. Does anyone knows?
>
>
>
> ERROR - 2015-03-18 15:06:51.019;
> org.apache.solr.update.StreamingSolrClients$1; error
> org.apache.solr.common.SolrException: Bad Request
>
>
>
> request:
>
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> <
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> >
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> INFO  - 2015-03-18 15:06:51.019;
> org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
> path=/update/extract params={literal.id
> =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf&resource.name
> =C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf}
> {add=[C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf]} 0 1252
> INFO  - 2015-03-18 15:06:51.029;
> org.apache.solr.update.DirectUpdateHandler2; start
>
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> INFO  - 2015-03-18 15:06:51.029;
> org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
> Skipping IW.commit.
> INFO  - 2015-03-18 15:06:51.029; org.apache.solr.core.SolrCore;
> SolrIndexSearcher has not changed - not re-opening:
> org.apache.solr.search.SolrIndexSearcher
> INFO  - 2015-03-18 15:06:51.039;
> org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> INFO  - 2015-03-18 15:06:51.039;
> org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
> path=/update params={waitSearcher=true&distrib.from=
>
> http://192.168.2.2:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false
> }
> {commit=} 0 10
> INFO  - 2015-03-18 15:06:51.039;
> org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
> path=/update params={commit=true} {commit=} 0 10
>
>
>
> Regards,
> Edwin
>
>
> On 19 March 2015 at 10:56, Damien Kamerman <da...@gmail.com> wrote:
>
> > I suggest you check your solr logs for more info as to the cause.
> >
> > On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> > > Hi Erick,
> > >
> > > No, the PDF file is a testing file which only contains 1 sentence.
> > >
> > > I've managed to get it to work by removing startup="lazy" in
> > > the ExtractingRequestHandler and added the following lines:
> > >       <str name="uprefix">ignored_</str>
> > >       <str name="captureAttr">true</str>
> > >       <str name="fmap.a">links</str>
> > >       <str name="fmap.div">ignored_</str>
> > >
> > > Does the presence of startup="lazy" affect the function of
> > > ExtractingRequestHandler , or is it one of the str name values?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com>
> > wrote:
> > >
> > > > Shot in the dark, but is the PDF file significantly larger than the
> > > > others? Perhaps your simply exceeding the packet limits for the
> > > > servlet container?
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > > > <ed...@gmail.com> wrote:
> > > > > Hi everyone,
> > > > >
> > > > > I'm having some issues with indexing rich-text documents from the
> > Solr
> > > > > Cloud. When I tried to index a pdf or word document, I get the
> > > following
> > > > > error:
> > > > >
> > > > >
> > > > > org.apache.solr.common.SolrException: Bad Request
> > > > >
> > > > >
> > > > >
> > > > > request:
> > > >
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > > >         at
> > > >
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > > >         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > > Source)
> > > > >         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > > Source)
> > > > >         at java.lang.Thread.run(Unknown Source)
> > > > >
> > > > >
> > > > > I'm able to index .xml and .csv files in Solr Cloud with the same
> > > > configuration.
> > > > >
> > > > > I have setup Solr Cloud using the default zookeeper in Solr 5.0.0,
> > and
> > > > > I have 2 shards with the following details:
> > > > > Shard1: 192.168.2.2:8983
> > > > > Shard2: 192.168.2.2:8984
> > > > >
> > > > > Prior to this, I'm already able to index rich-text documents
> without
> > > > > the Solr Cloud, and I'm using the same solrconfig.xml and
> schema.xml,
> > > > > so my ExtractRequestHandler is already defined.
> > > > >
> > > > > Is there other settings required in order to index rich-text
> > documents
> > > > > in Solr Cloud?
> > > > >
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > >
> > >
> >
> >
> >
> > --
> > Damien Kamerman
> >
>



-- 
Damien Kamerman

Re: Unable to index rich-text documents in Solr Cloud

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

This is the logs that I got from solr.log. I can't seems to figure out
what's wrong with it. Does anyone knows?



ERROR - 2015-03-18 15:06:51.019;
org.apache.solr.update.StreamingSolrClients$1; error
org.apache.solr.common.SolrException: Bad Request



request:
http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
<http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.23.72%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2>
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
INFO  - 2015-03-18 15:06:51.019;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update/extract params={literal.id
=C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf&resource.name=C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf}
{add=[C:\Users\edwin\solr-5.0.0\example\exampledocs\solr-word.pdf]} 0 1252
INFO  - 2015-03-18 15:06:51.029;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-03-18 15:06:51.029;
org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
Skipping IW.commit.
INFO  - 2015-03-18 15:06:51.029; org.apache.solr.core.SolrCore;
SolrIndexSearcher has not changed - not re-opening:
org.apache.solr.search.SolrIndexSearcher
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update params={waitSearcher=true&distrib.from=
http://192.168.2.2:8983/solr/logmill/&update.distrib=FROMLEADER&openSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false}
{commit=} 0 10
INFO  - 2015-03-18 15:06:51.039;
org.apache.solr.update.processor.LogUpdateProcessor; [logmill] webapp=/solr
path=/update params={commit=true} {commit=} 0 10



Regards,
Edwin


On 19 March 2015 at 10:56, Damien Kamerman <da...@gmail.com> wrote:

> I suggest you check your solr logs for more info as to the cause.
>
> On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi Erick,
> >
> > No, the PDF file is a testing file which only contains 1 sentence.
> >
> > I've managed to get it to work by removing startup="lazy" in
> > the ExtractingRequestHandler and added the following lines:
> >       <str name="uprefix">ignored_</str>
> >       <str name="captureAttr">true</str>
> >       <str name="fmap.a">links</str>
> >       <str name="fmap.div">ignored_</str>
> >
> > Does the presence of startup="lazy" affect the function of
> > ExtractingRequestHandler , or is it one of the str name values?
> >
> > Regards,
> > Edwin
> >
> >
> > On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com>
> wrote:
> >
> > > Shot in the dark, but is the PDF file significantly larger than the
> > > others? Perhaps your simply exceeding the packet limits for the
> > > servlet container?
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > > <ed...@gmail.com> wrote:
> > > > Hi everyone,
> > > >
> > > > I'm having some issues with indexing rich-text documents from the
> Solr
> > > > Cloud. When I tried to index a pdf or word document, I get the
> > following
> > > > error:
> > > >
> > > >
> > > > org.apache.solr.common.SolrException: Bad Request
> > > >
> > > >
> > > >
> > > > request:
> > >
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > > >         at
> > >
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > > >         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > > Source)
> > > >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > > Source)
> > > >         at java.lang.Thread.run(Unknown Source)
> > > >
> > > >
> > > > I'm able to index .xml and .csv files in Solr Cloud with the same
> > > configuration.
> > > >
> > > > I have setup Solr Cloud using the default zookeeper in Solr 5.0.0,
> and
> > > > I have 2 shards with the following details:
> > > > Shard1: 192.168.2.2:8983
> > > > Shard2: 192.168.2.2:8984
> > > >
> > > > Prior to this, I'm already able to index rich-text documents without
> > > > the Solr Cloud, and I'm using the same solrconfig.xml and schema.xml,
> > > > so my ExtractRequestHandler is already defined.
> > > >
> > > > Is there other settings required in order to index rich-text
> documents
> > > > in Solr Cloud?
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >
>
>
>
> --
> Damien Kamerman
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Damien Kamerman <da...@gmail.com>.

I suggest you check your solr logs for more info as to the cause.

On 19 March 2015 at 12:58, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:

> Hi Erick,
>
> No, the PDF file is a testing file which only contains 1 sentence.
>
> I've managed to get it to work by removing startup="lazy" in
> the ExtractingRequestHandler and added the following lines:
>       <str name="uprefix">ignored_</str>
>       <str name="captureAttr">true</str>
>       <str name="fmap.a">links</str>
>       <str name="fmap.div">ignored_</str>
>
> Does the presence of startup="lazy" affect the function of
> ExtractingRequestHandler , or is it one of the str name values?
>
> Regards,
> Edwin
>
>
> On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com> wrote:
>
> > Shot in the dark, but is the PDF file significantly larger than the
> > others? Perhaps your simply exceeding the packet limits for the
> > servlet container?
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> > <ed...@gmail.com> wrote:
> > > Hi everyone,
> > >
> > > I'm having some issues with indexing rich-text documents from the Solr
> > > Cloud. When I tried to index a pdf or word document, I get the
> following
> > > error:
> > >
> > >
> > > org.apache.solr.common.SolrException: Bad Request
> > >
> > >
> > >
> > > request:
> >
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> > >         at
> >
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> > >         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > Source)
> > >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > Source)
> > >         at java.lang.Thread.run(Unknown Source)
> > >
> > >
> > > I'm able to index .xml and .csv files in Solr Cloud with the same
> > configuration.
> > >
> > > I have setup Solr Cloud using the default zookeeper in Solr 5.0.0, and
> > > I have 2 shards with the following details:
> > > Shard1: 192.168.2.2:8983
> > > Shard2: 192.168.2.2:8984
> > >
> > > Prior to this, I'm already able to index rich-text documents without
> > > the Solr Cloud, and I'm using the same solrconfig.xml and schema.xml,
> > > so my ExtractRequestHandler is already defined.
> > >
> > > Is there other settings required in order to index rich-text documents
> > > in Solr Cloud?
> > >
> > >
> > > Regards,
> > > Edwin
> >
>



-- 
Damien Kamerman

Re: Unable to index rich-text documents in Solr Cloud

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Erick,

No, the PDF file is a testing file which only contains 1 sentence.

I've managed to get it to work by removing startup="lazy" in
the ExtractingRequestHandler and added the following lines:
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>

Does the presence of startup="lazy" affect the function of
ExtractingRequestHandler , or is it one of the str name values?

Regards,
Edwin


On 18 March 2015 at 23:19, Erick Erickson <er...@gmail.com> wrote:

> Shot in the dark, but is the PDF file significantly larger than the
> others? Perhaps your simply exceeding the packet limits for the
> servlet container?
>
> Best,
> Erick
>
> On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
> <ed...@gmail.com> wrote:
> > Hi everyone,
> >
> > I'm having some issues with indexing rich-text documents from the Solr
> > Cloud. When I tried to index a pdf or word document, I get the following
> > error:
> >
> >
> > org.apache.solr.common.SolrException: Bad Request
> >
> >
> >
> > request:
> http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
> >         at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
> >         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> >         at java.lang.Thread.run(Unknown Source)
> >
> >
> > I'm able to index .xml and .csv files in Solr Cloud with the same
> configuration.
> >
> > I have setup Solr Cloud using the default zookeeper in Solr 5.0.0, and
> > I have 2 shards with the following details:
> > Shard1: 192.168.2.2:8983
> > Shard2: 192.168.2.2:8984
> >
> > Prior to this, I'm already able to index rich-text documents without
> > the Solr Cloud, and I'm using the same solrconfig.xml and schema.xml,
> > so my ExtractRequestHandler is already defined.
> >
> > Is there other settings required in order to index rich-text documents
> > in Solr Cloud?
> >
> >
> > Regards,
> > Edwin
>

Re: Unable to index rich-text documents in Solr Cloud

Posted by Erick Erickson <er...@gmail.com>.

Shot in the dark, but is the PDF file significantly larger than the
others? Perhaps your simply exceeding the packet limits for the
servlet container?

Best,
Erick

On Wed, Mar 18, 2015 at 12:22 AM, Zheng Lin Edwin Yeo
<ed...@gmail.com> wrote:
> Hi everyone,
>
> I'm having some issues with indexing rich-text documents from the Solr
> Cloud. When I tried to index a pdf or word document, I get the following
> error:
>
>
> org.apache.solr.common.SolrException: Bad Request
>
>
>
> request: http://192.168.2.2:8984/solr/logmill/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.2.2%3A8983%2Fsolr%2Flogmill%2F&wt=javabin&version=2
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:241)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>         at java.lang.Thread.run(Unknown Source)
>
>
> I'm able to index .xml and .csv files in Solr Cloud with the same configuration.
>
> I have setup Solr Cloud using the default zookeeper in Solr 5.0.0, and
> I have 2 shards with the following details:
> Shard1: 192.168.2.2:8983
> Shard2: 192.168.2.2:8984
>
> Prior to this, I'm already able to index rich-text documents without
> the Solr Cloud, and I'm using the same solrconfig.xml and schema.xml,
> so my ExtractRequestHandler is already defined.
>
> Is there other settings required in order to index rich-text documents
> in Solr Cloud?
>
>
> Regards,
> Edwin