You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Fergus McMenemie <fe...@twig.me.uk> on 2009/04/06 16:16:02 UTC

Re: Using ExtractingRequestHandler to index a large PDF ~solved

Hmmm,

Not sure how this all hangs together. But editing my solrconfig.xml as follows
sorted the problem:-

    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
to 

    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20048" />

Also, my initial report of the issue was misled by the log messages. The mention
of "oceania.pdf" refers to a previous successful tika extract. There no mention 
of the filename that was rejected in the logs or any information that would help
me identify it! 

Regards Fergus.

>Sorry if this is a FAQ; I suspect it could be. But how do I work around the following:-
>
>INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/oceania.pdf} status=0 QTime=318 
>Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log
>SEVERE: org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the request was rejected because its size (4585774) exceeds the configured maximum (2097152)
>	at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.<init>(FileUploadBase.java:914)
>	at org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
>	at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:349)
>	at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
>	at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:343)
>	at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:396)
>	at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114)
>	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
>	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
>	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
>	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>
>Although the PDF is big, it contains very little text; it is a map. 
>
>   "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother with it.
>
>Fergus...
>-- 
>
>===============================================================
>Fergus McMenemie               Email:fergus@twig.me.uk
>Techmore Ltd                   Phone:(UK) 07721 376021
>
>Unix/Mac/Intranets             Analyst Programmer
>===============================================================

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Using ExtractingRequestHandler to index a large PDF ~solved

Posted by Fergus McMenemie <fe...@twig.me.uk>.

>On Apr 6, 2009, at 10:16 AM, Fergus McMenemie wrote:
>
>> Hmmm,
>>
>> Not sure how this all hangs together. But editing my solrconfig.xml  
>> as follows
>> sorted the problem:-
>>
>>    <requestParsers enableRemoteStreaming="false"  
>> multipartUploadLimitInKB="2048" />
>> to
>>
>>    <requestParsers enableRemoteStreaming="false"  
>> multipartUploadLimitInKB="20048" />
>>
>
>We should document this on the wiki or in the config, if it isn't  
>already.

As best I could tell it is not documented. I stumbled across
the idea of changing multipartUploadLimitInKB after reviewing 
http://wiki.apache.org/solr/UpdateRichDocuments. But this leads
onto wondering if streaming files from a local disk was in some
way also available via enableRemoteStreaming for the solr-cell
feature? With 20:20 hindsight I see that 
 http://wiki.apache.org/solr/SolrConfigXml does briefly refer
to "file upload size"

I feel that the requestDispatcher section of solrconfig.xml
needs a more complete description. I get the impression it
acts a filter on *any* URL sent to SOLR? What does it do?

I will mark up the wiki when this is clarified


>
>> Also, my initial report of the issue was misled by the log messages.  
>> The mention
>> of "oceania.pdf" refers to a previous successful tika extract. There  
>> no mention
>> of the filename that was rejected in the logs or any information  
>> that would help
>> me identify it!
>
>We should fix this so it at least spits out a meaningful message.  Can  
>you open a JIRA?
>

OK SOLR-1113 raised.

>>
>> Regards Fergus.
>>
>>> Sorry if this is a FAQ; I suspect it could be. But how do I work  
>>> around the following:-
>>>
>>> INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract  
>>> params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/ 
>>> oceania.pdf} status=0 QTime=318
>>> Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.commons.fileupload.FileUploadBase 
>>> $SizeLimitExceededException: the request was rejected because its  
>>> size (4585774) exceeds the configured maximum (2097152)
>>> 	at org.apache.commons.fileupload.FileUploadBase 
>>> $FileItemIteratorImpl.<init>(FileUploadBase.java:914)
>>> 	at  
>>> org 
>>> .apache 
>>> .commons 
>>> .fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
>>> 	at  
>>> org 
>>> .apache 
>>> .commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java: 
>>> 349)
>>> 	at  
>>> org 
>>> .apache 
>>> .commons 
>>> .fileupload 
>>> .servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
>>> 	at  
>>> org 
>>> .apache 
>>> .solr 
>>> .servlet 
>>> .MultipartRequestParser 
>>> .parseParamsAndFillStreams(SolrRequestParsers.java:343)
>>> 	at  
>>> org 
>>> .apache 
>>> .solr 
>>> .servlet 
>>> .StandardRequestParser 
>>> .parseParamsAndFillStreams(SolrRequestParsers.java:396)
>>> 	at  
>>> org 
>>> .apache 
>>> .solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114)
>>> 	at  
>>> org 
>>> .apache 
>>> .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 
>>> 217)
>>> 	at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core 
>>> .ApplicationFilterChain 
>>> .internalDoFilter(ApplicationFilterChain.java:202)
>>> 	at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 
>>> 173)
>>> 	at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>>> 	at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core.StandardContextValve.invoke(StandardContextValve.java:178)
>>> 	at  
>>> org 
>>> .apache 
>>> .catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>>> 	at  
>>> org 
>>> .apache 
>>> .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>>>
>>> Although the PDF is big, it contains very little text; it is a map.
>>>
>>>  "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother  
>>> with it.
>>>
>>> Fergus...
>
>--------------------------
>Grant Ingersoll
>http://www.lucidimagination.com/
>
>Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>using Solr/Lucene:
>http://www.lucidimagination.com/search

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: Using ExtractingRequestHandler to index a large PDF ~solved

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 6, 2009, at 10:16 AM, Fergus McMenemie wrote:

> Hmmm,
>
> Not sure how this all hangs together. But editing my solrconfig.xml  
> as follows
> sorted the problem:-
>
>    <requestParsers enableRemoteStreaming="false"  
> multipartUploadLimitInKB="2048" />
> to
>
>    <requestParsers enableRemoteStreaming="false"  
> multipartUploadLimitInKB="20048" />
>

We should document this on the wiki or in the config, if it isn't  
already.

> Also, my initial report of the issue was misled by the log messages.  
> The mention
> of "oceania.pdf" refers to a previous successful tika extract. There  
> no mention
> of the filename that was rejected in the logs or any information  
> that would help
> me identify it!

We should fix this so it at least spits out a meaningful message.  Can  
you open a JIRA?

>
>
> Regards Fergus.
>
>> Sorry if this is a FAQ; I suspect it could be. But how do I work  
>> around the following:-
>>
>> INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract  
>> params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/ 
>> oceania.pdf} status=0 QTime=318
>> Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log
>> SEVERE: org.apache.commons.fileupload.FileUploadBase 
>> $SizeLimitExceededException: the request was rejected because its  
>> size (4585774) exceeds the configured maximum (2097152)
>> 	at org.apache.commons.fileupload.FileUploadBase 
>> $FileItemIteratorImpl.<init>(FileUploadBase.java:914)
>> 	at  
>> org 
>> .apache 
>> .commons 
>> .fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
>> 	at  
>> org 
>> .apache 
>> .commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java: 
>> 349)
>> 	at  
>> org 
>> .apache 
>> .commons 
>> .fileupload 
>> .servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
>> 	at  
>> org 
>> .apache 
>> .solr 
>> .servlet 
>> .MultipartRequestParser 
>> .parseParamsAndFillStreams(SolrRequestParsers.java:343)
>> 	at  
>> org 
>> .apache 
>> .solr 
>> .servlet 
>> .StandardRequestParser 
>> .parseParamsAndFillStreams(SolrRequestParsers.java:396)
>> 	at  
>> org 
>> .apache 
>> .solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114)
>> 	at  
>> org 
>> .apache 
>> .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 
>> 217)
>> 	at  
>> org 
>> .apache 
>> .catalina 
>> .core 
>> .ApplicationFilterChain 
>> .internalDoFilter(ApplicationFilterChain.java:202)
>> 	at  
>> org 
>> .apache 
>> .catalina 
>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 
>> 173)
>> 	at  
>> org 
>> .apache 
>> .catalina 
>> .core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>> 	at  
>> org 
>> .apache 
>> .catalina 
>> .core.StandardContextValve.invoke(StandardContextValve.java:178)
>> 	at  
>> org 
>> .apache 
>> .catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>> 	at  
>> org 
>> .apache 
>> .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>>
>> Although the PDF is big, it contains very little text; it is a map.
>>
>>  "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother  
>> with it.
>>
>> Fergus...
>> -- 
>>
>> ===============================================================
>> Fergus McMenemie               Email:fergus@twig.me.uk
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===============================================================
>
> -- 
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search