You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brandon Waterloo <Br...@matrix.msu.edu> on 2011/04/04 20:00:53 UTC

Problems indexing very large set of documents

 Hey everybody,

I've been running into some issues indexing a very large set of documents.  There's about 4000 PDF files, ranging in size from 160MB to 10KB.  Obviously this is a big task for Solr.  I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files.  For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary.  I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once all the documents have been finished the PHP script queries Solr to commit.

The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr:

Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
        ... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
        ... 25 more

As far as I know there's nothing special about these documents so I'm wondering if it's not properly autocommitting.  What would be appropriate settings in solrconfig.xml for this particular application?  I'd like it to autocommit as soon as it needs to but no more often than that for the sake of efficiency.  Obviously it takes long enough to index 4000 documents and there's no reason to make it take longer.  Thanks for your help!

~Brandon Waterloo

Re: Problems indexing very large set of documents

Posted by Anuj Kumar <an...@gmail.com>.

Hi Brandon,

Sorry, I can't make out much here. The exception gives TIKA error that
signifies the parsing issue with PDF. That's all I can make out.
May be someone else on this mailing list can help.

Sorry.

- Anuj

On Tue, Apr 5, 2011 at 6:35 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

> It wasn't just a single file, it was dozens of files all having problems
> toward the end just before I killed the process.
>
> IPADDR -  -  [04/04/2011:17:17:03 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:05 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:09 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557
> IPADDR -  -  [04/04/2011:17:17:14 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?
> literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557
>
> That is by no means all the errors, that is just a sample of a few.  You
> can see they all threw HTTP 500 errors.  What is strange is, nearly every
> file succeeded before about the 2200-files-mark, and nearly every file after
> that failed.
>
>
> ~Brandon Waterloo
>
> ________________________________
> From: Anuj Kumar [anujsays@gmail.com]
> Sent: Monday, April 04, 2011 2:48 PM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> In the log messages are you able to locate the file at which it fails?
> Looks like TIKA is unable to parse one of your PDF files for the details. We
> need to hunt that one out.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>>
> wrote:
> Looks like I'm using Tika 0.4:
> apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
> .../tika-parsers-0.4.jar
>
> ~Brandon Waterloo
>
> ________________________________________
> From: Anuj Kumar [anujsays@gmail.com<ma...@gmail.com>]
> Sent: Monday, April 04, 2011 2:12 PM
> To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> This is related to Apache TIKA. Which version are you using?
> Please see this thread for more details-
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
> >Hope
> it helps.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>>
> wrote:
>
> >  Hey everybody,
> >
> > I've been running into some issues indexing a very large set of
> documents.
> >  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
> >  Obviously this is a big task for Solr.  I have a PHP script that
> iterates
> > over the directory and uses PHP cURL to query Solr to index the files.
>  For
> > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> > Solr should be auto-committing as necessary.  I'm using the default
> > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
> > all the documents have been finished the PHP script queries Solr to
> commit.
> >
> > The main problem is that after a few thousand documents (around 2000 last
> > time I tried), nearly every document begins causing Java exceptions in
> Solr:
> >
> > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
> >        at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >        at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >        at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >        at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >        at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >        at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> >        at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> >        at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> >        at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >        at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> >        at org.mortbay.jetty.Server.handle(Server.java:285)
> >        at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> >        at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> >        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> >        at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> >        at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> >        at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> >        at
> >
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> >        at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> >        ... 23 more
> > Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
> >        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
> >        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
> >        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> >        ... 25 more
> >
> > As far as I know there's nothing special about these documents so I'm
> > wondering if it's not properly autocommitting.  What would be appropriate
> > settings in solrconfig.xml for this particular application?  I'd like it
> to
> > autocommit as soon as it needs to but no more often than that for the
> sake
> > of efficiency.  Obviously it takes long enough to index 4000 documents
> and
> > there's no reason to make it take longer.  Thanks for your help!
> >
> > ~Brandon Waterloo
> >
>
>