You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brandon Waterloo <Br...@matrix.msu.edu> on 2011/04/04 20:00:53 UTC

Problems indexing very large set of documents

 Hey everybody,

I've been running into some issues indexing a very large set of documents.  There's about 4000 PDF files, ranging in size from 160MB to 10KB.  Obviously this is a big task for Solr.  I have a PHP script that iterates over the directory and uses PHP cURL to query Solr to index the files.  For now, commit is set to false to speed up the indexing, and I'm assuming that Solr should be auto-committing as necessary.  I'm using the default solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once all the documents have been finished the PHP script queries Solr to commit.

The main problem is that after a few thousand documents (around 2000 last time I tried), nearly every document begins causing Java exceptions in Solr:

Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
        ... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
        ... 25 more

As far as I know there's nothing special about these documents so I'm wondering if it's not properly autocommitting.  What would be appropriate settings in solrconfig.xml for this particular application?  I'd like it to autocommit as soon as it needs to but no more often than that for the sake of efficiency.  Obviously it takes long enough to index 4000 documents and there's no reason to make it take longer.  Thanks for your help!

~Brandon Waterloo

Re: Problems indexing very large set of documents

Posted by Anuj Kumar <an...@gmail.com>.
Hi Brandon,

Sorry, I can't make out much here. The exception gives TIKA error that
signifies the parsing issue with PDF. That's all I can make out.
May be someone else on this mailing list can help.

Sorry.

- Anuj

On Tue, Apr 5, 2011 at 6:35 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

> It wasn't just a single file, it was dozens of files all having problems
> toward the end just before I killed the process.
>
> IPADDR -  -  [04/04/2011:17:17:03 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:05 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:09 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557
> IPADDR -  -  [04/04/2011:17:17:14 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?
> literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558
> IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?
> literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557
>
> That is by no means all the errors, that is just a sample of a few.  You
> can see they all threw HTTP 500 errors.  What is strange is, nearly every
> file succeeded before about the 2200-files-mark, and nearly every file after
> that failed.
>
>
> ~Brandon Waterloo
>
> ________________________________
> From: Anuj Kumar [anujsays@gmail.com]
> Sent: Monday, April 04, 2011 2:48 PM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> In the log messages are you able to locate the file at which it fails?
> Looks like TIKA is unable to parse one of your PDF files for the details. We
> need to hunt that one out.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>>
> wrote:
> Looks like I'm using Tika 0.4:
> apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
> .../tika-parsers-0.4.jar
>
> ~Brandon Waterloo
>
> ________________________________________
> From: Anuj Kumar [anujsays@gmail.com<ma...@gmail.com>]
> Sent: Monday, April 04, 2011 2:12 PM
> To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> This is related to Apache TIKA. Which version are you using?
> Please see this thread for more details-
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
> >Hope
> it helps.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>>
> wrote:
>
> >  Hey everybody,
> >
> > I've been running into some issues indexing a very large set of
> documents.
> >  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
> >  Obviously this is a big task for Solr.  I have a PHP script that
> iterates
> > over the directory and uses PHP cURL to query Solr to index the files.
>  For
> > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> > Solr should be auto-committing as necessary.  I'm using the default
> > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
> > all the documents have been finished the PHP script queries Solr to
> commit.
> >
> > The main problem is that after a few thousand documents (around 2000 last
> > time I tried), nearly every document begins causing Java exceptions in
> Solr:
> >
> > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
> >        at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >        at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >        at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >        at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >        at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >        at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> >        at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> >        at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> >        at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >        at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> >        at org.mortbay.jetty.Server.handle(Server.java:285)
> >        at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> >        at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> >        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> >        at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> >        at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> >        at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> >        at
> >
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> >        at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> >        ... 23 more
> > Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
> >        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
> >        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
> >        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> >        ... 25 more
> >
> > As far as I know there's nothing special about these documents so I'm
> > wondering if it's not properly autocommitting.  What would be appropriate
> > settings in solrconfig.xml for this particular application?  I'd like it
> to
> > autocommit as soon as it needs to but no more often than that for the
> sake
> > of efficiency.  Obviously it takes long enough to index 4000 documents
> and
> > there's no reason to make it take longer.  Thanks for your help!
> >
> > ~Brandon Waterloo
> >
>
>

Re: Problems indexing very large set of documents

Posted by Ezequiel Calderara <ez...@gmail.com>.
Ohh sorry... didn't realize that they already sent you that link :P

On Fri, Apr 8, 2011 at 12:35 PM, Ezequiel Calderara <ez...@gmail.com>wrote:

> Maybe those files are created with a different Adobe Format version...
>
> See this:
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu> wrote:
>
>> A second test has revealed that it is something to do with the contents,
>> and not the literal filenames, of the second set of files.  I renamed one of
>> the second-format files and tested it and Solr still failed.  However, the
>> problem still only applies to those files of the second naming format.
>> ________________________________________
>> From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu]
>> Sent: Friday, April 08, 2011 10:40 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Problems indexing very large set of documents
>>
>> I had some time to do some research into the problems.  From what I can
>> tell, it appears Solr is tripping up over the filename.  These are strictly
>> examples, but, Solr handles this filename fine:
>>
>> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>>
>> However, it fails with either a parsing error or an EOF exception on this
>> filename:
>>
>> 32-130-A08-84-al.sff.document.nusa197102.pdf
>>
>> The only significant difference is that the second filename contains
>> multiple periods.  As there are about 1700 files whose filenames are similar
>> to the second format it is simply not possible to change their filenames.
>>  In addition they are being used by other applications.
>>
>> Is there something I can change in Solr configs to fix this issue or am I
>> simply SOL until the Solr dev team can work on this? (assuming I put in a
>> ticket)
>>
>> Thanks again everyone,
>>
>> ~Brandon Waterloo
>>
>>
>> ________________________________________
>> From: Chris Hostetter [hossman_lucene@fucit.org]
>> Sent: Tuesday, April 05, 2011 3:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Problems indexing very large set of documents
>>
>> : It wasn't just a single file, it was dozens of files all having problems
>> : toward the end just before I killed the process.
>>        ...
>> : That is by no means all the errors, that is just a sample of a few.
>> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
>> : every file succeeded before about the 2200-files-mark, and nearly every
>> : file after that failed.
>>
>> ..the root question is: do those files *only* fail if you have already
>> indexed ~2200 files, or do they fail if you start up your server and index
>> them first?
>>
>> there may be a resource issued (if it only happens after indexing 2200) or
>> it may just be a problem with a large number of your PDFs that your
>> iteration code just happens to get to at that point.
>>
>> If it's the former, then there may e something buggy about how Solr is
>> using Tika to cause the problem -- if it's the later, then it's a straight
>> Tika parsing issue.
>>
>> : > now, commit is set to false to speed up the indexing, and I'm assuming
>> that
>> : > Solr should be auto-committing as necessary.  I'm using the default
>> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>>  Once
>>
>> solr does no autocommitting by default, you need to check your
>> solrconfig.xml
>>
>>
>> -Hoss
>>
>
>
>
> --
> ______
> Ezequiel.
>
> Http://www.ironicnet.com
>



-- 
______
Ezequiel.

Http://www.ironicnet.com

RE: Problems indexing very large set of documents

Posted by Brandon Waterloo <Br...@matrix.msu.edu>.
I found a simpler command-line method to update the PDF files.  On some documents it does so perfect, the result is a pixel-for-pixel match and none of the OCR text (which is what all these PDFs are, newspaper articles that have been passed through OCR) is lost.  However, on other documents the result is considerably blurrier and some of the OCR text is lost.

We've decided to skip any documents that Tika cannot index for now.

As Lance stated, it's not specifically the version that causes the problem but rather some quirks caused by different PDF writers, a few tests have confirmed this, so we can't use version to determine which should be skipped.  I'm examining the XML responses from the queries, and I cannot figure out how to tell from the XML response whether or not a document was successfully indexed.  The status value seems to be 0 regardless of whether indexing was successful or not.

So my question is, how can I tell from the response whether or not indexing was actually successful?

~Brandon Waterloo

________________________________________
From: Lance Norskog [goksron@gmail.com]
Sent: Sunday, April 10, 2011 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problems indexing very large set of documents

There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is "dirty PDF" just like "dirty HTML".

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
<Br...@matrix.msu.edu> wrote:
> I think I've finally found the problem.  The files that work are PDF version 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into updating all the old documents to PDF 1.6.
>
> Thanks everyone!
>
> ~Brandon Waterloo
> ________________________________
> From: Ezequiel Calderara [ezechico@gmail.com]
> Sent: Friday, April 08, 2011 11:35 AM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> Maybe those files are created with a different Adobe Format version...
>
> See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <Br...@matrix.msu.edu>> wrote:
> A second test has revealed that it is something to do with the contents, and not the literal filenames, of the second set of files.  I renamed one of the second-format files and tested it and Solr still failed.  However, the problem still only applies to those files of the second naming format.
> ________________________________________
> From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can tell, it appears Solr is tripping up over the filename.  These are strictly examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains multiple periods.  As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames.  In addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> ________________________________________
> From: Chris Hostetter [hossman_lucene@fucit.org<ma...@fucit.org>]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>       ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> there may be a resource issued (if it only happens after indexing 2200) or
> it may just be a problem with a large number of your PDFs that your
> iteration code just happens to get to at that point.
>
> If it's the former, then there may e something buggy about how Solr is
> using Tika to cause the problem -- if it's the later, then it's a straight
> Tika parsing issue.
>
> : > now, commit is set to false to speed up the indexing, and I'm assuming that
> : > Solr should be auto-committing as necessary.  I'm using the default
> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
>
> solr does no autocommitting by default, you need to check your
> solrconfig.xml
>
>
> -Hoss
>
>
>
> --
> ______
> Ezequiel.
>
> Http://www.ironicnet.com
>



--
Lance Norskog
goksron@gmail.com

Re: Problems indexing very large set of documents

Posted by Lance Norskog <go...@gmail.com>.
There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is "dirty PDF" just like "dirty HTML".

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
<Br...@matrix.msu.edu> wrote:
> I think I've finally found the problem.  The files that work are PDF version 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into updating all the old documents to PDF 1.6.
>
> Thanks everyone!
>
> ~Brandon Waterloo
> ________________________________
> From: Ezequiel Calderara [ezechico@gmail.com]
> Sent: Friday, April 08, 2011 11:35 AM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> Maybe those files are created with a different Adobe Format version...
>
> See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <Br...@matrix.msu.edu>> wrote:
> A second test has revealed that it is something to do with the contents, and not the literal filenames, of the second set of files.  I renamed one of the second-format files and tested it and Solr still failed.  However, the problem still only applies to those files of the second naming format.
> ________________________________________
> From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can tell, it appears Solr is tripping up over the filename.  These are strictly examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains multiple periods.  As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames.  In addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> ________________________________________
> From: Chris Hostetter [hossman_lucene@fucit.org<ma...@fucit.org>]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>       ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> there may be a resource issued (if it only happens after indexing 2200) or
> it may just be a problem with a large number of your PDFs that your
> iteration code just happens to get to at that point.
>
> If it's the former, then there may e something buggy about how Solr is
> using Tika to cause the problem -- if it's the later, then it's a straight
> Tika parsing issue.
>
> : > now, commit is set to false to speed up the indexing, and I'm assuming that
> : > Solr should be auto-committing as necessary.  I'm using the default
> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
>
> solr does no autocommitting by default, you need to check your
> solrconfig.xml
>
>
> -Hoss
>
>
>
> --
> ______
> Ezequiel.
>
> Http://www.ironicnet.com
>



-- 
Lance Norskog
goksron@gmail.com

RE: Problems indexing very large set of documents

Posted by Brandon Waterloo <Br...@matrix.msu.edu>.
I think I've finally found the problem.  The files that work are PDF version 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into updating all the old documents to PDF 1.6.

Thanks everyone!

~Brandon Waterloo
________________________________
From: Ezequiel Calderara [ezechico@gmail.com]
Sent: Friday, April 08, 2011 11:35 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

Maybe those files are created with a different Adobe Format version...

See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <Br...@matrix.msu.edu>> wrote:
A second test has revealed that it is something to do with the contents, and not the literal filenames, of the second set of files.  I renamed one of the second-format files and tested it and Solr still failed.  However, the problem still only applies to those files of the second naming format.
________________________________________
From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>]
Sent: Friday, April 08, 2011 10:40 AM
To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

I had some time to do some research into the problems.  From what I can tell, it appears Solr is tripping up over the filename.  These are strictly examples, but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple periods.  As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames.  In addition they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket)

Thanks again everyone,

~Brandon Waterloo


________________________________________
From: Chris Hostetter [hossman_lucene@fucit.org<ma...@fucit.org>]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
       ...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml


-Hoss



--
______
Ezequiel.

Http://www.ironicnet.com

Re: Problems indexing very large set of documents

Posted by Ezequiel Calderara <ez...@gmail.com>.
Maybe those files are created with a different Adobe Format version...

See this:
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

> A second test has revealed that it is something to do with the contents,
> and not the literal filenames, of the second set of files.  I renamed one of
> the second-format files and tested it and Solr still failed.  However, the
> problem still only applies to those files of the second naming format.
> ________________________________________
> From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can
> tell, it appears Solr is tripping up over the filename.  These are strictly
> examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this
> filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains
> multiple periods.  As there are about 1700 files whose filenames are similar
> to the second format it is simply not possible to change their filenames.
>  In addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I
> simply SOL until the Solr dev team can work on this? (assuming I put in a
> ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> ________________________________________
> From: Chris Hostetter [hossman_lucene@fucit.org]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>        ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> there may be a resource issued (if it only happens after indexing 2200) or
> it may just be a problem with a large number of your PDFs that your
> iteration code just happens to get to at that point.
>
> If it's the former, then there may e something buggy about how Solr is
> using Tika to cause the problem -- if it's the later, then it's a straight
> Tika parsing issue.
>
> : > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> : > Solr should be auto-committing as necessary.  I'm using the default
> : > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
>
> solr does no autocommitting by default, you need to check your
> solrconfig.xml
>
>
> -Hoss
>



-- 
______
Ezequiel.

Http://www.ironicnet.com

RE: Problems indexing very large set of documents

Posted by Brandon Waterloo <Br...@matrix.msu.edu>.
A second test has revealed that it is something to do with the contents, and not the literal filenames, of the second set of files.  I renamed one of the second-format files and tested it and Solr still failed.  However, the problem still only applies to those files of the second naming format.
________________________________________
From: Brandon Waterloo [Brandon.Waterloo@matrix.msu.edu]
Sent: Friday, April 08, 2011 10:40 AM
To: solr-user@lucene.apache.org
Subject: RE: Problems indexing very large set of documents

I had some time to do some research into the problems.  From what I can tell, it appears Solr is tripping up over the filename.  These are strictly examples, but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple periods.  As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames.  In addition they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket)

Thanks again everyone,

~Brandon Waterloo


________________________________________
From: Chris Hostetter [hossman_lucene@fucit.org]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
        ...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml


-Hoss

RE: Problems indexing very large set of documents

Posted by Brandon Waterloo <Br...@matrix.msu.edu>.
I had some time to do some research into the problems.  From what I can tell, it appears Solr is tripping up over the filename.  These are strictly examples, but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple periods.  As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames.  In addition they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket)

Thanks again everyone,

~Brandon Waterloo


________________________________________
From: Chris Hostetter [hossman_lucene@fucit.org]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
        ...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml


-Hoss

RE: Problems indexing very large set of documents

Posted by Chris Hostetter <ho...@fucit.org>.
: It wasn't just a single file, it was dozens of files all having problems 
: toward the end just before I killed the process.
	...
: That is by no means all the errors, that is just a sample of a few.  
: You can see they all threw HTTP 500 errors.  What is strange is, nearly 
: every file succeeded before about the 2200-files-mark, and nearly every 
: file after that failed.

..the root question is: do those files *only* fail if you have already 
indexed ~2200 files, or do they fail if you start up your server and index 
them first?

there may be a resource issued (if it only happens after indexing 2200) or 
it may just be a problem with a large number of your PDFs that your 
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is 
using Tika to cause the problem -- if it's the later, then it's a straight 
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your 
solrconfig.xml


-Hoss

RE: Problems indexing very large set of documents

Posted by Brandon Waterloo <Br...@matrix.msu.edu>.
It wasn't just a single file, it was dozens of files all having problems toward the end just before I killed the process.

IPADDR -  -  [04/04/2011:17:17:03 +0000] "POST /solr/update/extract?literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:05 +0000] "POST /solr/update/extract?literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:09 +0000] "POST /solr/update/extract?literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557
IPADDR -  -  [04/04/2011:17:17:14 +0000] "POST /solr/update/extract?literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:21 +0000] "POST /solr/update/extract?literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557

That is by no means all the errors, that is just a sample of a few.  You can see they all threw HTTP 500 errors.  What is strange is, nearly every file succeeded before about the 2200-files-mark, and nearly every file after that failed.


~Brandon Waterloo

________________________________
From: Anuj Kumar [anujsays@gmail.com]
Sent: Monday, April 04, 2011 2:48 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

In the log messages are you able to locate the file at which it fails? Looks like TIKA is unable to parse one of your PDF files for the details. We need to hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <Br...@matrix.msu.edu>> wrote:
Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo

________________________________________
From: Anuj Kumar [anujsays@gmail.com<ma...@gmail.com>]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org<ma...@lucene.apache.org>
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

<http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu<ma...@matrix.msu.edu>> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>        ... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        ... 25 more
>
> As far as I know there's nothing special about these documents so I'm
> wondering if it's not properly autocommitting.  What would be appropriate
> settings in solrconfig.xml for this particular application?  I'd like it to
> autocommit as soon as it needs to but no more often than that for the sake
> of efficiency.  Obviously it takes long enough to index 4000 documents and
> there's no reason to make it take longer.  Thanks for your help!
>
> ~Brandon Waterloo
>


Re: Problems indexing very large set of documents

Posted by Anuj Kumar <an...@gmail.com>.
In the log messages are you able to locate the file at which it fails? Looks
like TIKA is unable to parse one of your PDF files for the details. We need
to hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

> Looks like I'm using Tika 0.4:
> apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
> .../tika-parsers-0.4.jar
>
> ~Brandon Waterloo
>
> ________________________________________
> From: Anuj Kumar [anujsays@gmail.com]
> Sent: Monday, April 04, 2011 2:12 PM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> This is related to Apache TIKA. Which version are you using?
> Please see this thread for more details-
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> <http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
> >Hope
> it helps.
>
> Regards,
> Anuj
>
> On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
> Brandon.Waterloo@matrix.msu.edu> wrote:
>
> >  Hey everybody,
> >
> > I've been running into some issues indexing a very large set of
> documents.
> >  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
> >  Obviously this is a big task for Solr.  I have a PHP script that
> iterates
> > over the directory and uses PHP cURL to query Solr to index the files.
>  For
> > now, commit is set to false to speed up the indexing, and I'm assuming
> that
> > Solr should be auto-committing as necessary.  I'm using the default
> > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.
>  Once
> > all the documents have been finished the PHP script queries Solr to
> commit.
> >
> > The main problem is that after a few thousand documents (around 2000 last
> > time I tried), nearly every document begins causing Java exceptions in
> Solr:
> >
> > Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
> >        at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> >        at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> >        at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> >        at
> > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> >        at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >        at
> > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> >        at
> > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> >        at
> > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> >        at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> >        at
> >
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >        at
> > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> >        at org.mortbay.jetty.Server.handle(Server.java:285)
> >        at
> > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> >        at
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
> >        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> >        at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> >        at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> >        at
> >
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> >        at
> >
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> > Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> > IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> >        at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> >        at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> >        ... 23 more
> > Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> > secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
> >        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
> >        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
> >        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
> >        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
> >        at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> >        ... 25 more
> >
> > As far as I know there's nothing special about these documents so I'm
> > wondering if it's not properly autocommitting.  What would be appropriate
> > settings in solrconfig.xml for this particular application?  I'd like it
> to
> > autocommit as soon as it needs to but no more often than that for the
> sake
> > of efficiency.  Obviously it takes long enough to index 4000 documents
> and
> > there's no reason to make it take longer.  Thanks for your help!
> >
> > ~Brandon Waterloo
> >
>

RE: Problems indexing very large set of documents

Posted by Brandon Waterloo <Br...@matrix.msu.edu>.
Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo

________________________________________
From: Anuj Kumar [anujsays@gmail.com]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

<http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>        ... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        ... 25 more
>
> As far as I know there's nothing special about these documents so I'm
> wondering if it's not properly autocommitting.  What would be appropriate
> settings in solrconfig.xml for this particular application?  I'd like it to
> autocommit as soon as it needs to but no more often than that for the sake
> of efficiency.  Obviously it takes long enough to index 4000 documents and
> there's no reason to make it take longer.  Thanks for your help!
>
> ~Brandon Waterloo
>

Re: Problems indexing very large set of documents

Posted by Anuj Kumar <an...@gmail.com>.
This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

<http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
Brandon.Waterloo@matrix.msu.edu> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>        ... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>        at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>        at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
>        at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        ... 25 more
>
> As far as I know there's nothing special about these documents so I'm
> wondering if it's not properly autocommitting.  What would be appropriate
> settings in solrconfig.xml for this particular application?  I'd like it to
> autocommit as soon as it needs to but no more often than that for the sake
> of efficiency.  Obviously it takes long enough to index 4000 documents and
> there's no reason to make it take longer.  Thanks for your help!
>
> ~Brandon Waterloo
>