You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Joiner <th...@gmail.com> on 2010/08/26 20:36:01 UTC

Re: how to deal with virtual collection in solr?

I don't know about the shards, etc.

However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

The problem is that the version of Tika that 1.4.1 using is a very old
version of Tika, which uses a old version of PDFBox to do its parsing.  (You
might be able to fix the problem just by replacing the Tika jars...however I
don't know if there have been any API changes so I can't really suggest
that.)

We didn't upgrade to trunk in order for that functionality, but it was nice
that it started working. (The PDFs we'll be indexing won't be of later
versions, but a test file was).

On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
xiaohui@mail.nlm.nih.gov> wrote:

> Thanks so much for your help, Jan Høydahl!
>
> I made multiple cores (aa public, aa private, bb public and bb private). I
> knew how to query them individually. Please tell me if I can do a
> combinations through shards parameter now. If yes, I tried to append
> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>
> Actually all of content is the same. I don't have "collection" field in xml
> files. Please tell me how I can set a "collection" field in schema and
> simply search collection through filter.
>
> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
> when I index pdf with version 1.5 and 1.6.
>
> *************************************
> <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 </title>
> </head>
> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@134ae32
>
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@134ae32
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>        ... 22 more
> Caused by: java.lang.NullPointerException
>        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>        at
> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>        at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        ... 24 more
> </pre>
> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> <br/>
> ***************************************
>
>
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
> Sent: Wednesday, August 25, 2010 4:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
>
> > 1. Currently we use Verity and have more than 20 collections, each
> collection has a index for public items and a index for private items. So
> there are virtual collections which point to each collection and a virtual
> collection which points to all. For example, we have AA and BB collections.
> >
> > AA virtual collection --> (AA index for public items and AA index for
> private items).
> > BB virtual collection --> (BB index for public items and BB index for
> private items).
> > All virtual collection --> (AA index for public items and AA index for
> private items, BB index for public items and BB index for private items).
> >
> > Would you please tell me what I should do for this if I use Solr?
>
> There are multiple ways to solve this, depending on the nature of your
> collections. If they have somewhat different schemas, a natural choice would
> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
> you can query them individually or in combinations through the shards
> parameter. From next Solr version you can use virtual collections for the
> shard parameter, e.g. &shards=AA,BB etc. (See
> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>
> If all your content is (roughly) the same kind of data, you could also
> solve your virtual collection issue through a "collection" field in your
> schema, and simply select collection through filters: &fq=collection:AA. You
> could even write a Search Component which translates a &collection=
> parameter in the request into the correct filters if you want to hide this
> implementation to the front ends.
>
> > 2. Our project has different kind format files I need index them. For
> example, xml files, pdf files and text files. Is it possible for Solr to
> return a search result from all?
>
> Sure. PDF and text files can be indexed through the
> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
> DataImportHandler. Solr uses Apache Tika internally to extract text from
> PDFs and other rich document formats.
>
> >
> > 3. I got a error when I index pdf files which are version 1.5 or 1.6.
> Would you please tell me if there is a patch to fix it?
>
> How did you try to index these PDFs? What version of Solr are you using?
> Exactly what error message did you get?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
>

Re: how to deal with virtual collection in solr?

Posted by Lance Norskog <go...@gmail.com>.
For XML files that are not in the Solr document upload format, you
would use the DataImportHandler.

http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Look for the wikipedia example. It shows how to read XML files from
disk. You give XPath expressions for different items in the XML.

On Fri, Aug 27, 2010 at 6:04 AM, Ma, Xiaohui (NIH/NLM/LHC) [C]
<xi...@mail.nlm.nih.gov> wrote:
> Thank you, Jan Høydahl.
>
> I used http://localhost:8983/solr/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a error "Missing solr core name in path". I have aapublic and aaprivate cores. I also got a error if I used http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a null exception "java.lang.NullPointerException".
>
> My collections are xml files. Please let me if I can use the following way you suggested.
> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.xml"
>
> Thanks so much as always!
> Xiaohui
>
>
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
> Sent: Friday, August 27, 2010 7:42 AM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
>
> Hi,
>
> Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
> &shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
>
>
> However, since schema is the same, I'd opt for one index with a "collections" field as the filter.
>
> You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:
>
> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
>
>> Thanks so much for your help! I will try it.
>>
>>
>> -----Original Message-----
>> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com]
>> Sent: Thursday, August 26, 2010 2:36 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>>
>> I don't know about the shards, etc.
>>
>> However I recently encountered that exception while indexing pdfs as well.
>> The way that I resolved it was to upgrade to a nightly build of Solr. (You
>> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
>>
>> The problem is that the version of Tika that 1.4.1 using is a very old
>> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
>> might be able to fix the problem just by replacing the Tika jars...however I
>> don't know if there have been any API changes so I can't really suggest
>> that.)
>>
>> We didn't upgrade to trunk in order for that functionality, but it was nice
>> that it started working. (The PDFs we'll be indexing won't be of later
>> versions, but a test file was).
>>
>> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
>> xiaohui@mail.nlm.nih.gov> wrote:
>>
>>> Thanks so much for your help, Jan Høydahl!
>>>
>>> I made multiple cores (aa public, aa private, bb public and bb private). I
>>> knew how to query them individually. Please tell me if I can do a
>>> combinations through shards parameter now. If yes, I tried to append
>>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>>>
>>> Actually all of content is the same. I don't have "collection" field in xml
>>> files. Please tell me how I can set a "collection" field in schema and
>>> simply search collection through filter.
>>>
>>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>>> when I index pdf with version 1.5 and 1.6.
>>>
>>> *************************************
>>> <html>
>>> <head>
>>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>>> <title>Error 500 </title>
>>> </head>
>>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>>> Unexpected RuntimeException from
>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>
>>> org.apache.solr.common.SolrException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>       at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>>       at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>       at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>       at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>       at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>       at
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>>       at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>>       at
>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>       at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>>       at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>>       at
>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>>       at
>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>>       at
>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>       at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>>       at org.mortbay.jetty.Server.handle(Server.java:285)
>>>       at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>>       at
>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>>       at
>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>>       at
>>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>>       at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>>       at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>>       at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>       ... 22 more
>>> Caused by: java.lang.NullPointerException
>>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>>       at
>>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>>       at
>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>       at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>>       at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>>       ... 24 more
>>> </pre>
>>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>>> <br/>
>>> ***************************************
>>>
>>>
>>> -----Original Message-----
>>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>>> Sent: Wednesday, August 25, 2010 4:34 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: how to deal with virtual collection in solr?
>>>
>>>> 1. Currently we use Verity and have more than 20 collections, each
>>> collection has a index for public items and a index for private items. So
>>> there are virtual collections which point to each collection and a virtual
>>> collection which points to all. For example, we have AA and BB collections.
>>>>
>>>> AA virtual collection --> (AA index for public items and AA index for
>>> private items).
>>>> BB virtual collection --> (BB index for public items and BB index for
>>> private items).
>>>> All virtual collection --> (AA index for public items and AA index for
>>> private items, BB index for public items and BB index for private items).
>>>>
>>>> Would you please tell me what I should do for this if I use Solr?
>>>
>>> There are multiple ways to solve this, depending on the nature of your
>>> collections. If they have somewhat different schemas, a natural choice would
>>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>>> you can query them individually or in combinations through the shards
>>> parameter. From next Solr version you can use virtual collections for the
>>> shard parameter, e.g. &shards=AA,BB etc. (See
>>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>>>
>>> If all your content is (roughly) the same kind of data, you could also
>>> solve your virtual collection issue through a "collection" field in your
>>> schema, and simply select collection through filters: &fq=collection:AA. You
>>> could even write a Search Component which translates a &collection=
>>> parameter in the request into the correct filters if you want to hide this
>>> implementation to the front ends.
>>>
>>>> 2. Our project has different kind format files I need index them. For
>>> example, xml files, pdf files and text files. Is it possible for Solr to
>>> return a search result from all?
>>>
>>> Sure. PDF and text files can be indexed through the
>>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>>> PDFs and other rich document formats.
>>>
>>>>
>>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>>> Would you please tell me if there is a patch to fix it?
>>>
>>> How did you try to index these PDFs? What version of Solr are you using?
>>> Exactly what error message did you get?
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Training in Europe - www.solrtraining.com
>>>
>>>
>
>



-- 
Lance Norskog
goksron@gmail.com

RE: how to deal with virtual collection in solr?

Posted by "Ma, Xiaohui (NIH/NLM/LHC) [C]" <xi...@mail.nlm.nih.gov>.
Thanks so much for your help, Jan Høydahl.

Have a great weekend!
Xiaohui

-----Original Message-----
From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
Sent: Friday, September 03, 2010 3:46 AM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

You did not supply your actual query. Try to add a &q=foobar parameter, also you don't need a & before shards since you have the ?.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. sep. 2010, at 20.14, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thank you, Jan. Unfortunately I got following exception when I use http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/ .
>
> *********************************
> Aug 31, 2010 4:54:42 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.NullPointerException
>        at java.io.StringReader.<init>(StringReader.java:33)
>        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
>        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
>        at org.apache.solr.search.QParser.getQuery(QParser.java:131)
>        at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
>        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> *********************************
>
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
> Sent: Tuesday, August 31, 2010 2:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
>
> Hi,
>
> If you have multiple cores defined in your solr.xml you need to issue your queries to one of the cores. Below it seems as if you are lacking core name. Try instead:
>
>       http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
>
> And as Lance pointed out, make sure your XML files conform to the Solr XML format (http://wiki.apache.org/solr/UpdateXmlMessages).
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
>
>> Thank you, Jan Høydahl.
>>
>> I used http://localhost:8983/solr/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a error "Missing solr core name in path". I have aapublic and aaprivate cores. I also got a error if I used http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a null exception "java.lang.NullPointerException".
>>
>> My collections are xml files. Please let me if I can use the following way you suggested.
>> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.xml"
>>
>> Thanks so much as always!
>> Xiaohui
>>
>>
>> -----Original Message-----
>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>> Sent: Friday, August 27, 2010 7:42 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>>
>> Hi,
>>
>> Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
>> &shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
>>
>>
>> However, since schema is the same, I'd opt for one index with a "collections" field as the filter.
>>
>> You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:
>>
>> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
>>
>>> Thanks so much for your help! I will try it.
>>>
>>>
>>> -----Original Message-----
>>> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com]
>>> Sent: Thursday, August 26, 2010 2:36 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: how to deal with virtual collection in solr?
>>>
>>> I don't know about the shards, etc.
>>>
>>> However I recently encountered that exception while indexing pdfs as well.
>>> The way that I resolved it was to upgrade to a nightly build of Solr. (You
>>> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
>>>
>>> The problem is that the version of Tika that 1.4.1 using is a very old
>>> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
>>> might be able to fix the problem just by replacing the Tika jars...however I
>>> don't know if there have been any API changes so I can't really suggest
>>> that.)
>>>
>>> We didn't upgrade to trunk in order for that functionality, but it was nice
>>> that it started working. (The PDFs we'll be indexing won't be of later
>>> versions, but a test file was).
>>>
>>> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
>>> xiaohui@mail.nlm.nih.gov> wrote:
>>>
>>>> Thanks so much for your help, Jan Høydahl!
>>>>
>>>> I made multiple cores (aa public, aa private, bb public and bb private). I
>>>> knew how to query them individually. Please tell me if I can do a
>>>> combinations through shards parameter now. If yes, I tried to append
>>>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>>>>
>>>> Actually all of content is the same. I don't have "collection" field in xml
>>>> files. Please tell me how I can set a "collection" field in schema and
>>>> simply search collection through filter.
>>>>
>>>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>>>> when I index pdf with version 1.5 and 1.6.
>>>>
>>>> *************************************
>>>> <html>
>>>> <head>
>>>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>>>> <title>Error 500 </title>
>>>> </head>
>>>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>>>> Unexpected RuntimeException from
>>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>>
>>>> org.apache.solr.common.SolrException:
>>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>>     at
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>>>     at
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>>     at
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>>     at
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>>     at
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>>     at
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>>>     at
>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>>>     at
>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>>     at
>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>>>     at
>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>>>     at
>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>>>     at
>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>>>     at
>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>>     at
>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>>>     at org.mortbay.jetty.Server.handle(Server.java:285)
>>>>     at
>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>>>     at
>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>>>     at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>>>     at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>>>     at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>>>     at
>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>>>     at
>>>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>>>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>>>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>>>     at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>>>     at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>>>     at
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>>     ... 22 more
>>>> Caused by: java.lang.NullPointerException
>>>>     at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>>>     at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>>>     at
>>>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>>>     at
>>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>>     at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>>>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>>>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>>>     at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>>>     ... 24 more
>>>> </pre>
>>>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>>>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>>>> <br/>
>>>> ***************************************
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>>>> Sent: Wednesday, August 25, 2010 4:34 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: how to deal with virtual collection in solr?
>>>>
>>>>> 1. Currently we use Verity and have more than 20 collections, each
>>>> collection has a index for public items and a index for private items. So
>>>> there are virtual collections which point to each collection and a virtual
>>>> collection which points to all. For example, we have AA and BB collections.
>>>>>
>>>>> AA virtual collection --> (AA index for public items and AA index for
>>>> private items).
>>>>> BB virtual collection --> (BB index for public items and BB index for
>>>> private items).
>>>>> All virtual collection --> (AA index for public items and AA index for
>>>> private items, BB index for public items and BB index for private items).
>>>>>
>>>>> Would you please tell me what I should do for this if I use Solr?
>>>>
>>>> There are multiple ways to solve this, depending on the nature of your
>>>> collections. If they have somewhat different schemas, a natural choice would
>>>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>>>> you can query them individually or in combinations through the shards
>>>> parameter. From next Solr version you can use virtual collections for the
>>>> shard parameter, e.g. &shards=AA,BB etc. (See
>>>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>>>>
>>>> If all your content is (roughly) the same kind of data, you could also
>>>> solve your virtual collection issue through a "collection" field in your
>>>> schema, and simply select collection through filters: &fq=collection:AA. You
>>>> could even write a Search Component which translates a &collection=
>>>> parameter in the request into the correct filters if you want to hide this
>>>> implementation to the front ends.
>>>>
>>>>> 2. Our project has different kind format files I need index them. For
>>>> example, xml files, pdf files and text files. Is it possible for Solr to
>>>> return a search result from all?
>>>>
>>>> Sure. PDF and text files can be indexed through the
>>>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>>>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>>>> PDFs and other rich document formats.
>>>>
>>>>>
>>>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>>>> Would you please tell me if there is a patch to fix it?
>>>>
>>>> How did you try to index these PDFs? What version of Solr are you using?
>>>> Exactly what error message did you get?
>>>>
>>>> --
>>>> Jan Høydahl, search solution architect
>>>> Cominvent AS - www.cominvent.com
>>>> Training in Europe - www.solrtraining.com
>>>>
>>>>
>>
>


Re: how to deal with virtual collection in solr?

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
You did not supply your actual query. Try to add a &q=foobar parameter, also you don't need a & before shards since you have the ?.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. sep. 2010, at 20.14, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thank you, Jan. Unfortunately I got following exception when I use http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/ . 
> 
> *********************************
> Aug 31, 2010 4:54:42 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.NullPointerException
>        at java.io.StringReader.<init>(StringReader.java:33)
>        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
>        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
>        at org.apache.solr.search.QParser.getQuery(QParser.java:131)
>        at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
>        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> *********************************
> 
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
> Sent: Tuesday, August 31, 2010 2:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> Hi,
> 
> If you have multiple cores defined in your solr.xml you need to issue your queries to one of the cores. Below it seems as if you are lacking core name. Try instead:
> 
> 	http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
> 
> And as Lance pointed out, make sure your XML files conform to the Solr XML format (http://wiki.apache.org/solr/UpdateXmlMessages).
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
> 
>> Thank you, Jan Høydahl. 
>> 
>> I used http://localhost:8983/solr/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a error "Missing solr core name in path". I have aapublic and aaprivate cores. I also got a error if I used http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a null exception "java.lang.NullPointerException". 
>> 
>> My collections are xml files. Please let me if I can use the following way you suggested.
>> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.xml"
>> 
>> Thanks so much as always!
>> Xiaohui 
>> 
>> 
>> -----Original Message-----
>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
>> Sent: Friday, August 27, 2010 7:42 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>> Hi,
>> 
>> Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
>> &shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
>> 
>> 
>> However, since schema is the same, I'd opt for one index with a "collections" field as the filter.
>> 
>> You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:
>> 
>> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
>> 
>>> Thanks so much for your help! I will try it.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
>>> Sent: Thursday, August 26, 2010 2:36 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: how to deal with virtual collection in solr?
>>> 
>>> I don't know about the shards, etc.
>>> 
>>> However I recently encountered that exception while indexing pdfs as well.
>>> The way that I resolved it was to upgrade to a nightly build of Solr. (You
>>> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
>>> 
>>> The problem is that the version of Tika that 1.4.1 using is a very old
>>> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
>>> might be able to fix the problem just by replacing the Tika jars...however I
>>> don't know if there have been any API changes so I can't really suggest
>>> that.)
>>> 
>>> We didn't upgrade to trunk in order for that functionality, but it was nice
>>> that it started working. (The PDFs we'll be indexing won't be of later
>>> versions, but a test file was).
>>> 
>>> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
>>> xiaohui@mail.nlm.nih.gov> wrote:
>>> 
>>>> Thanks so much for your help, Jan Høydahl!
>>>> 
>>>> I made multiple cores (aa public, aa private, bb public and bb private). I
>>>> knew how to query them individually. Please tell me if I can do a
>>>> combinations through shards parameter now. If yes, I tried to append
>>>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>>>> 
>>>> Actually all of content is the same. I don't have "collection" field in xml
>>>> files. Please tell me how I can set a "collection" field in schema and
>>>> simply search collection through filter.
>>>> 
>>>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>>>> when I index pdf with version 1.5 and 1.6.
>>>> 
>>>> *************************************
>>>> <html>
>>>> <head>
>>>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>>>> <title>Error 500 </title>
>>>> </head>
>>>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>>>> Unexpected RuntimeException from
>>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>> 
>>>> org.apache.solr.common.SolrException:
>>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>>     at
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>>>     at
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>>     at
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>>     at
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>>     at
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>>     at
>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>>>     at
>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>>>     at
>>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>>     at
>>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>>>     at
>>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>>>     at
>>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>>>     at
>>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>>>     at
>>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>>     at
>>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>>>     at org.mortbay.jetty.Server.handle(Server.java:285)
>>>>     at
>>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>>>     at
>>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>>>     at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>>>     at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>>>     at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>>>     at
>>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>>>     at
>>>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>>>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>>>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>>>     at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>>>     at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>>>     at
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>>     ... 22 more
>>>> Caused by: java.lang.NullPointerException
>>>>     at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>>>     at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>>>     at
>>>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>>>     at
>>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>>     at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>>>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>>>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>>>     at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>>>     ... 24 more
>>>> </pre>
>>>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>>>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>>>> <br/>
>>>> ***************************************
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>>>> Sent: Wednesday, August 25, 2010 4:34 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: how to deal with virtual collection in solr?
>>>> 
>>>>> 1. Currently we use Verity and have more than 20 collections, each
>>>> collection has a index for public items and a index for private items. So
>>>> there are virtual collections which point to each collection and a virtual
>>>> collection which points to all. For example, we have AA and BB collections.
>>>>> 
>>>>> AA virtual collection --> (AA index for public items and AA index for
>>>> private items).
>>>>> BB virtual collection --> (BB index for public items and BB index for
>>>> private items).
>>>>> All virtual collection --> (AA index for public items and AA index for
>>>> private items, BB index for public items and BB index for private items).
>>>>> 
>>>>> Would you please tell me what I should do for this if I use Solr?
>>>> 
>>>> There are multiple ways to solve this, depending on the nature of your
>>>> collections. If they have somewhat different schemas, a natural choice would
>>>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>>>> you can query them individually or in combinations through the shards
>>>> parameter. From next Solr version you can use virtual collections for the
>>>> shard parameter, e.g. &shards=AA,BB etc. (See
>>>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>>>> 
>>>> If all your content is (roughly) the same kind of data, you could also
>>>> solve your virtual collection issue through a "collection" field in your
>>>> schema, and simply select collection through filters: &fq=collection:AA. You
>>>> could even write a Search Component which translates a &collection=
>>>> parameter in the request into the correct filters if you want to hide this
>>>> implementation to the front ends.
>>>> 
>>>>> 2. Our project has different kind format files I need index them. For
>>>> example, xml files, pdf files and text files. Is it possible for Solr to
>>>> return a search result from all?
>>>> 
>>>> Sure. PDF and text files can be indexed through the
>>>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>>>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>>>> PDFs and other rich document formats.
>>>> 
>>>>> 
>>>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>>>> Would you please tell me if there is a patch to fix it?
>>>> 
>>>> How did you try to index these PDFs? What version of Solr are you using?
>>>> Exactly what error message did you get?
>>>> 
>>>> --
>>>> Jan Høydahl, search solution architect
>>>> Cominvent AS - www.cominvent.com
>>>> Training in Europe - www.solrtraining.com
>>>> 
>>>> 
>> 
> 


RE: how to deal with virtual collection in solr?

Posted by "Ma, Xiaohui (NIH/NLM/LHC) [C]" <xi...@mail.nlm.nih.gov>.
Thank you, Jan. Unfortunately I got following exception when I use http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/ . 

*********************************
Aug 31, 2010 4:54:42 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
        at java.io.StringReader.<init>(StringReader.java:33)
        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
        at org.apache.solr.search.QParser.getQuery(QParser.java:131)
        at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
*********************************

-----Original Message-----
From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
Sent: Tuesday, August 31, 2010 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

Hi,

If you have multiple cores defined in your solr.xml you need to issue your queries to one of the cores. Below it seems as if you are lacking core name. Try instead:

	http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/

And as Lance pointed out, make sure your XML files conform to the Solr XML format (http://wiki.apache.org/solr/UpdateXmlMessages).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thank you, Jan Høydahl. 
> 
> I used http://localhost:8983/solr/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a error "Missing solr core name in path". I have aapublic and aaprivate cores. I also got a error if I used http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a null exception "java.lang.NullPointerException". 
> 
> My collections are xml files. Please let me if I can use the following way you suggested.
> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.xml"
> 
> Thanks so much as always!
> Xiaohui 
> 
> 
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
> Sent: Friday, August 27, 2010 7:42 AM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> Hi,
> 
> Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
> &shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
> 
> 
> However, since schema is the same, I'd opt for one index with a "collections" field as the filter.
> 
> You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:
> 
> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
> 
>> Thanks so much for your help! I will try it.
>> 
>> 
>> -----Original Message-----
>> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
>> Sent: Thursday, August 26, 2010 2:36 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>> I don't know about the shards, etc.
>> 
>> However I recently encountered that exception while indexing pdfs as well.
>> The way that I resolved it was to upgrade to a nightly build of Solr. (You
>> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
>> 
>> The problem is that the version of Tika that 1.4.1 using is a very old
>> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
>> might be able to fix the problem just by replacing the Tika jars...however I
>> don't know if there have been any API changes so I can't really suggest
>> that.)
>> 
>> We didn't upgrade to trunk in order for that functionality, but it was nice
>> that it started working. (The PDFs we'll be indexing won't be of later
>> versions, but a test file was).
>> 
>> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
>> xiaohui@mail.nlm.nih.gov> wrote:
>> 
>>> Thanks so much for your help, Jan Høydahl!
>>> 
>>> I made multiple cores (aa public, aa private, bb public and bb private). I
>>> knew how to query them individually. Please tell me if I can do a
>>> combinations through shards parameter now. If yes, I tried to append
>>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>>> 
>>> Actually all of content is the same. I don't have "collection" field in xml
>>> files. Please tell me how I can set a "collection" field in schema and
>>> simply search collection through filter.
>>> 
>>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>>> when I index pdf with version 1.5 and 1.6.
>>> 
>>> *************************************
>>> <html>
>>> <head>
>>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>>> <title>Error 500 </title>
>>> </head>
>>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>>> Unexpected RuntimeException from
>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>> 
>>> org.apache.solr.common.SolrException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>      at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>>      at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>      at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>      at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>      at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>      at
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>>      at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>>      at
>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>      at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>>      at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>>      at
>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>>      at
>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>>      at
>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>      at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>>      at org.mortbay.jetty.Server.handle(Server.java:285)
>>>      at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>>      at
>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>>      at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>>      at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>>      at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>>      at
>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>>      at
>>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>>      at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>>      at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>      ... 22 more
>>> Caused by: java.lang.NullPointerException
>>>      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>>      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>>      at
>>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>>      at
>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>      at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>>      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>>      ... 24 more
>>> </pre>
>>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>>> <br/>
>>> ***************************************
>>> 
>>> 
>>> -----Original Message-----
>>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>>> Sent: Wednesday, August 25, 2010 4:34 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: how to deal with virtual collection in solr?
>>> 
>>>> 1. Currently we use Verity and have more than 20 collections, each
>>> collection has a index for public items and a index for private items. So
>>> there are virtual collections which point to each collection and a virtual
>>> collection which points to all. For example, we have AA and BB collections.
>>>> 
>>>> AA virtual collection --> (AA index for public items and AA index for
>>> private items).
>>>> BB virtual collection --> (BB index for public items and BB index for
>>> private items).
>>>> All virtual collection --> (AA index for public items and AA index for
>>> private items, BB index for public items and BB index for private items).
>>>> 
>>>> Would you please tell me what I should do for this if I use Solr?
>>> 
>>> There are multiple ways to solve this, depending on the nature of your
>>> collections. If they have somewhat different schemas, a natural choice would
>>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>>> you can query them individually or in combinations through the shards
>>> parameter. From next Solr version you can use virtual collections for the
>>> shard parameter, e.g. &shards=AA,BB etc. (See
>>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>>> 
>>> If all your content is (roughly) the same kind of data, you could also
>>> solve your virtual collection issue through a "collection" field in your
>>> schema, and simply select collection through filters: &fq=collection:AA. You
>>> could even write a Search Component which translates a &collection=
>>> parameter in the request into the correct filters if you want to hide this
>>> implementation to the front ends.
>>> 
>>>> 2. Our project has different kind format files I need index them. For
>>> example, xml files, pdf files and text files. Is it possible for Solr to
>>> return a search result from all?
>>> 
>>> Sure. PDF and text files can be indexed through the
>>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>>> PDFs and other rich document formats.
>>> 
>>>> 
>>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>>> Would you please tell me if there is a patch to fix it?
>>> 
>>> How did you try to index these PDFs? What version of Solr are you using?
>>> Exactly what error message did you get?
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Training in Europe - www.solrtraining.com
>>> 
>>> 
> 


Re: how to deal with virtual collection in solr?

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi,

If you have multiple cores defined in your solr.xml you need to issue your queries to one of the cores. Below it seems as if you are lacking core name. Try instead:

	http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/

And as Lance pointed out, make sure your XML files conform to the Solr XML format (http://wiki.apache.org/solr/UpdateXmlMessages).

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 27. aug. 2010, at 15.04, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thank you, Jan Høydahl. 
> 
> I used http://localhost:8983/solr/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a error "Missing solr core name in path". I have aapublic and aaprivate cores. I also got a error if I used http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a null exception "java.lang.NullPointerException". 
> 
> My collections are xml files. Please let me if I can use the following way you suggested.
> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.xml"
> 
> Thanks so much as always!
> Xiaohui 
> 
> 
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
> Sent: Friday, August 27, 2010 7:42 AM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> Hi,
> 
> Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
> &shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/
> 
> 
> However, since schema is the same, I'd opt for one index with a "collections" field as the filter.
> 
> You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:
> 
> curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:
> 
>> Thanks so much for your help! I will try it.
>> 
>> 
>> -----Original Message-----
>> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
>> Sent: Thursday, August 26, 2010 2:36 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>> I don't know about the shards, etc.
>> 
>> However I recently encountered that exception while indexing pdfs as well.
>> The way that I resolved it was to upgrade to a nightly build of Solr. (You
>> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
>> 
>> The problem is that the version of Tika that 1.4.1 using is a very old
>> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
>> might be able to fix the problem just by replacing the Tika jars...however I
>> don't know if there have been any API changes so I can't really suggest
>> that.)
>> 
>> We didn't upgrade to trunk in order for that functionality, but it was nice
>> that it started working. (The PDFs we'll be indexing won't be of later
>> versions, but a test file was).
>> 
>> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
>> xiaohui@mail.nlm.nih.gov> wrote:
>> 
>>> Thanks so much for your help, Jan Høydahl!
>>> 
>>> I made multiple cores (aa public, aa private, bb public and bb private). I
>>> knew how to query them individually. Please tell me if I can do a
>>> combinations through shards parameter now. If yes, I tried to append
>>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>>> 
>>> Actually all of content is the same. I don't have "collection" field in xml
>>> files. Please tell me how I can set a "collection" field in schema and
>>> simply search collection through filter.
>>> 
>>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>>> when I index pdf with version 1.5 and 1.6.
>>> 
>>> *************************************
>>> <html>
>>> <head>
>>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>>> <title>Error 500 </title>
>>> </head>
>>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>>> Unexpected RuntimeException from
>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>> 
>>> org.apache.solr.common.SolrException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>>      at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>>      at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>      at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>      at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>      at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>      at
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>>      at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>>      at
>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>      at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>>      at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>>      at
>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>>      at
>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>>      at
>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>      at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>>      at org.mortbay.jetty.Server.handle(Server.java:285)
>>>      at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>>      at
>>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>>      at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>>      at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>>      at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>>      at
>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>>      at
>>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>>      at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>>      at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>      ... 22 more
>>> Caused by: java.lang.NullPointerException
>>>      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>>      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>>      at
>>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>>      at
>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>>      at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>>      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>>      at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>>      ... 24 more
>>> </pre>
>>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>>> <br/>
>>> ***************************************
>>> 
>>> 
>>> -----Original Message-----
>>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>>> Sent: Wednesday, August 25, 2010 4:34 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: how to deal with virtual collection in solr?
>>> 
>>>> 1. Currently we use Verity and have more than 20 collections, each
>>> collection has a index for public items and a index for private items. So
>>> there are virtual collections which point to each collection and a virtual
>>> collection which points to all. For example, we have AA and BB collections.
>>>> 
>>>> AA virtual collection --> (AA index for public items and AA index for
>>> private items).
>>>> BB virtual collection --> (BB index for public items and BB index for
>>> private items).
>>>> All virtual collection --> (AA index for public items and AA index for
>>> private items, BB index for public items and BB index for private items).
>>>> 
>>>> Would you please tell me what I should do for this if I use Solr?
>>> 
>>> There are multiple ways to solve this, depending on the nature of your
>>> collections. If they have somewhat different schemas, a natural choice would
>>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>>> you can query them individually or in combinations through the shards
>>> parameter. From next Solr version you can use virtual collections for the
>>> shard parameter, e.g. &shards=AA,BB etc. (See
>>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>>> 
>>> If all your content is (roughly) the same kind of data, you could also
>>> solve your virtual collection issue through a "collection" field in your
>>> schema, and simply select collection through filters: &fq=collection:AA. You
>>> could even write a Search Component which translates a &collection=
>>> parameter in the request into the correct filters if you want to hide this
>>> implementation to the front ends.
>>> 
>>>> 2. Our project has different kind format files I need index them. For
>>> example, xml files, pdf files and text files. Is it possible for Solr to
>>> return a search result from all?
>>> 
>>> Sure. PDF and text files can be indexed through the
>>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>>> PDFs and other rich document formats.
>>> 
>>>> 
>>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>>> Would you please tell me if there is a patch to fix it?
>>> 
>>> How did you try to index these PDFs? What version of Solr are you using?
>>> Exactly what error message did you get?
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Training in Europe - www.solrtraining.com
>>> 
>>> 
> 


RE: how to deal with virtual collection in solr?

Posted by "Ma, Xiaohui (NIH/NLM/LHC) [C]" <xi...@mail.nlm.nih.gov>.
Thank you, Jan Høydahl. 

I used http://localhost:8983/solr/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a error "Missing solr core name in path". I have aapublic and aaprivate cores. I also got a error if I used http://localhost:8983/solr/aapublic/select?&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/. I got a null exception "java.lang.NullPointerException". 

My collections are xml files. Please let me if I can use the following way you suggested.
curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.xml"

Thanks so much as always!
Xiaohui 


-----Original Message-----
From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
Sent: Friday, August 27, 2010 7:42 AM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a "collections" field as the filter.

You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:

curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thanks so much for your help! I will try it.
> 
> 
> -----Original Message-----
> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
> Sent: Thursday, August 26, 2010 2:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> I don't know about the shards, etc.
> 
> However I recently encountered that exception while indexing pdfs as well.
> The way that I resolved it was to upgrade to a nightly build of Solr. (You
> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
> 
> The problem is that the version of Tika that 1.4.1 using is a very old
> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
> might be able to fix the problem just by replacing the Tika jars...however I
> don't know if there have been any API changes so I can't really suggest
> that.)
> 
> We didn't upgrade to trunk in order for that functionality, but it was nice
> that it started working. (The PDFs we'll be indexing won't be of later
> versions, but a test file was).
> 
> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
> xiaohui@mail.nlm.nih.gov> wrote:
> 
>> Thanks so much for your help, Jan Høydahl!
>> 
>> I made multiple cores (aa public, aa private, bb public and bb private). I
>> knew how to query them individually. Please tell me if I can do a
>> combinations through shards parameter now. If yes, I tried to append
>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>> 
>> Actually all of content is the same. I don't have "collection" field in xml
>> files. Please tell me how I can set a "collection" field in schema and
>> simply search collection through filter.
>> 
>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>> when I index pdf with version 1.5 and 1.6.
>> 
>> *************************************
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>> <title>Error 500 </title>
>> </head>
>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>> Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>> 
>> org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>       at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>       at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>       at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>       at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>       at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>       at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>       at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>       at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>       at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>       at org.mortbay.jetty.Server.handle(Server.java:285)
>>       at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>       at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>       at
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>       at
>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>       ... 22 more
>> Caused by: java.lang.NullPointerException
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>       at
>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>       at
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>       at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>       ... 24 more
>> </pre>
>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>> <br/>
>> ***************************************
>> 
>> 
>> -----Original Message-----
>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>> Sent: Wednesday, August 25, 2010 4:34 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>>> 1. Currently we use Verity and have more than 20 collections, each
>> collection has a index for public items and a index for private items. So
>> there are virtual collections which point to each collection and a virtual
>> collection which points to all. For example, we have AA and BB collections.
>>> 
>>> AA virtual collection --> (AA index for public items and AA index for
>> private items).
>>> BB virtual collection --> (BB index for public items and BB index for
>> private items).
>>> All virtual collection --> (AA index for public items and AA index for
>> private items, BB index for public items and BB index for private items).
>>> 
>>> Would you please tell me what I should do for this if I use Solr?
>> 
>> There are multiple ways to solve this, depending on the nature of your
>> collections. If they have somewhat different schemas, a natural choice would
>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>> you can query them individually or in combinations through the shards
>> parameter. From next Solr version you can use virtual collections for the
>> shard parameter, e.g. &shards=AA,BB etc. (See
>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>> 
>> If all your content is (roughly) the same kind of data, you could also
>> solve your virtual collection issue through a "collection" field in your
>> schema, and simply select collection through filters: &fq=collection:AA. You
>> could even write a Search Component which translates a &collection=
>> parameter in the request into the correct filters if you want to hide this
>> implementation to the front ends.
>> 
>>> 2. Our project has different kind format files I need index them. For
>> example, xml files, pdf files and text files. Is it possible for Solr to
>> return a search result from all?
>> 
>> Sure. PDF and text files can be indexed through the
>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>> PDFs and other rich document formats.
>> 
>>> 
>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>> Would you please tell me if there is a patch to fix it?
>> 
>> How did you try to index these PDFs? What version of Solr are you using?
>> Exactly what error message did you get?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> 


RE: how to deal with virtual collection in solr?

Posted by "Ma, Xiaohui (NIH/NLM/LHC) [C]" <xi...@mail.nlm.nih.gov>.
Thanks so much, I really appreciate your help!
Have a great weekend!
Xiaohui 

-----Original Message-----
From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com] 
Sent: Friday, August 27, 2010 7:42 AM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a "collections" field as the filter.

You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:

curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thanks so much for your help! I will try it.
> 
> 
> -----Original Message-----
> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
> Sent: Thursday, August 26, 2010 2:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> I don't know about the shards, etc.
> 
> However I recently encountered that exception while indexing pdfs as well.
> The way that I resolved it was to upgrade to a nightly build of Solr. (You
> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
> 
> The problem is that the version of Tika that 1.4.1 using is a very old
> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
> might be able to fix the problem just by replacing the Tika jars...however I
> don't know if there have been any API changes so I can't really suggest
> that.)
> 
> We didn't upgrade to trunk in order for that functionality, but it was nice
> that it started working. (The PDFs we'll be indexing won't be of later
> versions, but a test file was).
> 
> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
> xiaohui@mail.nlm.nih.gov> wrote:
> 
>> Thanks so much for your help, Jan Høydahl!
>> 
>> I made multiple cores (aa public, aa private, bb public and bb private). I
>> knew how to query them individually. Please tell me if I can do a
>> combinations through shards parameter now. If yes, I tried to append
>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>> 
>> Actually all of content is the same. I don't have "collection" field in xml
>> files. Please tell me how I can set a "collection" field in schema and
>> simply search collection through filter.
>> 
>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>> when I index pdf with version 1.5 and 1.6.
>> 
>> *************************************
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>> <title>Error 500 </title>
>> </head>
>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>> Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>> 
>> org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>       at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>       at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>       at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>       at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>       at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>       at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>       at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>       at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>       at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>       at org.mortbay.jetty.Server.handle(Server.java:285)
>>       at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>       at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>       at
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>       at
>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>       ... 22 more
>> Caused by: java.lang.NullPointerException
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>       at
>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>       at
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>       at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>       ... 24 more
>> </pre>
>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>> <br/>
>> ***************************************
>> 
>> 
>> -----Original Message-----
>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>> Sent: Wednesday, August 25, 2010 4:34 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>>> 1. Currently we use Verity and have more than 20 collections, each
>> collection has a index for public items and a index for private items. So
>> there are virtual collections which point to each collection and a virtual
>> collection which points to all. For example, we have AA and BB collections.
>>> 
>>> AA virtual collection --> (AA index for public items and AA index for
>> private items).
>>> BB virtual collection --> (BB index for public items and BB index for
>> private items).
>>> All virtual collection --> (AA index for public items and AA index for
>> private items, BB index for public items and BB index for private items).
>>> 
>>> Would you please tell me what I should do for this if I use Solr?
>> 
>> There are multiple ways to solve this, depending on the nature of your
>> collections. If they have somewhat different schemas, a natural choice would
>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>> you can query them individually or in combinations through the shards
>> parameter. From next Solr version you can use virtual collections for the
>> shard parameter, e.g. &shards=AA,BB etc. (See
>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>> 
>> If all your content is (roughly) the same kind of data, you could also
>> solve your virtual collection issue through a "collection" field in your
>> schema, and simply select collection through filters: &fq=collection:AA. You
>> could even write a Search Component which translates a &collection=
>> parameter in the request into the correct filters if you want to hide this
>> implementation to the front ends.
>> 
>>> 2. Our project has different kind format files I need index them. For
>> example, xml files, pdf files and text files. Is it possible for Solr to
>> return a search result from all?
>> 
>> Sure. PDF and text files can be indexed through the
>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>> PDFs and other rich document formats.
>> 
>>> 
>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>> Would you please tell me if there is a patch to fix it?
>> 
>> How did you try to index these PDFs? What version of Solr are you using?
>> Exactly what error message did you get?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> 


Re: how to deal with virtual collection in solr?

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
Hi,

Version 1.4.1 does not support the SolrCloud style sharding. In 1.4.1, please use this style:
&shards=localhost:8983/solr/aaprivate,localhost:8983/solr/aapublic/


However, since schema is the same, I'd opt for one index with a "collections" field as the filter.

You can add that field to your schema, and then inject it as metadata on the ExtractingRequestHandler call:

curl "http://localhost:8983/solr/update/extract?literal.collection=aaprivate&literal.id=doc1&commit=true" -F "file=@myfile.pdf"

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 26. aug. 2010, at 20.41, Ma, Xiaohui (NIH/NLM/LHC) [C] wrote:

> Thanks so much for your help! I will try it.
> 
> 
> -----Original Message-----
> From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
> Sent: Thursday, August 26, 2010 2:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
> 
> I don't know about the shards, etc.
> 
> However I recently encountered that exception while indexing pdfs as well.
> The way that I resolved it was to upgrade to a nightly build of Solr. (You
> can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).
> 
> The problem is that the version of Tika that 1.4.1 using is a very old
> version of Tika, which uses a old version of PDFBox to do its parsing.  (You
> might be able to fix the problem just by replacing the Tika jars...however I
> don't know if there have been any API changes so I can't really suggest
> that.)
> 
> We didn't upgrade to trunk in order for that functionality, but it was nice
> that it started working. (The PDFs we'll be indexing won't be of later
> versions, but a test file was).
> 
> On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
> xiaohui@mail.nlm.nih.gov> wrote:
> 
>> Thanks so much for your help, Jan Høydahl!
>> 
>> I made multiple cores (aa public, aa private, bb public and bb private). I
>> knew how to query them individually. Please tell me if I can do a
>> combinations through shards parameter now. If yes, I tried to append
>> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>> 
>> Actually all of content is the same. I don't have "collection" field in xml
>> files. Please tell me how I can set a "collection" field in schema and
>> simply search collection through filter.
>> 
>> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
>> when I index pdf with version 1.5 and 1.6.
>> 
>> *************************************
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
>> <title>Error 500 </title>
>> </head>
>> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
>> Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>> 
>> org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>       at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>       at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>       at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>       at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>       at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>       at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>       at
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>       at
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>       at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>       at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>       at
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>       at org.mortbay.jetty.Server.handle(Server.java:285)
>>       at
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>       at
>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>       at
>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>       at
>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>>       at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>>       at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>       ... 22 more
>> Caused by: java.lang.NullPointerException
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>>       at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>>       at
>> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>>       at
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>       at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>>       at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>>       ... 24 more
>> </pre>
>> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
>> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
>> <br/>
>> ***************************************
>> 
>> 
>> -----Original Message-----
>> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
>> Sent: Wednesday, August 25, 2010 4:34 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to deal with virtual collection in solr?
>> 
>>> 1. Currently we use Verity and have more than 20 collections, each
>> collection has a index for public items and a index for private items. So
>> there are virtual collections which point to each collection and a virtual
>> collection which points to all. For example, we have AA and BB collections.
>>> 
>>> AA virtual collection --> (AA index for public items and AA index for
>> private items).
>>> BB virtual collection --> (BB index for public items and BB index for
>> private items).
>>> All virtual collection --> (AA index for public items and AA index for
>> private items, BB index for public items and BB index for private items).
>>> 
>>> Would you please tell me what I should do for this if I use Solr?
>> 
>> There are multiple ways to solve this, depending on the nature of your
>> collections. If they have somewhat different schemas, a natural choice would
>> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
>> you can query them individually or in combinations through the shards
>> parameter. From next Solr version you can use virtual collections for the
>> shard parameter, e.g. &shards=AA,BB etc. (See
>> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>> 
>> If all your content is (roughly) the same kind of data, you could also
>> solve your virtual collection issue through a "collection" field in your
>> schema, and simply select collection through filters: &fq=collection:AA. You
>> could even write a Search Component which translates a &collection=
>> parameter in the request into the correct filters if you want to hide this
>> implementation to the front ends.
>> 
>>> 2. Our project has different kind format files I need index them. For
>> example, xml files, pdf files and text files. Is it possible for Solr to
>> return a search result from all?
>> 
>> Sure. PDF and text files can be indexed through the
>> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
>> DataImportHandler. Solr uses Apache Tika internally to extract text from
>> PDFs and other rich document formats.
>> 
>>> 
>>> 3. I got a error when I index pdf files which are version 1.5 or 1.6.
>> Would you please tell me if there is a patch to fix it?
>> 
>> How did you try to index these PDFs? What version of Solr are you using?
>> Exactly what error message did you get?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> 


RE: how to deal with virtual collection in solr?

Posted by "Ma, Xiaohui (NIH/NLM/LHC) [C]" <xi...@mail.nlm.nih.gov>.
Thanks so much for your help! I will try it.


-----Original Message-----
From: Thomas Joiner [mailto:thomas.b.joiner@gmail.com] 
Sent: Thursday, August 26, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: how to deal with virtual collection in solr?

I don't know about the shards, etc.

However I recently encountered that exception while indexing pdfs as well.
 The way that I resolved it was to upgrade to a nightly build of Solr. (You
can find them https://hudson.apache.org/hudson/view/Solr/job/Solr-trunk/).

The problem is that the version of Tika that 1.4.1 using is a very old
version of Tika, which uses a old version of PDFBox to do its parsing.  (You
might be able to fix the problem just by replacing the Tika jars...however I
don't know if there have been any API changes so I can't really suggest
that.)

We didn't upgrade to trunk in order for that functionality, but it was nice
that it started working. (The PDFs we'll be indexing won't be of later
versions, but a test file was).

On Thu, Aug 26, 2010 at 1:27 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] <
xiaohui@mail.nlm.nih.gov> wrote:

> Thanks so much for your help, Jan Høydahl!
>
> I made multiple cores (aa public, aa private, bb public and bb private). I
> knew how to query them individually. Please tell me if I can do a
> combinations through shards parameter now. If yes, I tried to append
> &shards=aapub,bbpub after query string. Unfortunately it didn't work.
>
> Actually all of content is the same. I don't have "collection" field in xml
> files. Please tell me how I can set a "collection" field in schema and
> simply search collection through filter.
>
> I used curl to index pdf files. I use Solr 1.4.1. I got the following error
> when I index pdf with version 1.5 and 1.6.
>
> *************************************
> <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
> <title>Error 500 </title>
> </head>
> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@134ae32
>
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@134ae32
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@134ae32
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
>        at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>        at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>        ... 22 more
> Caused by: java.lang.NullPointerException
>        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
>        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
>        at
> org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
>        at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
>        at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
>        ... 24 more
> </pre>
> <p>RequestURI=/solr/lhcpdf/update/extract</p><p><i><small><a href="
> http://jetty.mortbay.org/">Powered by Jetty://</a></small></i></p><br/>
> <br/>
> ***************************************
>
>
> -----Original Message-----
> From: Jan Høydahl / Cominvent [mailto:jan.asf@cominvent.com]
> Sent: Wednesday, August 25, 2010 4:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to deal with virtual collection in solr?
>
> > 1. Currently we use Verity and have more than 20 collections, each
> collection has a index for public items and a index for private items. So
> there are virtual collections which point to each collection and a virtual
> collection which points to all. For example, we have AA and BB collections.
> >
> > AA virtual collection --> (AA index for public items and AA index for
> private items).
> > BB virtual collection --> (BB index for public items and BB index for
> private items).
> > All virtual collection --> (AA index for public items and AA index for
> private items, BB index for public items and BB index for private items).
> >
> > Would you please tell me what I should do for this if I use Solr?
>
> There are multiple ways to solve this, depending on the nature of your
> collections. If they have somewhat different schemas, a natural choice would
> be to make multiple cores: AA-private, AA-public, BB-private, BB-public. Now
> you can query them individually or in combinations through the shards
> parameter. From next Solr version you can use virtual collections for the
> shard parameter, e.g. &shards=AA,BB etc. (See
> http://wiki.apache.org/solr/SolrCloud#Distributed_Requests)
>
> If all your content is (roughly) the same kind of data, you could also
> solve your virtual collection issue through a "collection" field in your
> schema, and simply select collection through filters: &fq=collection:AA. You
> could even write a Search Component which translates a &collection=
> parameter in the request into the correct filters if you want to hide this
> implementation to the front ends.
>
> > 2. Our project has different kind format files I need index them. For
> example, xml files, pdf files and text files. Is it possible for Solr to
> return a search result from all?
>
> Sure. PDF and text files can be indexed through the
> ExtractingRequestHandler. XML can be indexed from XMLUpdateHandler or
> DataImportHandler. Solr uses Apache Tika internally to extract text from
> PDFs and other rich document formats.
>
> >
> > 3. I got a error when I index pdf files which are version 1.5 or 1.6.
> Would you please tell me if there is a patch to fix it?
>
> How did you try to index these PDFs? What version of Solr are you using?
> Exactly what error message did you get?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
>