You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rok Rejc <ro...@gmail.com> on 2011/04/04 09:06:02 UTC
Distributed search component.
Hi all,
I am trying to create a distributed search component in solr which is quite
difficult (at least for me, because I am new in solr and java). Anyway I
have looked into solr source (FacetComponent, TermsComponent...) and created
my own search component (it extends SearchComponent) but I still have two
questions (for now):
1.) In the prepare method I have the following code:
String shards = params.get(ShardParams.SHARDS);
if (shards != null) {
List<String> lst = StrUtils.splitSmart(shards, ",", true);
rb.shards = lst.toArray(new String[lst.size()]);
rb.isDistrib = true;
}
If I remove "rb.isDistrib = true;" line the distributed methods are not
called. But to set the isDistrib my code must be in the
"org.apache.solr.handler.component" package (because it is not visible from
the outside). Is this correct procedure/behaviour/design?
2.) Functions (process, distributedProcess, handleResponses...) are all
called properly. I can read partial responses in the handleResponses but I
don't know how to build "final" response. I see that for example
TermsComponent has a helper in the ResponseBuilder which collects all the
terms. Is this the only way (to edit the ResponseBuilder source), or can I
achive that without editing the solr's source?
Many thanks,
Rok
Re: Text Only Extraction Using Solr and Tika
Posted by Emyr James <em...@sussex.ac.uk>.
Hi,
I'm not really sure how these can help with my problem. Can you give a
bit more info on this ?
I think what i'm after is a fairly common request..
http://lucene.472066.n3.nabble.com/Controlling-Tika-s-metadata-td2378677.html
http://lucene.472066.n3.nabble.com/Select-tika-output-for-extract-only-td499059.html#a499062
Did the change that Yonik Seely mentions to allow more control over the
output ever make it into 1.4 ?
Regards,
Emyr
On 05/05/11 15:01, Anuj Kumar wrote:
> Hi Emyr,
>
> You can try the XPath based approach and see if that works. Also, see if
> dynamic fields can help you for the meta data fields.
>
> References-
> http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
> http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
> http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput
>
> Regards,
> Anuj
>
> On Thu, May 5, 2011 at 7:28 PM, Emyr James<em...@sussex.ac.uk> wrote:
>
>> Thanks for the suggestion but there surely must be a better way than that
>> to do it ?
>> I don't want to post the whole file up, get it extracted on the server,
>> send the extracted text back to the client then send it all back up to the
>> server again as plain text.
>>
>>
>> On 05/05/11 14:55, Jay Luker wrote:
>>
>>> Hi Emyr,
>>>
>>> You could try using the "extractOnly=true" parameter [1]. Of course,
>>> you'll need to repost the extracted text manually.
>>>
>>> --jay
>>>
>>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>>>
>>>
>>> On Thu, May 5, 2011 at 9:36 AM, Emyr James<em...@sussex.ac.uk>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have solr and tika installed and am happily extracting and indexing
>>>> various files.
>>>> Unfortunately on some word documents it blows up since it tries to
>>>> auto-generate a 'title' field but my title field in the schema is single
>>>> valued.
>>>>
>>>> Here is my config for the extract handler...
>>>>
>>>> <requestHandler name="/update/extract"
>>>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>>>> <lst name="defaults">
>>>> <str name="uprefix">ignored_</str>
>>>> </lst>
>>>> </requestHandler>
>>>>
>>>> Is there a config option to make it only extract text, or ideally to
>>>> allow
>>>> me to specify which metadata fields to accept ?
>>>>
>>>> E.g. I'd like to use any author metadata it finds but to not use any
>>>> title
>>>> metadata it finds as I want title to be single valued and set explicitly
>>>> using a literal.title in the post request.
>>>>
>>>> I did look around for some docs but all i can find are very basic
>>>> examples.
>>>> there's no comprehensive configuration documentation out there as far as
>>>> I
>>>> can tell.
>>>>
>>>>
>>>> ALSO...
>>>>
>>>> I get some other bad responses coming back such as...
>>>>
>>>> <html><head><title>Apache Tomcat/6.0.28 - Error
>>>> report</title><style><!--H1
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>>>> 525D76;font-size:16px;} H3
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>>> BODY
>>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>>> B
>>>> {font-family:Tahoma,Arial,sans-serif;c
>>>> olor:white;background-color:#525D76;} P
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>>>> </head><body><h1>HTTP Status 500 - org.ap
>>>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>>
>>>> java.lang.NoSuchMethodError:
>>>>
>>>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>> at
>>>>
>>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>> at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>> at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>>> at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>>> at
>>>>
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>> at
>>>>
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>> at
>>>>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>> at
>>>>
>>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>> at
>>>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>> at
>>>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>> at
>>>>
>>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>>> at
>>>>
>>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>> at
>>>>
>>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>>> at
>>>>
>>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>>> at
>>>>
>>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>>> at
>>>>
>>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>>> at
>>>>
>>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>>> at
>>>>
>>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>>> at
>>>>
>>>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>>> at
>>>>
>>>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>>> at
>>>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>>> at java.lang.Thread.run(Thread.java:636)
>>>> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status
>>>> report</p><p><b>message</b>
>>>>
>>>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>>
>>>> For the above my url was...
>>>>
>>>>
>>>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>>>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>>>
>>>> I guess there's something special I need to be able to process power
>>>> point
>>>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>>>> welcome...
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Emyr
>>>>
>>>>
Re: Text Only Extraction Using Solr and Tika
Posted by Anuj Kumar <an...@gmail.com>.
Hi Emyr,
You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.
References-
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput
Regards,
Anuj
On Thu, May 5, 2011 at 7:28 PM, Emyr James <em...@sussex.ac.uk> wrote:
> Thanks for the suggestion but there surely must be a better way than that
> to do it ?
> I don't want to post the whole file up, get it extracted on the server,
> send the extracted text back to the client then send it all back up to the
> server again as plain text.
>
>
> On 05/05/11 14:55, Jay Luker wrote:
>
>> Hi Emyr,
>>
>> You could try using the "extractOnly=true" parameter [1]. Of course,
>> you'll need to repost the extracted text manually.
>>
>> --jay
>>
>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>>
>>
>> On Thu, May 5, 2011 at 9:36 AM, Emyr James<em...@sussex.ac.uk>
>> wrote:
>>
>>> Hi All,
>>>
>>> I have solr and tika installed and am happily extracting and indexing
>>> various files.
>>> Unfortunately on some word documents it blows up since it tries to
>>> auto-generate a 'title' field but my title field in the schema is single
>>> valued.
>>>
>>> Here is my config for the extract handler...
>>>
>>> <requestHandler name="/update/extract"
>>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>>> <lst name="defaults">
>>> <str name="uprefix">ignored_</str>
>>> </lst>
>>> </requestHandler>
>>>
>>> Is there a config option to make it only extract text, or ideally to
>>> allow
>>> me to specify which metadata fields to accept ?
>>>
>>> E.g. I'd like to use any author metadata it finds but to not use any
>>> title
>>> metadata it finds as I want title to be single valued and set explicitly
>>> using a literal.title in the post request.
>>>
>>> I did look around for some docs but all i can find are very basic
>>> examples.
>>> there's no comprehensive configuration documentation out there as far as
>>> I
>>> can tell.
>>>
>>>
>>> ALSO...
>>>
>>> I get some other bad responses coming back such as...
>>>
>>> <html><head><title>Apache Tomcat/6.0.28 - Error
>>> report</title><style><!--H1
>>>
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>>> 525D76;font-size:16px;} H3
>>>
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>> BODY
>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>> B
>>> {font-family:Tahoma,Arial,sans-serif;c
>>> olor:white;background-color:#525D76;} P
>>>
>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>>> </head><body><h1>HTTP Status 500 - org.ap
>>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>
>>> java.lang.NoSuchMethodError:
>>>
>>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>> at
>>>
>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>> at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>> at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>> at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>> at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>> at
>>>
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>> at
>>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>> at
>>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>> at
>>>
>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>> at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>> at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>> at
>>>
>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>> at
>>>
>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>> at
>>>
>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>> at
>>>
>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>> at
>>>
>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>> at
>>>
>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>> at
>>>
>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>> at
>>>
>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>> at
>>>
>>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>> at
>>>
>>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>> at
>>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>> at java.lang.Thread.run(Thread.java:636)
>>> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status
>>> report</p><p><b>message</b>
>>>
>>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>
>>> For the above my url was...
>>>
>>>
>>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>>
>>> I guess there's something special I need to be able to process power
>>> point
>>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>>> welcome...
>>>
>>>
>>> Regards,
>>>
>>> Emyr
>>>
>>>
>
Re: Text Only Extraction Using Solr and Tika
Posted by Emyr James <em...@sussex.ac.uk>.
Thanks for the suggestion but there surely must be a better way than
that to do it ?
I don't want to post the whole file up, get it extracted on the server,
send the extracted text back to the client then send it all back up to
the server again as plain text.
On 05/05/11 14:55, Jay Luker wrote:
> Hi Emyr,
>
> You could try using the "extractOnly=true" parameter [1]. Of course,
> you'll need to repost the extracted text manually.
>
> --jay
>
> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>
>
> On Thu, May 5, 2011 at 9:36 AM, Emyr James<em...@sussex.ac.uk> wrote:
>> Hi All,
>>
>> I have solr and tika installed and am happily extracting and indexing
>> various files.
>> Unfortunately on some word documents it blows up since it tries to
>> auto-generate a 'title' field but my title field in the schema is single
>> valued.
>>
>> Here is my config for the extract handler...
>>
>> <requestHandler name="/update/extract"
>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>> <lst name="defaults">
>> <str name="uprefix">ignored_</str>
>> </lst>
>> </requestHandler>
>>
>> Is there a config option to make it only extract text, or ideally to allow
>> me to specify which metadata fields to accept ?
>>
>> E.g. I'd like to use any author metadata it finds but to not use any title
>> metadata it finds as I want title to be single valued and set explicitly
>> using a literal.title in the post request.
>>
>> I did look around for some docs but all i can find are very basic examples.
>> there's no comprehensive configuration documentation out there as far as I
>> can tell.
>>
>>
>> ALSO...
>>
>> I get some other bad responses coming back such as...
>>
>> <html><head><title>Apache Tomcat/6.0.28 - Error report</title><style><!--H1
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>> 525D76;font-size:16px;} H3
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>> BODY
>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
>> {font-family:Tahoma,Arial,sans-serif;c
>> olor:white;background-color:#525D76;} P
>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>> </head><body><h1>HTTP Status 500 - org.ap
>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>
>> java.lang.NoSuchMethodError:
>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>> at
>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>> at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>> at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>> at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>> at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> at
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>> at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>> at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>> at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>> at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>> at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>> at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>> at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>> at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>> at
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>> at
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>> at java.lang.Thread.run(Thread.java:636)
>> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status
>> report</p><p><b>message</b>
>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>
>> For the above my url was...
>>
>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>
>> I guess there's something special I need to be able to process power point
>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>> welcome...
>>
>>
>> Regards,
>>
>> Emyr
>>
Re: Text Only Extraction Using Solr and Tika
Posted by Jay Luker <lb...@reallywow.com>.
Hi Emyr,
You could try using the "extractOnly=true" parameter [1]. Of course,
you'll need to repost the extracted text manually.
--jay
[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
On Thu, May 5, 2011 at 9:36 AM, Emyr James <em...@sussex.ac.uk> wrote:
> Hi All,
>
> I have solr and tika installed and am happily extracting and indexing
> various files.
> Unfortunately on some word documents it blows up since it tries to
> auto-generate a 'title' field but my title field in the schema is single
> valued.
>
> Here is my config for the extract handler...
>
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
> <lst name="defaults">
> <str name="uprefix">ignored_</str>
> </lst>
> </requestHandler>
>
> Is there a config option to make it only extract text, or ideally to allow
> me to specify which metadata fields to accept ?
>
> E.g. I'd like to use any author metadata it finds but to not use any title
> metadata it finds as I want title to be single valued and set explicitly
> using a literal.title in the post request.
>
> I did look around for some docs but all i can find are very basic examples.
> there's no comprehensive configuration documentation out there as far as I
> can tell.
>
>
> ALSO...
>
> I get some other bad responses coming back such as...
>
> <html><head><title>Apache Tomcat/6.0.28 - Error report</title><style><!--H1
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
> 525D76;font-size:16px;} H3
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
> BODY
> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
> {font-family:Tahoma,Arial,sans-serif;c
> olor:white;background-color:#525D76;} P
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
> </head><body><h1>HTTP Status 500 - org.ap
> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> java.lang.NoSuchMethodError:
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> at java.lang.Thread.run(Thread.java:636)
> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status
> report</p><p><b>message</b>
> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> For the above my url was...
>
> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>
> I guess there's something special I need to be able to process power point
> files ? Maybe I need to get the latest apache POI ? Any suggestions
> welcome...
>
>
> Regards,
>
> Emyr
>
Re: Text Only Extraction Using Solr and Tika
Posted by "Ramirez, Paul M (388J)" <pa...@jpl.nasa.gov>.
Hey Emyr,
Looking at your stack trace below my guess is that you have two conflicting Apache POI jars in your classpath. The odd stack trace is indicative of that as the class loader is likely loading some other version of the DirectoryNode class that doesn't have the iterator method.
> java.lang.NoSuchMethodError:
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
Thanks,
Paul Ramirez
On May 5, 2011, at 6:36 AM, Emyr James wrote:
> Hi All,
>
> I have solr and tika installed and am happily extracting and indexing
> various files.
> Unfortunately on some word documents it blows up since it tries to
> auto-generate a 'title' field but my title field in the schema is single
> valued.
>
> Here is my config for the extract handler...
>
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
> <lst name="defaults">
> <str name="uprefix">ignored_</str>
> </lst>
> </requestHandler>
>
> Is there a config option to make it only extract text, or ideally to
> allow me to specify which metadata fields to accept ?
>
> E.g. I'd like to use any author metadata it finds but to not use any
> title metadata it finds as I want title to be single valued and set
> explicitly using a literal.title in the post request.
>
> I did look around for some docs but all i can find are very basic
> examples. there's no comprehensive configuration documentation out there
> as far as I can tell.
>
>
> ALSO...
>
> I get some other bad responses coming back such as...
>
> <html><head><title>Apache Tomcat/6.0.28 - Error
> report</title><style><!--H1
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
> 525D76;font-size:16px;} H3
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
> BODY
> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
> {font-family:Tahoma,Arial,sans-serif;c
> olor:white;background-color:#525D76;} P
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
> </head><body><h1>HTTP Status 500 - org.ap
> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> java.lang.NoSuchMethodError:
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> at java.lang.Thread.run(Thread.java:636)
> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status
> report</p><p><b>message</b>
> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> For the above my url was...
>
> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>
> I guess there's something special I need to be able to process power
> point files ? Maybe I need to get the latest apache POI ? Any
> suggestions welcome...
>
>
> Regards,
>
> Emyr
Text Only Extraction Using Solr and Tika
Posted by Emyr James <em...@sussex.ac.uk>.
Hi All,
I have solr and tika installed and am happily extracting and indexing
various files.
Unfortunately on some word documents it blows up since it tries to
auto-generate a 'title' field but my title field in the schema is single
valued.
Here is my config for the extract handler...
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>
Is there a config option to make it only extract text, or ideally to
allow me to specify which metadata fields to accept ?
E.g. I'd like to use any author metadata it finds but to not use any
title metadata it finds as I want title to be single valued and set
explicitly using a literal.title in the post request.
I did look around for some docs but all i can find are very basic
examples. there's no comprehensive configuration documentation out there
as far as I can tell.
ALSO...
I get some other bad responses coming back such as...
<html><head><title>Apache Tomcat/6.0.28 - Error
report</title><style><!--H1
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
525D76;font-size:16px;} H3
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
{font-family:Tahoma,Arial,sans-serif;c
olor:white;background-color:#525D76;} P
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
</head><body><h1>HTTP Status 500 - org.ap
ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
java.lang.NoSuchMethodError:
org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
</h1><HR size="1" noshade="noshade"><p><b>type</b> Status
report</p><p><b>message</b>
<u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
For the above my url was...
http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
I guess there's something special I need to be able to process power
point files ? Maybe I need to get the latest apache POI ? Any
suggestions welcome...
Regards,
Emyr
Re: Distributed search component.
Posted by Rok Rejc <ro...@gmail.com>.
I am still fighting (after a month of doing other things) with the first
part of the problem. Any ideas?
Many thanks,
Rok
On Mon, Apr 4, 2011 at 9:06 AM, Rok Rejc <ro...@gmail.com> wrote:
> Hi all,
>
> I am trying to create a distributed search component in solr which is quite
> difficult (at least for me, because I am new in solr and java). Anyway I
> have looked into solr source (FacetComponent, TermsComponent...) and created
> my own search component (it extends SearchComponent) but I still have two
> questions (for now):
>
> 1.) In the prepare method I have the following code:
>
> String shards = params.get(ShardParams.SHARDS);
> if (shards != null) {
> List<String> lst = StrUtils.splitSmart(shards, ",", true);
> rb.shards = lst.toArray(new String[lst.size()]);
> rb.isDistrib = true;
> }
>
> If I remove "rb.isDistrib = true;" line the distributed methods are not
> called. But to set the isDistrib my code must be in the
> "org.apache.solr.handler.component" package (because it is not visible from
> the outside). Is this correct procedure/behaviour/design?
>
> 2.) Functions (process, distributedProcess, handleResponses...) are all
> called properly. I can read partial responses in the handleResponses but I
> don't know how to build "final" response. I see that for example
> TermsComponent has a helper in the ResponseBuilder which collects all the
> terms. Is this the only way (to edit the ResponseBuilder source), or can I
> achive that without editing the solr's source?
>
> Many thanks,
>
> Rok
>