You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rok Rejc <ro...@gmail.com> on 2011/04/04 09:06:02 UTC

Distributed search component.

Hi all,

I am trying to create a distributed search component in solr which is quite
difficult (at least for me, because I am new in solr and java). Anyway I
have looked into solr source (FacetComponent, TermsComponent...) and created
my own search component (it extends SearchComponent) but I still have two
questions (for now):

1.) In the prepare method I have the following code:

        String shards = params.get(ShardParams.SHARDS);
        if (shards != null) {
            List<String> lst = StrUtils.splitSmart(shards, ",", true);
            rb.shards = lst.toArray(new String[lst.size()]);
            rb.isDistrib = true;
        }

If I remove "rb.isDistrib = true;" line the distributed methods are not
called. But to set the isDistrib my code must be in the
"org.apache.solr.handler.component" package (because it is not visible from
the outside). Is this  correct procedure/behaviour/design?

2.) Functions (process, distributedProcess, handleResponses...) are all
called properly. I can read partial responses in the handleResponses but I
don't know how to build "final" response. I see that for example
TermsComponent has a helper in the ResponseBuilder which collects all the
terms. Is this the only way (to edit the ResponseBuilder source), or can I
achive that without editing the solr's source?

Many thanks,

Rok

Re: Text Only Extraction Using Solr and Tika

Posted by Emyr James <em...@sussex.ac.uk>.

Hi,
I'm not really sure how these can help with my problem. Can you give a 
bit more info on this ?

I think what i'm after is a fairly common request..

http://lucene.472066.n3.nabble.com/Controlling-Tika-s-metadata-td2378677.html
http://lucene.472066.n3.nabble.com/Select-tika-output-for-extract-only-td499059.html#a499062

Did the change that Yonik Seely mentions to allow more control over the 
output ever make it into 1.4 ?

Regards,
Emyr

On 05/05/11 15:01, Anuj Kumar wrote:
> Hi Emyr,
>
> You can try the XPath based approach and see if that works. Also, see if
> dynamic fields can help you for the meta data fields.
>
> References-
> http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
> http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
> http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput
>
> Regards,
> Anuj
>
> On Thu, May 5, 2011 at 7:28 PM, Emyr James<em...@sussex.ac.uk>  wrote:
>
>> Thanks for the suggestion but there surely must be a better way than that
>> to do it ?
>> I don't want to post the whole file up, get it extracted on the server,
>> send the extracted text back to the client then send it all back up to the
>> server again as plain text.
>>
>>
>> On 05/05/11 14:55, Jay Luker wrote:
>>
>>> Hi Emyr,
>>>
>>> You could try using the "extractOnly=true" parameter [1]. Of course,
>>> you'll need to repost the extracted text manually.
>>>
>>> --jay
>>>
>>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>>>
>>>
>>> On Thu, May 5, 2011 at 9:36 AM, Emyr James<em...@sussex.ac.uk>
>>>   wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have solr and tika installed and am happily extracting and indexing
>>>> various files.
>>>> Unfortunately on some word documents it blows up since it tries to
>>>> auto-generate a 'title' field but my title field in the schema is single
>>>> valued.
>>>>
>>>> Here is my config for the extract handler...
>>>>
>>>> <requestHandler name="/update/extract"
>>>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>>>> <lst name="defaults">
>>>> <str name="uprefix">ignored_</str>
>>>> </lst>
>>>> </requestHandler>
>>>>
>>>> Is there a config option to make it only extract text, or ideally to
>>>> allow
>>>> me to specify which metadata fields to accept ?
>>>>
>>>> E.g. I'd like to use any author metadata it finds but to not use any
>>>> title
>>>> metadata it finds as I want title to be single valued and set explicitly
>>>> using a literal.title in the post request.
>>>>
>>>> I did look around for some docs but all i can find are very basic
>>>> examples.
>>>> there's no comprehensive configuration documentation out there as far as
>>>> I
>>>> can tell.
>>>>
>>>>
>>>> ALSO...
>>>>
>>>> I get some other bad responses coming back such as...
>>>>
>>>> <html><head><title>Apache Tomcat/6.0.28 - Error
>>>> report</title><style><!--H1
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>>>> 525D76;font-size:16px;} H3
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>>> BODY
>>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>>> B
>>>> {font-family:Tahoma,Arial,sans-serif;c
>>>> olor:white;background-color:#525D76;} P
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>>>> </head><body><h1>HTTP Status 500 - org.ap
>>>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>>
>>>> java.lang.NoSuchMethodError:
>>>>
>>>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>>     at
>>>>
>>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>>>     at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>>     at
>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>>     at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>>>     at
>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>>>     at
>>>>
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>>     at
>>>>
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>>     at
>>>>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>>     at
>>>>
>>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>>     at
>>>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>>     at
>>>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>>     at
>>>>
>>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>>>     at
>>>>
>>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>>     at
>>>>
>>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>>>     at
>>>>
>>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>>>     at
>>>>
>>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>>>     at
>>>>
>>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>>>     at
>>>>
>>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>>>     at
>>>>
>>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>>>     at
>>>>
>>>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>>>     at
>>>>
>>>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>>>     at
>>>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>>>     at java.lang.Thread.run(Thread.java:636)
>>>> </h1><HR size="1" noshade="noshade"><p><b>type</b>   Status
>>>> report</p><p><b>message</b>
>>>>
>>>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>>
>>>> For the above my url was...
>>>>
>>>>
>>>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>>>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>>>
>>>> I guess there's something special I need to be able to process power
>>>> point
>>>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>>>> welcome...
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Emyr
>>>>
>>>>

Re: Text Only Extraction Using Solr and Tika

Posted by Anuj Kumar <an...@gmail.com>.

Hi Emyr,

You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.

References-
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Regards,
Anuj

On Thu, May 5, 2011 at 7:28 PM, Emyr James <em...@sussex.ac.uk> wrote:

> Thanks for the suggestion but there surely must be a better way than that
> to do it ?
> I don't want to post the whole file up, get it extracted on the server,
> send the extracted text back to the client then send it all back up to the
> server again as plain text.
>
>
> On 05/05/11 14:55, Jay Luker wrote:
>
>> Hi Emyr,
>>
>> You could try using the "extractOnly=true" parameter [1]. Of course,
>> you'll need to repost the extracted text manually.
>>
>> --jay
>>
>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>>
>>
>> On Thu, May 5, 2011 at 9:36 AM, Emyr James<em...@sussex.ac.uk>
>>  wrote:
>>
>>> Hi All,
>>>
>>> I have solr and tika installed and am happily extracting and indexing
>>> various files.
>>> Unfortunately on some word documents it blows up since it tries to
>>> auto-generate a 'title' field but my title field in the schema is single
>>> valued.
>>>
>>> Here is my config for the extract handler...
>>>
>>> <requestHandler name="/update/extract"
>>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>>> <lst name="defaults">
>>> <str name="uprefix">ignored_</str>
>>> </lst>
>>> </requestHandler>
>>>
>>> Is there a config option to make it only extract text, or ideally to
>>> allow
>>> me to specify which metadata fields to accept ?
>>>
>>> E.g. I'd like to use any author metadata it finds but to not use any
>>> title
>>> metadata it finds as I want title to be single valued and set explicitly
>>> using a literal.title in the post request.
>>>
>>> I did look around for some docs but all i can find are very basic
>>> examples.
>>> there's no comprehensive configuration documentation out there as far as
>>> I
>>> can tell.
>>>
>>>
>>> ALSO...
>>>
>>> I get some other bad responses coming back such as...
>>>
>>> <html><head><title>Apache Tomcat/6.0.28 - Error
>>> report</title><style><!--H1
>>>
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>>> 525D76;font-size:16px;} H3
>>>
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>> BODY
>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>> B
>>> {font-family:Tahoma,Arial,sans-serif;c
>>> olor:white;background-color:#525D76;} P
>>>
>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>>> </head><body><h1>HTTP Status 500 - org.ap
>>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>
>>> java.lang.NoSuchMethodError:
>>>
>>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>    at
>>>
>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>>    at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>    at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>    at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>>    at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>>    at
>>>
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>    at
>>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>    at
>>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>    at
>>>
>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>    at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>    at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>    at
>>>
>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>>    at
>>>
>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>    at
>>>
>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>>    at
>>>
>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>>    at
>>>
>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>>    at
>>>
>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>>    at
>>>
>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>>    at
>>>
>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>>    at
>>>
>>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>>    at
>>>
>>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>>    at
>>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>>    at java.lang.Thread.run(Thread.java:636)
>>> </h1><HR size="1" noshade="noshade"><p><b>type</b>  Status
>>> report</p><p><b>message</b>
>>>
>>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>
>>> For the above my url was...
>>>
>>>
>>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>>
>>> I guess there's something special I need to be able to process power
>>> point
>>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>>> welcome...
>>>
>>>
>>> Regards,
>>>
>>> Emyr
>>>
>>>
>

Re: Text Only Extraction Using Solr and Tika

Posted by Emyr James <em...@sussex.ac.uk>.

Thanks for the suggestion but there surely must be a better way than 
that to do it ?
I don't want to post the whole file up, get it extracted on the server, 
send the extracted text back to the client then send it all back up to 
the server again as plain text.

On 05/05/11 14:55, Jay Luker wrote:
> Hi Emyr,
>
> You could try using the "extractOnly=true" parameter [1]. Of course,
> you'll need to repost the extracted text manually.
>
> --jay
>
> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>
>
> On Thu, May 5, 2011 at 9:36 AM, Emyr James<em...@sussex.ac.uk>  wrote:
>> Hi All,
>>
>> I have solr and tika installed and am happily extracting and indexing
>> various files.
>> Unfortunately on some word documents it blows up since it tries to
>> auto-generate a 'title' field but my title field in the schema is single
>> valued.
>>
>> Here is my config for the extract handler...
>>
>> <requestHandler name="/update/extract"
>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>> <lst name="defaults">
>> <str name="uprefix">ignored_</str>
>> </lst>
>> </requestHandler>
>>
>> Is there a config option to make it only extract text, or ideally to allow
>> me to specify which metadata fields to accept ?
>>
>> E.g. I'd like to use any author metadata it finds but to not use any title
>> metadata it finds as I want title to be single valued and set explicitly
>> using a literal.title in the post request.
>>
>> I did look around for some docs but all i can find are very basic examples.
>> there's no comprehensive configuration documentation out there as far as I
>> can tell.
>>
>>
>> ALSO...
>>
>> I get some other bad responses coming back such as...
>>
>> <html><head><title>Apache Tomcat/6.0.28 - Error report</title><style><!--H1
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>> 525D76;font-size:16px;} H3
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>> BODY
>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
>> {font-family:Tahoma,Arial,sans-serif;c
>> olor:white;background-color:#525D76;} P
>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>> </head><body><h1>HTTP Status 500 - org.ap
>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>
>> java.lang.NoSuchMethodError:
>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>     at
>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>     at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>     at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>     at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>     at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>     at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>     at
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>     at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>     at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>     at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>     at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>     at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>     at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>     at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>     at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>     at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>     at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>     at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>     at
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>     at
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>     at java.lang.Thread.run(Thread.java:636)
>> </h1><HR size="1" noshade="noshade"><p><b>type</b>  Status
>> report</p><p><b>message</b>
>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>
>> For the above my url was...
>>
>>   http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>
>> I guess there's something special I need to be able to process power point
>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>> welcome...
>>
>>
>> Regards,
>>
>> Emyr
>>

Re: Text Only Extraction Using Solr and Tika

Posted by Jay Luker <lb...@reallywow.com>.

Hi Emyr,

You could try using the "extractOnly=true" parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr James <em...@sussex.ac.uk> wrote:
> Hi All,
>
> I have solr and tika installed and am happily extracting and indexing
> various files.
> Unfortunately on some word documents it blows up since it tries to
> auto-generate a 'title' field but my title field in the schema is single
> valued.
>
> Here is my config for the extract handler...
>
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
> <lst name="defaults">
> <str name="uprefix">ignored_</str>
> </lst>
> </requestHandler>
>
> Is there a config option to make it only extract text, or ideally to allow
> me to specify which metadata fields to accept ?
>
> E.g. I'd like to use any author metadata it finds but to not use any title
> metadata it finds as I want title to be single valued and set explicitly
> using a literal.title in the post request.
>
> I did look around for some docs but all i can find are very basic examples.
> there's no comprehensive configuration documentation out there as far as I
> can tell.
>
>
> ALSO...
>
> I get some other bad responses coming back such as...
>
> <html><head><title>Apache Tomcat/6.0.28 - Error report</title><style><!--H1
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
> 525D76;font-size:16px;} H3
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
> BODY
> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
> {font-family:Tahoma,Arial,sans-serif;c
> olor:white;background-color:#525D76;} P
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
> </head><body><h1>HTTP Status 500 - org.ap
> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> java.lang.NoSuchMethodError:
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>    at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>    at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>    at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>    at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>    at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>    at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>    at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>    at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>    at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>    at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>    at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>    at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>    at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>    at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>    at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>    at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>    at java.lang.Thread.run(Thread.java:636)
> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status
> report</p><p><b>message</b>
> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> For the above my url was...
>
>  http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>
> I guess there's something special I need to be able to process power point
> files ? Maybe I need to get the latest apache POI ? Any suggestions
> welcome...
>
>
> Regards,
>
> Emyr
>

Re: Text Only Extraction Using Solr and Tika

Posted by "Ramirez, Paul M (388J)" <pa...@jpl.nasa.gov>.

Hey Emyr,

Looking at your stack trace below my guess is that you have two conflicting Apache POI jars in your classpath. The odd stack trace is indicative of that as the class loader is likely loading some other version of  the DirectoryNode class that doesn't have the iterator method. 

> java.lang.NoSuchMethodError: 
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

Thanks,
Paul Ramirez


On May 5, 2011, at 6:36 AM, Emyr James wrote:

> Hi All,
> 
> I have solr and tika installed and am happily extracting and indexing 
> various files.
> Unfortunately on some word documents it blows up since it tries to 
> auto-generate a 'title' field but my title field in the schema is single 
> valued.
> 
> Here is my config for the extract handler...
> 
> <requestHandler name="/update/extract" 
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
> <lst name="defaults">
> <str name="uprefix">ignored_</str>
> </lst>
> </requestHandler>
> 
> Is there a config option to make it only extract text, or ideally to 
> allow me to specify which metadata fields to accept ?
> 
> E.g. I'd like to use any author metadata it finds but to not use any 
> title metadata it finds as I want title to be single valued and set 
> explicitly using a literal.title in the post request.
> 
> I did look around for some docs but all i can find are very basic 
> examples. there's no comprehensive configuration documentation out there 
> as far as I can tell.
> 
> 
> ALSO...
> 
> I get some other bad responses coming back such as...
> 
> <html><head><title>Apache Tomcat/6.0.28 - Error 
> report</title><style><!--H1 
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} 
> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
> 525D76;font-size:16px;} H3 
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} 
> BODY 
> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B 
> {font-family:Tahoma,Arial,sans-serif;c
> olor:white;background-color:#525D76;} P 
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A 
> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> 
> </head><body><h1>HTTP Status 500 - org.ap
> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
> 
> java.lang.NoSuchMethodError: 
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>     at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>     at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>     at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>     at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>     at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>     at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>     at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>     at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>     at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>     at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>     at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>     at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>     at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>     at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>     at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>     at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>     at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>     at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>     at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>     at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>     at java.lang.Thread.run(Thread.java:636)
> </h1><HR size="1" noshade="noshade"><p><b>type</b> Status 
> report</p><p><b>message</b> 
> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
> 
> For the above my url was...
> 
>  http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
> 
> I guess there's something special I need to be able to process power 
> point files ? Maybe I need to get the latest apache POI ? Any 
> suggestions welcome...
> 
> 
> Regards,
> 
> Emyr

Text Only Extraction Using Solr and Tika

Posted by Emyr James <em...@sussex.ac.uk>.

Hi All,

I have solr and tika installed and am happily extracting and indexing 
various files.
Unfortunately on some word documents it blows up since it tries to 
auto-generate a 'title' field but my title field in the schema is single 
valued.

Here is my config for the extract handler...

<requestHandler name="/update/extract" 
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="uprefix">ignored_</str>
</lst>
</requestHandler>

Is there a config option to make it only extract text, or ideally to 
allow me to specify which metadata fields to accept ?

E.g. I'd like to use any author metadata it finds but to not use any 
title metadata it finds as I want title to be single valued and set 
explicitly using a literal.title in the post request.

I did look around for some docs but all i can find are very basic 
examples. there's no comprehensive configuration documentation out there 
as far as I can tell.


ALSO...

I get some other bad responses coming back such as...

<html><head><title>Apache Tomcat/6.0.28 - Error 
report</title><style><!--H1 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} 
H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
525D76;font-size:16px;} H3 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} 
BODY 
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B 
{font-family:Tahoma,Arial,sans-serif;c
olor:white;background-color:#525D76;} P 
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A 
{color : black;}A.name {color : black;}HR {color : #525D76;}--></style> 
</head><body><h1>HTTP Status 500 - org.ap
ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

java.lang.NoSuchMethodError: 
org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
     at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
     at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
     at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
     at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
     at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
     at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
     at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
     at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
     at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
     at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
     at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
     at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
     at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
     at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
     at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
     at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
     at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
     at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
     at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
     at java.lang.Thread.run(Thread.java:636)
</h1><HR size="1" noshade="noshade"><p><b>type</b> Status 
report</p><p><b>message</b> 
<u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

For the above my url was...

  http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten

I guess there's something special I need to be able to process power 
point files ? Maybe I need to get the latest apache POI ? Any 
suggestions welcome...


Regards,

Emyr

Re: Distributed search component.

Posted by Rok Rejc <ro...@gmail.com>.

I am still fighting (after a month of doing other things) with the first
part of the problem. Any ideas?

Many thanks,
Rok

On Mon, Apr 4, 2011 at 9:06 AM, Rok Rejc <ro...@gmail.com> wrote:

> Hi all,
>
> I am trying to create a distributed search component in solr which is quite
> difficult (at least for me, because I am new in solr and java). Anyway I
> have looked into solr source (FacetComponent, TermsComponent...) and created
> my own search component (it extends SearchComponent) but I still have two
> questions (for now):
>
> 1.) In the prepare method I have the following code:
>
>         String shards = params.get(ShardParams.SHARDS);
>         if (shards != null) {
>             List<String> lst = StrUtils.splitSmart(shards, ",", true);
>             rb.shards = lst.toArray(new String[lst.size()]);
>             rb.isDistrib = true;
>         }
>
> If I remove "rb.isDistrib = true;" line the distributed methods are not
> called. But to set the isDistrib my code must be in the
> "org.apache.solr.handler.component" package (because it is not visible from
> the outside). Is this  correct procedure/behaviour/design?
>
> 2.) Functions (process, distributedProcess, handleResponses...) are all
> called properly. I can read partial responses in the handleResponses but I
> don't know how to build "final" response. I see that for example
> TermsComponent has a helper in the ResponseBuilder which collects all the
> terms. Is this the only way (to edit the ResponseBuilder source), or can I
> achive that without editing the solr's source?
>
> Many thanks,
>
> Rok
>