You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by wonder <a-...@rambler.ru> on 2013/10/17 12:05:22 UTC

A few questions about solr and tika

Hello everyone! Please tell me how and where to set Tika options in 
Solr? Where is Tica conf? I'm want to know how I can eliminate not 
required to me response attribute(such as links or images)? Also I am 
interesting how i can get and index only metadata in several file formats?

Re: A few questions about solr and tika

Posted by pr...@policija.si.
Everythink about Tika extraction is written under those links. Basicaly 
what you need is the following:

1) requestHandler for Tika in solrconfig.xml
2) keep all the fields in schema.xml that are needed for Tika (they are 
marked in example schema.xml) and set those you don't need to 
indexed=false and stored=false
3) if you want to limit the returned fields in query response use query 
parameter 'fl'.

Primoz




From:   wonder <a-...@rambler.ru>
To:     solr-user@lucene.apache.org
Date:   17.10.2013 14:44
Subject:        Re: A few questions about solr and tika



Thanks for answer. If I dont want to store and index any fields i do:
<field name="links" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="link" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="img" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних TIKA-->
<field name="iframe" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="area" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="map" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="pragma" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних TIKA-->
<field name="expires" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="keywords" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="stream_source_info" type="string" indexed="false" 
stored="false" multiValued="true"/><!--удаление лишних полей TIKA-->

Other qestions is still open for me.


17.10.2013 14:26, primoz.skale@policija.si пишет:
> Why don't you check these:
>
> - Content extraction with Apache Tika (
> http://www.youtube.com/watch?v=ifgFjAeTOws)
> - ExtractingRequestHandler (
> http://wiki.apache.org/solr/ExtractingRequestHandler)
> - Uploading Data with Solr Cell using Apache Tika (
> 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

> )
>
> Primož
>
>
>
> From:   wonder <a-...@rambler.ru>
> To:     solr-user@lucene.apache.org
> Date:   17.10.2013 12:23
> Subject:        A few questions about solr and tika
>
>
>
> Hello everyone! Please tell me how and where to set Tika options in
> Solr? Where is Tica conf? I'm want to know how I can eliminate not
> required to me response attribute(such as links or images)? Also I am
> interesting how i can get and index only metadata in several file 
formats?
>
>




Re: A few questions about solr and tika

Posted by wonder <a-...@rambler.ru>.
Thanks for answer. If I dont want to store and index any fields i do:
<field name="links" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="link" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="img" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних TIKA-->
<field name="iframe" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="area" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="map" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="pragma" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних TIKA-->
<field name="expires" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="keywords" type="string" indexed="false" stored="false" 
multiValued="true"/><!--удаление лишних полей TIKA-->
<field name="stream_source_info" type="string" indexed="false" 
stored="false" multiValued="true"/><!--удаление лишних полей TIKA-->

Other qestions is still open for me.


17.10.2013 14:26, primoz.skale@policija.si пишет:
> Why don't you check these:
>
> - Content extraction with Apache Tika (
> http://www.youtube.com/watch?v=ifgFjAeTOws)
> - ExtractingRequestHandler (
> http://wiki.apache.org/solr/ExtractingRequestHandler)
> - Uploading Data with Solr Cell using Apache Tika (
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> )
>
> Primož
>
>
>
> From:   wonder <a-...@rambler.ru>
> To:     solr-user@lucene.apache.org
> Date:   17.10.2013 12:23
> Subject:        A few questions about solr and tika
>
>
>
> Hello everyone! Please tell me how and where to set Tika options in
> Solr? Where is Tica conf? I'm want to know how I can eliminate not
> required to me response attribute(such as links or images)? Also I am
> interesting how i can get and index only metadata in several file formats?
>
>


Re: A few questions about solr and tika

Posted by pr...@policija.si.
Why don't you check these:

- Content extraction with Apache Tika (
http://www.youtube.com/watch?v=ifgFjAeTOws)
- ExtractingRequestHandler (
http://wiki.apache.org/solr/ExtractingRequestHandler)
- Uploading Data with Solr Cell using Apache Tika (
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
)

Primož



From:   wonder <a-...@rambler.ru>
To:     solr-user@lucene.apache.org
Date:   17.10.2013 12:23
Subject:        A few questions about solr and tika



Hello everyone! Please tell me how and where to set Tika options in 
Solr? Where is Tica conf? I'm want to know how I can eliminate not 
required to me response attribute(such as links or images)? Also I am 
interesting how i can get and index only metadata in several file formats?


Re: Solr errors

Posted by Roland Everaert <re...@gmail.com>.
I have just find this JIRA report, which could explain your problem:

https://issues.apache.org/jira/browse/SOLR-2416


Regards,

Roland.



On Thu, Oct 17, 2013 at 3:30 PM, wonder <a-...@rambler.ru> wrote:

> Thanks for answer. Yes Tika extract, but not index content. Here is the
> solr response
> ...
> "content": [ " 9118_xmessengereu_v18ximpda.**jar dimonvideo.ru.txt " ],
> ...
> There are not any of this files in index.
> Any ideas?
> 17.10.2013 17:20, Roland Everaert ?????:
>
>  Even if I don't test it myself, you can use Tika, it is able to extract
>> document from zip archives and index them, but of course it depends of the
>> file type in the archive.
>>
>
>

Re: Solr errors

Posted by wonder <a-...@rambler.ru>.
Thanks for answer. Yes Tika extract, but not index content. Here is the 
solr response
...
"content": [ " 9118_xmessengereu_v18ximpda.jar dimonvideo.ru.txt " ],
...
There are not any of this files in index.
Any ideas?
17.10.2013 17:20, Roland Everaert ?????:
> Even if I don't test it myself, you can use Tika, it is able to extract
> document from zip archives and index them, but of course it depends of the
> file type in the archive.


Re: Solr errors

Posted by Roland Everaert <re...@gmail.com>.
Even if I don't test it myself, you can use Tika, it is able to extract
document from zip archives and index them, but of course it depends of the
file type in the archive.

Regards,


Roland.


On Thu, Oct 17, 2013 at 2:36 PM, wonder <a-...@rambler.ru> wrote:

> Does anybody know how index files in zip archives?
>
>

Re: Solr errors

Posted by wonder <a-...@rambler.ru>.
Does anybody know how index files in zip archives?


Solr errors

Posted by wonder <a-...@rambler.ru>.
Hello everyone! Please tell my wy Solr freezes when I adding this file
http://yadi.sk/d/dy-RtcHXB7KZU
The response from the server does not come.
curl 
"http://localhost:8085/solr/myCollection/update/extract?literal.id=doc1&literal.fileName=as&uprefix=attr_&&commit=true" 
-F "myfile=@/media/PENDRIVE/Out/www-http/159/8696_6_5_5535.mp3"

Second question:
When I adding this file
http://yadi.sk/d/OpLW2JTTB7Ms4
Solr returns:
wonder@wonder:~$ curl 
"http://localhost:8085/solr/myCollection/update/extract?literal.id=doc1&literal.fileName=as&uprefix=attr_&&commit=true" 
-F "myfile=@/media/PENDRIVE/Out/www-http/152/8696_6_5_5528.jpeg"

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="error"><str name="msg">java.lang.NoClassDefFoundError: 
com/adobe/xmp/XMPException</str><str 
name="trace">java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
com/adobe/xmp/XMPException
     at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
     at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1489)
     at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:517)
     at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
     at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:540)
     at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
     at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1097)
     at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:446)
     at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
     at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1031)
     at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
     at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:200)
     at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
     at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
     at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:317)
     at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
     at org.eclipse.jetty.server.Server.handle(Server.java:445)
     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:269)
     at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
     at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
     at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
     at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
     at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
     at 
com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
     at 
com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
     at 
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
     at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
     at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
     at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
     at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
     at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
     at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
     at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
     at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
     ... 23 more
Caused by: java.lang.ClassNotFoundException: com.adobe.xmp.XMPException
     at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
     at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
     at java.security.AccessController.doPrivileged(Native Method)
     at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
     at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789)
     at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
     ... 37 more
</str><int name="code">500</int></lst>
</response>