You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bisonti Mario <Ma...@vimar.com> on 2018/10/11 15:06:16 UTC
Tika and Solr : rejected document due to mime type restrictions
Hallo.
I startup tika server from command line:
java -jar /opt/tika/tika-server-1.19.1.jar
I configured, with ManifoldCF a connector to Solr.
When I start the ingest of pdf and .xls document, I see in the tika server:
INFO Setting the server's publish address to be http://localhost:9998/
INFO Logging initialized @1053ms to org.eclipse.jetty.util.log.Slf4jLog
INFO jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 10.0.2+13-Ubuntu-1ubuntu0.18.04.2
INFO Started ServerConnector@f74e835{HTTP/1.1,[http/1.1]}{localhost:9998}
INFO Started @1134ms
WARN Empty contextPath
INFO Started o.e.j.s.h.ContextHandler@68d6972f{/,null,AVAILABLE}
INFO Started Apache Tika server at http://localhost:9998/
INFO meta (application/pdf)
INFO meta (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
WARN Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
WARN Using fallback font 'LiberationSans' for 'Arial-Black'
WARN Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN Using fallback font 'LiberationSans' for 'Arial-BoldMT'
WARN Using fallback font 'LiberationSans' for 'ArialMT'
WARN Using fallback font 'LiberationSans' for 'CourierNewPSMT'
WARN Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
INFO tika (application/pdf)
WARN Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'
WARN Using fallback font 'LiberationSans' for 'Arial-Black'
WARN Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN Using fallback font 'LiberationSans' for 'Arial-BoldMT'
WARN Using fallback font 'LiberationSans' for 'ArialMT'
WARN Using fallback font 'LiberationSans' for 'CourierNewPSMT'
WARN Using fallback font 'LiberationSans' for 'TimesNewRomanPS-ItalicMT'
INFO tika (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
so it seems that tika server process the cocuments, but, Solr server doesn't ingest.
I obtain the error:
Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
Solr connector rejected document due to mime type restrictions: (application/pdf)
I understood that tika converts all documents in text so it index to solr, or are there any restriction about Tika Server mime typ?
Thanks a lot
Mario
Re: Tika and Solr : rejected document due to mime type restrictions
Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/11/2018 9:06 AM, Bisonti Mario wrote:
> I startup tika server from command line:
> java -jar /opt/tika/tika-server-1.19.1.jar
>
> I configured, with ManifoldCF a connector to Solr.
>
> When I start the ingest of pdf and .xls document, I see in the tika server:
<snip>
> so it seems that tika server process the cocuments, but, Solr server doesn't ingest.
>
> I obtain the error:
> Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
> Solr connector rejected document due to mime type restrictions: (application/pdf)
Those errors are not coming from Solr. Do you see any errors in
solr.log? If you do, then we can help you with those.
Since ManifoldCF calls its components connectors, I am betting the
errors are being generated by ManifoldCF, and that for those documents,
nothing has actually been sent to Solr, so you won't see errors in the
solr.log for those files. ManifoldCF is a separate project within
Apache, which has its own support infrastructure.
https://manifoldcf.apache.org/en_US/mail.html
Thanks,
Shawn