You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Gus Heck (Jira)" <ji...@apache.org> on 2022/07/12 20:20:00 UTC
[jira] [Commented] (SOLR-16288) Error indexing files(html, pdf) using SOLR Cell Tika

    [ https://issues.apache.org/jira/browse/SOLR-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566000#comment-17566000 ] 

Gus Heck commented on SOLR-16288:
---------------------------------

This is not the appropriate forum for obtaining help. This is a bug tracker, and is meant to be used when you are certain of something (of a bug, a feature request, of a patch that you want to contribute). When you are {*}uncertain{*}, and need help, please use the mailing list [users@solr.apache.org|mailto:users@solr.apache.org] (instructions for joining are available at [https://solr.apache.org/community.html]

Briefly I will comment that SolrCell is really only appropriate for small scale and for testing. Large indexes will want to do their text extraction via Tika (which is what SolrCell uses) outside of solr to avoid excessive load on the search machine while it is serving queries. (You may be aware of that given that it is stated in the docs at [https://solr.apache.org/guide/8_11/uploading-data-with-solr-cell-using-apache-tika.html#solr-cell-performance-implications)] As to your specific problem I'm not sure, but I suspect your non-standard _uniqueid field (evidently defined in your schema based on the error message) needs to be specified as a literal in the request

> Error indexing files(html, pdf) using SOLR Cell Tika
> ----------------------------------------------------
>
>                 Key: SOLR-16288
>                 URL: https://issues.apache.org/jira/browse/SOLR-16288
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Nicolas
>            Priority: Major
>
> Hi - I am trying to index files such as html and pdf. I got the following error related to unique id which is defined in the curl command. The unique id is set with the literal.id parameter.
> Can you please help? I read all the documentation of SOLR Cell and tika, and I am doing the steps as its described.
> Here is what I enter in the cmd.
> C:\>{*}curl "https://localhost:8984/solr/XP0_Slavik_web_index/update/extract?literal.id=doc1?commit=true" -F "myfile=@example.pdf"{*}
> {
>   "responseHeader":{
>     "status":400,
>     "QTime":55},
>   "error":{
>     "metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"{*}Document is missing mandatory uniqueKey field: _uniqueid{*}",
>     "code":400}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org