You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by kostali hassan <me...@gmail.com> on 2015/12/01 19:03:36 UTC

Fwd: Indexing rich data (msword and pdf) in apache solr-5.3.1

I start working in solr 5x by extract solr in D://solr and run solr server
with :

D:\solr\solr-5.3.1\bin>solr start ;

Then I create a core in standalone mode :

D:\solr\solr-5.3.1\bin>solr create -c mycore

I need indexing from system files (word and pdf) and the schema API don’t
have a field “name” of document, then I Add this field using curl :

curl -X POST -H 'Content-type:application/json' --data-binary '{

  "add-field":{

     "name":"name",

     "type":"text_general",

     "stored":true,

     “indexed”:true }

}' http://localhost:8983/solr/mycore/schema



And re-index all document.with windows SimplepostTools:

D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
-Dc=mycore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool
D:\Lucene\document ;



But even if the field “name” is succeffly added he is empty ; the field
title get the name for only pdf document not for msword(.doc and .docx).



Then I choose indexing with techproducts example because he don’t use
schema.xml API then I can modified my schema:



D:\solr\solr-5.3.1>solr –e techproducts



Techproducts return the name of all files.xml indexed;



Then I create a new core based in solr_home example/techproducts/solr and I
use schema.xml (contient field “name”) and solrConfig.xml from techproducts
in this new core called demo.

When I indexed all document the field name exist but still empty for all
document indexed.



My question is how I can get just the name of each document(msword and pdf)
not the path like the field “id” or field “ressource_name” ; I have to
create new Typefield or exist another way.



Sorry for my basic English.

Thank you.