You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kostali hassan <me...@gmail.com> on 2015/12/01 19:03:36 UTC
Fwd: Indexing rich data (msword and pdf) in apache solr-5.3.1
I start working in solr 5x by extract solr in D://solr and run solr server
with :
D:\solr\solr-5.3.1\bin>solr start ;
Then I create a core in standalone mode :
D:\solr\solr-5.3.1\bin>solr create -c mycore
I need indexing from system files (word and pdf) and the schema API don’t
have a field “name” of document, then I Add this field using curl :
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"name",
"type":"text_general",
"stored":true,
“indexed”:true }
}' http://localhost:8983/solr/mycore/schema
And re-index all document.with windows SimplepostTools:
D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
-Dc=mycore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool
D:\Lucene\document ;
But even if the field “name” is succeffly added he is empty ; the field
title get the name for only pdf document not for msword(.doc and .docx).
Then I choose indexing with techproducts example because he don’t use
schema.xml API then I can modified my schema:
D:\solr\solr-5.3.1>solr –e techproducts
Techproducts return the name of all files.xml indexed;
Then I create a new core based in solr_home example/techproducts/solr and I
use schema.xml (contient field “name”) and solrConfig.xml from techproducts
in this new core called demo.
When I indexed all document the field name exist but still empty for all
document indexed.
My question is how I can get just the name of each document(msword and pdf)
not the path like the field “id” or field “ressource_name” ; I have to
create new Typefield or exist another way.
Sorry for my basic English.
Thank you.