You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by anarchos78 <ri...@hotmail.com> on 2012/08/07 22:28:38 UTC

Solr search – Tika extracted text from PDF not return highlighting snippet

Greetings friends,
I have successfully indexed Pdf –using Tika- and pure text –fetched from
database- in one single collection. Now I am trying to implement
highlighting. When I querying Solr i placing in the url the following: 
http://localhost:8090/solr/ktimatologio/select/?q=BlahBlah&
&start=0&rows=120&indent=on&hl=true&wt=json . Everything is OK. The received
output has the original (not highlighted text) content under “docs” and the
highlighted snippets under “highlighting”. But I had noticed the documents
that have been extracted by Tika don’t have “highlighting” snippet. That
kind of response, cause me many troubles (zero length rows). Is there any
workaround in order to tackle it? I have already tried to copyField (at
index time) but the response come out blank *({“highlighting”:{}})*. I
really need help on this.

With honor,

Tom

Greece 




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by tuxdna <tu...@gmail.com>.
I am replying to this post because I am also facing "very similar" issue.

I am indexing the documents stored in a blob field of a MySQL database. I
have described the whole setup in the following blog post:

http://tuxdna.wordpress.com/2013/02/04/indexing-the-documents-stored-in-a-database-using-apache-solr-and-apache-tika/


Basically, the blob content is fetched from database, and then it is parsed
by Tika and converted into text. All the fields in the datbase table get
indexed properly except the blob field ( which was processed by Tika ). It
doesn't reflect in Solr schema browser. There are no terms against the text
field. 

I tried with some permutation and combination of the fields in (
db-data-config.xml and schema.xml ) and got it working. I now have to fields
"text" and "text1", where "text" is indexed + stored, and "text2" is
neither. However if I remove "text2" from configuration, I am back to the
same problem i.e. the field doesn't get indexed. 

I don't understand how, the above work around is working. Can anyone give me
pointers where I can explore further to understand this behaviour? Is it
solvable using copyField ?

NOTE: I have described the configuration files and setup in the link above.

Thanks in advance! :)

/tuxdna




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4041180.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by Lance Norskog <go...@gmail.com>.
There are two different sets of readers for binary and character-mode
data, and I don't remember which is which. You may be reading the PDF
binary blob as a character blob.

On Wed, Aug 22, 2012 at 1:34 AM, anarchos78
<ri...@hotmail.com> wrote:
> Thanks for your reply,
> I had tryied many things (copy field etc) with no succes. Notice that the
> "pdfs" are stored as BLOB in mysql database. I am trying to use DIH in order
> to fetch the binaries from DB. Is it possible?
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4002587.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goksron@gmail.com

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by anarchos78 <ri...@hotmail.com>.
Thanks for your reply,
I had tryied many things (copy field etc) with no succes. Notice that the
"pdfs" are stored as BLOB in mysql database. I am trying to use DIH in order
to fetch the binaries from DB. Is it possible?
Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4002587.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by Lance Norskog <go...@gmail.com>.
There is no copyField in the schema.  You have to store the parsed
text in a field which is stored! Highlighting works on stored fields.
There is no "text" field in the schema. I don't know how the DIH
automatically creates it.

On Tue, Aug 21, 2012 at 2:10 PM, anarchos78
<ri...@hotmail.com> wrote:
> Any help? Anyone?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4002513.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goksron@gmail.com

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by anarchos78 <ri...@hotmail.com>.
Any help? Anyone?



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p4002513.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by anarchos78 <ri...@hotmail.com>.
Thank you for the reply. I had tried many things with no results. So i am
providing the following:

*The solr responce (notice the empty object at the end of highlighting-that
is the problem):*
{
  "responseHeader":{"status":0,"QTime":0,"params":{
      "indent":"on",
      "q":"Bloh",
      "wt":"json",
      "hl":"on",
      "rows":"3"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "ida":"2",
        "search_tag":"N.3889/2010",
        "solr_id":"2_n_3889_2010",
        "last_modified":"2012-07-31 10:43:31.0",
        "model":"n_3889_2010",
        "title":"Άρθρο 14",
        "type":"text",
        "grid_title":"Άρθρο 14/ΦΕΚ Α΄ 182/14.10.2010",
        "content":[
          "Blah Bloh Bleh"]},
      {
        "ida":"22",
        "search_tag":"Δικαστικές Αποφάσεις",
        "solr_id":"22_apofaseis_dikastikes",
        "last_modified":"2012-07-18 00:42:56.0",
        "model":"apofaseis_dikastikes",
        "title":"37/2009 ΔΕΦ ΑΘ (ΑΝΑΣΤ)",
        "type":"text",
        "grid_title":"37/2009 ΔΕΦ ΑΘ (ΑΝΑΣΤ)",
        "content":[
          "Lola lolo Bloh lili"]},
      {
        "ida":"45",
        "search_tag":"Δικαστικές Αποφάσεις",
        "solr_id":"45_apofaseis_dikastikes",
        "last_modified":"2012-07-18 00:43:40.0",
        "model":"apofaseis_dikastikes",
        "title":"126/2009 ΔΕΦ ΑΘ (ΑΝΑΣΤ)",
        "type":"bin",,
        "url":"resources/pdf/1.pdf",
        "grid_title":"126/2009 ΔΕΦ ΑΘ (ΑΝΑΣΤ)",
        "content":[
          "Bloh Abc Cbn Cnn"]}]
  },
  "highlighting":{
    "2_n_3889_2010":{
      "content":["Blah <em>Bloh</em> Bleh"]},
    "22_apofaseis_dikastikes":{
      "content":["Lola lolo <em>Bloh</em> lili"]},
    "45_apofaseis_dikastikes":{}}}


*The schema.xml :*

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="ktimatologio" version="1.5">

  <types>
  
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

    
    <fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true"/>
    
    <fieldtype name="binary" class="solr.BinaryField"/>
	
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0"
positionIncrementGap="0"/>
	
    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8"
positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8"
positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8"
positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8"
positionIncrementGap="0"/>
	
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
positionIncrementGap="0"/>

    
    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
positionIncrementGap="0"/>

    <fieldType name="pint" class="solr.IntField"/>
    <fieldType name="plong" class="solr.LongField"/>
    <fieldType name="pfloat" class="solr.FloatField"/>
    <fieldType name="pdouble" class="solr.DoubleField"/>
    <fieldType name="pdate" class="solr.DateField" sortMissingLast="true"/>
	
    <fieldType name="sint" class="solr.SortableIntField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="slong" class="solr.SortableLongField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sfloat" class="solr.SortableFloatField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sdouble" class="solr.SortableDoubleField"
sortMissingLast="true" omitNorms="true"/>
	
    <fieldType name="random" class="solr.RandomSortField" indexed="true" />
	
		
	<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory"/> 
				       
				<filter class="solr.LowerCaseFilterFactory"/>				
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_el.txt" enablePositionIncrements="true"/>		
				<filter class="solr.GreekLowerCaseFilterFactory"/>
				<filter class="solr.GreekStemFilterFactory"/>
				
				<filter class="solr.HunspellStemFilterFactory"
dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff"
ignoreCase="true" />
            </analyzer>
            <analyzer type="query">
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/> 
				       
				<filter class="solr.LowerCaseFilterFactory"/>				
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_el.txt" enablePositionIncrements="true"/>		
				<filter class="solr.GreekLowerCaseFilterFactory"/>
				<filter class="solr.GreekStemFilterFactory"/>
				
				<filter class="solr.HunspellStemFilterFactory"
dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff"
ignoreCase="true" />
            </analyzer>
        </fieldType>
		
    <fieldType name="text_ktimatologio" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
	    
	  			
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
	    <filter class="solr.EnglishPossessiveFilterFactory"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
	    <filter class="solr.GreekLowerCaseFilterFactory"/>
	    <filter class="solr.GreekStemFilterFactory"/>		
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
	  	  
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_en.txt" enablePositionIncrements="true"/>		
		<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_el.txt" enablePositionIncrements="true"/>		
		<filter class="solr.GreekLowerCaseFilterFactory"/>
        <filter class="solr.GreekStemFilterFactory"/>		
		
        <filter class="solr.LowerCaseFilterFactory"/>
	    <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
    <fieldtype name="ignored" stored="false" indexed="false"
multiValued="true" class="solr.StrField" />
    <fieldType name="point" class="solr.PointType" dimension="2"
subFieldSuffix="_d"/>
    <fieldType name="location" class="solr.LatLonType"
subFieldSuffix="_coordinate"/>
    <fieldtype name="geohash" class="solr.GeoHashField"/>
    <fieldType name="currency" class="solr.CurrencyField" precisionStep="8"
defaultCurrency="USD" currencyConfig="currency.xml" />
 </types>



 <fields>
  <field  name="ida" type="string" indexed="true" stored="true"
multiValued="false"/>
  <field  name="solr_id" type="string" indexed="true" stored="true"
multiValued="false"/> 
  <field  name="title" type="text_ktimatologio" indexed="true"
stored="true"/>
  <field  name="grid_title" type="text_ktimatologio" indexed="true"
stored="true"/>
  <field  name="model" type="string" indexed="true" stored="true"
multiValued="false"/>
  <field  name="type" type="string" indexed="true" stored="true"/>
  <field  name="url" type="string" indexed="true" stored="true"/>
  <field  name="last_modified" type="string" indexed="true" stored="true"/>
  <field  name="search_tag" type="string" indexed="true" stored="true"/>
  <field  name="contentbin" type="text" indexed="true" stored="true"
multiValued="true"/>
  <field  name="content" type="text_ktimatologio" indexed="true"
stored="true" multiValued="true"/>    
 </fields>
 
 <uniqueKey>solr_id</uniqueKey>
 <defaultSearchField>content</defaultSearchField>
 <solrQueryParser defaultOperator="OR"/>


</schema>



*The data-config.xml (part of it): *

<?xml version="1.0" encoding="utf-8"?>

<dataConfig>
   
  <dataSource type="JdbcDataSource"
		  autoCommit="true" batchSize="-1"
		  convertType="false"
		  driver="com.mysql.jdbc.Driver"
		  url="jdbc:mysql://127.0.0.1:3306/ktimatologio"
		  user="root" 
		  password="******"
		  name="db"/>
		  
		 <dataSource name="fieldReader" type="FieldStreamDataSource" />		
                  
			  
  <document>  
  
  
  <entity name="aitiologikes_ektheseis"
  	dataSource="db" 
  	transformer="HTMLStripTransformer" 
  	query="select id, title, title AS grid_title, model, type, url,
last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'"
	deltaImportQuery="select id, title, title AS grid_title, model, type, url,
last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
and id='${dataimporter.delta.id}'"
	deltaQuery="select id, title, title AS grid_title, model, type, url,
last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT(
body,' ',title)  AS content from aitiologikes_ektheseis where type = 'text'
and last_modified &gt; '${dataimporter.last_index_time}'">
		<field column="id" name="ida" />		
		<field column="solr_id" name="solr_id" />
		<field column="title" name="title" stripHTML="true" />
		<field column="grid_title" name="grid_title" stripHTML="true" />
		<field column="model" name="model" stripHTML="true" />
		<field column="type" name="type" stripHTML="true" />
		<field column="url" name="url" stripHTML="true" />
		<field column="last_modified" name="last_modified" stripHTML="true"  />
		<field column="search_tag" name="search_tag" stripHTML="true" />
		<field column="content" name="content" stripHTML="true" />
    </entity>
	
    <entity name="aitiologikes_ektheseis_bin"
	  query="select id, title, title AS grid_title, model, type, url,
last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
text from aitiologikes_ektheseis where type = 'bin'" 
	  deltaImportQuery="select id, title, title AS grid_title, model, type,
url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con
AS text from aitiologikes_ektheseis where type = 'bin' and
id='${dataimporter.delta.id}'"
	  deltaQuery="select id, title, title AS grid_title, model, type, url,
last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS
text from aitiologikes_ektheseis where type = 'bin' and last_modified &gt;
'${dataimporter.last_index_time}'"
	  transformer="TemplateTransformer"
	  dataSource="db">
	  		
		  <field column="id" name="ida" />		
		<field column="solr_id" name="solr_id" />
		  <field column="title" name="title" stripHTML="true" />
		  <field column="grid_title" name="grid_title" stripHTML="true" />
		  <field column="model" name="model" stripHTML="true" />
		  <field column="type" name="type" stripHTML="true" />
		  <field column="url" name="url" stripHTML="true" />
		  <field column="last_modified" name="last_modified" stripHTML="true"  />
		  <field column="search_tag" name="search_tag" stripHTML="true" />
		  
		<entity dataSource="fieldReader" processor="TikaEntityProcessor"
dataField="aitiologikes_ektheseis_bin.text" format="text">  
		  <field column="text" name="content" stripHTML="true" />
		</entity>
		
	</entity>
	
	
		
  <entity name="ak"
  	dataSource="db" 
  	transformer="HTMLStripTransformer" 
  	query="select id, title, CONCAT_WS('/',title,fek,date) AS grid_title,
model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id,
search_tag, CONCAT_WS(' ',description,body,title,fek,date) AS content from
ak where type = 'text'"
	deltaImportQuery="select id, title, CONCAT_WS('/',title,fek,date) AS
grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS
solr_id, search_tag, CONCAT_WS(' ',description,body,title,fek,date) AS
content from ak where type = 'text' and id='${dataimporter.delta.id}'"
	deltaQuery="select id, title, CONCAT_WS('/',title,fek,date) AS grid_title,
model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id,
search_tag, CONCAT_WS(' ',description,body,title,fek,date) AS content from
ak where type = 'text' and last_modified &gt;
'${dataimporter.last_index_time}'">
		<field column="id" name="ida" />		
		<field column="solr_id" name="solr_id" />
		<field column="title" name="title" stripHTML="true" />
		<field column="grid_title" name="grid_title" stripHTML="true" />
		<field column="model" name="model" stripHTML="true" />
		<field column="type" name="type" stripHTML="true" />
		<field column="url" name="url" stripHTML="true" />
		<field column="last_modified" name="last_modified" stripHTML="true"  />
		<field column="search_tag" name="search_tag" stripHTML="true" />
		<field column="content" name="content" stripHTML="true" />
    </entity>
	
	<entity name="ak_bin"
	  query="select id, title, CONCAT_WS('/',title,fek,date) AS grid_title,
model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id,
search_tag, bin_con AS text from ak where type = 'bin'" 
	  deltaImportQuery="select id, title, CONCAT_WS('/',title,fek,date) AS
grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS
solr_id, search_tag, bin_con AS text from ak where type = 'bin' and
id='${dataimporter.delta.id}'"
	  deltaQuery="select id, title, CONCAT_WS('/',title,fek,date) AS
grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS
solr_id, search_tag, bin_con AS text from ak where type = 'bin' and
last_modified &gt; '${dataimporter.last_index_time}'"
	  transformer="TemplateTransformer"
	  dataSource="db">
	  		
		  <field column="id" name="ida" />		
		<field column="solr_id" name="solr_id" />
		  <field column="title" name="title" stripHTML="true" />
		  <field column="grid_title" name="grid_title" stripHTML="true" />
		  <field column="model" name="model" stripHTML="true" />
		  <field column="type" name="type" stripHTML="true" />
		  <field column="url" name="url" stripHTML="true" />
		  <field column="last_modified" name="last_modified" stripHTML="true"  />
		  <field column="search_tag" name="search_tag" stripHTML="true" />
		  
		<entity dataSource="fieldReader" processor="TikaEntityProcessor"
dataField="ak_bin.text" format="text">  
		  <field column="text" name="content" stripHTML="true" />
		</entity>
		
	</entity>
	

 	
  <entity name="pd_541_1978"
  	dataSource="db" 
  	transformer="HTMLStripTransformer" 
  	query="select id, title, CONCAT_WS('/',title,fek,date) AS grid_title,
model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id,
search_tag, CONCAT_WS(' ',description,body,title,fek,date) AS content from
pd_541_1978 where type = 'text'"
	deltaImportQuery="select id, title, CONCAT_WS('/',title,fek,date) AS
grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS
solr_id, search_tag, CONCAT_WS(' ',description,body,title,fek,date) AS
content from pd_541_1978 where type = 'text' and
id='${dataimporter.delta.id}'"
	deltaQuery="select id, title, CONCAT_WS('/',title,fek,date) AS grid_title,
model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id,
search_tag, CONCAT_WS(' ',description,body,title,fek,date) AS content from
pd_541_1978 where type = 'text' and last_modified &gt;
'${dataimporter.last_index_time}'">
		<field column="id" name="ida" />		
		<field column="solr_id" name="solr_id" />
		<field column="title" name="title" stripHTML="true" />
		<field column="fek" name="title" stripHTML="true" />
		<field column="grid_title" name="grid_title" stripHTML="true" />
		<field column="type" name="type" stripHTML="true" />
		<field column="url" name="url" stripHTML="true" />
		<field column="last_modified" name="last_modified" stripHTML="true"  />
		<field column="search_tag" name="search_tag" stripHTML="true" />
		<field column="content" name="content" stripHTML="true" />
    </entity>
	
	<entity name="pd_541_1978_bin"
	  query="select id, title, CONCAT_WS('/',title,fek,date) AS grid_title,
model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id,
search_tag, bin_con AS text from pd_541_1978 where type = 'bin'" 
	  deltaImportQuery="select id, title, CONCAT_WS('/',title,fek,date) AS
grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS
solr_id, search_tag, bin_con AS text from pd_541_1978 where type = 'bin' and
id='${dataimporter.delta.id}'"
	  deltaQuery="select id, title, CONCAT_WS('/',title,fek,date) AS
grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS
solr_id, search_tag, bin_con AS text from pd_541_1978 where type = 'bin' and
last_modified &gt; '${dataimporter.last_index_time}'"
	  transformer="TemplateTransformer"
	  dataSource="db">
	  		
		  <field column="id" name="ida" />		
		  <field column="solr_id" name="solr_id" />
		  <field column="title" name="title" stripHTML="true" />
		  <field column="fek" name="title" stripHTML="true" />
		  <field column="grid_title" name="grid_title" stripHTML="true" />
		  <field column="type" name="type" stripHTML="true" />
		  <field column="url" name="url" stripHTML="true" />
		  <field column="last_modified" name="last_modified" stripHTML="true"  />
		  <field column="search_tag" name="search_tag" stripHTML="true" />
		  
		<entity dataSource="fieldReader" processor="TikaEntityProcessor"
dataField="n_3983_2011_bin.text" format="text">  
		  <field column="text" name="content" stripHTML="true" />
		</entity>
		
	</entity>
	
	
  </document>	
  
   
</dataConfig>


*In the solrconfig.xml (the SearchHandler, ExtractingRequestHandler and
HighlightComponent):*

<requestHandler name="/select" class="solr.SearchHandler">
    
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">120</int>
	   
	   
	   <str name="hl.fragsize">800</str>
	   
     </lst>
	 
  </requestHandler>
  
 <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
	  
	  
      <str name="fmap.Last-Modified">last_modified</str>

      
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>
  
  
  
  <searchComponent class="solr.HighlightComponent" name="highlight">
    <highlighting>
      
      
      <fragmenter name="gap" 
                  default="true"
                  class="solr.highlight.GapFragmenter">
        <lst name="defaults">
          
        </lst>
      </fragmenter>

      
      <fragmenter name="regex" 
                  class="solr.highlight.RegexFragmenter">
        <lst name="defaults">
          
          <int name="hl.fragsize">70</int>
          
          <float name="hl.regex.slop">0.5</float>
          
          <str name="hl.regex.pattern">[-\w ,/\n\&quot;&apos;]{20,200}</str>
        </lst>
      </fragmenter>

      
      <formatter name="html" 
                 default="true"
                 class="solr.highlight.HtmlFormatter">
        <lst name="defaults">
          
		  
		  <str name="hl.simple.pre">&lt;solrhighlight&gt;</str>
          <str name="hl.simple.post">&lt;/solrhighlight&gt;</str>
		  
        </lst>
      </formatter>

      
      <encoder name="html" 
               class="solr.highlight.HtmlEncoder" />

      
      <fragListBuilder name="simple" 
                       default="true"
                       class="solr.highlight.SimpleFragListBuilder"/>

      
      <fragListBuilder name="single" 
                       class="solr.highlight.SingleFragListBuilder"/>

      
      <fragmentsBuilder name="default" 
                        default="true"
                        class="solr.highlight.ScoreOrderFragmentsBuilder">
        
      </fragmentsBuilder>  

      
      <fragmentsBuilder name="colored" 
                        class="solr.highlight.ScoreOrderFragmentsBuilder">
        <lst name="defaults">
          <str name="hl.tag.pre"></str>
          <str name="hl.tag.post"></str>
        </lst>
      </fragmentsBuilder>
      
      <boundaryScanner name="default" 
                       default="true"
                       class="solr.highlight.SimpleBoundaryScanner">
        <lst name="defaults">
          <str name="hl.bs.maxScan">10</str>
          <str name="hl.bs.chars">.,!? &#9;&#10;&#13;</str>
        </lst>
      </boundaryScanner>
      
      <boundaryScanner name="breakIterator" 
                       class="solr.highlight.BreakIteratorBoundaryScanner">
        <lst name="defaults">
         
          <str name="hl.bs.type">WORD</str>
          
          <str name="hl.bs.language">en</str>
          <str name="hl.bs.country">US</str>
        </lst>
      </boundaryScanner>
    </highlighting>
  </searchComponent>
  
  



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647p3999725.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr search – Tika extracted text from PDF not return highlighting snippet

Posted by Jack Krupansky <ja...@basetechnology.com>.
The out-of-the-box example for SolrCell/Tika redirects the Tika "content" to 
the "text" field, which is not stored/highlighted, so the Tika content is 
indexed but not retrievable/highligtable.

What field are you highlighting for your database text?

You should direct your Tika "content" to a stored field, and then copy it to 
"text" for indexing and to whatever field you are highlighting.

-- Jack Krupansky

-----Original Message----- 
From: anarchos78
Sent: Tuesday, August 07, 2012 4:28 PM
To: solr-user@lucene.apache.org
Subject: Solr search – Tika extracted text from PDF not return highlighting 
snippet

Greetings friends,
I have successfully indexed Pdf –using Tika- and pure text –fetched from
database- in one single collection. Now I am trying to implement
highlighting. When I querying Solr i placing in the url the following:
http://localhost:8090/solr/ktimatologio/select/?q=BlahBlah&
&start=0&rows=120&indent=on&hl=true&wt=json . Everything is OK. The received
output has the original (not highlighted text) content under “docs” and the
highlighted snippets under “highlighting”. But I had noticed the documents
that have been extracted by Tika don’t have “highlighting” snippet. That
kind of response, cause me many troubles (zero length rows). Is there any
workaround in order to tackle it? I have already tried to copyField (at
index time) but the response come out blank *({“highlighting”:{}})*. I
really need help on this.

With honor,

Tom

Greece




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-search-Tika-extracted-text-from-PDF-not-return-highlighting-snippet-tp3999647.html
Sent from the Solr - User mailing list archive at Nabble.com.