You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Luis <re...@gmail.com> on 2013/03/14 20:30:25 UTC

Solr indexing binary files

Hi, I am new with Solr and I am extracting metadata from binary files through
URLs stored in my database.  I would like to know what fields are available
for indexing from PDFs (the ones that would be initiated as in column=””). 
For example how would I extract something like file size, format or file
type.  

I would also like to know how to create customized fields in Solr.  How
those metadata and text content are mapped into Solr schema?  Would I have
to declare that in the solrconfig.xml or do some more tweaking somewhere
else?  If someone has a code snippet that could show me it would be greatly
appreciated.

Thank you in advance.




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

Posted by Luis <re...@gmail.com>.
Hi Gora,

Yes, my urlpath points to an url like that.  I do not get why uncommenting
the catch all dynamic field ("*") does not work for me.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4048542.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

Posted by Gora Mohanty <go...@mimirtech.com>.
On 16 March 2013 00:30, Luis <re...@gmail.com> wrote:
> Sorry, Gora.  It is ${fileSourcePaths.urlpath} actually.

Most likely, there is some issue with the selected urlpath
not pointing to a proper http or file source. E.g., urlpath
could be something like http://example.com/myfile.pdf .
Please check that ${fileSourcePaths.urlpath} points to a
proper resource.

> *My complete schema.xml is this:*
[...]

This looks fine.

Regards,
Gora

Re: Solr indexing binary files

Posted by Luis <re...@gmail.com>.
Sorry, Gora.  It is ${fileSourcePaths.urlpath} actually.

*My complete schema.xml is this:*

<?xml version="1.0" encoding="UTF-8" ?>




<schema name="db" version="1.1">
  

  <types>
  
   <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" />
    

    
    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>

    
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"
omitNorms="true"/>

        


    
    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>


    
    <fieldType name="sint" class="solr.SortableIntField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="slong" class="solr.SortableLongField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sfloat" class="solr.SortableFloatField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sdouble" class="solr.SortableDoubleField"
sortMissingLast="true" omitNorms="true"/>
	

    
    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
omitNorms="true"/>


    
    <fieldType name="random" class="solr.RandomSortField" indexed="true" />

    

    

    
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

    
    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


    
    <fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    
    <fieldType name="alphaOnlySort" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <analyzer>
        
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        
        <filter class="solr.LowerCaseFilterFactory" />
        
        <filter class="solr.TrimFilterFactory" />
        
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldType>

     
    <fieldtype name="ignored" stored="false" indexed="false"
class="solr.StrField" /> 

 </types>


 <fields>
   

   <field name="id" type="string" indexed="true" stored="true"
required="true" multiValued="false" /> 
   <field name="sku" type="textTight" indexed="true" stored="true"
omitNorms="true"/>
   <field name="name" type="text" indexed="true" stored="true"/>
   <field name="nameSort" type="string" indexed="true" stored="false"/>
   <field name="alphaNameSort" type="alphaOnlySort" indexed="true"
stored="false"/>
   <field name="manu" type="text" indexed="true" stored="true"
omitNorms="true"/>
   <field name="cat" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" termVectors="true" />
   <field name="features" type="text" indexed="true" stored="true"
multiValued="true"/>
   <field name="includes" type="text" indexed="true" stored="true"/>

   <field name="weight" type="sfloat" indexed="true" stored="true"/>
   <field name="price"  type="sfloat" indexed="true" stored="true"/>
   
   
   <field name="fileDir" type="text" indexed="true" stored="true" />
	<field name="file" type="text" indexed="true" stored="true" />
    <field name="initials" type="string" indexed="true" stored="true" />
  
   <field name="company" type="text" indexed="true" stored="true" />
   <field name="file_size" type="long" indexed="true" stored="true" />
  
   
   
   
   <field name="title" type="text" indexed="true" stored="true"
multiValued="true"/>
   <field name="subject" type="text_ws" indexed="true" stored="true"/>
   <field name="description" type="text_ws" indexed="true" stored="true" />
   <field name="comments" type="text_ws" indexed="true" stored="true"/>
    <field name="resour" type="text_ws" indexed="true" stored="true"/>
	<field name="creator" type="text_ws" indexed="true" stored="true"/>
   <field name="keywords" type="text_ws" indexed="true" stored="true"/>
   <field name="category" type="text_ws" indexed="true" stored="true"/>
   <field name="resourcename" type="text_ws" indexed="true" stored="true"/>
   <field name="url" type="text_ws" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true"
multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="creation_date" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true"
multiValued="true"/>
   
    
   <field name="author" type="string" indexed="true" stored="true" />
   
   <field name="fileName" type="string" indexed="true" stored="true" />
   <field name="type" type="string" indexed="true" stored="true"
multiValued="true" />
   <field name="mime" type="string" indexed="true" stored="true" />
   
  <field name="summary" type="text_ws" indexed="true" stored="true" />
   <field name="date_published" type="string" indexed="true" stored="true"
multiValued="false"/>
   
   

   
   <field name="content" type="text_ws" indexed="false" stored="true"
multiValued="true"/>
   <field name="guid" type="text_ws" indexed="true" stored="true"
multiValued="true"/>

   

   
   
   
   
   
   
   
   
   
   
   
   
   <field name="popularity" type="sint" indexed="true" stored="true"
default="0"/>
   <field name="inStock" type="boolean" indexed="true" stored="true"/>

   
   <field name="word" type="string" indexed="true" stored="true"/>

   
   
   <field name="text" type="text" indexed="true" stored="true"
multiValued="true"/>

   
   <field name="manu_exact" type="string" indexed="true" stored="false"/>

   
   <field name="timestamp" type="date" indexed="true" stored="true"
default="NOW" multiValued="false"/>
   

   
   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
<dynamicField name="metadata_*" type="text" indexed="true" stored="true"
multiValued="false"/>
   <dynamicField name="random*" type="random" />
<dynamicField name="attr_*" type="text_general" indexed="true" stored="true"
multiValued="false"/>
    
   
   
 </fields>

 
 <uniqueKey>id</uniqueKey>

 
 <defaultSearchField>text</defaultSearchField>

 
 <solrQueryParser defaultOperator="OR"/>

  
   <copyField source="id" dest="sku"/>

   <copyField source="cat" dest="text"/>
   <copyField source="name" dest="text"/>
   <copyField source="name" dest="nameSort"/>
   <copyField source="name" dest="alphaNameSort"/>
   <copyField source="manu" dest="text"/>
   <copyField source="features" dest="text"/>
   <copyField source="includes" dest="text"/>

   <copyField source="manu" dest="manu_exact"/>

 
 

</schema>




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047778.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

Posted by Gora Mohanty <go...@mimirtech.com>.
On 15 March 2013 20:16, Luis <re...@gmail.com> wrote:
>
> Hi Gora, thank you for your reply.  I am not using any commands, I just go
> on
> the Solr dashboard, db > Dataimport and execute a full-import.

In that case, you are not using the ExtractingRequestHandler, but
using the DataImportHandler, even though you have both handlers
defined.

>
> *My schema.xml looks like this:*
[...]

This cannot be the complete schema.xml, but in any case,
the issue probably does not lie there.

> *My db-data-config.xml looks like this:*
>
> <dataConfig>
>         <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
>                      url="jdbc:mysql://localhost:3306/opspedia"
>                      user="username" batchSize="-1" name="mysql" />
>         <dataSource type="BinURLDataSource" name="bin"/>
>
>         <document>
>
>                 <entity onError="skip" name="fileSourcePaths"
> rootEntity="true"
> dataSource="mysql" query="select ID, urlpath from myposts"
>                 deltaImportQuery="SELECT * FROM myposts WHERE id =
> '${dataimporter.delta.id}'"
>                   deltaQuery="SELECT id FROM myposts WHERE last_modified >
> '${dataimporter.last_index_time}'">
>
>                         <entity name="tika-test"
> processor="TikaEntityProcessor" fileName=".*"
> recursive="true" url="${fileSourcePaths.guid}" format="text"
> dataSource="bin" >

Your query on the root entity, fileSourcePaths, only selects ID
and urlpath, but the url attribute in the nested TikaEntityProcessor
refers to ${fileSourcePaths.guid} which has never been selected.

Regards,
Gora

Re: Solr indexing binary files

Posted by Luis <re...@gmail.com>.
Hi Gora, thank you for your reply.  I am not using any commands, I just go on
the Solr dashboard, db > Dataimport and execute a full-import.

*My schema.xml looks like this:*

<field name="id" type="string" indexed="true" stored="true" required="true"
multiValued="false" /> 
   <field name="sku" type="textTight" indexed="true" stored="true"
omitNorms="true"/>
   <field name="name" type="text" indexed="true" stored="true"/>
   <field name="nameSort" type="string" indexed="true" stored="false"/>
   <field name="alphaNameSort" type="alphaOnlySort" indexed="true"
stored="false"/>
   <field name="manu" type="text" indexed="true" stored="true"
omitNorms="true"/>
   <field name="cat" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" termVectors="true" />
   <field name="features" type="text" indexed="true" stored="true"
multiValued="true"/>
   <field name="includes" type="text" indexed="true" stored="true"/>
   <field name="weight" type="sfloat" indexed="true" stored="true"/>
   <field name="price"  type="sfloat" indexed="true" stored="true"/>
   <field name="fileDir" type="text" indexed="true" stored="true" />
	<field name="file" type="text" indexed="true" stored="true" />
    <field name="initials" type="string" indexed="true" stored="true" />
   <field name="company" type="text" indexed="true" stored="true" />
   <field name="file_size" type="long" indexed="true" stored="true" />
  
   
   
   
   <field name="title" type="text" indexed="true" stored="true"
multiValued="true"/>
   <field name="subject" type="text_ws" indexed="true" stored="true"/>
   <field name="description" type="text_ws" indexed="true" stored="true" />
   <field name="comments" type="text_ws" indexed="true" stored="true"/>
    <field name="resour" type="text_ws" indexed="true" stored="true"/>
	<field name="creator" type="text_ws" indexed="true" stored="true"/>
   <field name="keywords" type="text_ws" indexed="true" stored="true"/>
   <field name="category" type="text_ws" indexed="true" stored="true"/>
   <field name="resourcename" type="text_ws" indexed="true" stored="true"/>
   <field name="url" type="text_ws" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true"
multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="creation_date" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true"
multiValued="true"/>  
   <field name="author" type="string" indexed="true" stored="true" />
   <field name="fileName" type="string" indexed="true" stored="true" />
   <field name="type" type="string" indexed="true" stored="true"
multiValued="true" />
   <field name="mime" type="string" indexed="true" stored="true" />
  <field name="summary" type="text_ws" indexed="true" stored="true" />
   <field name="date_published" type="string" indexed="true" stored="true"
multiValued="false"/>


<dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
<dynamicField name="metadata_*" type="text" indexed="true" stored="true"
multiValued="false"/>
   <dynamicField name="random*" type="random" />
<dynamicField name="attr_*" type="text_general" indexed="true" stored="true"
multiValued="false"/>
    
  * <dynamicField name="*" type="text_general" multiValued="true" />*

*My db-data-config.xml looks like this:*

<dataConfig>
	<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
                     url="jdbc:mysql://localhost:3306/opspedia" 
                     user="username" batchSize="-1" name="mysql" />
	<dataSource type="BinURLDataSource" name="bin"/>
	
	<document>
		
		<entity onError="skip" name="fileSourcePaths" rootEntity="true"
dataSource="mysql" query="select ID, urlpath from myposts"
		deltaImportQuery="SELECT * FROM myposts WHERE id =
'${dataimporter.delta.id}'"
		  deltaQuery="SELECT id FROM myposts WHERE last_modified >
'${dataimporter.last_index_time}'">
		  
			<entity name="tika-test" processor="TikaEntityProcessor" fileName=".*"
recursive="true" url="${fileSourcePaths.guid}" format="text"
dataSource="bin" >
                <field column="ID" name="id" />
				<field column="Author" name="author" meta="true"/>
				<field column="Creation-Date" name="date_published" meta="true"/>
				<field column="modified" name="last_modified" meta="true" />
				<field column="title" name="title" meta="true" />
				<field column="file_size" name="file_size" meta="true" />
</entity>				
		</entity>
	</document>
</dataConfig>

*In my solrconfig.xml I have this:*

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    	<str name="config">db-data-config.xml</str>
    </lst>
  </requestHandler>
  
   <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">metadata_</str>
		<str name="map.Last-Modified">last_modified</str>
		<str name="fmap.content">text</str>
		<str name="fmap.Size">size</str>
		<str name="fmap.Initials">initials</str>
		<str name="fmap.application-name">name</str>
		<str name="fmap.Subject">subject</str>
		<str name="Company">company</str>
		<str name="fmap.Title">title</str>
		<str name="fmap.Comments">comments</str>
		<str name="Words">words</str>
		<str name="Last-Modified-By">last_modified_by</str>
		 <str name="captureAttr">true</str>
    </lst>
  </requestHandler>

Thank you for your help!




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047702.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

Posted by Gora Mohanty <go...@mimirtech.com>.
On 15 March 2013 19:28, Luis <re...@gmail.com> wrote:
> Hi Jack, thanks a lot for your reply.  I did that <dynamicField name="*"
> type="text" multiValued="true" />.  However, when I run Solr it gives me a
> bunch of errors.  It actually displays the content of my files on my command
> line and shows some logs like this:
>
> org.apache.solr.common.SolrException: Document is missing mandatory
> uniqueKey field: id
[...]
> I do have an uniqueKey though.  Any ideas what the problem might be?

Please share your schema.xml, and details on the exact
command used to index the PDFs. It is possible that you
are not supplying the the literal.id=XXX param that is
needed to provide a uniqueKey for the document. Please
see the "Getting Started with the Solr Example" section at
http://wiki.apache.org/solr/ExtractingRequestHandler

Regards,
Gora

Re: Solr indexing binary files

Posted by Luis <re...@gmail.com>.
Hi Jack, thanks a lot for your reply.  I did that <dynamicField name="*"
type="text" multiValued="true" />.  However, when I run Solr it gives me a
bunch of errors.  It actually displays the content of my files on my command
line and shows some logs like this:

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
        at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:468)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:350)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
        at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:234)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)
15-Mar-2013 9:56:29 AM org.apache.solr.handler.dataimport.DocBuilder execute

I do have an uniqueKey though.  Any ideas what the problem might be?





--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470p4047690.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing binary files

Posted by Jack Krupansky <ja...@basetechnology.com>.
Take a look at Solr Cell:

http://wiki.apache.org/solr/ExtractingRequestHandler

Include a dynamicField with a "*" pattern and you will see the wide variety 
of metadata that is available for PDF and other rich document formats.

-- Jack Krupansky

-----Original Message----- 
From: Luis
Sent: Thursday, March 14, 2013 3:30 PM
To: solr-user@lucene.apache.org
Subject: Solr indexing binary files

Hi, I am new with Solr and I am extracting metadata from binary files 
through
URLs stored in my database.  I would like to know what fields are available
for indexing from PDFs (the ones that would be initiated as in column=””).
For example how would I extract something like file size, format or file
type.

I would also like to know how to create customized fields in Solr.  How
those metadata and text content are mapped into Solr schema?  Would I have
to declare that in the solrconfig.xml or do some more tweaking somewhere
else?  If someone has a code snippet that could show me it would be greatly
appreciated.

Thank you in advance.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-binary-files-tp4047470.html
Sent from the Solr - User mailing list archive at Nabble.com.