You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by KennyN <Ke...@sparta.com> on 2009/03/10 17:32:40 UTC

Solr configuration with Text files

I am trying to figure out how to configure Solr. I have worked with the
example and have been reading over the wiki, and am having some difficulty
figuring out how I would setup this simple scenario:

index a large number of text files(they aren't the csv files solr can ingest
directly) that are named using an id. I have the schema.xml setup like this:
   <field name="id"    type="string" indexed="true" stored="true"
required="true" />
   <field name="subname" type="string" indexed="true" stored="true"/>
   <field name="content" type="text"   indexed="true" stored="false"/>

I want to be able to define the id and subname fields in an xml file like
this:
<add>
<doc>
  <field name="id">0000110121212000232300002323</field>
  <field
name="subname">0000110121212000232300002323_SYSTEM_OUT_0000.data</field>
</doc>
</add>

But I want the 'content' field for the entry to be filled in with the
contents of one of these id named files. I was looking into setting up a
DataImportHandler but it looked like this was targeted at database
uses(except the PlainTextEntityProcessor, which isn't available in the
latest release [1.3]). What is the best way to go about doing this for Solr? 
-- 
View this message in context: http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22438201.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr configuration with Text files

Posted by KennyN <Ke...@sparta.com>.
This functionality is possible 'out of the box', right? Or am I going to need
to code up something that reads in the id named files and generates the xml
file?
-- 
View this message in context: http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22440089.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr configuration with Text files

Posted by Pravin Paratey <pr...@gmail.com>.
AFAIK, you're going to have to code something up. Do remember to add CDATA
tags to your xml.

On Tue, Mar 10, 2009 at 11:31 PM, KennyN <Ke...@sparta.com> wrote:

>
> This functionality is possible 'out of the box', right? Or am I going to
> need
> to code up something that reads in the id named files and generates the xml
> file?
> --
> View this message in context:
> http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22440095.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Solr configuration with Text files

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Mar 11, 2009 at 8:32 PM, KennyN <Ke...@sparta.com> wrote:

>
> Thanks for the responses guys!
>
> I looked around the wiki for an example of using DataImportHandler to
> iterate over a list of files and read the content into a field and didn't
> find anything. I agree it would be useful!
>

With 1.3 you still have the FileListEntityProcessor and the
FileDataSource/XPathEntityProcessor using which you iterate over the XML
files you described above. You'll need to write a Transformer which can get
the value of the 'id' field, read it's contents and store them into the
'content' field.

With 1.4 trunk you can remove the need for the custom Transformer and use
FileDataSource/PlainTextEntityProcessor instead.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr configuration with Text files

Posted by KennyN <Ke...@sparta.com>.
Thanks for the responses guys! 

I looked around the wiki for an example of using DataImportHandler to
iterate over a list of files and read the content into a field and didn't
find anything. I agree it would be useful!



Erik Hatcher wrote:
> 
> Using Solr Cell (ExtractingRequestHandler) which is now built into  
> trunk, and thus an eventual Solr 1.4 release, indexing a directory of  
> text (or even Word, PDF, etc) files is mostly 'out of the box'.
> 
> It still requires scripting an iteration over all files and sending  
> them.  Here's an example of doing that scripting using Ant and the ant- 
> contrib <for> and <post> tasks:
> 
>    <target name="index-docs" description="Index documents">
>      <for param="filename">
>        <fileset dir="${docs.dir}"/>
>        <sequential>
>          <echo>Processing @{filename}</echo>
> 
>          <post to="${solr.url}/update/extract" verbose="false"  
> failonerror="true">
>            <prop name="stream.file" value="@{filename}"/>
>            <prop name="ext.resource.name" value="@{filename}"/>
>            <prop name="ext.idx.attr" value="false"/>
>            <prop name="ext.ignore.und.fl" value="true"/>
> 
>            <prop name="ext.literal.id" value="@{filename}"/>
>            <prop name="ext.def.fl" value="text"/>
>            <prop name="ext.map.title" value="title"/>
>            <prop name="wt" value="ruby"/>
>          </post>
>        </sequential>
>      </for>
>    </target>
> 
> And it also should be possible, perhaps slightly easier and more built- 
> in to do the entire iteration using DataImportHandler's ability to  
> iterate over a list of files and read their contents into a field.   
> [an example of this on the wiki would be handy, or a pointer to it if  
> it doesn't already exist]
> 
> 	Erik
> 
> 
> On Mar 10, 2009, at 2:01 PM, KennyN wrote:
> 
>>
>> This functionality is possible 'out of the box', right? Or am I  
>> going to need
>> to code up something that reads in the id named files and generates  
>> the xml
>> file?
>> -- 
>> View this message in context:
>> http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22440095.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22457049.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr configuration with Text files

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Using Solr Cell (ExtractingRequestHandler) which is now built into  
trunk, and thus an eventual Solr 1.4 release, indexing a directory of  
text (or even Word, PDF, etc) files is mostly 'out of the box'.

It still requires scripting an iteration over all files and sending  
them.  Here's an example of doing that scripting using Ant and the ant- 
contrib <for> and <post> tasks:

   <target name="index-docs" description="Index documents">
     <for param="filename">
       <fileset dir="${docs.dir}"/>
       <sequential>
         <echo>Processing @{filename}</echo>

         <post to="${solr.url}/update/extract" verbose="false"  
failonerror="true">
           <prop name="stream.file" value="@{filename}"/>
           <prop name="ext.resource.name" value="@{filename}"/>
           <prop name="ext.idx.attr" value="false"/>
           <prop name="ext.ignore.und.fl" value="true"/>

           <prop name="ext.literal.id" value="@{filename}"/>
           <prop name="ext.def.fl" value="text"/>
           <prop name="ext.map.title" value="title"/>
           <prop name="wt" value="ruby"/>
         </post>
       </sequential>
     </for>
   </target>

And it also should be possible, perhaps slightly easier and more built- 
in to do the entire iteration using DataImportHandler's ability to  
iterate over a list of files and read their contents into a field.   
[an example of this on the wiki would be handy, or a pointer to it if  
it doesn't already exist]

	Erik


On Mar 10, 2009, at 2:01 PM, KennyN wrote:

>
> This functionality is possible 'out of the box', right? Or am I  
> going to need
> to code up something that reads in the id named files and generates  
> the xml
> file?
> -- 
> View this message in context: http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22440095.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr configuration with Text files

Posted by KennyN <Ke...@sparta.com>.
This functionality is possible 'out of the box', right? Or am I going to need
to code up something that reads in the id named files and generates the xml
file?
-- 
View this message in context: http://www.nabble.com/Solr-configuration-with-Text-files-tp22438201p22440095.html
Sent from the Solr - User mailing list archive at Nabble.com.