You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bing Hua <bh...@cornell.edu> on 2012/08/31 00:28:44 UTC

Send plain text file to solr for indexing

Hello,

I used to use solrcell, which has built-in tika support to handle both
extraction and indexing of raw documents. Now I got another text extraction
provider to convert raw document to a plain text txt file so I want to let
solr bypass that extraction phase. Is there a way I can send the plain txt
file to solr to simply index that as a fulltext field without doing
extraction on that file?

Thanks,
Bing



--
View this message in context: http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Send plain text file to solr for indexing

Posted by Jack Krupansky <ja...@basetechnology.com>.
There's no need to read the entire text file into memory. I mean, each 
line/entry stands on its own and can be sent to Solr to be indexed by 
itself.

-- Jack Krupansky

-----Original Message----- 
From: Bing Hua
Sent: Friday, August 31, 2012 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Send plain text file to solr for indexing

Thanks Mr.Yagami. I'll look into that.

Jack, for the latter two options, they both require reading the entire text
file into memory, right?

Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515p4004772.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Send plain text file to solr for indexing

Posted by Ahmet Arslan <io...@yahoo.com>.
> Thanks Mr.Yagami. I'll look into that.

Hi Bing,

You can this data-config.xml to index txt files on disk. Add these fields to schema.xml

<field name="link" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
	
<field name="fileLastModified" type="string" indexed="true" stored="true" />
<field name="text" type="text" indexed="true" stored="true"  />

<uniqueKey>link</uniqueKey>

<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" name="fds"/>
    <document>
        <entity name="f" processor="FileListEntityProcessor" fileName=".*txt" baseDir="/Volumes/data/Documents" recursive="true" rootEntity="false" dataSource="null" >
        <!--The implicit fields generated by the FileListEntityProcessor are fileDir, file, fileAbsolutePath, fileSize, fileLastModified -->
     	 <field column="fileLastModified" name="fileLastModified" /> 
     	 <field column="fileAbsolutePath" name="link" />
     	 <entity processor="PlainTextEntityProcessor" name="x" url="${f.fileAbsolutePath}" dataSource="fds"  rootEntity="true">
     	   <!-- copies the text to a field called 'text' in Solr-->
      	  <field column="plainText" name="text"/>
      	  </entity>
        </entity>
    </document>
</dataConfig>

Re: Send plain text file to solr for indexing

Posted by Bing Hua <bh...@cornell.edu>.
Thanks Mr.Yagami. I'll look into that.

Jack, for the latter two options, they both require reading the entire text
file into memory, right?

Bing



--
View this message in context: http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515p4004772.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Send plain text file to solr for indexing

Posted by Jack Krupansky <ja...@basetechnology.com>.
Data Import Handler can also be used to ingest plain text files.

Or you can use SolrJ and write your own code to process the text files 
yourself and add their content to the desired field.

Or write a script in Python or some other scripting language to form a 
SolrXML/JSON/CSV wrapper around your plain text.

-- Jack Krupansky

-----Original Message----- 
From: Bing Hua
Sent: Friday, August 31, 2012 10:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Send plain text file to solr for indexing

So in order to use solrcell I'll have to add a number of dependent 
libraries,
which is one of what I'm trying to avoid. The second thing is, solrcell
still parses the plain text files and I don't want it to make any change to
those of my exported files.

Any ideas?
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515p4004753.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Send plain text file to solr for indexing

Posted by Ahmet Arslan <io...@yahoo.com>.
> So in order to use solrcell I'll have
> to add a number of dependent libraries,
> which is one of what I'm trying to avoid. The second thing
> is, solrcell
> still parses the plain text files and I don't want it to
> make any change to
> those of my exported files.


You can index plain text files without writing any java code:
http://wiki.apache.org/solr/DataImportHandler#PlainTextEntityProcessor

Re: Send plain text file to solr for indexing

Posted by Bing Hua <bh...@cornell.edu>.
So in order to use solrcell I'll have to add a number of dependent libraries,
which is one of what I'm trying to avoid. The second thing is, solrcell
still parses the plain text files and I don't want it to make any change to
those of my exported files.

Any ideas?
Bing



--
View this message in context: http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515p4004753.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Send plain text file to solr for indexing

Posted by Jack Krupansky <ja...@basetechnology.com>.
SolrCell can be used to index raw text files as well.

-- Jack Krupansky

-----Original Message----- 
From: Bing Hua
Sent: Thursday, August 30, 2012 6:28 PM
To: solr-user@lucene.apache.org
Subject: Send plain text file to solr for indexing

Hello,

I used to use solrcell, which has built-in tika support to handle both
extraction and indexing of raw documents. Now I got another text extraction
provider to convert raw document to a plain text txt file so I want to let
solr bypass that extraction phase. Is there a way I can send the plain txt
file to solr to simply index that as a fulltext field without doing
extraction on that file?

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515.html
Sent from the Solr - User mailing list archive at Nabble.com.