You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark N <ni...@gmail.com> on 2010/01/05 10:05:46 UTC

Indexing large text documents

SolrInputDocument doc1 = new SolrInputDocument();
 doc1.addField( "Fulltext", strContent);

strContent is a string variable which  contains  contents of  text file.
( assume that text file is located in c:\files\abc.txt )

In my case abc.text  ( text files ) could be very huge ~ 2 GB so it is not
always possible to read and store them into string variables
before indexing . Can anyone suggest what should be better approach to index
these huge text files ?



-- 
Nipen Mark

Re: Indexing large text documents

Posted by Grant Ingersoll <gs...@apache.org>.

I haven't tried it, but you might be able to use either (and this is just me thinking aloud):
DataImportHandler with the FileEntityProcessor
Remote Streaming - (you might have to write out Solr XML or do something else)

-Grant


On Jan 5, 2010, at 4:05 AM, Mark N wrote:

> SolrInputDocument doc1 = new SolrInputDocument();
> doc1.addField( "Fulltext", strContent);
> 
> strContent is a string variable which  contains  contents of  text file.
> ( assume that text file is located in c:\files\abc.txt )
> 
> In my case abc.text  ( text files ) could be very huge ~ 2 GB so it is not
> always possible to read and store them into string variables
> before indexing . Can anyone suggest what should be better approach to index
> these huge text files ?
> 
> 
> 
> -- 
> Nipen Mark

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Indexing large text documents

Posted by Glen Newton <gl...@gmail.com>.

(In Lucene) I break the document into smaller pieces, then add each
piece to the Document field in a loop. This seems to work better, but
will mess-around with analysis like term offsets.
This should work in your example.

In Lucene, you can also add the field using a Reader to the file in question:
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.io.Reader%29

I haven't looked at the source, so I am not sure if it handles very
large files in a scalable fashion...

-Glen
http://zzzoot.blogspot.com/

2010/1/5 Mark N <ni...@gmail.com>:
> SolrInputDocument doc1 = new SolrInputDocument();
>  doc1.addField( "Fulltext", strContent);
>
> strContent is a string variable which  contains  contents of  text file.
> ( assume that text file is located in c:\files\abc.txt )
>
> In my case abc.text  ( text files ) could be very huge ~ 2 GB so it is not
> always possible to read and store them into string variables
> before indexing . Can anyone suggest what should be better approach to index
> these huge text files ?
>
>
>
> --
> Nipen Mark
>



-- 

-