You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark N <ni...@gmail.com> on 2010/01/05 10:05:46 UTC
Indexing large text documents
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( "Fulltext", strContent);
strContent is a string variable which contains contents of text file.
( assume that text file is located in c:\files\abc.txt )
In my case abc.text ( text files ) could be very huge ~ 2 GB so it is not
always possible to read and store them into string variables
before indexing . Can anyone suggest what should be better approach to index
these huge text files ?
--
Nipen Mark
Re: Indexing large text documents
Posted by Grant Ingersoll <gs...@apache.org>.
I haven't tried it, but you might be able to use either (and this is just me thinking aloud):
DataImportHandler with the FileEntityProcessor
Remote Streaming - (you might have to write out Solr XML or do something else)
-Grant
On Jan 5, 2010, at 4:05 AM, Mark N wrote:
> SolrInputDocument doc1 = new SolrInputDocument();
> doc1.addField( "Fulltext", strContent);
>
> strContent is a string variable which contains contents of text file.
> ( assume that text file is located in c:\files\abc.txt )
>
> In my case abc.text ( text files ) could be very huge ~ 2 GB so it is not
> always possible to read and store them into string variables
> before indexing . Can anyone suggest what should be better approach to index
> these huge text files ?
>
>
>
> --
> Nipen Mark
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Indexing large text documents
Posted by Glen Newton <gl...@gmail.com>.
(In Lucene) I break the document into smaller pieces, then add each
piece to the Document field in a loop. This seems to work better, but
will mess-around with analysis like term offsets.
This should work in your example.
In Lucene, you can also add the field using a Reader to the file in question:
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.io.Reader%29
I haven't looked at the source, so I am not sure if it handles very
large files in a scalable fashion...
-Glen
http://zzzoot.blogspot.com/
2010/1/5 Mark N <ni...@gmail.com>:
> SolrInputDocument doc1 = new SolrInputDocument();
> doc1.addField( "Fulltext", strContent);
>
> strContent is a string variable which contains contents of text file.
> ( assume that text file is located in c:\files\abc.txt )
>
> In my case abc.text ( text files ) could be very huge ~ 2 GB so it is not
> always possible to read and store them into string variables
> before indexing . Can anyone suggest what should be better approach to index
> these huge text files ?
>
>
>
> --
> Nipen Mark
>
--
-