You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Anand Kumar Prabhakar <an...@gmail.com> on 2009/07/07 14:41:26 UTC

Loading Data into Solr without HTTP

I'm Loading a CSV file into Solr, since the CSV file contains a huge amount
of data its taking a very long time to load and sometimes resulting in
OutOfMemoryException. Is there any way so that we can read the data from the
CSV file and load it into the Solr database without using "/update/csv" or
"/dataimport".
-- 
View this message in context: http://www.nabble.com/Loading-Data-into-Solr-without-HTTP-tp24372564p24372564.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Loading Data into Solr without HTTP

Posted by Yonik Seeley <yo...@lucidimagination.com>.
Also make sure you don't have any autocommit rules enabled in solrconfig.xml

How many documents are in the 400MB CSV file, and how long does it
take to index now?

-Yonik
http://www.lucidimagination.com



On Tue, Jul 7, 2009 at 10:03 AM, Anand Kumar
Prabhakar<an...@gmail.com> wrote:
>
> Hi Yonik,
>
> Currently our Schema has very few fields and we don't have any copy fields
> also. Please find the below Schema.xml we are using:
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <schema name="cmps" version="1.1">
>  <!-- attribute "name" is the name of this schema and is only used for
> display purposes.
>       Applications should change this to reflect the nature of the search
> collection.
>       version="1.1" is Solr's version number for the schema syntax and
> semantics.  It should
>       not normally be changed by applications.
>       1.0: multiValued attribute did not exist, all fields are multiValued
> by nature
>       1.1: multiValued attribute introduced, false by default -->
>  <types>
>
>
>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>
>    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"
> omitNorms="true"/>
>
>
>    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
>    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
>    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
>    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
>
>    <fieldType name="sint" class="solr.SortableIntField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="slong" class="solr.SortableLongField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="sfloat" class="solr.SortableFloatField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="sdouble" class="solr.SortableDoubleField"
> sortMissingLast="true" omitNorms="true"/>
>
>    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
> omitNorms="true"/>
>
>    <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>
>
>
>    <fieldType name="text_ws" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="textTight" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="textSpell" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="alphaNumericKeyword" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>      <analyzer>
>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>
>      </analyzer>
>    </fieldType>
>
>
>    <fieldtype name="ignored" stored="false" indexed="false"
> class="solr.StrField" />
>    <fieldType name="phNo" class="solr.TextField"
> positionIncrementGap="100" sortMissingLast="true" omitNorms="true">
>        <analyzer>
>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>
>        </analyzer>
>    </fieldType>
>    <fieldType name="textStA" class="solr.TextField"
> positionIncrementGap="100" sortMissingLast="true" omitNorms="true">
>        <analyzer>
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <filter class="solr.StandardFilterFactory"/>
>
>        </analyzer>
>    </fieldType>
>  </types>
>  <fields>
>   <field name="cugKey" type="textStA" indexed="true" stored="true"/>
>   <field name="bacKey" type="textStA" indexed="true" stored="true"/>
>   <field name="assetKey" type="phNo" indexed="true" stored="true"/>
>   <field name="contactKey" type="phNo" indexed="true" stored="true"/>
>   <field name="sourceSystem" type="textStA" indexed="true" stored="true"/>
>   <field name="parentIdFieldName" type="alphaNumericKeyword" indexed="true"
> stored="true"/>
>   <field name="parentIdFieldValue" type="alphaNumericKeyword"
> indexed="true" stored="true"/>
>   <field name="idFieldName" type="alphaNumericKeyword" indexed="true"
> stored="true"/>
>  </fields>
>
>  <defaultSearchField>cugKey</defaultSearchField>
>
>  <solrQueryParser defaultOperator="OR"/>
>
>
>
> </schema>
>
>
> Yonik Seeley-2 wrote:
>>
>> On Tue, Jul 7, 2009 at 9:14 AM, Anand Kumar
>> Prabhakar<an...@gmail.com> wrote:
>>> I want to know is there any method to do
>>> it much faster, we have overcome the OutOfMemoryException by increasing
>>> heap
>>> space.
>>
>> Optimize your schema - eliminate all unnecessary copyFields and
>> default values.  The current example schema is not good for
>> performance benchmarking.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Loading-Data-into-Solr-without-HTTP-tp24372564p24373870.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Loading Data into Solr without HTTP

Posted by Anand Kumar Prabhakar <an...@gmail.com>.
Hi Yonik,

Currently our Schema has very few fields and we don't have any copy fields
also. Please find the below Schema.xml we are using:

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="cmps" version="1.1">
  <!-- attribute "name" is the name of this schema and is only used for
display purposes.
       Applications should change this to reflect the nature of the search
collection.
       version="1.1" is Solr's version number for the schema syntax and
semantics.  It should
       not normally be changed by applications.
       1.0: multiValued attribute did not exist, all fields are multiValued
by nature
       1.1: multiValued attribute introduced, false by default -->
  <types>
    
  
    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
    
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"
omitNorms="true"/>
      
    
    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
   
    <fieldType name="sint" class="solr.SortableIntField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="slong" class="solr.SortableLongField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sfloat" class="solr.SortableFloatField"
sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sdouble" class="solr.SortableDoubleField"
sortMissingLast="true" omitNorms="true"/>
    
    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
omitNorms="true"/>
  
    <fieldType name="random" class="solr.RandomSortField" indexed="true" />
   
  
   
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       
      
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
  
    <fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
   
    <fieldType name="textSpell" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
   
    <fieldType name="alphaNumericKeyword" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <analyzer>
      
        <tokenizer class="solr.KeywordTokenizerFactory"/>
       
      </analyzer>
    </fieldType>
    
    
    <fieldtype name="ignored" stored="false" indexed="false"
class="solr.StrField" /> 
    <fieldType name="phNo" class="solr.TextField" 
positionIncrementGap="100" sortMissingLast="true" omitNorms="true">
	<analyzer>
		<tokenizer class="solr.KeywordTokenizerFactory"/>
		
	</analyzer>
    </fieldType>
    <fieldType name="textStA" class="solr.TextField"
positionIncrementGap="100" sortMissingLast="true" omitNorms="true">
	<analyzer>
		<tokenizer class="solr.StandardTokenizerFactory"/>
	        <filter class="solr.StandardFilterFactory"/>
	      
	</analyzer>	
    </fieldType>
 </types>
 <fields>
   <field name="cugKey" type="textStA" indexed="true" stored="true"/>
   <field name="bacKey" type="textStA" indexed="true" stored="true"/>
   <field name="assetKey" type="phNo" indexed="true" stored="true"/>
   <field name="contactKey" type="phNo" indexed="true" stored="true"/>
   <field name="sourceSystem" type="textStA" indexed="true" stored="true"/>
   <field name="parentIdFieldName" type="alphaNumericKeyword" indexed="true"
stored="true"/>
   <field name="parentIdFieldValue" type="alphaNumericKeyword"
indexed="true" stored="true"/>
   <field name="idFieldName" type="alphaNumericKeyword" indexed="true"
stored="true"/>   
 </fields>
 
 <defaultSearchField>cugKey</defaultSearchField>
 
 <solrQueryParser defaultOperator="OR"/>
 
 
 
</schema>


Yonik Seeley-2 wrote:
> 
> On Tue, Jul 7, 2009 at 9:14 AM, Anand Kumar
> Prabhakar<an...@gmail.com> wrote:
>> I want to know is there any method to do
>> it much faster, we have overcome the OutOfMemoryException by increasing
>> heap
>> space.
> 
> Optimize your schema - eliminate all unnecessary copyFields and
> default values.  The current example schema is not good for
> performance benchmarking.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 

-- 
View this message in context: http://www.nabble.com/Loading-Data-into-Solr-without-HTTP-tp24372564p24373870.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Loading Data into Solr without HTTP

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Jul 7, 2009 at 9:14 AM, Anand Kumar
Prabhakar<an...@gmail.com> wrote:
> I want to know is there any method to do
> it much faster, we have overcome the OutOfMemoryException by increasing heap
> space.

Optimize your schema - eliminate all unnecessary copyFields and
default values.  The current example schema is not good for
performance benchmarking.

-Yonik
http://www.lucidimagination.com

Re: Loading Data into Solr without HTTP

Posted by Anand Kumar Prabhakar <an...@gmail.com>.
Thank you for the Reply Yonik, I have already tried with smaller CSV files,
currently we are trying to load a CSV file of 400 MB but this is taking too
much time(more than half an hour). I want to know is there any method to do
it much faster, we have overcome the OutOfMemoryException by increasing heap
space.

Please suggest.



Yonik Seeley-2 wrote:
> 
> On Tue, Jul 7, 2009 at 8:41 AM, Anand Kumar
> Prabhakar<an...@gmail.com> wrote:
>> Is there any way so that we can read the data from the
>> CSV file and load it into the Solr database without using "/update/csv"
> 
> That *is* the right way to load a CSV file into Solr.
> How many records are in the CSV file, and how much heap are you giving the
> JVM?
> Try a small CSV file first to make sure that it's being parsed
> correctly... for example, do a
> 
> head -1000 bigfile.csv > smallfile.csv
> 
> Now upload that and inspect the documents by querying Solr to ensure
> that everything imported as expected.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 

-- 
View this message in context: http://www.nabble.com/Loading-Data-into-Solr-without-HTTP-tp24372564p24373116.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Loading Data into Solr without HTTP

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Jul 7, 2009 at 8:41 AM, Anand Kumar
Prabhakar<an...@gmail.com> wrote:
> Is there any way so that we can read the data from the
> CSV file and load it into the Solr database without using "/update/csv"

That *is* the right way to load a CSV file into Solr.
How many records are in the CSV file, and how much heap are you giving the JVM?
Try a small CSV file first to make sure that it's being parsed
correctly... for example, do a

head -1000 bigfile.csv > smallfile.csv

Now upload that and inspect the documents by querying Solr to ensure
that everything imported as expected.

-Yonik
http://www.lucidimagination.com