You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mitra <mi...@ornext.com> on 2012/11/05 06:41:41 UTC

Splitting index created from a csv using solr

Hello all

i have a csv file of size 10 gb which i have to index using solr

my question is how to index the csv in such a way so that 
i can get two separate index files of which one of the index is the index
for the first half of the csv and the second index is the index for the
second half of the csv


also coming to index settings what should be the optimal value of auto
commit maxdocs and maxtime for the 10gb csv file it has around 28 milllion
records



--
View this message in context: http://lucene.472066.n3.nabble.com/Splitting-index-created-from-a-csv-using-solr-tp4018191.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Splitting index created from a csv using solr

Posted by Gora Mohanty <go...@mimirtech.com>.
On 6 November 2012 10:52, mitra <mi...@ornext.com> wrote:
>
> Thanks for the reply Gora
>
> i just wanted to know if solr could do it by itself now from your answer i
> could see its not possible


Yes, this is not a common use case.

> So what do you think is the best way to split it I mean should i use Luke to
> split the index or should I split the csv and index it
[...]

Walter already covered that. It is better to split the CSV

What OS are you using? I am not familiar with
Windows, but I am sure that there would be tools
that do the equivalent of split. You would have
better luck asking elsewhere.

Regards,
Gora

Re: Splitting index created from a csv using solr

Posted by mitra <mi...@ornext.com>.
Thanks for the reply Gora

i just wanted to know if solr could do it by itself now from your answer i
could see its not possible

So what do you think is the best way to split it I mean should i use Luke to
split the index or should I split the csv and index it

@Walter

Thankyou sir , i dont have a unix environment though



--
View this message in context: http://lucene.472066.n3.nabble.com/Splitting-index-created-from-a-csv-using-solr-tp4018195p4018427.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Splitting index created from a csv using solr

Posted by Walter Underwood <wu...@wunderwood.org>.
I would use the Unix "split" command. You can give it a line count.

% split -l 14000000 myfile.csv

You can use "wc -l" to count the lines.

wunder

On Nov 4, 2012, at 10:23 PM, Gora Mohanty wrote:

> On 5 November 2012 11:11, mitra <mi...@ornext.com> wrote:
> 
>> Hello all
>> 
>> i have a csv file of size 10 gb which i have to index using solr
>> 
>> my question is how to index the csv in such a way so that
>> i can get two separate index files of which one of the index is the index
>> for the first half of the csv and the second index is the index for the
>> second half of the csv
>> 
> 
> I do not think that there is any automatic way to do that in Solr.
> Could you not split the CSV file yourself, and index different
> halves of it to different Solr indices?
> 
> 
>> 
>> 
>> also coming to index settings what should be the optimal value of auto
>> commit maxdocs and maxtime for the 10gb csv file it has around 28 milllion
>> records
>> 
> 
> That would depend on various local factors like how much RAM
> you have to give to Solr, network speed, etc. The best way would
> be to experiment with these settings. Usually, your goal should
> to minimise auto-commits, so you can try setting these numbers
> to high values. You could also disable auto-commit altogether, and
> do manual commits.
> 
> Given your data size, I think that the indexing should be quite fast
> on reasonable hardware.
> 
> Regards,
> Gora





Re: Splitting index created from a csv using solr

Posted by Gora Mohanty <go...@mimirtech.com>.
On 5 November 2012 11:11, mitra <mi...@ornext.com> wrote:

> Hello all
>
> i have a csv file of size 10 gb which i have to index using solr
>
> my question is how to index the csv in such a way so that
> i can get two separate index files of which one of the index is the index
> for the first half of the csv and the second index is the index for the
> second half of the csv
>

I do not think that there is any automatic way to do that in Solr.
Could you not split the CSV file yourself, and index different
halves of it to different Solr indices?


>
>
> also coming to index settings what should be the optimal value of auto
> commit maxdocs and maxtime for the 10gb csv file it has around 28 milllion
> records
>

That would depend on various local factors like how much RAM
you have to give to Solr, network speed, etc. The best way would
be to experiment with these settings. Usually, your goal should
to minimise auto-commits, so you can try setting these numbers
to high values. You could also disable auto-commit altogether, and
do manual commits.

Given your data size, I think that the indexing should be quite fast
on reasonable hardware.

Regards,
Gora