You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by abhishek tiwari <te...@gmail.com> on 2015/03/13 09:37:19 UTC

data import

solr indexing taking too much time .

What should i do to reduce time . working on solr 4.0.

Re: data import

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
take a profile by Visual VM or so.

On Fri, Mar 13, 2015 at 11:37 AM, abhishek tiwari <te...@gmail.com>
wrote:

> solr indexing taking too much time .
>
> What should i do to reduce time . working on solr 4.0.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: data import

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/19/2015 10:36 PM, Midas A wrote:
> Thanks for replying .. I need clarity on following points
> a) Making store false in schema for few fields will improve indexing time ?

Maybe, maybe not.  If Solr is I/O bound, then it probably would help ...
but usually I/O on the Solr index directory is not the bottleneck.

> b) Does soft commit and hard commit configuration depends on hard ware ?

You need to make your autoCommit and autoSoftCommit intervals as long as
you can stand.  I use autoCommit with a five minute / 25000 document
config, and I don't use autoSoftCommit.  My indexing application sends
explicit soft commits, and those are at least a full minute apart,
sometimes longer.

> c) Should i do merge factor , Rambuffersize configuration ? and how should
> i decide these values ?

The default mergeFactor is 10.  A higher mergeFactor will result in
faster indexing, but queries on the resulting index will be a little bit
slower, unless you optimize after your indexing is complete.  The
default ramBufferSizeMB setting in recent versions is 100, and community
experience has shown that increasing this value doesn't normally make
much difference unless you have enormous documents where each one is a
few megabytes.

> We are doing full indexing and it takes around 4.5 hrs ..(20 M documents )

I would call that a pretty good rate.  One of my single dataimporter
configs will index about 17 million docs into a Solr core in4.5 to 5
hours from MySQL.  By doing several of these in parallel (into separate
shards) on two machines at once, I can re-index my entire 100 million
document database in about 4.5 to 5 hours.

Thanks,
Shawn


Re: data import

Posted by Midas A <te...@gmail.com>.
Hi Shawn ,

Thanks for replying .. I need clarity on following points
a) Making store false in schema for few fields will improve indexing time ?
b) Does soft commit and hard commit configuration depends on hard ware ?
c) Should i do merge factor , Rambuffersize configuration ? and how should
i decide these values ?


We are doing full indexing and it takes around 4.5 hrs ..(20 M documents )

Regards,
MA

On Fri, Mar 20, 2015 at 1:57 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/19/2015 11:47 AM, abhishek tiwari wrote:
> > <autoSoftCommit> <maxTime>500</maxTime> </autoSoftCommit>
>
> You're doing soft commits as often as twice a second.  You have
> configured 500 milliseconds here.  This might have something to do with
> your slow indexing speed.  A soft commit is less expensive than a full
> hard commit, but soft commits are *NOT* free, and they aren't even cheap.
>
> I doubt that you *need* your documents to be visible within half a
> second of indexing them ... and there's a good chance that even with
> this config they won't be visible that soon, because each commit is
> probably going to take longer than half a second to complete.  With a
> 500 millisecond autoSoftCommit configuration, your server may be doing
> commit operations close to 100% of the time while indexing is happening.
>
>
> http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> Also, the dataimport handler is single threaded, so if you are only
> using one handler definition in solrconfig.xml, there is no parallel
> indexing.  You'll need to write your own multi-threaded indexing program
> if you want parallel indexing.
>
> Thanks,
> Shawn
>
>

Re: data import

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/19/2015 11:47 AM, abhishek tiwari wrote:
> <autoSoftCommit> <maxTime>500</maxTime> </autoSoftCommit>

You're doing soft commits as often as twice a second.  You have
configured 500 milliseconds here.  This might have something to do with
your slow indexing speed.  A soft commit is less expensive than a full
hard commit, but soft commits are *NOT* free, and they aren't even cheap.

I doubt that you *need* your documents to be visible within half a
second of indexing them ... and there's a good chance that even with
this config they won't be visible that soon, because each commit is
probably going to take longer than half a second to complete.  With a
500 millisecond autoSoftCommit configuration, your server may be doing
commit operations close to 100% of the time while indexing is happening.

http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Also, the dataimport handler is single threaded, so if you are only
using one handler definition in solrconfig.xml, there is no parallel
indexing.  You'll need to write your own multi-threaded indexing program
if you want parallel indexing.

Thanks,
Shawn


Re: data import

Posted by abhishek tiwari <te...@gmail.com>.
Hi ,

- architecture : master (1) - slave(3)
solrconfig:

<autoSoftCommit> <maxTime>500</maxTime> </autoSoftCommit>

<autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </
autoCommit>

schema :
<field name="id" type="string" indexed="true" stored="true" required="true"
multiValued="false"/> <field name="meta_keywords" type="text_general"
indexed="true" stored="true" /> <field name="search_words" type=
"text_general" indexed="true" stored="true" /> <field name=
"manufacturer_reference_number" type="text_general" indexed="true" stored=
"true"/> <field name="is_wholesale_product" type="boolean" indexed="true"
stored="true"/> <field name="page_title" type="text_general" indexed="true"
stored="true" /> <field name="image_url" type="text_general" indexed="false"
stored="true" /> <field name="brand_id" type="tint" indexed="true" stored=
"true" /> <field name="brand_url" type="text_general" indexed="false" stored
="true" /> <field name="brand" type="variantFacet" indexed="true" stored=
"true" /> <field name="show_brand" type="variantFacet" indexed="true" stored
="true" /> <field name="brandname" type="text_general" indexed="true" stored
="true" /> <field name="metacategory" type="metaCatFacet" indexed="true"
stored="true" /> <field name="category" type="catFacet" indexed="true"
stored="true" /> <field name="show_category" type="variantFacet" indexed=
"true" stored="true" /> <field name="metacategory_id" type="tint" indexed=
"true" stored="true" /> <field name="merchant" type="text_path_new" indexed=
"true" stored="true" /> <field name="merchant_rating" type="tfloat" indexed=
"true" stored="true" /> <field name="product_rating" type="tfloat" indexed=
"true" stored="true" /> <field name="bestsellers" type="tint" indexed="true"
stored="true" /> <field name="product_id" type="tint" indexed="true" stored=
"true" /> <field name="company_id" type="tint" indexed="true" stored="true"
/> <field name="list_price" type="tfloat" indexed="true" stored="true" /> <
field name="selling_price" type="tfloat" indexed="true" stored="true" /> <
field name="third_price" type="tfloat" indexed="true" stored="true" /> <
field name="discount_percentage" type="tfloat" indexed="true" stored="true"
/> <field name="free_shipping" type="text_general" indexed="true" stored=
"true" /> <field name="is_trm" type="text_general" indexed="true" stored=
"true" /> <field name="company" type="text_general" indexed="true" stored=
"true" /> <field name="sort_1" type="tint" indexed="true" stored="true" /> <
field name="sort_2" type="tint" indexed="true" stored="true" /> <field name=
"product_shipping_estimation" type="text_general" indexed="true" stored=
"true" /> <field name="total_price" type="text_general" indexed="true"
stored="true" /> <field name="shipping_freight" type="tint" indexed="true"
stored="true" /> <field name="amount" type="tint" indexed="true" stored=
"true" /> <field name="product_sales_amount" type="tint" indexed="true"
stored="true"/> <field name="view_count" type="tint" indexed="true" stored=
"false" /> <field name="deal_index" type="tint" indexed="true" stored=
"false" /> <field name="feature_index" type="tint" indexed="false" stored=
"false" /> <field name="is_cod" type="text_general" indexed="true" stored=
"true" /> <field name="transfer_price" type="tfloat" indexed="true" stored=
"true" /> <field name="promotion_id" type="tint" indexed="true" stored=
"true" /> <field name="price_see_inside" type="boolean" indexed="true"
stored="true" /> <field name="deal_inside_badge" type="boolean" indexed=
"true" stored="true" /> <field name="special_offer_badge" type="boolean"
indexed="true" stored="true" /> <field name="special_offer_text" type=
"text_general" indexed="true" stored="true" /> <field name="freebee_inside"
type="boolean" indexed="true" stored="true" /> <field name="timestamp" type=
"tint" indexed="true" stored="true" /> <field name="newarrivals" type="tint"
indexed="true" stored="true" /> <field name="hotdeals" type="tint" indexed=
"true" stored="true" /> <field name="featured" type="tint" indexed="true"
stored="true" /> <field name="boost_index" type="tint" indexed="true" stored
="true" /> <field name="sort_index" type="tint" indexed="true" stored="true"
/> <field name="product_amount_available" type="tint" indexed="true" stored=
"true" /> <field name="suggest_brand" type="text_general" indexed="true"
stored="true" /> <field name="status" type="text_general" indexed="true"
stored="true" /> <field name="maincategory_status" type="text_general"
indexed="true" stored="true" /> <field name="category_status" type=
"text_general" indexed="true" stored="true" /> <field name="company_status"
type="text_general" indexed="true" stored="true" /> <field name=
"metacategory_status" type="text_general" indexed="true" stored="true" /> <
field name="show_metacategory" type="variantFacet" indexed="true" stored=
"true" /> <field name="product" type="text_general" indexed="true" stored=
"true" /> <field name="label" type="alphaOnlySort" indexed="true" stored=
"true" /> <field name="company_name" type="text_general" indexed="true"
stored="true" /> <field name="category_ids" type="text_path" indexed="true"
stored="true" /> <field name="seo_name" type="text_general" indexed="true"
stored="true" /> <field name="category_id" type="tint" indexed="true" stored
="true" /> <field name="id_path" type="text_path" indexed="true" stored=
"true" /> <field name="variant_id" type="tint" indexed="true" stored="true"
/> <field name="feature_id" type="tint" indexed="true" stored="true" /> <
field name="products" type="tint" indexed="true" stored="true" /> <field
name="range_id" type="tint" indexed="true" stored="true" /> <field name=
"range_name" type="text_general" indexed="true" stored="true" /> <field name
="feature_type" type="text_general" indexed="true" stored="true" /> <field
name="filter_id" type="tint" indexed="true" stored="true" /> <field name=
"text" type="text_general" indexed="true" stored="false" multiValued="true"
/> <field name="_version_" type="long" indexed="true" stored="true"/> <field
name="by_mega_pixel" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_facility" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="by_publisher" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_suitable_for" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_compatibility" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_memory_capacity" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_function" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="saree_colour" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_spray_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_capacity" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="by_device" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_color_and_finish" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_processor" type="text_path_new" indexed
="true" stored="true" multiValued="true"/> <field name="by_optical_zoom"
type="text_path_new" indexed="true" stored="true" multiValued="true"/> <
field name="by_drive_supported" type="text_path_new" indexed="true" stored=
"true" multiValued="true"/> <field name="by_headphone_type" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="wired_or_wireless" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_weight_in_kg" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_energy_rating" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_material" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="ideal_for" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_system_memory" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_number_of_pc" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_function_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_display_size" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_subscription_validity" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_loading_type" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_screen_size" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_platform" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name=
"by_compatible_laptop_size" type="text_path_new" indexed="true" stored=
"true" multiValued="true"/> <field name="by_genre" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_service_count" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_age" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="by_shape" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_ink_color" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_motherboard_supported" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_interface" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_placement" type="text_path_new" indexed
="true" stored="true" multiValued="true"/> <field name="by_color" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_star_rating" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_purifier_type" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_strap_material" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_design" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="by_purity" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_pegi_rating" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_character" type="text_path_new" indexed
="true" stored="true" multiValued="true"/> <field name="by_weight_in_gm"
type="text_path_new" indexed="true" stored="true" multiValued="true"/> <
field name="by_primary_camera" type="text_path_new" indexed="true" stored=
"true" multiValued="true"/> <field name="by_operating_system" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_type" type="text_path_new" indexed="true" stored="true" multiValued
="true"/> <field name="by_neck_lines" type="text_path_new" indexed="true"
stored="true" multiValued="true"/> <field name="by_sleeves" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_door_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_strap_color" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name="by_occasion"
type="text_path_new" indexed="true" stored="true" multiValued="true"/> <
field name="by_dial_shape" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_playing_mode" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name="by_class"
type="text_path_new" indexed="true" stored="true" multiValued="true"/> <
field name="by_features" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_purification_method" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_dial_color" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_hard_drive_capacity" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_chocolates_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_graphic_memory" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"by_hard_drive" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_language" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="by_binding" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="by_headset_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_wired_or_wireless" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name="suitable_for"
type="text_path_new" indexed="true" stored="true" multiValued="true"/> <
field name="speaker_configuration" type="text_path_new" indexed="true"
stored="true" multiValued="true"/> <field name="touchscreen" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="laptop_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_logo" type="text_path_new" indexed=
"true" stored="true" multiValued="true"/> <field name="bags_material" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="burner_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="hd" type="text_path_new" indexed="true"
stored="true" multiValued="true"/> <field name="built_in_microphone" type=
"text_path_new" indexed="true" stored="true" multiValued="true"/> <field
name="webcam_mega_pixels" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="by_occassion" type="text_path_new" indexed
="true" stored="true" multiValued="true"/> <field name=
"hub_and_card_reader_type" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="rams_capacity" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"tv_tuner_placement" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="graphic_card_memory" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name=
"with_recording" type="text_path_new" indexed="true" stored="true"
multiValued="true"/> <field name="landline_type" type="text_path_new"
indexed="true" stored="true" multiValued="true"/> <field name="price" type=
"tfloat" indexed="true" stored="true"/> <field name="sort_price" type=
"tfloat" indexed="true" stored="true"/> <field name="popularity" type="tint"
indexed="true" stored="true" /> <field name="metacategory_autosuggest" type=
"string_autosuggest" indexed="true" stored="true"/> <field name=
"product_autosuggest" type="string_autosuggest" indexed="true" stored="true"
/> <field name="brand_autosuggest" type="string_autosuggest" indexed="true"
stored="true"/> <field name="metacategory_keyword" type="text_keyword"
indexed="true" stored="true"/> <field name="product_keyword" type=
"text_keyword" indexed="true" stored="true"/> <field name="brand_keyword"
type="text_keyword" indexed="true" stored="true"/> </fields> <uniqueKey>id</
uniqueKey> <copyField source="metacategory" dest="metacategory_keyword"/> <
copyField source="product" dest="product_keyword"/> <copyField source=
"brand" dest="brand_keyword"/> <copyField source="product" dest="label"/> <
copyField source="list_price" dest="text"/> <copyField source="product" dest
="text" /> <copyField source="category" dest="text" /> <copyField source=
"metacategory" dest="text" /> <copyField source="brand" dest="text" /> <
copyField source="seo_name" dest="text"/> <copyField source="amount" dest=
"product_sales_amount" /> <copyField source="timestamp" dest="newarrivals"
/> <copyField source="deal_index" dest="hotdeals" /> <copyField source=
"feature_index" dest="featured" />

On Fri, Mar 13, 2015 at 2:25 PM, Antonio Jesús Sánchez Padial <
antonio.sanchez@inia.es> wrote:

> Maybe you should add some info about:
>
> - your architecture, number of servers, etc
> - your schema.xml
> - and the data (ammount, type, ...) you are indexing
>
> Best.
>
> El 13/03/2015 a las 9:37, abhishek tiwari escribió:
>
>  solr indexing taking too much time .
>>
>> What should i do to reduce time . working on solr 4.0.
>>
>>
> --
> Antonio Jesús Sánchez Padial
> Jefe del Servicio de Biometría
> antonio.sanchez@inia.es
> Tlfno: +34 91 347 6831
> Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria
> Ctra.m de La Coruña, km.7
> 28040 Madrid
>
>

Re: data import

Posted by Antonio Jesús Sánchez Padial <an...@inia.es>.
Maybe you should add some info about:

- your architecture, number of servers, etc
- your schema.xml
- and the data (ammount, type, ...) you are indexing

Best.

El 13/03/2015 a las 9:37, abhishek tiwari escribió:
> solr indexing taking too much time .
>
> What should i do to reduce time . working on solr 4.0.
>

-- 
Antonio Jesús Sánchez Padial
Jefe del Servicio de Biometría
antonio.sanchez@inia.es
Tlfno: +34 91 347 6831
Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria
Ctra.m de La Coruña, km.7
28040 Madrid