You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Purohit, Sumit" <Su...@pnnl.gov> on 2015/03/24 00:41:17 UTC

Difference in indexing using config file vs client i.e SolrJ

Hi All,

I have recently started working with Solr and i have a trivial question to ask, as i could not find suitable answer.

A document's indexes can be defined in a config file (such as schema.xml) and on the fly using some solr client such as SolrJ.

1. What is the difference in indexes created by both the approaches ?
2. Is there any major performance gain in the case of using predefined index instead of using SolrJ ?
3. Does solr persist these indexes differently and does that has any impact on the Query efficiency ?

Thanks
Sumit Purohit

RE: Difference in indexing using config file vs client i.e SolrJ

Posted by "Purohit, Sumit" <Su...@pnnl.gov>.
Thanks Erick for the helpful explanations.

thanks
sumit 
________________________________________
From: Erick Erickson [erickerickson@gmail.com]
Sent: Monday, March 23, 2015 4:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Difference in indexing using config file vs client i.e SolrJ

1> Either none or lots, depending;). You're talking "schemaless" here
I think. schemaless mode guesses what the field should be based on the
document and creates a field in the doc. pre-defined schemas require
you to make that decision up front.

So in terms of what the underlying index looks like on a lower-level
Lucene basis, whether a field is defined in the schema.xml or
dynamically it's identical. So in that perspective, there's no
difference.

However, whether the field definitions chosen best represent the
problem you're trying to solve is another issue all together.
Schemaless simply cannot apply the same kind of domain-specific
interpretation that a human can, not to mention construct analysis
chains for the tokens that are reflective of the characteristics
specific to that domain.

2> There have been some anecdotal reports of schemaless copying
everything into a _text field that impact performance, but this is
configurable.

3> Again, the underlying structure of the index at the Lucene level is
the same. What's NOT the same is whether schemaless mode makes the
right decisions. Almost invariably a human being can do better since
you're armed with knowledge of what's important and what's not.

Here's my take: Schemaless mode is a great way to get started with
minimal effort on your part. But pretty soon the problem domain
requires that you take control of the schema and hand-craft
schema.xml. For some problem spaces, schemaless may be "good enough",
you have to evaluate your corpus and your problem space....

Best,
Erick

On Mon, Mar 23, 2015 at 4:41 PM, Purohit, Sumit <Su...@pnnl.gov> wrote:
> Hi All,
>
> I have recently started working with Solr and i have a trivial question to ask, as i could not find suitable answer.
>
> A document's indexes can be defined in a config file (such as schema.xml) and on the fly using some solr client such as SolrJ.
>
> 1. What is the difference in indexes created by both the approaches ?
> 2. Is there any major performance gain in the case of using predefined index instead of using SolrJ ?
> 3. Does solr persist these indexes differently and does that has any impact on the Query efficiency ?
>
> Thanks
> Sumit Purohit

Re: Difference in indexing using config file vs client i.e SolrJ

Posted by Erick Erickson <er...@gmail.com>.
1> Either none or lots, depending;). You're talking "schemaless" here
I think. schemaless mode guesses what the field should be based on the
document and creates a field in the doc. pre-defined schemas require
you to make that decision up front.

So in terms of what the underlying index looks like on a lower-level
Lucene basis, whether a field is defined in the schema.xml or
dynamically it's identical. So in that perspective, there's no
difference.

However, whether the field definitions chosen best represent the
problem you're trying to solve is another issue all together.
Schemaless simply cannot apply the same kind of domain-specific
interpretation that a human can, not to mention construct analysis
chains for the tokens that are reflective of the characteristics
specific to that domain.

2> There have been some anecdotal reports of schemaless copying
everything into a _text field that impact performance, but this is
configurable.

3> Again, the underlying structure of the index at the Lucene level is
the same. What's NOT the same is whether schemaless mode makes the
right decisions. Almost invariably a human being can do better since
you're armed with knowledge of what's important and what's not.

Here's my take: Schemaless mode is a great way to get started with
minimal effort on your part. But pretty soon the problem domain
requires that you take control of the schema and hand-craft
schema.xml. For some problem spaces, schemaless may be "good enough",
you have to evaluate your corpus and your problem space....

Best,
Erick

On Mon, Mar 23, 2015 at 4:41 PM, Purohit, Sumit <Su...@pnnl.gov> wrote:
> Hi All,
>
> I have recently started working with Solr and i have a trivial question to ask, as i could not find suitable answer.
>
> A document's indexes can be defined in a config file (such as schema.xml) and on the fly using some solr client such as SolrJ.
>
> 1. What is the difference in indexes created by both the approaches ?
> 2. Is there any major performance gain in the case of using predefined index instead of using SolrJ ?
> 3. Does solr persist these indexes differently and does that has any impact on the Query efficiency ?
>
> Thanks
> Sumit Purohit