You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike Murphy <mm...@gmail.com> on 2015/03/22 16:35:09 UTC

schemaless slow indexing

I'm trying out schemaless in solr 5.0, but the indexing seems quite a
bit slower than it did in the past on 4.10.  Any pointers?

--Mike

Re: schemaless slow indexing

Posted by Steve Rowe <sa...@gmail.com>.

> On Mar 23, 2015, at 11:09 AM, Yonik Seeley <ys...@gmail.com> wrote:
> 
> On Mon, Mar 23, 2015 at 1:54 PM, Alexandre Rafalovitch
> <ar...@gmail.com> wrote:
>> I looked at SOLR-7290, but I think the discussion should stay on the
>> mailing list for at least one more iteration.
>> 
>> My understanding that the reason copyField exists is so that a search
>> actually worked out of the box. Without knowing the field names, one
>> cannot say what to search.
> 
> Some points:
> - Schemaless is often just to make it easier to get started.
> - If one assumes a lack of knowledge of field names, that's an issue
> for non-schemaless too.
> - Full-text search is only one use-case that people use Solr for...
> there's lots of sorting/faceting/analytics use cases.

Under SOLR-6779, Erik Hatcher changed the data_driven_schema_configs's auto-guessed default field type from text_general to strings in order to support features other than full-text search:

<https://svn.apache.org/viewvc/lucene/dev/trunk/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml?r1=1648456&r2=1648455&pathrev=1648456>

It’s for exactly this reason (as Alex pointed out) that the catch-all field makes sense: there is no other full-text available.

Yonik, can you suggest a path that supports both these possibilities?  Because having zero fields with full text search in the default Solr configuration seems like a really bad idea to me.

Steve

Re: schemaless slow indexing

Posted by Steve Rowe <sa...@gmail.com>.

> On Mar 23, 2015, at 11:51 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
> For example, I am not even sure if we can create a copyField
> definition via REST API yet.

<https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-AddaNewCopyFieldRule>

Re: schemaless slow indexing

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Yonik, those are all facts. Which I do not disagree with at all.

But there are also consequences when you bring the rest of the facts
and the assumptions and documented workflows into play. My comment was
trying to address the situation on that level

I am all for improving performance. I am just saying that the
copyField did not seem to be an oversight. So, if we just kill it,
something else will suffer. So, killing it may need a corresponding
re-balancing in ??? (documentation?).

For example, I am not even sure if we can create a copyField
definition via REST API yet. Without that, and without global
copyField, what is our default search? And if schemaless makes it
easier to get started, that must cover easy to actually search too, I
would guess!

I am not sure if this makes sense?

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 23 March 2015 at 14:09, Yonik Seeley <ys...@gmail.com> wrote:
> On Mon, Mar 23, 2015 at 1:54 PM, Alexandre Rafalovitch
> <ar...@gmail.com> wrote:
>> I looked at SOLR-7290, but I think the discussion should stay on the
>> mailing list for at least one more iteration.
>>
>> My understanding that the reason copyField exists is so that a search
>> actually worked out of the box. Without knowing the field names, one
>> cannot say what to search.
>
> Some points:
> - Schemaless is often just to make it easier to get started.
> - If one assumes a lack of knowledge of field names, that's an issue
> for non-schemaless too.
> - Full-text search is only one use-case that people use Solr for...
> there's lots of sorting/faceting/analytics use cases.
> - Bad performance by default is.... bad.  People tend to do benchmarks
> and make sweeping conclusions based on those.
>
>
> -Yonik

Re: schemaless slow indexing

Posted by Yonik Seeley <ys...@gmail.com>.

On Mon, Mar 23, 2015 at 1:54 PM, Alexandre Rafalovitch
<ar...@gmail.com> wrote:
> I looked at SOLR-7290, but I think the discussion should stay on the
> mailing list for at least one more iteration.
>
> My understanding that the reason copyField exists is so that a search
> actually worked out of the box. Without knowing the field names, one
> cannot say what to search.

Some points:
- Schemaless is often just to make it easier to get started.
- If one assumes a lack of knowledge of field names, that's an issue
for non-schemaless too.
- Full-text search is only one use-case that people use Solr for...
there's lots of sorting/faceting/analytics use cases.
- Bad performance by default is.... bad.  People tend to do benchmarks
and make sweeping conclusions based on those.


-Yonik

Re: schemaless slow indexing

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I looked at SOLR-7290, but I think the discussion should stay on the
mailing list for at least one more iteration.

My understanding that the reason copyField exists is so that a search
actually worked out of the box. Without knowing the field names, one
cannot say what to search. So, the copyField to a general field and
search that is a classic strategy. Though usually it is not with a
*match all* wildcard. But for schemaless, *match all* is all we get as
we don't even have prefix/suffix strategies to rely on.

So, saying *remove* without offering an alternative way to achieve
easy search is not - to me - a terribly useful contribution for a
default setup.

Regards,
    Alex.
P.s. As to the field renaming, I have no opinion. It would be nice if
somebody checked the consistency now that a couple more special names
were introduced with smart JSON parsing.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 22 March 2015 at 20:32, Erick Erickson <er...@gmail.com> wrote:
> I think you mean https://issues.apache.org/jira/browse/SOLR-7290?
>
> Erick
>
> On Sun, Mar 22, 2015 at 2:30 PM, Mike Murphy <mm...@gmail.com> wrote:
>> That's it!
>> I hand edited the file that says you are not supposed to edit it and
>> removed that copyField.
>> Indexing performance is now back to expected levels.
>>
>> I created an issue for this, https://issues.apache.org/jira/browse/SOLR-7284
>>
>> --Mike
>>
>> On Sun, Mar 22, 2015 at 3:29 PM, Yonik Seeley <ys...@gmail.com> wrote:
>>> I took a quick look at the stock schemaless configs... unfortunately
>>> they contain a performance trap.
>>> There's a copyField by default that copies *all* fields to a catch-all
>>> field called "_text".
>>>
>>> IMO, that's not a great default.  Double the index size (well, the
>>> "index" portion of it at least... not stored fields), and slower
>>> indexing performance.
>>>
>>> The other unfortunate thing is the name.  No where else in solr (that
>>> I know of) do we have a single underscore field name.  _text looks
>>> more like a dynamicField pattern.  Our other fields with underscores
>>> look like _version_ and _root_.  If we're going to start a new naming
>>> convention (or expand the naming conventions) we need to have some
>>> consistency and logic behind it.
>>>
>>> -Yonik
>>>
>>> On Sun, Mar 22, 2015 at 12:32 PM, Mike Murphy <mm...@gmail.com> wrote:
>>>> I start up solr schemaless and index a bunch of data, and it takes a
>>>> lot longer to finish indexing.
>>>> No configuration changes, just straight schemaless.
>>>>
>>>> --Mike
>>>>
>>>> On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
>>>> <er...@gmail.com> wrote:
>>>>> Please review: http://wiki.apache.org/solr/UsingMailingLists
>>>>>
>>>>> You haven't quantified the slowdown. Or given any details on how
>>>>> you're measuring the "slowdown". Or how you've configured your setups
>>>>> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
>>>>>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>>>>>> bit slower than it did in the past on 4.10.  Any pointers?
>>>>>>
>>>>>> --Mike

Re: schemaless slow indexing

Posted by Erick Erickson <er...@gmail.com>.

I think you mean https://issues.apache.org/jira/browse/SOLR-7290?

Erick

On Sun, Mar 22, 2015 at 2:30 PM, Mike Murphy <mm...@gmail.com> wrote:
> That's it!
> I hand edited the file that says you are not supposed to edit it and
> removed that copyField.
> Indexing performance is now back to expected levels.
>
> I created an issue for this, https://issues.apache.org/jira/browse/SOLR-7284
>
> --Mike
>
> On Sun, Mar 22, 2015 at 3:29 PM, Yonik Seeley <ys...@gmail.com> wrote:
>> I took a quick look at the stock schemaless configs... unfortunately
>> they contain a performance trap.
>> There's a copyField by default that copies *all* fields to a catch-all
>> field called "_text".
>>
>> IMO, that's not a great default.  Double the index size (well, the
>> "index" portion of it at least... not stored fields), and slower
>> indexing performance.
>>
>> The other unfortunate thing is the name.  No where else in solr (that
>> I know of) do we have a single underscore field name.  _text looks
>> more like a dynamicField pattern.  Our other fields with underscores
>> look like _version_ and _root_.  If we're going to start a new naming
>> convention (or expand the naming conventions) we need to have some
>> consistency and logic behind it.
>>
>> -Yonik
>>
>> On Sun, Mar 22, 2015 at 12:32 PM, Mike Murphy <mm...@gmail.com> wrote:
>>> I start up solr schemaless and index a bunch of data, and it takes a
>>> lot longer to finish indexing.
>>> No configuration changes, just straight schemaless.
>>>
>>> --Mike
>>>
>>> On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
>>> <er...@gmail.com> wrote:
>>>> Please review: http://wiki.apache.org/solr/UsingMailingLists
>>>>
>>>> You haven't quantified the slowdown. Or given any details on how
>>>> you're measuring the "slowdown". Or how you've configured your setups
>>>> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
>>>>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>>>>> bit slower than it did in the past on 4.10.  Any pointers?
>>>>>
>>>>> --Mike

Re: schemaless slow indexing

Posted by Mike Murphy <mm...@gmail.com>.

That's it!
I hand edited the file that says you are not supposed to edit it and
removed that copyField.
Indexing performance is now back to expected levels.

I created an issue for this, https://issues.apache.org/jira/browse/SOLR-7284

--Mike

On Sun, Mar 22, 2015 at 3:29 PM, Yonik Seeley <ys...@gmail.com> wrote:
> I took a quick look at the stock schemaless configs... unfortunately
> they contain a performance trap.
> There's a copyField by default that copies *all* fields to a catch-all
> field called "_text".
>
> IMO, that's not a great default.  Double the index size (well, the
> "index" portion of it at least... not stored fields), and slower
> indexing performance.
>
> The other unfortunate thing is the name.  No where else in solr (that
> I know of) do we have a single underscore field name.  _text looks
> more like a dynamicField pattern.  Our other fields with underscores
> look like _version_ and _root_.  If we're going to start a new naming
> convention (or expand the naming conventions) we need to have some
> consistency and logic behind it.
>
> -Yonik
>
> On Sun, Mar 22, 2015 at 12:32 PM, Mike Murphy <mm...@gmail.com> wrote:
>> I start up solr schemaless and index a bunch of data, and it takes a
>> lot longer to finish indexing.
>> No configuration changes, just straight schemaless.
>>
>> --Mike
>>
>> On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
>> <er...@gmail.com> wrote:
>>> Please review: http://wiki.apache.org/solr/UsingMailingLists
>>>
>>> You haven't quantified the slowdown. Or given any details on how
>>> you're measuring the "slowdown". Or how you've configured your setups
>>> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>>>
>>> Best,
>>> Erick
>>>
>>> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
>>>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>>>> bit slower than it did in the past on 4.10.  Any pointers?
>>>>
>>>> --Mike

Re: schemaless slow indexing

Posted by Yonik Seeley <ys...@gmail.com>.

I took a quick look at the stock schemaless configs... unfortunately
they contain a performance trap.
There's a copyField by default that copies *all* fields to a catch-all
field called "_text".

IMO, that's not a great default.  Double the index size (well, the
"index" portion of it at least... not stored fields), and slower
indexing performance.

The other unfortunate thing is the name.  No where else in solr (that
I know of) do we have a single underscore field name.  _text looks
more like a dynamicField pattern.  Our other fields with underscores
look like _version_ and _root_.  If we're going to start a new naming
convention (or expand the naming conventions) we need to have some
consistency and logic behind it.

-Yonik

On Sun, Mar 22, 2015 at 12:32 PM, Mike Murphy <mm...@gmail.com> wrote:
> I start up solr schemaless and index a bunch of data, and it takes a
> lot longer to finish indexing.
> No configuration changes, just straight schemaless.
>
> --Mike
>
> On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
> <er...@gmail.com> wrote:
>> Please review: http://wiki.apache.org/solr/UsingMailingLists
>>
>> You haven't quantified the slowdown. Or given any details on how
>> you're measuring the "slowdown". Or how you've configured your setups
>> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>>
>> Best,
>> Erick
>>
>> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
>>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>>> bit slower than it did in the past on 4.10.  Any pointers?
>>>
>>> --Mike

Re: schemaless slow indexing

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Same data with same version of Solr with the only difference between
Schema vs. Schemaless? How much longer, 10%, 2x, 20x?

Schemaless mode has a much more complex UpdateRequestProcessor chain,
that's partially what makes it schemaless. But I hesitate pointing
fingers at that without any real details.

Notice I am still asking the same questions as Erick!
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 22 March 2015 at 12:32, Mike Murphy <mm...@gmail.com> wrote:
> I start up solr schemaless and index a bunch of data, and it takes a
> lot longer to finish indexing.
> No configuration changes, just straight schemaless.
>
> --Mike
>
> On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
> <er...@gmail.com> wrote:
>> Please review: http://wiki.apache.org/solr/UsingMailingLists
>>
>> You haven't quantified the slowdown. Or given any details on how
>> you're measuring the "slowdown". Or how you've configured your setups
>> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>>
>> Best,
>> Erick
>>
>> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
>>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>>> bit slower than it did in the past on 4.10.  Any pointers?
>>>
>>> --Mike

Re: schemaless slow indexing

Posted by Mike Murphy <mm...@gmail.com>.

I start up solr schemaless and index a bunch of data, and it takes a
lot longer to finish indexing.
No configuration changes, just straight schemaless.

--Mike

On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
<er...@gmail.com> wrote:
> Please review: http://wiki.apache.org/solr/UsingMailingLists
>
> You haven't quantified the slowdown. Or given any details on how
> you're measuring the "slowdown". Or how you've configured your setups
> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>
> Best,
> Erick
>
> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>> bit slower than it did in the past on 4.10.  Any pointers?
>>
>> --Mike

Re: schemaless slow indexing

Posted by Erick Erickson <er...@gmail.com>.

Please review: http://wiki.apache.org/solr/UsingMailingLists

You haven't quantified the slowdown. Or given any details on how
you're measuring the "slowdown". Or how you've configured your setups
in 4.10 and 5.0. Or... Ad Hossman would say "details matter".

Best,
Erick

On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy <mm...@gmail.com> wrote:
> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
> bit slower than it did in the past on 4.10.  Any pointers?
>
> --Mike