You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2007/12/19 19:01:09 UTC

[jira] Created: (SOLR-440) Should use longs for internal DateField storage

Should use longs for internal DateField storage
-----------------------------------------------

                 Key: SOLR-440
                 URL: https://issues.apache.org/jira/browse/SOLR-440
             Project: Solr
          Issue Type: Improvement
          Components: search
    Affects Versions: 1.2, 1.3
            Reporter: Stu Hood


The current DateField implementation uses formatted Strings internally to store dates.

As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.

As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-440) Should use longs for internal DateField storage

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555663#action_12555663 ] 

Hoss Man commented on SOLR-440:
-------------------------------

depending on what the final implementation looks like, i could easily be convinced to use a new class like this in the example configs, and maybe even go so far as to deprecate the existing DateFiled ... but i can't imagine being convinced to completley replaced DateField with it at any point ... a RAM performance improvement in sorting doesn't seem worth breaking back compatibility and forcing people to reindex ... not when it would be just as easy to add a new (subclass) FieldType.

consider the people who currently use DateField for range queries but never for sorting -- they'll be pretty upset if we tell them they have to reindex and they won't see any noticeable benefit from doing so.

> Should use longs for internal DateField storage
> -----------------------------------------------
>
>                 Key: SOLR-440
>                 URL: https://issues.apache.org/jira/browse/SOLR-440
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.2, 1.3
>            Reporter: Stu Hood
>
> The current DateField implementation uses formatted Strings internally to store dates.
> As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.
> As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-440) Should use longs for internal DateField storage

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555679#action_12555679 ] 

Stu Hood commented on SOLR-440:
-------------------------------

Is there no way to make DateField backwards compatible? Perhaps with 'magic' values, or by looking at the byte length of the field before attempting to parse it as a long or string?

I'm probably misunderstanding how significant a difference in memory usage it would be, but I kindof feel like I was suckered into using DateField. I have a field (the timestamp field in the dist solrconfig.xml that defaults to NOW) that I currently cannot sort on, because it runs instances out of memory trying to do a string sort.

Thanks for considering it!




> Should use longs for internal DateField storage
> -----------------------------------------------
>
>                 Key: SOLR-440
>                 URL: https://issues.apache.org/jira/browse/SOLR-440
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.2, 1.3
>            Reporter: Stu Hood
>
> The current DateField implementation uses formatted Strings internally to store dates.
> As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.
> As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-440) Should use longs for internal DateField storage

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555493#action_12555493 ] 

Stu Hood commented on SOLR-440:
-------------------------------

Thanks a lot for the explanation, I think I get it now.


That sounds like a good plan. Perhaps it could replace the DateField in Solr 2.0.


-----Original Message-----
From: "Hoss Man (JIRA)" <ji...@apache.org>
Sent: Wednesday, January 2, 2008 7:20pm
To: stuhood@webmail.us
Subject: [jira] Commented: (SOLR-440) Should use longs for internal DateField storage


    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555450#action_12555450 ] 

Hoss Man commented on SOLR-440:
-------------------------------

>From an email reply i'm just now looking at...

{noformat}
Date: Sat, 22 Dec 2007 22:59:29 -0500 (EST)
From: Stu Hood 

If sorting was working, why couldn't range queries be supported?
{noformat}

RangeQueries (or RangeFilters more specifically) require that a single in order walk of the TermEnum from "low" to "high" include all values in the index that are truly inclusive, and none that aren't .... Sorting works differently, the first time a field is sorted on, all the values are walked and an "inverted-invertedindex" (as i like to call it) FieldCache is built mapping docIds to values.

Although .... assuming the FieldCache support for Longs supports a "LongParser" the same way the Int and Float support does (i just checked, and it does) then a "new SortableLongDateField" could be created which uses the same tricks as SortableLongField to "index" values that sort lexigraphicaly and builds a FieldCache using a long[] ... but knows how to parse/return ISO formated dates (and do DateMath).




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.





> Should use longs for internal DateField storage
> -----------------------------------------------
>
>                 Key: SOLR-440
>                 URL: https://issues.apache.org/jira/browse/SOLR-440
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.2, 1.3
>            Reporter: Stu Hood
>
> The current DateField implementation uses formatted Strings internally to store dates.
> As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.
> As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-440) Should use longs for internal DateField storage

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554153 ] 

Hoss Man commented on SOLR-440:
-------------------------------

using longs for millis since epoch in addition to "long" sorting might be more efficient if all you care about is strict date sorting, but "range queries" wouldn't work in that case.

that plus index back compatibility are reason enough to leave String as the internal representation of DateField ... but there is no reason we can't add a new FieldType that uses a long as the internal representation.

before a lot of work is invested in this issue however, it might be a good worthwhile to revist and consider some of the previous discussions about potential changes/improvements to Solr date handling...

http://www.nabble.com/dates---times-to10417533.html#a10417533



> Should use longs for internal DateField storage
> -----------------------------------------------
>
>                 Key: SOLR-440
>                 URL: https://issues.apache.org/jira/browse/SOLR-440
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.2, 1.3
>            Reporter: Stu Hood
>
> The current DateField implementation uses formatted Strings internally to store dates.
> As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.
> As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-440) Should use longs for internal DateField storage

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555450#action_12555450 ] 

Hoss Man commented on SOLR-440:
-------------------------------

>From an email reply i'm just now looking at...

{noformat}
Date: Sat, 22 Dec 2007 22:59:29 -0500 (EST)
From: Stu Hood 

If sorting was working, why couldn't range queries be supported?
{noformat}

RangeQueries (or RangeFilters more specifically) require that a single in order walk of the TermEnum from "low" to "high" include all values in the index that are truly inclusive, and none that aren't .... Sorting works differently, the first time a field is sorted on, all the values are walked and an "inverted-invertedindex" (as i like to call it) FieldCache is built mapping docIds to values.

Although .... assuming the FieldCache support for Longs supports a "LongParser" the same way the Int and Float support does (i just checked, and it does) then a "new SortableLongDateField" could be created which uses the same tricks as SortableLongField to "index" values that sort lexigraphicaly and builds a FieldCache using a long[] ... but knows how to parse/return ISO formated dates (and do DateMath).



> Should use longs for internal DateField storage
> -----------------------------------------------
>
>                 Key: SOLR-440
>                 URL: https://issues.apache.org/jira/browse/SOLR-440
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.2, 1.3
>            Reporter: Stu Hood
>
> The current DateField implementation uses formatted Strings internally to store dates.
> As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.
> As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (SOLR-440) Should use longs for internal DateField storage

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555493#action_12555493 ] 

stuhood edited comment on SOLR-440 at 1/3/08 12:47 PM:
--------------------------------------------------------

Thanks a lot for the explanation, I think I get it now.

That sounds like a good plan. Perhaps it could replace the DateField in Solr 2.0.

      was (Author: stuhood):
    Thanks a lot for the explanation, I think I get it now.


That sounds like a good plan. Perhaps it could replace the DateField in Solr 2.0.


-----Original Message-----
From: "Hoss Man (JIRA)" <ji...@apache.org>
Sent: Wednesday, January 2, 2008 7:20pm
To: stuhood@webmail.us
Subject: [jira] Commented: (SOLR-440) Should use longs for internal DateField storage


    [ https://issues.apache.org/jira/browse/SOLR-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555450#action_12555450 ] 

Hoss Man commented on SOLR-440:
-------------------------------

>From an email reply i'm just now looking at...

{noformat}
Date: Sat, 22 Dec 2007 22:59:29 -0500 (EST)
From: Stu Hood 

If sorting was working, why couldn't range queries be supported?
{noformat}

RangeQueries (or RangeFilters more specifically) require that a single in order walk of the TermEnum from "low" to "high" include all values in the index that are truly inclusive, and none that aren't .... Sorting works differently, the first time a field is sorted on, all the values are walked and an "inverted-invertedindex" (as i like to call it) FieldCache is built mapping docIds to values.

Although .... assuming the FieldCache support for Longs supports a "LongParser" the same way the Int and Float support does (i just checked, and it does) then a "new SortableLongDateField" could be created which uses the same tricks as SortableLongField to "index" values that sort lexigraphicaly and builds a FieldCache using a long[] ... but knows how to parse/return ISO formated dates (and do DateMath).




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




  
> Should use longs for internal DateField storage
> -----------------------------------------------
>
>                 Key: SOLR-440
>                 URL: https://issues.apache.org/jira/browse/SOLR-440
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>    Affects Versions: 1.2, 1.3
>            Reporter: Stu Hood
>
> The current DateField implementation uses formatted Strings internally to store dates.
> As a user creating a schema, I assumed that using the DateField type would be more efficient than using an integer field to store seconds. In fact, the DateField type is currently significantly less efficient: ~20 extra bytes are required per value, and String sorting requires that all values fit in memory.
> As soon as sorting on long fields has been implemented (SOLR-324), I'd suggest that the date implementation be switched to use long values internally, representing milliseconds since the epoch in UTC. Unfortunately, this will cause index incompatibilities, so the schema version will need to be bumped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.