You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jason Toy <ja...@gmail.com> on 2012/02/23 07:25:00 UTC

date issues

I  have a solr instance with about 400m docs. For text searches it is perfectly fine. When I do searches that calculate  the amount of times a word appeared in the doc set for every day of a month, it usually causes solr to crash with out of memory errors. 
I calculate this by running  ~30 queries, one for each day to see the count for that day.
Is there a better way I could do this?

Currently the date fields are stored as:
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>

and the timestamps are stored in the format of:
2012-02-22T21:11:14Z

We have no need to store anything beyond the date. Will just changing the time portion to zeros make things faster:
2012-02-22T00:00:00Z

I thought that to optimize this, there would be an actual date type that doesnt store the time component, but looking through the solr docs, I don't see anything specifically for a date as opposed to a timestamp.  Would it be faster for me to store dates in an sint format?  What is the optimal format I should use? If the format is to continue to use TrieDateField,  is it not a waste to store the hour/minute/seconds even if they are not being used?

Is there anything else I can do to make this more efficient?

I have looked around on the mailing list and on google and not sure what to use, I would appreciate any pointers.  Thanks.

Jason
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: date issues

Posted by Danil ŢORIN <to...@gmail.com>.
Ranges on String are painfully slow.

Format them as YYYYMMDD and store as class="solr.TrieIntField"
precisionStep="8" omitNorms="true" positionIncrementGap="0"

On Thu, Feb 23, 2012 at 10:19, findbestopensource
<fi...@gmail.com> wrote:
> Yes. By storing as String, You should be able to do range search. I am not
> sure, which is better, storing as String / Integer.
>
>  Regards
>  Aditya
>  www.findbestopensource.com
>
>
> On Thu, Feb 23, 2012 at 1:25 PM, Jason Toy <ja...@gmail.com> wrote:
>
>> Can I still do range searches on a string? It seems like it would be more
>> efficient to store as an integer.
>> > Hi,
>> >
>> > You could consider storing date field as String in "YYYYMMDD" format.
>> This
>> > will save space and it will perform better.
>> >
>> > Regards
>> > Aditya
>> > www.findbestopensource.com
>> >
>> >
>> > On Thu, Feb 23, 2012 at 11:55 AM, Jason Toy <ja...@gmail.com> wrote:
>> >
>> >> I  have a solr instance with about 400m docs. For text searches it is
>> >> perfectly fine. When I do searches that calculate  the amount of times a
>> >> word appeared in the doc set for every day of a month, it usually causes
>> >> solr to crash with out of memory errors.
>> >> I calculate this by running  ~30 queries, one for each day to see the
>> >> count for that day.
>> >> Is there a better way I could do this?
>> >>
>> >> Currently the date fields are stored as:
>> >> <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
>> >> precisionStep="0" positionIncrementGap="0"/>
>> >>
>> >> and the timestamps are stored in the format of:
>> >> 2012-02-22T21:11:14Z
>> >>
>> >> We have no need to store anything beyond the date. Will just changing
>> the
>> >> time portion to zeros make things faster:
>> >> 2012-02-22T00:00:00Z
>> >>
>> >> I thought that to optimize this, there would be an actual date type that
>> >> doesnt store the time component, but looking through the solr docs, I
>> don't
>> >> see anything specifically for a date as opposed to a timestamp.  Would
>> it
>> >> be faster for me to store dates in an sint format?  What is the optimal
>> >> format I should use? If the format is to continue to use TrieDateField,
>>  is
>> >> it not a waste to store the hour/minute/seconds even if they are not
>> being
>> >> used?
>> >>
>> >> Is there anything else I can do to make this more efficient?
>> >>
>> >> I have looked around on the mailing list and on google and not sure what
>> >> to use, I would appreciate any pointers.  Thanks.
>> >>
>> >> Jason
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: date issues

Posted by findbestopensource <fi...@gmail.com>.
Yes. By storing as String, You should be able to do range search. I am not
sure, which is better, storing as String / Integer.

 Regards
 Aditya
 www.findbestopensource.com


On Thu, Feb 23, 2012 at 1:25 PM, Jason Toy <ja...@gmail.com> wrote:

> Can I still do range searches on a string? It seems like it would be more
> efficient to store as an integer.
> > Hi,
> >
> > You could consider storing date field as String in "YYYYMMDD" format.
> This
> > will save space and it will perform better.
> >
> > Regards
> > Aditya
> > www.findbestopensource.com
> >
> >
> > On Thu, Feb 23, 2012 at 11:55 AM, Jason Toy <ja...@gmail.com> wrote:
> >
> >> I  have a solr instance with about 400m docs. For text searches it is
> >> perfectly fine. When I do searches that calculate  the amount of times a
> >> word appeared in the doc set for every day of a month, it usually causes
> >> solr to crash with out of memory errors.
> >> I calculate this by running  ~30 queries, one for each day to see the
> >> count for that day.
> >> Is there a better way I could do this?
> >>
> >> Currently the date fields are stored as:
> >> <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
> >> precisionStep="0" positionIncrementGap="0"/>
> >>
> >> and the timestamps are stored in the format of:
> >> 2012-02-22T21:11:14Z
> >>
> >> We have no need to store anything beyond the date. Will just changing
> the
> >> time portion to zeros make things faster:
> >> 2012-02-22T00:00:00Z
> >>
> >> I thought that to optimize this, there would be an actual date type that
> >> doesnt store the time component, but looking through the solr docs, I
> don't
> >> see anything specifically for a date as opposed to a timestamp.  Would
> it
> >> be faster for me to store dates in an sint format?  What is the optimal
> >> format I should use? If the format is to continue to use TrieDateField,
>  is
> >> it not a waste to store the hour/minute/seconds even if they are not
> being
> >> used?
> >>
> >> Is there anything else I can do to make this more efficient?
> >>
> >> I have looked around on the mailing list and on google and not sure what
> >> to use, I would appreciate any pointers.  Thanks.
> >>
> >> Jason
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: date issues

Posted by Jason Toy <ja...@gmail.com>.
Can I still do range searches on a string? It seems like it would be more efficient to store as an integer.
> Hi,
> 
> You could consider storing date field as String in "YYYYMMDD" format. This
> will save space and it will perform better.
> 
> Regards
> Aditya
> www.findbestopensource.com
> 
> 
> On Thu, Feb 23, 2012 at 11:55 AM, Jason Toy <ja...@gmail.com> wrote:
> 
>> I  have a solr instance with about 400m docs. For text searches it is
>> perfectly fine. When I do searches that calculate  the amount of times a
>> word appeared in the doc set for every day of a month, it usually causes
>> solr to crash with out of memory errors.
>> I calculate this by running  ~30 queries, one for each day to see the
>> count for that day.
>> Is there a better way I could do this?
>> 
>> Currently the date fields are stored as:
>> <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
>> precisionStep="0" positionIncrementGap="0"/>
>> 
>> and the timestamps are stored in the format of:
>> 2012-02-22T21:11:14Z
>> 
>> We have no need to store anything beyond the date. Will just changing the
>> time portion to zeros make things faster:
>> 2012-02-22T00:00:00Z
>> 
>> I thought that to optimize this, there would be an actual date type that
>> doesnt store the time component, but looking through the solr docs, I don't
>> see anything specifically for a date as opposed to a timestamp.  Would it
>> be faster for me to store dates in an sint format?  What is the optimal
>> format I should use? If the format is to continue to use TrieDateField,  is
>> it not a waste to store the hour/minute/seconds even if they are not being
>> used?
>> 
>> Is there anything else I can do to make this more efficient?
>> 
>> I have looked around on the mailing list and on google and not sure what
>> to use, I would appreciate any pointers.  Thanks.
>> 
>> Jason
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: date issues

Posted by findbestopensource <fi...@gmail.com>.
Hi,

You could consider storing date field as String in "YYYYMMDD" format. This
will save space and it will perform better.

Regards
Aditya
www.findbestopensource.com


On Thu, Feb 23, 2012 at 11:55 AM, Jason Toy <ja...@gmail.com> wrote:

> I  have a solr instance with about 400m docs. For text searches it is
> perfectly fine. When I do searches that calculate  the amount of times a
> word appeared in the doc set for every day of a month, it usually causes
> solr to crash with out of memory errors.
> I calculate this by running  ~30 queries, one for each day to see the
> count for that day.
> Is there a better way I could do this?
>
> Currently the date fields are stored as:
> <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
> precisionStep="0" positionIncrementGap="0"/>
>
> and the timestamps are stored in the format of:
> 2012-02-22T21:11:14Z
>
> We have no need to store anything beyond the date. Will just changing the
> time portion to zeros make things faster:
> 2012-02-22T00:00:00Z
>
> I thought that to optimize this, there would be an actual date type that
> doesnt store the time component, but looking through the solr docs, I don't
> see anything specifically for a date as opposed to a timestamp.  Would it
> be faster for me to store dates in an sint format?  What is the optimal
> format I should use? If the format is to continue to use TrieDateField,  is
> it not a waste to store the hour/minute/seconds even if they are not being
> used?
>
> Is there anything else I can do to make this more efficient?
>
> I have looked around on the mailing list and on google and not sure what
> to use, I would appreciate any pointers.  Thanks.
>
> Jason
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: date issues

Posted by Erick Erickson <er...@gmail.com>.
1> Don't use sint, it's being deprecated. And it'll take up more space
than a TrieDate
2> Precision. Sure, use the coarsest time you can, normalizing
everything to day would be a good thing.

You won't get any space savings by storing to day resolution, it's
just a long under the covers. But
depending on how you're doing your query, you may get much less memory
usage since some searches are sensitive to the number of *unique* terms
in a field and you'll reduce that number.

But without some idea of the queries you're running it's hard to say whether
this will help.

Best
Erick

On Thu, Feb 23, 2012 at 1:25 AM, Jason Toy <ja...@gmail.com> wrote:
> I  have a solr instance with about 400m docs. For text searches it is perfectly fine. When I do searches that calculate  the amount of times a word appeared in the doc set for every day of a month, it usually causes solr to crash with out of memory errors.
> I calculate this by running  ~30 queries, one for each day to see the count for that day.
> Is there a better way I could do this?
>
> Currently the date fields are stored as:
> <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
>
> and the timestamps are stored in the format of:
> 2012-02-22T21:11:14Z
>
> We have no need to store anything beyond the date. Will just changing the time portion to zeros make things faster:
> 2012-02-22T00:00:00Z
>
> I thought that to optimize this, there would be an actual date type that doesnt store the time component, but looking through the solr docs, I don't see anything specifically for a date as opposed to a timestamp.  Would it be faster for me to store dates in an sint format?  What is the optimal format I should use? If the format is to continue to use TrieDateField,  is it not a waste to store the hour/minute/seconds even if they are not being used?
>
> Is there anything else I can do to make this more efficient?
>
> I have looked around on the mailing list and on google and not sure what to use, I would appreciate any pointers.  Thanks.
>
> Jason
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org