You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/04/25 12:54:02 UTC

Date faceting - howto improve performance

Hi.

One of our faceting use-cases:
We are creating trend graphs of how many blog posts that contains a certain
term and groups it by day/week/year etc. with the nice DateMathParser
functions.

The performance degrades really fast and consumes a lot of memory which
forces OOM from time to time
We think it is due the fact that the cardinality of the field publishedDate
in our index is huge, almost equal to the nr of documents in the index.

We need to address that...

Some questions:

1. Can a datefield have other date-formats than the default of yyyy-MM-dd
HH:mm:ssZ ?

2. We are thinking of adding a field to the index which have the format
yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
could perhaps be a string, but the question then is if faceting can be used
?

3. Since we now already have such a huge index, is there a way to add a
field afterwards and apply it to all documents without actually reindexing
the whole shebang ?

4. If the field cannot be a string can we just leave out the
hour/minute/second information and to reduce the cardinality and improve
performance ? Example: 2009-01-01 00:00:00Z

5. I am afraid that we need to reindex everything to get this to work
(negates Q3). We have 8 shards as of current, what would the most efficient
way be to reindexing the whole shebang ? Dump the entire database to disk
(sigh), create many xml file splits and use curl in a
random/hash(numServers) manner on them ?


Kindly

//Marcus







-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Marcus Herou <ma...@tailsweep.com>.
Thanks should have grep'ed the source of course (like I always seem to end
up with doing haha)


/M

On Wed, Apr 29, 2009 at 10:13 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Some basic documentation is in the example schema.xml. Ask away if you have
> specific questions.
>
> On Thu, Apr 30, 2009 at 1:00 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Aha!
> >
> > Hmm , googling wont help me I see. any hints of usages ?
> >
> > /M
> >
> >
> > On Tue, Apr 28, 2009 at 12:29 AM, Shalin Shekhar Mangar <
> > shalinmangar@gmail.com> wrote:
> >
> > > Sorry, I'm late in this thread.
> > >
> > > Did you try using Trie fields (new in 1.4)? The regular date faceting
> > won't
> > > work out-of-the-box for trie fields I think. But you could use
> > facet.query
> > > to achieve the same effect. On my simple benchmarks I've found trie
> > fields
> > > to give a huge improvement in range searches.
> > >
> > > On Sat, Apr 25, 2009 at 4:24 PM, Marcus Herou <
> > marcus.herou@tailsweep.com
> > > >wrote:
> > >
> > > > Hi.
> > > >
> > > > One of our faceting use-cases:
> > > > We are creating trend graphs of how many blog posts that contains a
> > > certain
> > > > term and groups it by day/week/year etc. with the nice DateMathParser
> > > > functions.
> > > >
> > > > The performance degrades really fast and consumes a lot of memory
> which
> > > > forces OOM from time to time
> > > > We think it is due the fact that the cardinality of the field
> > > publishedDate
> > > > in our index is huge, almost equal to the nr of documents in the
> index.
> > > >
> > > > We need to address that...
> > > >
> > > > Some questions:
> > > >
> > > > 1. Can a datefield have other date-formats than the default of
> > yyyy-MM-dd
> > > > HH:mm:ssZ ?
> > > >
> > > > 2. We are thinking of adding a field to the index which have the
> format
> > > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
> it
> > > > could perhaps be a string, but the question then is if faceting can
> be
> > > used
> > > > ?
> > > >
> > > > 3. Since we now already have such a huge index, is there a way to add
> a
> > > > field afterwards and apply it to all documents without actually
> > > reindexing
> > > > the whole shebang ?
> > > >
> > > > 4. If the field cannot be a string can we just leave out the
> > > > hour/minute/second information and to reduce the cardinality and
> > improve
> > > > performance ? Example: 2009-01-01 00:00:00Z
> > > >
> > > > 5. I am afraid that we need to reindex everything to get this to work
> > > > (negates Q3). We have 8 shards as of current, what would the most
> > > efficient
> > > > way be to reindexing the whole shebang ? Dump the entire database to
> > disk
> > > > (sigh), create many xml file splits and use curl in a
> > > > random/hash(numServers) manner on them ?
> > > >
> > > >
> > > > Kindly
> > > >
> > > > //Marcus
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > +46702561312
> > > > marcus.herou@tailsweep.com
> > > > http://www.tailsweep.com/
> > > > http://blogg.tailsweep.com/
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Some basic documentation is in the example schema.xml. Ask away if you have
specific questions.

On Thu, Apr 30, 2009 at 1:00 AM, Marcus Herou <ma...@tailsweep.com>wrote:

> Aha!
>
> Hmm , googling wont help me I see. any hints of usages ?
>
> /M
>
>
> On Tue, Apr 28, 2009 at 12:29 AM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
> > Sorry, I'm late in this thread.
> >
> > Did you try using Trie fields (new in 1.4)? The regular date faceting
> won't
> > work out-of-the-box for trie fields I think. But you could use
> facet.query
> > to achieve the same effect. On my simple benchmarks I've found trie
> fields
> > to give a huge improvement in range searches.
> >
> > On Sat, Apr 25, 2009 at 4:24 PM, Marcus Herou <
> marcus.herou@tailsweep.com
> > >wrote:
> >
> > > Hi.
> > >
> > > One of our faceting use-cases:
> > > We are creating trend graphs of how many blog posts that contains a
> > certain
> > > term and groups it by day/week/year etc. with the nice DateMathParser
> > > functions.
> > >
> > > The performance degrades really fast and consumes a lot of memory which
> > > forces OOM from time to time
> > > We think it is due the fact that the cardinality of the field
> > publishedDate
> > > in our index is huge, almost equal to the nr of documents in the index.
> > >
> > > We need to address that...
> > >
> > > Some questions:
> > >
> > > 1. Can a datefield have other date-formats than the default of
> yyyy-MM-dd
> > > HH:mm:ssZ ?
> > >
> > > 2. We are thinking of adding a field to the index which have the format
> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> > > could perhaps be a string, but the question then is if faceting can be
> > used
> > > ?
> > >
> > > 3. Since we now already have such a huge index, is there a way to add a
> > > field afterwards and apply it to all documents without actually
> > reindexing
> > > the whole shebang ?
> > >
> > > 4. If the field cannot be a string can we just leave out the
> > > hour/minute/second information and to reduce the cardinality and
> improve
> > > performance ? Example: 2009-01-01 00:00:00Z
> > >
> > > 5. I am afraid that we need to reindex everything to get this to work
> > > (negates Q3). We have 8 shards as of current, what would the most
> > efficient
> > > way be to reindexing the whole shebang ? Dump the entire database to
> disk
> > > (sigh), create many xml file splits and use curl in a
> > > random/hash(numServers) manner on them ?
> > >
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.herou@tailsweep.com
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
> > >
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Date faceting - howto improve performance

Posted by Marcus Herou <ma...@tailsweep.com>.
Aha!

Hmm , googling wont help me I see. any hints of usages ?

/M


On Tue, Apr 28, 2009 at 12:29 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Sorry, I'm late in this thread.
>
> Did you try using Trie fields (new in 1.4)? The regular date faceting won't
> work out-of-the-box for trie fields I think. But you could use facet.query
> to achieve the same effect. On my simple benchmarks I've found trie fields
> to give a huge improvement in range searches.
>
> On Sat, Apr 25, 2009 at 4:24 PM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Hi.
> >
> > One of our faceting use-cases:
> > We are creating trend graphs of how many blog posts that contains a
> certain
> > term and groups it by day/week/year etc. with the nice DateMathParser
> > functions.
> >
> > The performance degrades really fast and consumes a lot of memory which
> > forces OOM from time to time
> > We think it is due the fact that the cardinality of the field
> publishedDate
> > in our index is huge, almost equal to the nr of documents in the index.
> >
> > We need to address that...
> >
> > Some questions:
> >
> > 1. Can a datefield have other date-formats than the default of yyyy-MM-dd
> > HH:mm:ssZ ?
> >
> > 2. We are thinking of adding a field to the index which have the format
> > yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> > could perhaps be a string, but the question then is if faceting can be
> used
> > ?
> >
> > 3. Since we now already have such a huge index, is there a way to add a
> > field afterwards and apply it to all documents without actually
> reindexing
> > the whole shebang ?
> >
> > 4. If the field cannot be a string can we just leave out the
> > hour/minute/second information and to reduce the cardinality and improve
> > performance ? Example: 2009-01-01 00:00:00Z
> >
> > 5. I am afraid that we need to reindex everything to get this to work
> > (negates Q3). We have 8 shards as of current, what would the most
> efficient
> > way be to reindexing the whole shebang ? Dump the entire database to disk
> > (sigh), create many xml file splits and use curl in a
> > random/hash(numServers) manner on them ?
> >
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Sorry, I'm late in this thread.

Did you try using Trie fields (new in 1.4)? The regular date faceting won't
work out-of-the-box for trie fields I think. But you could use facet.query
to achieve the same effect. On my simple benchmarks I've found trie fields
to give a huge improvement in range searches.

On Sat, Apr 25, 2009 at 4:24 PM, Marcus Herou <ma...@tailsweep.com>wrote:

> Hi.
>
> One of our faceting use-cases:
> We are creating trend graphs of how many blog posts that contains a certain
> term and groups it by day/week/year etc. with the nice DateMathParser
> functions.
>
> The performance degrades really fast and consumes a lot of memory which
> forces OOM from time to time
> We think it is due the fact that the cardinality of the field publishedDate
> in our index is huge, almost equal to the nr of documents in the index.
>
> We need to address that...
>
> Some questions:
>
> 1. Can a datefield have other date-formats than the default of yyyy-MM-dd
> HH:mm:ssZ ?
>
> 2. We are thinking of adding a field to the index which have the format
> yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> could perhaps be a string, but the question then is if faceting can be used
> ?
>
> 3. Since we now already have such a huge index, is there a way to add a
> field afterwards and apply it to all documents without actually reindexing
> the whole shebang ?
>
> 4. If the field cannot be a string can we just leave out the
> hour/minute/second information and to reduce the cardinality and improve
> performance ? Example: 2009-01-01 00:00:00Z
>
> 5. I am afraid that we need to reindex everything to get this to work
> (negates Q3). We have 8 shards as of current, what would the most efficient
> way be to reindexing the whole shebang ? Dump the entire database to disk
> (sigh), create many xml file splits and use curl in a
> random/hash(numServers) manner on them ?
>
>
> Kindly
>
> //Marcus
>
>
>
>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Regards,
Shalin Shekhar Mangar.

RE: Date faceting - howto improve performance

Posted by "Smiley, David W." <ds...@mitre.org>.
Hi Marcus.

You must supply dates in the format that you are doing now -- ISO-8601 with the Z to indicate there is no time-zone offset occurring.  To reduce cardinality to the day level instead of to the second that you are currently performing, the date you supply can include DateMathParser operations.  So if you supply:  2009-04-01 20:15:01Z/DAY then this will do what you think it does.  Of course you then loose the ability to search based on a granularity finer than a day.  And the date you get back (i.e. the stored value) is the rounded date; not the date prior to rounding.

Yes you will certainly need to re-index.  Since you have architected your indexing strategy, only you know how to go about doing that.  By now I'm sure you are aware that you cannot update individual fields.  By the way, if your current strategy involves periodic updates then you could take the strategy of simply waiting until all your data eventually gets re-indexed.  There's no harm in some of the dates being rounded and some not -- it's just that until most of them are rounded, you have your current problem of sporadic OOM.

~ David
________________________________________
From: Marcus Herou [marcus.herou@tailsweep.com]
Sent: Saturday, April 25, 2009 6:54 AM
To: solr-user@lucene.apache.org
Subject: Date faceting - howto improve performance

Hi.

One of our faceting use-cases:
We are creating trend graphs of how many blog posts that contains a certain
term and groups it by day/week/year etc. with the nice DateMathParser
functions.

The performance degrades really fast and consumes a lot of memory which
forces OOM from time to time
We think it is due the fact that the cardinality of the field publishedDate
in our index is huge, almost equal to the nr of documents in the index.

We need to address that...

Some questions:

1. Can a datefield have other date-formats than the default of yyyy-MM-dd
HH:mm:ssZ ?

2. We are thinking of adding a field to the index which have the format
yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
could perhaps be a string, but the question then is if faceting can be used
?

3. Since we now already have such a huge index, is there a way to add a
field afterwards and apply it to all documents without actually reindexing
the whole shebang ?

4. If the field cannot be a string can we just leave out the
hour/minute/second information and to reduce the cardinality and improve
performance ? Example: 2009-01-01 00:00:00Z

5. I am afraid that we need to reindex everything to get this to work
(negates Q3). We have 8 shards as of current, what would the most efficient
way be to reindexing the whole shebang ? Dump the entire database to disk
(sigh), create many xml file splits and use curl in a
random/hash(numServers) manner on them ?


Kindly

//Marcus







--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Marcus Herou <ma...@tailsweep.com>.
Yes that's exactly what I meant.

Think adding "new" fields to a separate index and use ParallelReader at
query time would be something to investigate at SOLR level.
I think I can spend some time creating a patch for this if you think it is a
good idea and if you think it would be merged into the repo haha.
It is not very main stream but I think everyone with more than a million
docs curses alot over that they need to stop the entire service for a couple
of days just to add a field :)

We have 60M rows now and 50 000M index size (shit 800k per doc, man that is
too much) so we are getting into a state where reindexing is starting to
become impossible...

Keep up the fantastic work

//Marcus



On Mon, Apr 27, 2009 at 5:09 PM, Ning Li <ni...@gmail.com> wrote:

> You mean doc A and doc B will become one doc after adding index 2 to
> index 1? I don't think this is currently supported either at Lucene
> level or at Solr level. If index 1 has m docs and index 2 has n docs,
> index 1 will have m+n docs after adding index 2 to index 1. Documents
> themselves are not modified by index merge.
>
> Cheers,
> Ning
>
>
> On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou
> <ma...@tailsweep.com> wrote:
> > Hmm looking in the code for the IndexMerger in Solr
> > (org.apache.solr.update.DirectUpdateHandler(2)
> >
> > See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
> > indexes) ?
> >
> > And the test class
> org.apache.solr.client.solrj.MergeIndexesExampleTestBase
> > suggests:
> > add doc A to index1 with id=AAA,name=core1
> > add doc B to index2 with id=BBB,name=core2
> > merge the two indexes into one index which then contains both docs.
> > The resulting index will have 2 docs.
> >
> > Great but in my case I think it should work more like this.
> >
> > add doc A to index1 with id=X,title=blog entry title,description=blog
> entry
> > description
> > add doc B to index2 with id=X,score=1.2
> > somehow add index2 to index1 so id=XX has score=1.2 when searching in
> index1
> > The resulting index should have 1 doc.
> >
> > So this is not really what I want right ?
> >
> > Sorry for being a smart-ass...
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> > On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <
> marcus.herou@tailsweep.com>wrote:
> >
> >> Guys!
> >>
> >> Thanks for these insights, I think we will head for Lucene level merging
> >> strategy (two or more indexes).
> >> When merging I guess the second index need to have the same doc ids
> >> somehow. This is an internal id in Lucene, not that easy to get hold of
> >> right ?
> >>
> >> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> >> would not work very well performance wise or what do you mean ?
> >>
> >> I sure like bleeding edge :)
> >>
> >> Cheers dudes
> >>
> >> //Marcus
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> >> otis_gospodnetic@yahoo.com> wrote:
> >>
> >>>
> >>> I should emphasize that the PR trick I mentioned is something you'd do
> at
> >>> the Lucene level, outside Solr, and then you'd just slip the modified
> index
> >>> back into Solr.
> >>> Of, if you like the bleeding edge, perhaps you can make use of Ning
> Li's
> >>> Solr index merging functionality (patch in JIRA).
> >>>
> >>>
> >>> Otis --
> >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>>
> >>>
> >>>
> >>> ----- Original Message ----
> >>> > From: Otis Gospodnetic <ot...@yahoo.com>
> >>> > To: solr-user@lucene.apache.org
> >>> > Sent: Saturday, April 25, 2009 9:41:45 AM
> >>> > Subject: Re: Date faceting - howto improve performance
> >>> >
> >>> >
> >>> > Yes, you could simply round the date, no need for a non-date type
> field.
> >>> > Yes, you can add a field after the fact by making use of
> ParallelReader
> >>> and
> >>> > merging (I don't recall the details, search the ML for ParallelReader
> >>> and
> >>> > Andrzej), I remember he once provided the working recipe.
> >>> >
> >>> >
> >>> > Otis --
> >>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>> >
> >>> >
> >>> >
> >>> > ----- Original Message ----
> >>> > > From: Marcus Herou
> >>> > > To: solr-user@lucene.apache.org
> >>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
> >>> > > Subject: Date faceting - howto improve performance
> >>> > >
> >>> > > Hi.
> >>> > >
> >>> > > One of our faceting use-cases:
> >>> > > We are creating trend graphs of how many blog posts that contains a
> >>> certain
> >>> > > term and groups it by day/week/year etc. with the nice
> DateMathParser
> >>> > > functions.
> >>> > >
> >>> > > The performance degrades really fast and consumes a lot of memory
> >>> which
> >>> > > forces OOM from time to time
> >>> > > We think it is due the fact that the cardinality of the field
> >>> publishedDate
> >>> > > in our index is huge, almost equal to the nr of documents in the
> >>> index.
> >>> > >
> >>> > > We need to address that...
> >>> > >
> >>> > > Some questions:
> >>> > >
> >>> > > 1. Can a datefield have other date-formats than the default of
> >>> yyyy-MM-dd
> >>> > > HH:mm:ssZ ?
> >>> > >
> >>> > > 2. We are thinking of adding a field to the index which have the
> >>> format
> >>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a
> date,
> >>> it
> >>> > > could perhaps be a string, but the question then is if faceting can
> be
> >>> used
> >>> > > ?
> >>> > >
> >>> > > 3. Since we now already have such a huge index, is there a way to
> add
> >>> a
> >>> > > field afterwards and apply it to all documents without actually
> >>> reindexing
> >>> > > the whole shebang ?
> >>> > >
> >>> > > 4. If the field cannot be a string can we just leave out the
> >>> > > hour/minute/second information and to reduce the cardinality and
> >>> improve
> >>> > > performance ? Example: 2009-01-01 00:00:00Z
> >>> > >
> >>> > > 5. I am afraid that we need to reindex everything to get this to
> work
> >>> > > (negates Q3). We have 8 shards as of current, what would the most
> >>> efficient
> >>> > > way be to reindexing the whole shebang ? Dump the entire database
> to
> >>> disk
> >>> > > (sigh), create many xml file splits and use curl in a
> >>> > > random/hash(numServers) manner on them ?
> >>> > >
> >>> > >
> >>> > > Kindly
> >>> > >
> >>> > > //Marcus
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Marcus Herou CTO and co-founder Tailsweep AB
> >>> > > +46702561312
> >>> > > marcus.herou@tailsweep.com
> >>> > > http://www.tailsweep.com/
> >>> > > http://blogg.tailsweep.com/
> >>>
> >>>
> >>
> >>
> >> --
> >> Marcus Herou CTO and co-founder Tailsweep AB
> >> +46702561312
> >> marcus.herou@tailsweep.com
> >> http://www.tailsweep.com/
> >> http://blogg.tailsweep.com/
> >>
> >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Ning Li <ni...@gmail.com>.
You mean doc A and doc B will become one doc after adding index 2 to
index 1? I don't think this is currently supported either at Lucene
level or at Solr level. If index 1 has m docs and index 2 has n docs,
index 1 will have m+n docs after adding index 2 to index 1. Documents
themselves are not modified by index merge.

Cheers,
Ning


On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou
<ma...@tailsweep.com> wrote:
> Hmm looking in the code for the IndexMerger in Solr
> (org.apache.solr.update.DirectUpdateHandler(2)
>
> See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
> indexes) ?
>
> And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase
> suggests:
> add doc A to index1 with id=AAA,name=core1
> add doc B to index2 with id=BBB,name=core2
> merge the two indexes into one index which then contains both docs.
> The resulting index will have 2 docs.
>
> Great but in my case I think it should work more like this.
>
> add doc A to index1 with id=X,title=blog entry title,description=blog entry
> description
> add doc B to index2 with id=X,score=1.2
> somehow add index2 to index1 so id=XX has score=1.2 when searching in index1
> The resulting index should have 1 doc.
>
> So this is not really what I want right ?
>
> Sorry for being a smart-ass...
>
> Kindly
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <ma...@tailsweep.com>wrote:
>
>> Guys!
>>
>> Thanks for these insights, I think we will head for Lucene level merging
>> strategy (two or more indexes).
>> When merging I guess the second index need to have the same doc ids
>> somehow. This is an internal id in Lucene, not that easy to get hold of
>> right ?
>>
>> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
>> would not work very well performance wise or what do you mean ?
>>
>> I sure like bleeding edge :)
>>
>> Cheers dudes
>>
>> //Marcus
>>
>>
>>
>>
>>
>> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
>> otis_gospodnetic@yahoo.com> wrote:
>>
>>>
>>> I should emphasize that the PR trick I mentioned is something you'd do at
>>> the Lucene level, outside Solr, and then you'd just slip the modified index
>>> back into Solr.
>>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>>> Solr index merging functionality (patch in JIRA).
>>>
>>>
>>> Otis --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>> > From: Otis Gospodnetic <ot...@yahoo.com>
>>> > To: solr-user@lucene.apache.org
>>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>>> > Subject: Re: Date faceting - howto improve performance
>>> >
>>> >
>>> > Yes, you could simply round the date, no need for a non-date type field.
>>> > Yes, you can add a field after the fact by making use of ParallelReader
>>> and
>>> > merging (I don't recall the details, search the ML for ParallelReader
>>> and
>>> > Andrzej), I remember he once provided the working recipe.
>>> >
>>> >
>>> > Otis --
>>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >
>>> >
>>> >
>>> > ----- Original Message ----
>>> > > From: Marcus Herou
>>> > > To: solr-user@lucene.apache.org
>>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>>> > > Subject: Date faceting - howto improve performance
>>> > >
>>> > > Hi.
>>> > >
>>> > > One of our faceting use-cases:
>>> > > We are creating trend graphs of how many blog posts that contains a
>>> certain
>>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>>> > > functions.
>>> > >
>>> > > The performance degrades really fast and consumes a lot of memory
>>> which
>>> > > forces OOM from time to time
>>> > > We think it is due the fact that the cardinality of the field
>>> publishedDate
>>> > > in our index is huge, almost equal to the nr of documents in the
>>> index.
>>> > >
>>> > > We need to address that...
>>> > >
>>> > > Some questions:
>>> > >
>>> > > 1. Can a datefield have other date-formats than the default of
>>> yyyy-MM-dd
>>> > > HH:mm:ssZ ?
>>> > >
>>> > > 2. We are thinking of adding a field to the index which have the
>>> format
>>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
>>> it
>>> > > could perhaps be a string, but the question then is if faceting can be
>>> used
>>> > > ?
>>> > >
>>> > > 3. Since we now already have such a huge index, is there a way to add
>>> a
>>> > > field afterwards and apply it to all documents without actually
>>> reindexing
>>> > > the whole shebang ?
>>> > >
>>> > > 4. If the field cannot be a string can we just leave out the
>>> > > hour/minute/second information and to reduce the cardinality and
>>> improve
>>> > > performance ? Example: 2009-01-01 00:00:00Z
>>> > >
>>> > > 5. I am afraid that we need to reindex everything to get this to work
>>> > > (negates Q3). We have 8 shards as of current, what would the most
>>> efficient
>>> > > way be to reindexing the whole shebang ? Dump the entire database to
>>> disk
>>> > > (sigh), create many xml file splits and use curl in a
>>> > > random/hash(numServers) manner on them ?
>>> > >
>>> > >
>>> > > Kindly
>>> > >
>>> > > //Marcus
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Marcus Herou CTO and co-founder Tailsweep AB
>>> > > +46702561312
>>> > > marcus.herou@tailsweep.com
>>> > > http://www.tailsweep.com/
>>> > > http://blogg.tailsweep.com/
>>>
>>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.herou@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

Re: Date faceting - howto improve performance

Posted by Marcus Herou <ma...@tailsweep.com>.
Hmm looking in the code for the IndexMerger in Solr
(org.apache.solr.update.DirectUpdateHandler(2)

See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
indexes) ?

And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase
suggests:
add doc A to index1 with id=AAA,name=core1
add doc B to index2 with id=BBB,name=core2
merge the two indexes into one index which then contains both docs.
The resulting index will have 2 docs.

Great but in my case I think it should work more like this.

add doc A to index1 with id=X,title=blog entry title,description=blog entry
description
add doc B to index2 with id=X,score=1.2
somehow add index2 to index1 so id=XX has score=1.2 when searching in index1
The resulting index should have 1 doc.

So this is not really what I want right ?

Sorry for being a smart-ass...

Kindly

//Marcus





On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <ma...@tailsweep.com>wrote:

> Guys!
>
> Thanks for these insights, I think we will head for Lucene level merging
> strategy (two or more indexes).
> When merging I guess the second index need to have the same doc ids
> somehow. This is an internal id in Lucene, not that easy to get hold of
> right ?
>
> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> would not work very well performance wise or what do you mean ?
>
> I sure like bleeding edge :)
>
> Cheers dudes
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
>>
>> I should emphasize that the PR trick I mentioned is something you'd do at
>> the Lucene level, outside Solr, and then you'd just slip the modified index
>> back into Solr.
>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>> Solr index merging functionality (patch in JIRA).
>>
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Otis Gospodnetic <ot...@yahoo.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>> > Subject: Re: Date faceting - howto improve performance
>> >
>> >
>> > Yes, you could simply round the date, no need for a non-date type field.
>> > Yes, you can add a field after the fact by making use of ParallelReader
>> and
>> > merging (I don't recall the details, search the ML for ParallelReader
>> and
>> > Andrzej), I remember he once provided the working recipe.
>> >
>> >
>> > Otis --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> > > From: Marcus Herou
>> > > To: solr-user@lucene.apache.org
>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>> > > Subject: Date faceting - howto improve performance
>> > >
>> > > Hi.
>> > >
>> > > One of our faceting use-cases:
>> > > We are creating trend graphs of how many blog posts that contains a
>> certain
>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>> > > functions.
>> > >
>> > > The performance degrades really fast and consumes a lot of memory
>> which
>> > > forces OOM from time to time
>> > > We think it is due the fact that the cardinality of the field
>> publishedDate
>> > > in our index is huge, almost equal to the nr of documents in the
>> index.
>> > >
>> > > We need to address that...
>> > >
>> > > Some questions:
>> > >
>> > > 1. Can a datefield have other date-formats than the default of
>> yyyy-MM-dd
>> > > HH:mm:ssZ ?
>> > >
>> > > 2. We are thinking of adding a field to the index which have the
>> format
>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
>> it
>> > > could perhaps be a string, but the question then is if faceting can be
>> used
>> > > ?
>> > >
>> > > 3. Since we now already have such a huge index, is there a way to add
>> a
>> > > field afterwards and apply it to all documents without actually
>> reindexing
>> > > the whole shebang ?
>> > >
>> > > 4. If the field cannot be a string can we just leave out the
>> > > hour/minute/second information and to reduce the cardinality and
>> improve
>> > > performance ? Example: 2009-01-01 00:00:00Z
>> > >
>> > > 5. I am afraid that we need to reindex everything to get this to work
>> > > (negates Q3). We have 8 shards as of current, what would the most
>> efficient
>> > > way be to reindexing the whole shebang ? Dump the entire database to
>> disk
>> > > (sigh), create many xml file splits and use curl in a
>> > > random/hash(numServers) manner on them ?
>> > >
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.herou@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Marcus Herou <ma...@tailsweep.com>.
Oh and the indexing strategy is just a stupid random across the shards.

What I asked about was a "Best Practice" of achieving most MB/sec indexing.
I feel that the java-api should be less efficient than something more raw
like curl or so but that is just my hunch.

/M


On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou <ma...@tailsweep.com>wrote:

> Guys!
>
> Thanks for these insights, I think we will head for Lucene level merging
> strategy (two or more indexes).
> When merging I guess the second index need to have the same doc ids
> somehow. This is an internal id in Lucene, not that easy to get hold of
> right ?
>
> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
> would not work very well performance wise or what do you mean ?
>
> I sure like bleeding edge :)
>
> Cheers dudes
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
>>
>> I should emphasize that the PR trick I mentioned is something you'd do at
>> the Lucene level, outside Solr, and then you'd just slip the modified index
>> back into Solr.
>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>> Solr index merging functionality (patch in JIRA).
>>
>>
>> Otis --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Otis Gospodnetic <ot...@yahoo.com>
>> > To: solr-user@lucene.apache.org
>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>> > Subject: Re: Date faceting - howto improve performance
>> >
>> >
>> > Yes, you could simply round the date, no need for a non-date type field.
>> > Yes, you can add a field after the fact by making use of ParallelReader
>> and
>> > merging (I don't recall the details, search the ML for ParallelReader
>> and
>> > Andrzej), I remember he once provided the working recipe.
>> >
>> >
>> > Otis --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> > > From: Marcus Herou
>> > > To: solr-user@lucene.apache.org
>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>> > > Subject: Date faceting - howto improve performance
>> > >
>> > > Hi.
>> > >
>> > > One of our faceting use-cases:
>> > > We are creating trend graphs of how many blog posts that contains a
>> certain
>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>> > > functions.
>> > >
>> > > The performance degrades really fast and consumes a lot of memory
>> which
>> > > forces OOM from time to time
>> > > We think it is due the fact that the cardinality of the field
>> publishedDate
>> > > in our index is huge, almost equal to the nr of documents in the
>> index.
>> > >
>> > > We need to address that...
>> > >
>> > > Some questions:
>> > >
>> > > 1. Can a datefield have other date-formats than the default of
>> yyyy-MM-dd
>> > > HH:mm:ssZ ?
>> > >
>> > > 2. We are thinking of adding a field to the index which have the
>> format
>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
>> it
>> > > could perhaps be a string, but the question then is if faceting can be
>> used
>> > > ?
>> > >
>> > > 3. Since we now already have such a huge index, is there a way to add
>> a
>> > > field afterwards and apply it to all documents without actually
>> reindexing
>> > > the whole shebang ?
>> > >
>> > > 4. If the field cannot be a string can we just leave out the
>> > > hour/minute/second information and to reduce the cardinality and
>> improve
>> > > performance ? Example: 2009-01-01 00:00:00Z
>> > >
>> > > 5. I am afraid that we need to reindex everything to get this to work
>> > > (negates Q3). We have 8 shards as of current, what would the most
>> efficient
>> > > way be to reindexing the whole shebang ? Dump the entire database to
>> disk
>> > > (sigh), create many xml file splits and use curl in a
>> > > random/hash(numServers) manner on them ?
>> > >
>> > >
>> > > Kindly
>> > >
>> > > //Marcus
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Marcus Herou CTO and co-founder Tailsweep AB
>> > > +46702561312
>> > > marcus.herou@tailsweep.com
>> > > http://www.tailsweep.com/
>> > > http://blogg.tailsweep.com/
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Marcus Herou <ma...@tailsweep.com>.
Guys!

Thanks for these insights, I think we will head for Lucene level merging
strategy (two or more indexes).
When merging I guess the second index need to have the same doc ids somehow.
This is an internal id in Lucene, not that easy to get hold of right ?

So you are saying the the solr: ExternalFileField + FunctionQuery stuff
would not work very well performance wise or what do you mean ?

I sure like bleeding edge :)

Cheers dudes

//Marcus




On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

>
> I should emphasize that the PR trick I mentioned is something you'd do at
> the Lucene level, outside Solr, and then you'd just slip the modified index
> back into Solr.
> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
> Solr index merging functionality (patch in JIRA).
>
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Otis Gospodnetic <ot...@yahoo.com>
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, April 25, 2009 9:41:45 AM
> > Subject: Re: Date faceting - howto improve performance
> >
> >
> > Yes, you could simply round the date, no need for a non-date type field.
> > Yes, you can add a field after the fact by making use of ParallelReader
> and
> > merging (I don't recall the details, search the ML for ParallelReader and
> > Andrzej), I remember he once provided the working recipe.
> >
> >
> > Otis --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Marcus Herou
> > > To: solr-user@lucene.apache.org
> > > Sent: Saturday, April 25, 2009 6:54:02 AM
> > > Subject: Date faceting - howto improve performance
> > >
> > > Hi.
> > >
> > > One of our faceting use-cases:
> > > We are creating trend graphs of how many blog posts that contains a
> certain
> > > term and groups it by day/week/year etc. with the nice DateMathParser
> > > functions.
> > >
> > > The performance degrades really fast and consumes a lot of memory which
> > > forces OOM from time to time
> > > We think it is due the fact that the cardinality of the field
> publishedDate
> > > in our index is huge, almost equal to the nr of documents in the index.
> > >
> > > We need to address that...
> > >
> > > Some questions:
> > >
> > > 1. Can a datefield have other date-formats than the default of
> yyyy-MM-dd
> > > HH:mm:ssZ ?
> > >
> > > 2. We are thinking of adding a field to the index which have the format
> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> > > could perhaps be a string, but the question then is if faceting can be
> used
> > > ?
> > >
> > > 3. Since we now already have such a huge index, is there a way to add a
> > > field afterwards and apply it to all documents without actually
> reindexing
> > > the whole shebang ?
> > >
> > > 4. If the field cannot be a string can we just leave out the
> > > hour/minute/second information and to reduce the cardinality and
> improve
> > > performance ? Example: 2009-01-01 00:00:00Z
> > >
> > > 5. I am afraid that we need to reindex everything to get this to work
> > > (negates Q3). We have 8 shards as of current, what would the most
> efficient
> > > way be to reindexing the whole shebang ? Dump the entire database to
> disk
> > > (sigh), create many xml file splits and use curl in a
> > > random/hash(numServers) manner on them ?
> > >
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > marcus.herou@tailsweep.com
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Date faceting - howto improve performance

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I should emphasize that the PR trick I mentioned is something you'd do at the Lucene level, outside Solr, and then you'd just slip the modified index back into Solr.
Of, if you like the bleeding edge, perhaps you can make use of Ning Li's Solr index merging functionality (patch in JIRA).


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Otis Gospodnetic <ot...@yahoo.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, April 25, 2009 9:41:45 AM
> Subject: Re: Date faceting - howto improve performance
> 
> 
> Yes, you could simply round the date, no need for a non-date type field.
> Yes, you can add a field after the fact by making use of ParallelReader and 
> merging (I don't recall the details, search the ML for ParallelReader and 
> Andrzej), I remember he once provided the working recipe.
> 
> 
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: Marcus Herou 
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, April 25, 2009 6:54:02 AM
> > Subject: Date faceting - howto improve performance
> > 
> > Hi.
> > 
> > One of our faceting use-cases:
> > We are creating trend graphs of how many blog posts that contains a certain
> > term and groups it by day/week/year etc. with the nice DateMathParser
> > functions.
> > 
> > The performance degrades really fast and consumes a lot of memory which
> > forces OOM from time to time
> > We think it is due the fact that the cardinality of the field publishedDate
> > in our index is huge, almost equal to the nr of documents in the index.
> > 
> > We need to address that...
> > 
> > Some questions:
> > 
> > 1. Can a datefield have other date-formats than the default of yyyy-MM-dd
> > HH:mm:ssZ ?
> > 
> > 2. We are thinking of adding a field to the index which have the format
> > yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> > could perhaps be a string, but the question then is if faceting can be used
> > ?
> > 
> > 3. Since we now already have such a huge index, is there a way to add a
> > field afterwards and apply it to all documents without actually reindexing
> > the whole shebang ?
> > 
> > 4. If the field cannot be a string can we just leave out the
> > hour/minute/second information and to reduce the cardinality and improve
> > performance ? Example: 2009-01-01 00:00:00Z
> > 
> > 5. I am afraid that we need to reindex everything to get this to work
> > (negates Q3). We have 8 shards as of current, what would the most efficient
> > way be to reindexing the whole shebang ? Dump the entire database to disk
> > (sigh), create many xml file splits and use curl in a
> > random/hash(numServers) manner on them ?
> > 
> > 
> > Kindly
> > 
> > //Marcus
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -- 
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/


Re: Date faceting - howto improve performance

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yes, you could simply round the date, no need for a non-date type field.
Yes, you can add a field after the fact by making use of ParallelReader and merging (I don't recall the details, search the ML for ParallelReader and Andrzej), I remember he once provided the working recipe.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Marcus Herou <ma...@tailsweep.com>
> To: solr-user@lucene.apache.org
> Sent: Saturday, April 25, 2009 6:54:02 AM
> Subject: Date faceting - howto improve performance
> 
> Hi.
> 
> One of our faceting use-cases:
> We are creating trend graphs of how many blog posts that contains a certain
> term and groups it by day/week/year etc. with the nice DateMathParser
> functions.
> 
> The performance degrades really fast and consumes a lot of memory which
> forces OOM from time to time
> We think it is due the fact that the cardinality of the field publishedDate
> in our index is huge, almost equal to the nr of documents in the index.
> 
> We need to address that...
> 
> Some questions:
> 
> 1. Can a datefield have other date-formats than the default of yyyy-MM-dd
> HH:mm:ssZ ?
> 
> 2. We are thinking of adding a field to the index which have the format
> yyyy-MM-dd to reduce the cardinality, if that field can't be a date, it
> could perhaps be a string, but the question then is if faceting can be used
> ?
> 
> 3. Since we now already have such a huge index, is there a way to add a
> field afterwards and apply it to all documents without actually reindexing
> the whole shebang ?
> 
> 4. If the field cannot be a string can we just leave out the
> hour/minute/second information and to reduce the cardinality and improve
> performance ? Example: 2009-01-01 00:00:00Z
> 
> 5. I am afraid that we need to reindex everything to get this to work
> (negates Q3). We have 8 shards as of current, what would the most efficient
> way be to reindexing the whole shebang ? Dump the entire database to disk
> (sigh), create many xml file splits and use curl in a
> random/hash(numServers) manner on them ?
> 
> 
> Kindly
> 
> //Marcus
> 
> 
> 
> 
> 
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/