You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Larochelle <dl...@cyber.law.harvard.edu> on 2013/08/01 00:44:17 UTC

Re: SolrCloud and Joins

Thanks Walter,

Existing media sets will rarely change but new media sets will be added
relatively frequently. (There is a many to many relationship between media
sets and media sources.) Given the size of data, a new Media Set that only
includes 1% of the collection would include 6 million rows.

Our data is stored in a Postgresql database and imported using the
dataImportHandler. It takes around 3 days to fully import the data.
In the single shard case, the nice thing about using joins is that the
media set to source mapping data could be updated using an hourly cron job
while the sentence data could be updated using a delta query.

The obvious alternative to joins is to add the media_sets_id to the
sentence data as a multi-value field. We'll benchmark this. But my concern
is that importing the full data will take even longer and that there will
be no easy way to automatically update each affected row when a new media
set is created. (I could write a separate one-off query for
DataImportHandler each time a new media set is added but this requires a
lot of manual interaction.)

Does SolrCloud really not have a simple way to specify which shard to put a
document on? I'm considering randomly generating document ID prefixes and
then taking their murmurhash to determine what shards they correspond to. I
could then explicitly send documents to a particular shard by specifying a
document ID prefix. However, this seems like a hackish approach. Is there a
better way?



On Mon, Jul 29, 2013 at 12:45 PM, Walter Underwood <wu...@wunderwood.org>wrote:

> A join may seem clean, but it will be slow and (currently) doesn't work in
> a cluster.
>
> You find all the sentences in a media set by searching for that set id and
> requesting only the sentence_id (yes, you need that). Then you reindex
> them. With small documents like this, it is probably fairly fast.
>
> If you can't estimate how often the media sets will change or the size of
> the changes, then you aren't ready to choose a design.
>
> wunder
>
> On Jul 29, 2013, at 8:41 AM, David Larochelle wrote:
>
> > We'd like to be able to easily update the media set to source mapping.
> I'm
> > concerned that if we store the media_sets_id in the sentence documents,
> it
> > will be very difficult to add additional media set to source mapping. I
> > imagine that adding a new media set would either require reimporting all
> > 600 million documents or writing complicated application logic to find
> out
> > which sentences to update. Hence joins seem like a cleaner solution.
> >
> > --
> >
> > David
> >
> >
> > On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood <
> wunder@wunderwood.org>wrote:
> >
> >> Denormalize. Add media_set_id to each sentence document. Done.
> >>
> >> wunder
> >>
> >> On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:
> >>
> >>> I'm setting up SolrCloud with around 600 million documents. The basic
> >>> structure of each document is:
> >>>
> >>> stories_id: integer, media_id: integer, sentence: text_en
> >>>
> >>> We have a number of stories from different media and we treat each
> >> sentence
> >>> as a separate document because we need to run sentence level analytics.
> >>>
> >>> We also have a concept of groups or sets of sources. We've imported
> this
> >>> media source to media sets mapping into Solr using the following
> >> structure:
> >>>
> >>> media_id_inner: integer, media_sets_id: integer
> >>>
> >>> For the single node case, we're able to filter our sources by
> >> media_set_id
> >>> using a join query like the following:
> >>>
> >>>
> >>
> http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1<http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1>
> >> <
> >>
> http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
> >>>
> >>>
> >>> However, this does not work correctly with SolrCloud. The problem is
> that
> >>> the join query is performed separately on each of the shards and no
> shard
> >>> has the complete media set to source mapping data. So SolrCloud returns
> >>> incomplete results.
> >>>
> >>> Since the complete media set to source mapping data is comparatively
> >> small
> >>> (~50,000 rows), I would like to replicate it on every shard. So that
> the
> >>> results of the individual join queries on separate shards would  be
> >>> equivalent to performing the same query on a single shard system.
> >>>
> >>> However, I'm can't figure out how to replicate documents on separate
> >>> shards. The compositeID router has the ability to colocate documents
> >> based
> >>> on a prefix in the document ID but this isn't what I need. What I would
> >>> like is some way to either have the media set to source data replicated
> >> on
> >>> every shard or to be able to explicitly upload this data to the
> >> individual
> >>> shards. (For the rest of the data I like the compositeID autorouting.)
> >>>
> >>> Any suggestions?
> >>>
> >>> --
> >>>
> >>> Thanks,
> >>>
> >>>
> >>> David
> >>
> >> --
> >> Walter Underwood
> >> wunder@wunderwood.org
> >>
> >>
> >>
> >>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
>