You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Susheel Kumar <su...@gmail.com> on 2015/12/22 17:25:58 UTC

Schema/Index design for disparate data sources (Federated / Google like search)

Hello,

I am going thru few use cases where we have kind of multiple disparate data
sources which in general doesn't have much common fields and i was thinking
to design different schema/index/collection for each of them and query each
of them separately and provide different result sets to the client.

I have seen one implementation where all different fields from these
disparate data sources are put together in single schema/design/collection
that it can be searched easily using catch all field but this was having
200+ fields including copy fields. The problem i see with this design is
ingestion will be slower (and scaling) as many of the fields for one data
source will not be applicable when ingesting for other data source.
Basically everything is being dumped into one huge schema/index/collection.

After looking above, I am wondering how we can design this better in
another implementation where we have the requirement to search across
disparate source (each having multiple fields 10-15 fields searchable &
10-15 fields stored) with only 1 common field like description in each of
the data sources.  Most of the time user may perform search on description
and rest of the time combination of different fields. Similar to google
like search where you search for "coffee" and it searches in various data
sources (websites, maps, images, places etc.)

My thought is to make separate indexes for each search scenario.  For
example for single search box, we index description, other key fields which
can be searched together  and their data source type into one index/schema
that we don't make a huge index/schema and use the catch all field for
search.

And for other Advance search (field specific) scenario we create separate
index/schema for each data sources.

Any suggestions/guidelines on how we can better address this in terms of
responsiveness and scaling? Each data source may have documents in 50-100+
millions.

Thanks,
Susheel

Re: Schema/Index design for disparate data sources (Federated / Google like search)

Posted by Susheel Kumar <su...@gmail.com>.
Thanks, Jack for various points. A question when you have hundreds of
fields from different sources and you also have lot of copy fields
instructions for facets, sort or catch all etc. you suffer some performance
hit during ingestion as many of the copy instructions would just be
executing but doing nothing since they don't have data, do you agree?

Assuming keyword search is required on different data sources and present
result from each data source when user is typing (instant / auto complete)
in single search box and advance search (very field specific) is required
in the advance search option,  how do you suggest to design the
index/schema?

Let me know if i am missing any other info to get your thoughts.

On Tue, Dec 22, 2015 at 11:53 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> Step one is to refine and more clearly state the requirements. Sure,
> sometimes (most of the time?) the end user really doesn't know exactly what
> they expect or want other than "Gee, I want to search for everything, isn't
> that obvious??!!", but that simply means that an analyst is needed to
> intervene before you leap to implementation. An analyst is someone who
> knows how to interview all relevant parties (not just the approving
> manager) to understand their true needs. I mean, who knows, maybe all they
> really need is basic keyword search. Or... maybe they actually need a
> full-blown data warehouse with precise access to each specific field of
> each data source. Without knowing how refined user queries need to get,
> there is little to go on here.
>
> My other advice is to be careful not to overthink the problem - to imagine
> that some complex solution is needed when the end users really only need to
> do super basic queries. In general, managers are very poor when it comes to
> analysis and requirement specification.
>
> Do they need to do date searches on a variety of date fields?
>
> Do they need to do numeric or range queries on specific numeric fields?
>
> Do they need to do any exact match queries on raw character fields (as
> opposed to tokenized text)?
>
> Do they have fields like product names or numbers in addition to free-form
> text?
>
> Do they need to distinguish or weight titles from detailed descriptions?
>
> You could have catchall fields for categories of field types like titles,
> bodies, authors/names, locations, dates, numeric values. But... who
> knows... this may be more than what an average user really needs.
>
> As far as the concern about fields from different sources that are not
> used, Lucene only stores and indexes fields which have values, so no
> storage or performance is consumed when you have a lot of fields which are
> not present for a particular data source.
>
> -- Jack Krupansky
>
> On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar <su...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I am going thru few use cases where we have kind of multiple disparate
> data
> > sources which in general doesn't have much common fields and i was
> thinking
> > to design different schema/index/collection for each of them and query
> each
> > of them separately and provide different result sets to the client.
> >
> > I have seen one implementation where all different fields from these
> > disparate data sources are put together in single
> schema/design/collection
> > that it can be searched easily using catch all field but this was having
> > 200+ fields including copy fields. The problem i see with this design is
> > ingestion will be slower (and scaling) as many of the fields for one data
> > source will not be applicable when ingesting for other data source.
> > Basically everything is being dumped into one huge
> schema/index/collection.
> >
> > After looking above, I am wondering how we can design this better in
> > another implementation where we have the requirement to search across
> > disparate source (each having multiple fields 10-15 fields searchable &
> > 10-15 fields stored) with only 1 common field like description in each of
> > the data sources.  Most of the time user may perform search on
> description
> > and rest of the time combination of different fields. Similar to google
> > like search where you search for "coffee" and it searches in various data
> > sources (websites, maps, images, places etc.)
> >
> > My thought is to make separate indexes for each search scenario.  For
> > example for single search box, we index description, other key fields
> which
> > can be searched together  and their data source type into one
> index/schema
> > that we don't make a huge index/schema and use the catch all field for
> > search.
> >
> > And for other Advance search (field specific) scenario we create separate
> > index/schema for each data sources.
> >
> > Any suggestions/guidelines on how we can better address this in terms of
> > responsiveness and scaling? Each data source may have documents in
> 50-100+
> > millions.
> >
> > Thanks,
> > Susheel
> >
>

Re: Schema/Index design for disparate data sources (Federated / Google like search)

Posted by Jack Krupansky <ja...@gmail.com>.
Step one is to refine and more clearly state the requirements. Sure,
sometimes (most of the time?) the end user really doesn't know exactly what
they expect or want other than "Gee, I want to search for everything, isn't
that obvious??!!", but that simply means that an analyst is needed to
intervene before you leap to implementation. An analyst is someone who
knows how to interview all relevant parties (not just the approving
manager) to understand their true needs. I mean, who knows, maybe all they
really need is basic keyword search. Or... maybe they actually need a
full-blown data warehouse with precise access to each specific field of
each data source. Without knowing how refined user queries need to get,
there is little to go on here.

My other advice is to be careful not to overthink the problem - to imagine
that some complex solution is needed when the end users really only need to
do super basic queries. In general, managers are very poor when it comes to
analysis and requirement specification.

Do they need to do date searches on a variety of date fields?

Do they need to do numeric or range queries on specific numeric fields?

Do they need to do any exact match queries on raw character fields (as
opposed to tokenized text)?

Do they have fields like product names or numbers in addition to free-form
text?

Do they need to distinguish or weight titles from detailed descriptions?

You could have catchall fields for categories of field types like titles,
bodies, authors/names, locations, dates, numeric values. But... who
knows... this may be more than what an average user really needs.

As far as the concern about fields from different sources that are not
used, Lucene only stores and indexes fields which have values, so no
storage or performance is consumed when you have a lot of fields which are
not present for a particular data source.

-- Jack Krupansky

On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar <su...@gmail.com>
wrote:

> Hello,
>
> I am going thru few use cases where we have kind of multiple disparate data
> sources which in general doesn't have much common fields and i was thinking
> to design different schema/index/collection for each of them and query each
> of them separately and provide different result sets to the client.
>
> I have seen one implementation where all different fields from these
> disparate data sources are put together in single schema/design/collection
> that it can be searched easily using catch all field but this was having
> 200+ fields including copy fields. The problem i see with this design is
> ingestion will be slower (and scaling) as many of the fields for one data
> source will not be applicable when ingesting for other data source.
> Basically everything is being dumped into one huge schema/index/collection.
>
> After looking above, I am wondering how we can design this better in
> another implementation where we have the requirement to search across
> disparate source (each having multiple fields 10-15 fields searchable &
> 10-15 fields stored) with only 1 common field like description in each of
> the data sources.  Most of the time user may perform search on description
> and rest of the time combination of different fields. Similar to google
> like search where you search for "coffee" and it searches in various data
> sources (websites, maps, images, places etc.)
>
> My thought is to make separate indexes for each search scenario.  For
> example for single search box, we index description, other key fields which
> can be searched together  and their data source type into one index/schema
> that we don't make a huge index/schema and use the catch all field for
> search.
>
> And for other Advance search (field specific) scenario we create separate
> index/schema for each data sources.
>
> Any suggestions/guidelines on how we can better address this in terms of
> responsiveness and scaling? Each data source may have documents in 50-100+
> millions.
>
> Thanks,
> Susheel
>