You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Lackhoff <mi...@lackhoff.de> on 2008/01/17 08:11:35 UTC

Some sort of join in SOLR?

Hello,

I have two sources of data for the same "things" to search. It is book 
data in a library. First there is the usual bibliographic data (author, 
title...) and then I have scanned and OCRed table of contents data about 
the same books. Both are updated independently.
Now I don't know how to best index and search this data.
- One option would be to save the data in different records. That would
   make updates easy because I don't have to worry about the fields
   from the other source. But searching would be more difficult: I have
   to do an additional search for every hit in the "contents" data to
   get the bibliographic data.
- The other option would be to save everything in one record but then
   updates would be difficult. Before I can update a record I must first
   look if there is any data from the other source, merge it into the
   record and only then update it. This option sounds very time consuming
   for a complete reindex.

The best solution would be some sort of join: Have two records in the 
index but always give both in the result no matter where the hit was.
Any ideas on how to best organize this kind of data?

-Michael

Re: Some sort of join in SOLR?

Posted by Michael Lackhoff <mi...@lackhoff.de>.

On 17.01.2008 23:48 Chris Hostetter wrote:

> assuming these are simple delimited files, something like the unix "join" 
> command can do this for you ... then your indexing code can just process 
> on file linerally.  (if they aren't simple delimited files, you can 
> preprocess them to strip out the excess markup and make them simple 
> deliminted files ... depending on what these look like, you might not even 
> need much custom indexing code at all .. "join" and the CSV update 
> request handler might solve all your needs)

Thanks for the hint, haven't heard of a unix "join" command yet but will 
have a look.

-Michael

Re: Some sort of join in SOLR?

Posted by Chris Hostetter <ho...@fucit.org>.

: I don't have an index to check. Both sources come in huge text files, one of
: them daily, the other irregular. One has the ID, the other has a different ID
: that must be mapped first to the ID of the first source. So there is no easy
: way of saying: "Give me the record to this ID from the other set of records".
: It is all burried in plain text files.

assuming these are simple delimited files, something like the unix "join" 
command can do this for you ... then your indexing code can just process 
on file linerally.  (if they aren't simple delimited files, you can 
preprocess them to strip out the excess markup and make them simple 
deliminted files ... depending on what these look like, you might not even 
need much custom indexing code at all .. "join" and the CSV update 
request handler might solve all your needs)


-Hoss

Re: Some sort of join in SOLR?

Posted by Erick Erickson <er...@gmail.com>.

Good Luck! You're right, there *is* a lot to
learn. I get both excited and frustrated by
new stuff, which is much better than having
my skill set only comprised of,  say,  only
being able to work with old "C" code.....

Before getting fancy at all, I'd find out what the total
size of my index will be (index and store data as a
single document). If it's anything less than 1G, don't
even consider complexifying things, just index and
store and go.

If it's > 1G, I'd *still* do the simple thing first but be
mentally prepared to think about more complex stuff
if you have performance issues.......

Best
Erick

On Jan 17, 2008 1:57 PM, Michael Lackhoff <mi...@lackhoff.de> wrote:

> On 17.01.2008 18:32 Erick Erickson wrote:
>
> > There's some cost here, and I don't know how this
> > all plays with the sizes of your indexes. It may be
> > totally impractical.
> >
> > Anyway, back to work.
>
> I think I will have to play with the different possibilities and see
> what fits best to my situation. There will be many things to learn (I am
> a newbee to SOLR, Lucene and Java) until everythings plays nicely
> together.
> As you say, back to work...
>
> Thanks
> -Michael
>
>

Re: Some sort of join in SOLR?

Posted by Michael Lackhoff <mi...@lackhoff.de>.

On 17.01.2008 18:32 Erick Erickson wrote:

> There's some cost here, and I don't know how this
> all plays with the sizes of your indexes. It may be
> totally impractical.
> 
> Anyway, back to work.

I think I will have to play with the different possibilities and see 
what fits best to my situation. There will be many things to learn (I am 
a newbee to SOLR, Lucene and Java) until everythings plays nicely together.
As you say, back to work...

Thanks
-Michael

Re: Some sort of join in SOLR?

Posted by Erick Erickson <er...@gmail.com>.

See below:

On Jan 17, 2008 11:42 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:

> On 17.01.2008 16:53 Erick Erickson wrote:
>
> > I would *strongly* encourage you to store them together
> > as one document. There's no real method of doing
> > DB like joins in the underlying Lucene search engine.
>
> Thanks, that was also my preference.
>
> > But that's generic advice. The question I have for you is
> > "What's the big deal about coordinating the sources?"
> > That is, you have to have something that allows you to
> > make a 1:1 correspondence between your data sources
> > or you couldn't relate them in the first place. Is it really
> > that onerous to check?
>
> I don't have an index to check. Both sources come in huge text files,
> one of them daily, the other irregular. One has the ID, the other has a
> different ID that must be mapped first to the ID of the first source. So
> there is no easy way of saying: "Give me the record to this ID from the
> other set of records". It is all burried in plain text files.
>

I didn't explain this well, it's really what you say below.
You already do this.

>
> > If it is, why not build an index and search it when you
> > want to know?
>
> That is what I will do now: Build a SQLite database with just two
> columns: ID and contents with an index on the ID. Then when I rebuild
> the SOLR index by processing the other data I will lookup the SQLite DB
> if there is a corresponding record from the other source.
> My hope was that I could avoid this intermediate database.
>

I don't see a good way of avoiding this. It's important to keep clearly
in mind that Lucene doesn't do DB things, and trying to force it
to is usually a bad idea. Except sometimes <G>.

If you wanted to, you could use a Lucene index in place of your
SQLite DB and search on your ID (or use TermEnum/TermDocs
to find it). But the only argument for doing it this way is if you
do NOT need the SQLIte DB in the first place you could have
one less tool. And it wouldn't even have to be a separate index
since there's no requirement that all documents in Lucene
have the same fields. You could store your meta-data in one
or more documents with fields orthogonal to your "real" data.

True, this trades off complexity of understanding the various
parts of the index against tracking two Lucene indexes, or
a Lucene index and a DB. I can't really argue convincingly
for one or the other approach, except I like as much as
possible to be self-contained.....

>
> > You haven't described enough of your problem
> > space for me to render any opinion of whether
> > this is premature optimization or not, but it
> > sure smells like it from a distance <G>...
>
> I don't think it was premature optimization. It was just the attempt to
> keep the nightly rebuild of the index as easy as possible and to avoid
> unnecessary complexity. But if it is necessary I will go this way.
>

Well, avoiding complexity is good <G>.

There's another thing to consider if (and only if) your data is
stored (as opposed to indexed). Let's say you use the one
document approach. If you both index AND store your non-meta
data, your update process for your meta data could be:
1> find the old document and store away all the
    non-meta data.
2> delete the old document
3> construct a new document with the new meta-data and the
    data from <1>.
4> re-index the document.

This won't work if you only index (but not store) the non-meta
data. It'll depend on how much data you have and how big the
data set is, which I sure don't know. If you choose to do this,
be aware that you probably want to lazy-load the non-meta
data or loading the document may get expensive.

I suppose you could also consider a variant on the two-index
model.

Index (but don't store) the non-meta data in your primary
index. This reduces the size significantly.

Store (but don't index) the non-meta data in your "update"
index along with an indexed ID.

Updates then become
1> look up the non-meta data from your update index.
2> construct the new document by combining things.
3> delete the doc from the primary index.
4> add the new doc to your primary index.

There's some cost here, and I don't know how this
all plays with the sizes of your indexes. It may be
totally impractical.

Anyway, back to work.

Erick

> -Michael
>

Re: Some sort of join in SOLR?

Posted by Michael Lackhoff <mi...@lackhoff.de>.

On 17.01.2008 16:53 Erick Erickson wrote:

> I would *strongly* encourage you to store them together
> as one document. There's no real method of doing
> DB like joins in the underlying Lucene search engine.

Thanks, that was also my preference.

> But that's generic advice. The question I have for you is
> "What's the big deal about coordinating the sources?"
> That is, you have to have something that allows you to
> make a 1:1 correspondence between your data sources
> or you couldn't relate them in the first place. Is it really
> that onerous to check?

I don't have an index to check. Both sources come in huge text files, 
one of them daily, the other irregular. One has the ID, the other has a 
different ID that must be mapped first to the ID of the first source. So 
there is no easy way of saying: "Give me the record to this ID from the 
other set of records". It is all burried in plain text files.

> If it is, why not build an index and search it when you
> want to know?

That is what I will do now: Build a SQLite database with just two 
columns: ID and contents with an index on the ID. Then when I rebuild 
the SOLR index by processing the other data I will lookup the SQLite DB 
if there is a corresponding record from the other source.
My hope was that I could avoid this intermediate database.

> You haven't described enough of your problem
> space for me to render any opinion of whether
> this is premature optimization or not, but it
> sure smells like it from a distance <G>...

I don't think it was premature optimization. It was just the attempt to 
keep the nightly rebuild of the index as easy as possible and to avoid 
unnecessary complexity. But if it is necessary I will go this way.

-Michael

Re: Some sort of join in SOLR?

Posted by Erick Erickson <er...@gmail.com>.

I would *strongly* encourage you to store them together
as one document. There's no real method of doing
DB like joins in the underlying Lucene search engine.

But that's generic advice. The question I have for you is
"What's the big deal about coordinating the sources?"
That is, you have to have something that allows you to
make a 1:1 correspondence between your data sources
or you couldn't relate them in the first place. Is it really
that onerous to check?

If it is, why not build an index and search it when you
want to know?

Surrounding this question is "How often to you
really update data?" If it's once an hour, I submit
that you don't care how difficult finding out if there's
corresponding data in the other data set. If it's
once a second, that may be a different story.

You haven't described enough of your problem
space for me to render any opinion of whether
this is premature optimization or not, but it
sure smells like it from a distance <G>...

Best
Erick

On Jan 17, 2008 2:11 AM, Michael Lackhoff <mi...@lackhoff.de> wrote:

> Hello,
>
> I have two sources of data for the same "things" to search. It is book
> data in a library. First there is the usual bibliographic data (author,
> title...) and then I have scanned and OCRed table of contents data about
> the same books. Both are updated independently.
> Now I don't know how to best index and search this data.
> - One option would be to save the data in different records. That would
>   make updates easy because I don't have to worry about the fields
>   from the other source. But searching would be more difficult: I have
>   to do an additional search for every hit in the "contents" data to
>   get the bibliographic data.
> - The other option would be to save everything in one record but then
>   updates would be difficult. Before I can update a record I must first
>   look if there is any data from the other source, merge it into the
>   record and only then update it. This option sounds very time consuming
>   for a complete reindex.
>
> The best solution would be some sort of join: Have two records in the
> index but always give both in the result no matter where the hit was.
> Any ideas on how to best organize this kind of data?
>
> -Michael
>
>