You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Travis Low <tl...@4centurion.com> on 2011/07/20 19:28:32 UTC

Schema design/data import

Greetings.  I am struggling to design a schema and a data import/update
strategy for some semi-complicated data.  I would appreciate any input.

What we have is a bunch of database records that may or may not have files
attached.  Sometimes no files, sometimes 50.

The requirement is to index the database records AND the documents, and the
search results would be just links to the database records.

I'd love to crawl the site with Nutch and be done with it, but we have a
complicated search form with various codes and attributes for the database
records, so we need a detailed schema that will loosely correspond to boxes
on the search form.  I don't think we could easily do that if we just crawl
the site.  But with a detailed schema, I'm having trouble understanding how
we could import and index from the database, and also index the related
files, and have the same schema being populated, especially with the number
of related documents being variable (maybe index them all to one field?).

We have a lot of flexibility on how we can build this, so I'm open to any
suggestions or pointers for further reading.  I've spent a fair amount of
time on the wiki but I didn't see anything that seemed directly relevant.

An additional difficulty, that I am willing to overlook for the first cut,
is that some of these files are zipped, and some of the zip files may
contain other zip files, to maybe 3 or 4 levels deep.

Help, please?

cheers,

Travis



-- 

**

*Travis Low, Director of Development*


** <tl...@4centurion.com>* *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* <http://www.centurionresearch.com>

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed to
be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from the
content of this email.

Re: Schema design/data import

Posted by Stefan Matheis <ma...@googlemail.com>.
Hey Travis,

after reading your Mail .. and thinking a bit of it, i'm not sure if i 
would go with Nutch. Nutch is [from my understanding] more a crawler .. 
meant to crawl external / unknown sites.

But, if it got this correct, you have a complete knowledge of your data 
and could solr exactly tell what to index - and how the things are 
related to each other.

So i guess .. the more relevant Question would be: How would you / 
people use Solr to search for your records? Will the search the content 
of the attached documents/files? Or is it more a structured search with 
Filters? And, what are they expecting to see in the end? Something like 
DocumentCloud (https://www.documentcloud.org/public/search/palin) ?

If i've got something wrong .. please tell me :)

Regards
Stefan

Am 20.07.2011 19:28, schrieb Travis Low:
> Greetings.  I am struggling to design a schema and a data import/update
> strategy for some semi-complicated data.  I would appreciate any input.
>
> What we have is a bunch of database records that may or may not have files
> attached.  Sometimes no files, sometimes 50.
>
> The requirement is to index the database records AND the documents, and the
> search results would be just links to the database records.
>
> I'd love to crawl the site with Nutch and be done with it, but we have a
> complicated search form with various codes and attributes for the database
> records, so we need a detailed schema that will loosely correspond to boxes
> on the search form.  I don't think we could easily do that if we just crawl
> the site.  But with a detailed schema, I'm having trouble understanding how
> we could import and index from the database, and also index the related
> files, and have the same schema being populated, especially with the number
> of related documents being variable (maybe index them all to one field?).
>
> We have a lot of flexibility on how we can build this, so I'm open to any
> suggestions or pointers for further reading.  I've spent a fair amount of
> time on the wiki but I didn't see anything that seemed directly relevant.
>
> An additional difficulty, that I am willing to overlook for the first cut,
> is that some of these files are zipped, and some of the zip files may
> contain other zip files, to maybe 3 or 4 levels deep.
>
> Help, please?
>
> cheers,
>
> Travis
>
>
>