You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tr...@dawnstar.com on 2011/07/20 21:27:05 UTC

Schema Design/Data Import

[Apologies if this is a duplicate -- I have sent several messages from my work email and they just vanish, so I subscribed with my personal email]

Greetings. I am struggling to design a schema and a data import/update strategy for some semi-complicated data. I would appreciate any input.

What we have is a bunch of database records that may or may not have files attached. Sometimes no files, sometimes 50.

The requirement is to index the database records AND the documents, and the search results would be just links to the database records.

I'd love to crawl the site with Nutch and be done with it, but we have a complicated search form with various codes and attributes for the database records, so we need a detailed schema that will loosely correspond to boxes on the search form. I don't think we could easily do that if we just crawl the site. But with a detailed schema, I'm having trouble understanding how we could import and index from the database, and also index the related files, and have the same schema being populated, especially with the number of related documents being variable (maybe index them all to one field?).

We have a lot of flexibility on how we can build this, so I'm open to any suggestions or pointers for further reading. I've spent a fair amount of time on the wiki but I didn't see anything that seemed directly relevant.

An additional difficulty, that I am willing to overlook for the first cut, is that some of these files are zipped, and some of the zip files may contain other zip files, to maybe 3 or 4 levels deep.

Help, please?

cheers,

Travis

Re: Schema Design/Data Import

Posted by Stefan Matheis <ma...@googlemail.com>.

Am 25.07.2011 16:58, schrieb Erick Erickson:
> Well, the attachment_1, attachment_2 idea would be awkward
> to form queries (i.e. there would be 100 clauses if there were 100 docs?)
> Dynamic fields have this same problem.

Oh, yes .. correct .. overlooked that part :/ sorry.

Re: Schema Design/Data Import

Posted by Erick Erickson <er...@gmail.com>.

Well, the attachment_1, attachment_2 idea would be awkward
to form queries (i.e. there would be 100 clauses if there were 100 docs?)
Dynamic fields have this same problem.

You could certainly index them all into a big field, just make it
multivalued and do a SolrDocument.add("bigtextfield", docContents) for
each document. Watch out for the maxFieldLength parameter in solrconfig.xml,
you'll want to bump that way up.

You could also index a separate document for each attachment, then
perhaps use the grouping/field collapsing feature to gather them all
together, depending upon your requirements.

I'd either put them all in one field or use a separate solr document for each
row/attachment pair as a first approach...

Hope that helps
Erick

On Mon, Jul 25, 2011 at 10:36 AM, Travis Low <tl...@4centurion.com> wrote:
> Thanks so much Erick (and Stefan).  Yes, I did some reading on SolrJ and
> Tika and you are spot-on.  We will write our own importer using SolrJ and
> then we can grab the DB records and parse any attachments along the way.
>
> Now it comes down to a schema design question.  The issue I'm struggling
> with is what kind of field or fields to use for the attachments.  The reason
> for the difficulty is that the documents we're most interested in are the DB
> records, not the attachments, and there could be 0 or 3 or 50 attachments
> for a single DB record.  Should we:
>
> (1) Just add fields called "attachment_0", "attachment_1", ... ,
> "attachment_100" to the schema?
> (2) Somehow index all attachments to a single field? (Is this even
> possible?)
> (3) Use dynamic fields?
> (4) None of the above?
>
> The idea is that if there is a hit in one of the attachments, then we need
> to show a link to the DB record.  It would be nice to show a link the the
> document as well, but that's less important.
>
> cheers,
>
> Travis
>
>
> On Mon, Jul 25, 2011 at 9:49 AM, Erick Erickson <er...@gmail.com>wrote:
>
>> I'd seriously consider going with SolrJ as your indexing strategy, it
>> allows
>> you to do anything you need to do in Java code. You can call the Tika
>> library yourself on the files pointed to by your rows as you see fit,
>> indexing
>> them as you choose, perhaps one Solr doc per attachment, perhaps one
>> per row, whatever.
>>
>> Best
>> Erick
>>
>> On Wed, Jul 20, 2011 at 3:27 PM,  <tr...@dawnstar.com> wrote:
>> >
>> > [Apologies if this is a duplicate -- I have sent several messages from my
>> work email and they just vanish, so I subscribed with my personal email]
>> >
>> > Greetings.  I am struggling to design a schema and a data import/update
>>  strategy for some semi-complicated data.  I would appreciate any input.
>> >
>> > What we have is a bunch of database records that may or may not have
>> files attached.  Sometimes no files, sometimes 50.
>> >
>> > The requirement is to index the database records AND the documents,  and
>> the search results would be just links to the database records.
>> >
>> > I'd  love to crawl the site with Nutch and be done with it, but we have a
>>  complicated search form with various codes and attributes for the  database
>> records, so we need a detailed schema that will loosely  correspond to boxes
>> on the search form.  I don't think we could easily  do that if we just crawl
>> the site.  But with a detailed schema, I'm  having trouble understanding how
>> we could import and index from the  database, and also index the related
>> files, and have the same schema  being populated, especially with the number
>> of related documents being  variable (maybe index them all to one field?).
>> >
>> > We have a lot of flexibility on how we can build this, so I'm open  to
>> any suggestions or pointers for further reading.  I've spent a fair  amount
>> of time on the wiki but I didn't see anything that seemed  directly
>> relevant.
>> >
>> > An additional difficulty, that I am willing to overlook for the  first
>> cut, is that some of these files are zipped, and some of the zip  files may
>> contain other zip files, to maybe 3 or 4 levels deep.
>> >
>> > Help, please?
>> >
>> > cheers,
>> >
>> > Travis
>>
>
>
>
> --
>
> **
>
> *Travis Low, Director of Development*
>
>
> ** <tl...@4centurion.com>* *
>
> *Centurion Research Solutions, LLC*
>
> *14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*
>
> *703-956-6276 *•* 703-378-4474 (fax)*
>
> *http://www.centurionresearch.com* <http://www.centurionresearch.com>
>
> **The information contained in this email message is confidential and
> protected from disclosure.  If you are not the intended recipient, any use
> or dissemination of this communication, including attachments, is strictly
> prohibited.  If you received this email message in error, please delete it
> and immediately notify the sender.
>
> This email message and any attachments have been scanned and are believed to
> be free of malicious software and defects that might affect any computer
> system in which they are received and opened. No responsibility is accepted
> by Centurion Research Solutions, LLC for any loss or damage arising from the
> content of this email.
>

Re: Schema Design/Data Import

Posted by Stefan Matheis <ma...@googlemail.com>.

Travis,

that sounds like a perfect usecase for dynamic fields .. attachment_* 
and there you go. works for no attachment, as well as one, three or 50.

for the user interface, you could iterate over them and show them as 
list - or something else that would fit your need.

also, maybe, you would have attachment_name_* and attachment_body_* ? 
otherwise the information, which (file-)name relates to which body would 
be lost .. at least on the solr-level.

Regards
Stefan

Am 25.07.2011 16:36, schrieb Travis Low:
> Thanks so much Erick (and Stefan).  Yes, I did some reading on SolrJ and
> Tika and you are spot-on.  We will write our own importer using SolrJ and
> then we can grab the DB records and parse any attachments along the way.
>
> Now it comes down to a schema design question.  The issue I'm struggling
> with is what kind of field or fields to use for the attachments.  The reason
> for the difficulty is that the documents we're most interested in are the DB
> records, not the attachments, and there could be 0 or 3 or 50 attachments
> for a single DB record.  Should we:
>
> (1) Just add fields called "attachment_0", "attachment_1", ... ,
> "attachment_100" to the schema?
> (2) Somehow index all attachments to a single field? (Is this even
> possible?)
> (3) Use dynamic fields?
> (4) None of the above?
>
> The idea is that if there is a hit in one of the attachments, then we need
> to show a link to the DB record.  It would be nice to show a link the the
> document as well, but that's less important.
>
> cheers,
>
> Travis
>
>
> On Mon, Jul 25, 2011 at 9:49 AM, Erick Erickson<er...@gmail.com>wrote:
>
>> I'd seriously consider going with SolrJ as your indexing strategy, it
>> allows
>> you to do anything you need to do in Java code. You can call the Tika
>> library yourself on the files pointed to by your rows as you see fit,
>> indexing
>> them as you choose, perhaps one Solr doc per attachment, perhaps one
>> per row, whatever.
>>
>> Best
>> Erick
>>
>> On Wed, Jul 20, 2011 at 3:27 PM,<tr...@dawnstar.com>  wrote:
>>>
>>> [Apologies if this is a duplicate -- I have sent several messages from my
>> work email and they just vanish, so I subscribed with my personal email]
>>>
>>> Greetings.  I am struggling to design a schema and a data import/update
>>   strategy for some semi-complicated data.  I would appreciate any input.
>>>
>>> What we have is a bunch of database records that may or may not have
>> files attached.  Sometimes no files, sometimes 50.
>>>
>>> The requirement is to index the database records AND the documents,  and
>> the search results would be just links to the database records.
>>>
>>> I'd  love to crawl the site with Nutch and be done with it, but we have a
>>   complicated search form with various codes and attributes for the  database
>> records, so we need a detailed schema that will loosely  correspond to boxes
>> on the search form.  I don't think we could easily  do that if we just crawl
>> the site.  But with a detailed schema, I'm  having trouble understanding how
>> we could import and index from the  database, and also index the related
>> files, and have the same schema  being populated, especially with the number
>> of related documents being  variable (maybe index them all to one field?).
>>>
>>> We have a lot of flexibility on how we can build this, so I'm open  to
>> any suggestions or pointers for further reading.  I've spent a fair  amount
>> of time on the wiki but I didn't see anything that seemed  directly
>> relevant.
>>>
>>> An additional difficulty, that I am willing to overlook for the  first
>> cut, is that some of these files are zipped, and some of the zip  files may
>> contain other zip files, to maybe 3 or 4 levels deep.
>>>
>>> Help, please?
>>>
>>> cheers,
>>>
>>> Travis
>>
>
>
>

Re: Schema Design/Data Import

Posted by Travis Low <tl...@4centurion.com>.

Thanks so much Erick (and Stefan).  Yes, I did some reading on SolrJ and
Tika and you are spot-on.  We will write our own importer using SolrJ and
then we can grab the DB records and parse any attachments along the way.

Now it comes down to a schema design question.  The issue I'm struggling
with is what kind of field or fields to use for the attachments.  The reason
for the difficulty is that the documents we're most interested in are the DB
records, not the attachments, and there could be 0 or 3 or 50 attachments
for a single DB record.  Should we:

(1) Just add fields called "attachment_0", "attachment_1", ... ,
"attachment_100" to the schema?
(2) Somehow index all attachments to a single field? (Is this even
possible?)
(3) Use dynamic fields?
(4) None of the above?

The idea is that if there is a hit in one of the attachments, then we need
to show a link to the DB record.  It would be nice to show a link the the
document as well, but that's less important.

cheers,

Travis

On Mon, Jul 25, 2011 at 9:49 AM, Erick Erickson <er...@gmail.com>wrote:

> I'd seriously consider going with SolrJ as your indexing strategy, it
> allows
> you to do anything you need to do in Java code. You can call the Tika
> library yourself on the files pointed to by your rows as you see fit,
> indexing
> them as you choose, perhaps one Solr doc per attachment, perhaps one
> per row, whatever.
>
> Best
> Erick
>
> On Wed, Jul 20, 2011 at 3:27 PM,  <tr...@dawnstar.com> wrote:
> >
> > [Apologies if this is a duplicate -- I have sent several messages from my
> work email and they just vanish, so I subscribed with my personal email]
> >
> > Greetings.  I am struggling to design a schema and a data import/update
>  strategy for some semi-complicated data.  I would appreciate any input.
> >
> > What we have is a bunch of database records that may or may not have
> files attached.  Sometimes no files, sometimes 50.
> >
> > The requirement is to index the database records AND the documents,  and
> the search results would be just links to the database records.
> >
> > I'd  love to crawl the site with Nutch and be done with it, but we have a
>  complicated search form with various codes and attributes for the  database
> records, so we need a detailed schema that will loosely  correspond to boxes
> on the search form.  I don't think we could easily  do that if we just crawl
> the site.  But with a detailed schema, I'm  having trouble understanding how
> we could import and index from the  database, and also index the related
> files, and have the same schema  being populated, especially with the number
> of related documents being  variable (maybe index them all to one field?).
> >
> > We have a lot of flexibility on how we can build this, so I'm open  to
> any suggestions or pointers for further reading.  I've spent a fair  amount
> of time on the wiki but I didn't see anything that seemed  directly
> relevant.
> >
> > An additional difficulty, that I am willing to overlook for the  first
> cut, is that some of these files are zipped, and some of the zip  files may
> contain other zip files, to maybe 3 or 4 levels deep.
> >
> > Help, please?
> >
> > cheers,
> >
> > Travis
>

-- 

**

*Travis Low, Director of Development*

** <tl...@4centurion.com>* *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* <http://www.centurionresearch.com>

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed to
be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from the
content of this email.

Re: Schema Design/Data Import

Posted by Erick Erickson <er...@gmail.com>.

I'd seriously consider going with SolrJ as your indexing strategy, it allows
you to do anything you need to do in Java code. You can call the Tika
library yourself on the files pointed to by your rows as you see fit, indexing
them as you choose, perhaps one Solr doc per attachment, perhaps one
per row, whatever.

Best
Erick

On Wed, Jul 20, 2011 at 3:27 PM,  <tr...@dawnstar.com> wrote:
>
> [Apologies if this is a duplicate -- I have sent several messages from my work email and they just vanish, so I subscribed with my personal email]
>
> Greetings.  I am struggling to design a schema and a data import/update  strategy for some semi-complicated data.  I would appreciate any input.
>
> What we have is a bunch of database records that may or may not have files attached.  Sometimes no files, sometimes 50.
>
> The requirement is to index the database records AND the documents,  and the search results would be just links to the database records.
>
> I'd  love to crawl the site with Nutch and be done with it, but we have a  complicated search form with various codes and attributes for the  database records, so we need a detailed schema that will loosely  correspond to boxes on the search form.  I don't think we could easily  do that if we just crawl the site.  But with a detailed schema, I'm  having trouble understanding how we could import and index from the  database, and also index the related files, and have the same schema  being populated, especially with the number of related documents being  variable (maybe index them all to one field?).
>
> We have a lot of flexibility on how we can build this, so I'm open  to any suggestions or pointers for further reading.  I've spent a fair  amount of time on the wiki but I didn't see anything that seemed  directly relevant.
>
> An additional difficulty, that I am willing to overlook for the  first cut, is that some of these files are zipped, and some of the zip  files may contain other zip files, to maybe 3 or 4 levels deep.
>
> Help, please?
>
> cheers,
>
> Travis