You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Dan Barron <db...@mail.acponline.org> on 2002/06/06 21:19:16 UTC

Advice on DB Design Wanted

I have an idea and I was just wondering what you all thought about it. Here's the deal:

We are going to use Xindice to store XML data for scientific journal citations. The simplest idea is to just dump them all in one collection and use XPath to find what we need. But most times, they would be searched by journal name and volume.

So what I'm thinking is if I create a subcollection for each journal, and then collections for each volume say under that, there would only be a few dozen articles in each collection. And since you search first by getting a collection and then searching, I'm guessing this would be much faster and could effectively eliminate the need for indexers on journal name and volume. And presumably I could still search the entire collection when necessary using the base collection.

So I'm thinking search the /db/citations/JAMA/132 collection of a few dozen documents would be way faster than searching /db/citations where altogether there would be hundreds of thousands of documents.

Does this make any sense? Will it be faster? Am I missing any obvious problems with this approach? Any ideas would be appreciated.

dan


____________________________________________________________________
Daniel W. Barron
Senior Systems Analyst/Application Developer
American College of Physicians-American Society of Internal Medicine
Tel: (215) 351-2617     Tel: (800) 523-1546 x2617
Fax: (215) 351-2644    E-mail: dbarron@mail.acponline.org



Re: Advice on DB Design Wanted

Posted by "Mark J. Stang" <ma...@earthlink.net>.
I don't think so.   Each seach is based on a single collection.   I don't
believe that sub-collections are considered as part of the parent.   So,
my guess, and it is only a guess, is that it won't search them.

Mark

Nutan Kaul wrote:

> Hi,
> If my collection has many sub collections. Then, can they be included in the
> same search?
> Thanks,
>
> Nutan
>
> ----- Original Message -----
> From: "Mark J. Stang" <ma...@earthlink.net>
> To: <xi...@xml.apache.org>
> Sent: Thursday, June 06, 2002 1:15 PM
> Subject: Re: Advice on DB Design Wanted
>
> > Dan,
> > >From what I have done using XPath to search collections, you can't search
> > more than one collection at a time.   Which would meant that if you needed
> to
> > search all, then you would have to run the search against each individual
> > collection.   People have built collections with hundreds of thousands of
> > documents and the search time is fast if you index them.
> >
> > So, what kind of search do you want to do?   XPath provides a "contains"
> but
> > as I understand it, that will search every tag and you can't index them.
> If you
> > can pick out tags and index those then your search will be under a second.
> > Tom Bradford is working on a full text search, but that is not ready yet.
> >
> > So it appears that you have several choices.
> >
> > 1) Sub-collections - as long as the number of documents and content is
> small, fast searches.
> > But no way to search all collections at once.
> >
> > 2) One collection, use "contains", fast as long as your data doesn't get
> too big too quick.
> > Wait for Tom Bradford to come up with Full Text Search capability.
> >
> > 3) One collection, pick out specific tags, index those, fast for all index
> searches, slower for
> > "contains".
> >
> > 4) Use a third-party tool (apache has one) to index your documents as you
> add/edit/delete them.
> > Use it to find your document and then do a direct access.   Fast, a little
> more complicated, but
> > you can implement it now.   Also, when Tom is done, you can discard it and
> transfer the maintenance
> > to Xindice.
> >
> > HTH,
> >
> > Mark
> >
> > Dan Barron wrote:
> >
> > > I have an idea and I was just wondering what you all thought about it.
> Here's the deal:
> > >
> > > We are going to use Xindice to store XML data for scientific journal
> citations. The simplest idea is to just dump them all in one collection and
> use XPath to find what we need. But most times, they would be searched by
> journal name and volume.
> > >
> > > So what I'm thinking is if I create a subcollection for each journal,
> and then collections for each volume say under that, there would only be a
> few dozen articles in each collection. And since you search first by getting
> a collection and then searching, I'm guessing this would be much faster and
> could effectively eliminate the need for indexers on journal name and
> volume. And presumably I could still search the entire collection when
> necessary using the base collection.
> > >
> > > So I'm thinking search the /db/citations/JAMA/132 collection of a few
> dozen documents would be way faster than searching /db/citations where
> altogether there would be hundreds of thousands of documents.
> > >
> > > Does this make any sense? Will it be faster? Am I missing any obvious
> problems with this approach? Any ideas would be appreciated.
> > >
> > > dan
> > >
> > > ____________________________________________________________________
> > > Daniel W. Barron
> > > Senior Systems Analyst/Application Developer
> > > American College of Physicians-American Society of Internal Medicine
> > > Tel: (215) 351-2617     Tel: (800) 523-1546 x2617
> > > Fax: (215) 351-2644    E-mail: dbarron@mail.acponline.org
> >
> > --
> > Mark J Stang
> > Architect
> > Cybershop Systems
> >
> >

--
Mark J Stang
Architect
Cybershop Systems


Re: Advice on DB Design Wanted

Posted by Nutan Kaul <nk...@mail.arc.nasa.gov>.
Hi,
If my collection has many sub collections. Then, can they be included in the
same search?
Thanks,

Nutan

----- Original Message -----
From: "Mark J. Stang" <ma...@earthlink.net>
To: <xi...@xml.apache.org>
Sent: Thursday, June 06, 2002 1:15 PM
Subject: Re: Advice on DB Design Wanted


> Dan,
> >From what I have done using XPath to search collections, you can't search
> more than one collection at a time.   Which would meant that if you needed
to
> search all, then you would have to run the search against each individual
> collection.   People have built collections with hundreds of thousands of
> documents and the search time is fast if you index them.
>
> So, what kind of search do you want to do?   XPath provides a "contains"
but
> as I understand it, that will search every tag and you can't index them.
If you
> can pick out tags and index those then your search will be under a second.
> Tom Bradford is working on a full text search, but that is not ready yet.
>
> So it appears that you have several choices.
>
> 1) Sub-collections - as long as the number of documents and content is
small, fast searches.
> But no way to search all collections at once.
>
> 2) One collection, use "contains", fast as long as your data doesn't get
too big too quick.
> Wait for Tom Bradford to come up with Full Text Search capability.
>
> 3) One collection, pick out specific tags, index those, fast for all index
searches, slower for
> "contains".
>
> 4) Use a third-party tool (apache has one) to index your documents as you
add/edit/delete them.
> Use it to find your document and then do a direct access.   Fast, a little
more complicated, but
> you can implement it now.   Also, when Tom is done, you can discard it and
transfer the maintenance
> to Xindice.
>
> HTH,
>
> Mark
>
> Dan Barron wrote:
>
> > I have an idea and I was just wondering what you all thought about it.
Here's the deal:
> >
> > We are going to use Xindice to store XML data for scientific journal
citations. The simplest idea is to just dump them all in one collection and
use XPath to find what we need. But most times, they would be searched by
journal name and volume.
> >
> > So what I'm thinking is if I create a subcollection for each journal,
and then collections for each volume say under that, there would only be a
few dozen articles in each collection. And since you search first by getting
a collection and then searching, I'm guessing this would be much faster and
could effectively eliminate the need for indexers on journal name and
volume. And presumably I could still search the entire collection when
necessary using the base collection.
> >
> > So I'm thinking search the /db/citations/JAMA/132 collection of a few
dozen documents would be way faster than searching /db/citations where
altogether there would be hundreds of thousands of documents.
> >
> > Does this make any sense? Will it be faster? Am I missing any obvious
problems with this approach? Any ideas would be appreciated.
> >
> > dan
> >
> > ____________________________________________________________________
> > Daniel W. Barron
> > Senior Systems Analyst/Application Developer
> > American College of Physicians-American Society of Internal Medicine
> > Tel: (215) 351-2617     Tel: (800) 523-1546 x2617
> > Fax: (215) 351-2644    E-mail: dbarron@mail.acponline.org
>
> --
> Mark J Stang
> Architect
> Cybershop Systems
>
>


Re: Advice on DB Design Wanted

Posted by "Mark J. Stang" <ma...@earthlink.net>.
Dan,