You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mohammad Norouzi <mn...@gmail.com> on 2007/03/31 09:30:21 UTC

how to index a large database

Hi all
I am going to index our database. one approach is to join them and then
index the fields. but the information are very large say more than 3
millions. so the Sql Server fails to select them.

I want to know if anyone has such this experience to indexing huge
information of database using lucene.

can anyone give me some advice?

-- 
Regards,
Mohammad

Re: how to index a large database

Posted by Chris Lu <ch...@gmail.com>.
Mohammad,

This is the main idea, but things can get quite complicated.

In addition, do you need to do incremental indexing?
Do you need to delete duplicates?
How would you manage deleted documents in the database?
Will taking down the server while indexing affect you?
...

You are welcome to take a look at DBSight.
Here is a demo that you can create a scalable lucene database search
in 3 minutes.

http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 3/31/07, Erick Erickson <er...@gmail.com> wrote:
> Well, don't do it that way <G>.....
>
> I'm assuming that you have some SQL statement like "for each
> entry in table 1, find all the related info", and what you're
> failing to retrieve is the result.
>
> So, try something like creating a SQL statement that selects
> the ID for table 1 and write it to a file. At the end of this, you'll
> have a list of all the IDs from table 1. Then, read those in one
> at a time and execute your query for use in indexing on one
> row at a time. In other words, iterate over each row rather
> than try to execute it in one massive statement.
>
> Erick
>
> On 3/31/07, Mohammad Norouzi <mn...@gmail.com> wrote:
> >
> > Hi all
> > I am going to index our database. one approach is to join them and then
> > index the fields. but the information are very large say more than 3
> > millions. so the Sql Server fails to select them.
> >
> > I want to know if anyone has such this experience to indexing huge
> > information of database using lucene.
> >
> > can anyone give me some advice?
> >
> > --
> > Regards,
> > Mohammad
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to index a large database

Posted by Chris Lu <ch...@gmail.com>.
BTW: DBSight uses the same approach that Erick describes.
Like Phil said, this will incur multiple joins for each document.
But this works well and efficient for incremental indexing.
And this approach does not need to change any database schema.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 3/31/07, Phil Myers <jl...@yahoo.com> wrote:
> I've used the approach that Erick describes, and it
> works well. Another approach is to create a single new
> table in your database that holds all of the data you
> want to index. This allows you to copy the various
> fields from other tables using separate SQL statements
> before you index, rather than relying on one massive
> join during the indexing process. Having everything in
> a single table helps avoid the overhead associated
> with making multiple calls from your indexing process
> for each row.
>
> This approach may not be practical, of course, if you
> don't have enough disk space to keep multiple copies
> of your data.
>
> Phil
> --- Erick Erickson <er...@gmail.com> wrote:
> > So, try something like creating a SQL statement that
> > selects
> > the ID for table 1 and write it to a file. At the
> > end of this, you'll
> > have a list of all the IDs from table 1. Then, read
> > those in one
> > at a time and execute your query for use in indexing
> > on one
> > row at a time. In other words, iterate over each row
> > rather
> > than try to execute it in one massive statement.
> >
> > Erick
> >
> > On 3/31/07, Mohammad Norouzi <mn...@gmail.com>
> > wrote:
> > >
> > > Hi all
> > > I am going to index our database. one approach is
> > to join them and then
> > > index the fields. but the information are very
> > large say more than 3
> > > millions. so the Sql Server fails to select them.
> > >
> > > I want to know if anyone has such this experience
> > to indexing huge
> > > information of database using lucene.
> > >
> > > can anyone give me some advice?
> > >
> > > --
> > > Regards,
> > > Mohammad
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to index a large database

Posted by Phil Myers <jl...@yahoo.com>.
I've used the approach that Erick describes, and it
works well. Another approach is to create a single new
table in your database that holds all of the data you
want to index. This allows you to copy the various
fields from other tables using separate SQL statements
before you index, rather than relying on one massive
join during the indexing process. Having everything in
a single table helps avoid the overhead associated
with making multiple calls from your indexing process
for each row.

This approach may not be practical, of course, if you
don't have enough disk space to keep multiple copies
of your data.

Phil
--- Erick Erickson <er...@gmail.com> wrote: 
> So, try something like creating a SQL statement that
> selects
> the ID for table 1 and write it to a file. At the
> end of this, you'll
> have a list of all the IDs from table 1. Then, read
> those in one
> at a time and execute your query for use in indexing
> on one
> row at a time. In other words, iterate over each row
> rather
> than try to execute it in one massive statement.
> 
> Erick
> 
> On 3/31/07, Mohammad Norouzi <mn...@gmail.com>
> wrote:
> >
> > Hi all
> > I am going to index our database. one approach is
> to join them and then
> > index the fields. but the information are very
> large say more than 3
> > millions. so the Sql Server fails to select them.
> >
> > I want to know if anyone has such this experience
> to indexing huge
> > information of database using lucene.
> >
> > can anyone give me some advice?
> >
> > --
> > Regards,
> > Mohammad
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to index a large database

Posted by Erick Erickson <er...@gmail.com>.
Well, don't do it that way <G>.....

I'm assuming that you have some SQL statement like "for each
entry in table 1, find all the related info", and what you're
failing to retrieve is the result.

So, try something like creating a SQL statement that selects
the ID for table 1 and write it to a file. At the end of this, you'll
have a list of all the IDs from table 1. Then, read those in one
at a time and execute your query for use in indexing on one
row at a time. In other words, iterate over each row rather
than try to execute it in one massive statement.

Erick

On 3/31/07, Mohammad Norouzi <mn...@gmail.com> wrote:
>
> Hi all
> I am going to index our database. one approach is to join them and then
> index the fields. but the information are very large say more than 3
> millions. so the Sql Server fails to select them.
>
> I want to know if anyone has such this experience to indexing huge
> information of database using lucene.
>
> can anyone give me some advice?
>
> --
> Regards,
> Mohammad
>