You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by uday kumar maddigatla <uk...@mach.com> on 2009/05/11 08:20:42 UTC

How can I use Lucene to index a database?

hi,

we are new to Lucene. We want to index the database using lucene. 

we don't know where  to start. we already done the indexing of documents.

But we don;t know how to index database using lucene, We started searching
in google. we found one query regarding this by you people.

please tell us how to index the database. A series of steps(where to change
or update) will helps us.


Richard Marr wrote:
> 
>> Thank you for your reply.
>> I'm just affraid of choosing wrong option for the project and later it
>> would
>> be harder to change it.
> 
> I'm making the same kinds of decisions at the moment. My plan is to
> use an RDBMS to store the "master" copy of our data, because that
> technology is a ubiquitous skill set, and we know its capabilities
> almost instinctively.
> 
> For example, we can hire a DBA off-the-shelf to deal with database
> replication without having to explain the details of our other systems
> to them, but doing the same thing with Lucene would require them to
> have experience with it.
> 
> This way we can save our scarce brain-power for more interesting problems 
> :)
> 
> Rich
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene-Index-file-vs.-database-tp19724177p23477688.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: what if my database data contains other language (like danish, german).

Posted by Chris Collins <ch...@yahoo.com>.

Thanks Otis, I will take a look.

Best

C
On May 17, 2009, at 7:05 PM, Otis Gospodnetic wrote:

>
> Chris,
>
> I don't have the issue number here, but look in Lucene's JIRA and  
> search for... ah, here:
>
>  https://issues.apache.org/jira/browse/LUCENE-1166
>
>
> And for Chinese:
>
>  https://issues.apache.org/jira/browse/LUCENE-1629
>
> If you happen to be using Solr:
>
>  http://www.sematext.com/product-multilingual-analyzer.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Chris Collins <ch...@yahoo.com>
>> To: general@lucene.apache.org
>> Sent: Monday, May 11, 2009 11:28:06 AM
>> Subject: Re: what if my database data contains other language (like  
>> danish,  german).
>>
>> Is anyone aware of either of the two things:
>>
>> 1) ability to plugin an external source for DF, this would allow  
>> you to
>> circumvent the problem you mentioned below.  (Of course you would  
>> have to
>> compute a df set for each language you care to have meaningful  
>> weights for).
>> 2) any open source segmenters, primarily for german, but also for  
>> CJK at a
>> longshot :-}
>>
>> Thanks
>>
>> C
>>
>> On May 11, 2009, at 8:13 AM, Ted Dunning wrote:
>>
>>> Yes.  Lucene can handle that.  You have to select which stemmer to  
>>> use.  You
>>> may have to improve the German and Danish stemmers a little bit.
>>>
>>> You may also have some issues with the fact that if Danish is 5%  
>>> of your
>>> corpus, then words that occur in 100% of your Danish documents  
>>> will tend to
>>> have too high weights since they only occur in 5% of your  
>>> documents.  Any
>>> term that occurs in more than 20% of a sub-corpus should generally  
>>> be
>>> discarded from your query.  This can be difficult in multi-lingual
>>> situations.
>>>
>>> For a first pass, I would ignore this issue, however.
>>>
>>> On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:
>>>
>>>> what if my database data contains other language (like danish,  
>>>> german).
>>>>
>>>> Is Lucene will handle that .
>>>>
>>>> If yes How?
>>>>
>>>
>>>
>>>
>>> --Ted Dunning, CTO
>>> DeepDyve
>>>
>>> 111 West Evelyn Ave. Ste. 202
>>> Sunnyvale, CA 94086
>>> www.deepdyve.com
>>> 858-414-0013 (m)
>>> 408-773-0220 (fax)
>

Re: what if my database data contains other language (like danish, german).

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Chris,

I don't have the issue number here, but look in Lucene's JIRA and search for... ah, here:

  https://issues.apache.org/jira/browse/LUCENE-1166


And for Chinese:

  https://issues.apache.org/jira/browse/LUCENE-1629

If you happen to be using Solr:

  http://www.sematext.com/product-multilingual-analyzer.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Chris Collins <ch...@yahoo.com>
> To: general@lucene.apache.org
> Sent: Monday, May 11, 2009 11:28:06 AM
> Subject: Re: what if my database data contains other language (like danish,  german).
> 
> Is anyone aware of either of the two things:
> 
> 1) ability to plugin an external source for DF, this would allow you to 
> circumvent the problem you mentioned below.  (Of course you would have to 
> compute a df set for each language you care to have meaningful weights for).
> 2) any open source segmenters, primarily for german, but also for CJK at a 
> longshot :-}
> 
> Thanks
> 
> C
> 
> On May 11, 2009, at 8:13 AM, Ted Dunning wrote:
> 
> > Yes.  Lucene can handle that.  You have to select which stemmer to use.  You
> > may have to improve the German and Danish stemmers a little bit.
> > 
> > You may also have some issues with the fact that if Danish is 5% of your
> > corpus, then words that occur in 100% of your Danish documents will tend to
> > have too high weights since they only occur in 5% of your documents.  Any
> > term that occurs in more than 20% of a sub-corpus should generally be
> > discarded from your query.  This can be difficult in multi-lingual
> > situations.
> > 
> > For a first pass, I would ignore this issue, however.
> > 
> > On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla wrote:
> > 
> >> what if my database data contains other language (like danish, german).
> >> 
> >> Is Lucene will handle that .
> >> 
> >> If yes How?
> >> 
> > 
> > 
> > 
> > --Ted Dunning, CTO
> > DeepDyve
> > 
> > 111 West Evelyn Ave. Ste. 202
> > Sunnyvale, CA 94086
> > www.deepdyve.com
> > 858-414-0013 (m)
> > 408-773-0220 (fax)

Re: what if my database data contains other language (like danish, german).

Posted by Ted Dunning <te...@gmail.com>.

On Mon, May 11, 2009 at 8:28 AM, Chris Collins <ch...@yahoo.com>wrote:

> Is anyone aware of either of the two things:
>
> 1) ability to plugin an external source for DF, this would allow you to
> circumvent the problem you mentioned below.  (Of course you would have to
> compute a df set for each language you care to have meaningful weights for).

See
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html#weight(org.apache.lucene.search.Searcher)

The typical idiom is to extend Searcher with a specialized structure that
knows the term frequencies that you want it to know.

This is what katta does to propagate cluster-global term frequencies to
shard specific searches.  Presumably solr does likewise.

> 2) any open source segmenters, primarily for german, but also for CJK at a
> longshot :-}

Lucene has a rudimentary german stemmer which may be sufficient.  Real lemma
identification in German can be difficult because of the large number of
morphological variants and word compounding.  For text retrieval, however,
compounding is your friend and very simple stemmers typically suffice.

For CJK, the approach that I favor lately is this one:

      http://technology.chtsai.org/mmseg/

Basically, it is a longest dictionary match method with the addition that it
picks the next token that is part of the longest match for the next three
tokens.  This gets rid of the garden path problems that greedy algorithms
without look-ahead have.  It depends a bit on the assumption that long words
in the dictionary have higher frequency than would be expected if the
possible components occur independently.  This means that picking the longer
match in the dictionary is equivalent to doing a more subtle statistical
test.  (See here for more details on the stats involved in bigram detection:
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html).

-- 
Ted Dunning, CTO
DeepDyve

Re: what if my database data contains other language (like danish, german).

Posted by Chris Collins <ch...@yahoo.com>.

Is anyone aware of either of the two things:

1) ability to plugin an external source for DF, this would allow you  
to circumvent the problem you mentioned below.  (Of course you would  
have to compute a df set for each language you care to have meaningful  
weights for).
2) any open source segmenters, primarily for german, but also for CJK  
at a longshot :-}

Thanks

C

On May 11, 2009, at 8:13 AM, Ted Dunning wrote:

> Yes.  Lucene can handle that.  You have to select which stemmer to  
> use.  You
> may have to improve the German and Danish stemmers a little bit.
>
> You may also have some issues with the fact that if Danish is 5% of  
> your
> corpus, then words that occur in 100% of your Danish documents will  
> tend to
> have too high weights since they only occur in 5% of your  
> documents.  Any
> term that occurs in more than 20% of a sub-corpus should generally be
> discarded from your query.  This can be difficult in multi-lingual
> situations.
>
> For a first pass, I would ignore this issue, however.
>
> On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla  
> <uk...@mach.com>wrote:
>
>> what if my database data contains other language (like danish,  
>> german).
>>
>> Is Lucene will handle that .
>>
>> If yes How?
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)

Re: what if my database data contains other language (like danish, german).

Posted by Ted Dunning <te...@gmail.com>.

Yes.  Lucene can handle that.  You have to select which stemmer to use.  You
may have to improve the German and Danish stemmers a little bit.

You may also have some issues with the fact that if Danish is 5% of your
corpus, then words that occur in 100% of your Danish documents will tend to
have too high weights since they only occur in 5% of your documents.  Any
term that occurs in more than 20% of a sub-corpus should generally be
discarded from your query.  This can be difficult in multi-lingual
situations.

For a first pass, I would ignore this issue, however.

On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla <uk...@mach.com>wrote:

> what if my database data contains other language (like danish, german).
>
> Is Lucene will handle that .
>
> If yes How?
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

what if my database data contains other language (like danish, german).

Posted by uday kumar maddigatla <uk...@mach.com>.

hi,

Are you talking about the Solr for this.

my limitation is to use only Lucene(no third party app like Solr which is
built on top of Lucene).

I did a sample application which will get the data from one table which is
in database and i'm indexing it.

Now i'm able to index the tables in database. But the problem here is ...

what if my database data contains other language (like danish, german).

Is Lucene will handle that .

If yes How?



Andrew McCombe wrote:
> 
> Hi
> 
> You should take a look at the DataImport Handler and use the examples
> on that page.  I've recently learned this myself and this has been the
> #1 source of help.
> 
> http://wiki.apache.org/solr/DataImportHandler
> 
> There are two steps to set up when using this.  The first one is the
> initial import of your data.  This is called the 'full-import'.  Once
> this has been done you then use a delta-import to import any new or
> updated data into Solr.
> 
> One thing to remember is to rebuild any spellchecker indexes after the
> delta import.
> 
> Hope this Helps
> Andrew
> 
> 2009/5/11 uday kumar maddigatla <uk...@mach.com>:
>>
>> hi,
>>
>> we are new to Lucene. We want to index the database using lucene.
>>
>> we don't know where  to start. we already done the indexing of documents.
>>
>> But we don;t know how to index database using lucene, We started
>> searching
>> in google. we found one query regarding this by you people.
>>
>> please tell us how to index the database. A series of steps(where to
>> change
>> or update) will helps us.
>>
>>
>> Richard Marr wrote:
>>>
>>>> Thank you for your reply.
>>>> I'm just affraid of choosing wrong option for the project and later it
>>>> would
>>>> be harder to change it.
>>>
>>> I'm making the same kinds of decisions at the moment. My plan is to
>>> use an RDBMS to store the "master" copy of our data, because that
>>> technology is a ubiquitous skill set, and we know its capabilities
>>> almost instinctively.
>>>
>>> For example, we can hire a DBA off-the-shelf to deal with database
>>> replication without having to explain the details of our other systems
>>> to them, but doing the same thing with Lucene would require them to
>>> have experience with it.
>>>
>>> This way we can save our scarce brain-power for more interesting
>>> problems
>>> :)
>>>
>>> Rich
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Lucene-Index-file-vs.-database-tp19724177p23477688.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene-Index-file-vs.-database-tp19724177p23481214.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: How can I use Lucene to index a database?

Posted by Andrew McCombe <eu...@gmail.com>.

Hi

You should take a look at the DataImport Handler and use the examples
on that page.  I've recently learned this myself and this has been the
#1 source of help.

http://wiki.apache.org/solr/DataImportHandler

There are two steps to set up when using this.  The first one is the
initial import of your data.  This is called the 'full-import'.  Once
this has been done you then use a delta-import to import any new or
updated data into Solr.

One thing to remember is to rebuild any spellchecker indexes after the
delta import.

Hope this Helps
Andrew

2009/5/11 uday kumar maddigatla <uk...@mach.com>:
>
> hi,
>
> we are new to Lucene. We want to index the database using lucene.
>
> we don't know where  to start. we already done the indexing of documents.
>
> But we don;t know how to index database using lucene, We started searching
> in google. we found one query regarding this by you people.
>
> please tell us how to index the database. A series of steps(where to change
> or update) will helps us.
>
>
> Richard Marr wrote:
>>
>>> Thank you for your reply.
>>> I'm just affraid of choosing wrong option for the project and later it
>>> would
>>> be harder to change it.
>>
>> I'm making the same kinds of decisions at the moment. My plan is to
>> use an RDBMS to store the "master" copy of our data, because that
>> technology is a ubiquitous skill set, and we know its capabilities
>> almost instinctively.
>>
>> For example, we can hire a DBA off-the-shelf to deal with database
>> replication without having to explain the details of our other systems
>> to them, but doing the same thing with Lucene would require them to
>> have experience with it.
>>
>> This way we can save our scarce brain-power for more interesting problems
>> :)
>>
>> Rich
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Lucene-Index-file-vs.-database-tp19724177p23477688.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>

Re: How can I use Lucene to index a database?

Posted by Shashi Kant <sh...@gmail.com>.

Take a look at LuSQL

http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql



On Mon, May 11, 2009 at 2:20 AM, uday kumar maddigatla <uk...@mach.com> wrote:
>
> hi,
>
> we are new to Lucene. We want to index the database using lucene.
>
> we don't know where  to start. we already done the indexing of documents.
>
> But we don;t know how to index database using lucene, We started searching
> in google. we found one query regarding this by you people.
>
> please tell us how to index the database. A series of steps(where to change
> or update) will helps us.
>
>
> Richard Marr wrote:
>>
>>> Thank you for your reply.
>>> I'm just affraid of choosing wrong option for the project and later it
>>> would
>>> be harder to change it.
>>
>> I'm making the same kinds of decisions at the moment. My plan is to
>> use an RDBMS to store the "master" copy of our data, because that
>> technology is a ubiquitous skill set, and we know its capabilities
>> almost instinctively.
>>
>> For example, we can hire a DBA off-the-shelf to deal with database
>> replication without having to explain the details of our other systems
>> to them, but doing the same thing with Lucene would require them to
>> have experience with it.
>>
>> This way we can save our scarce brain-power for more interesting problems
>> :)
>>
>> Rich
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Lucene-Index-file-vs.-database-tp19724177p23477688.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>