You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Kumar, Santosh" <sa...@sap.com> on 2017/12/21 11:48:50 UTC

Lucene with Database

Hi,
I’m currently working on project which has the following scenario:


  1.  I have entities in DB on which I would like to prevent duplicates by same name or near match, for example, SalesOrder or SlsOrd or SalesOrd etc…are all considered same. For this, I would like to use fuzzy search and return only entities depending on a matching criterion (say, return entities with match >=60%).
  2.  How do I approach this use case? Should I create one index (IndexWriter with RAMDirectory?) for the entire application and keep updating the index (in the background as a separate micro service) and whenever, a new entity is created or updated or removed (I need real time updates, can’t wait for bulk updates on index) update the index as well?
  3.  I can then use the index created above as lookup when a user tries to create a new entity and generate error or warning message.

If the 2nd point above is fine, then is there any general guideline or example that I can follow for creating a global index for the application? Also, is there any guideline for using Lucene with Database.

Appreciate your help!!!

Thank you and Regards,
Santosh

Re: Lucene with Database

Posted by Evert Wagenaar <ev...@gmail.com>.

Lucene just makes rdms system faster.

On Wed, 27 Dec 2017 at 17:08 Riccardo Tasso <ri...@gmail.com>
wrote:

> Hi,
>  I am not aware of any lucene integration with rdbms but I don't think it
> would be very usefull, what do you mean with "guideline for using Lucene
> with Database"?.
>
> Sometimes it makes sense a database integration with lucene (for example
> Neo4J and OrientDB use lucene as one of their indexing engine), but it
> depends on the specific vendor. Depending on your database, please check if
> it already has some kind of fuzzy search (for example recent version of
> Postgres has fuzzy search:
> https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html).
>
> If you decide to use lucene for your use case I don't think RAMDirectory
> would be the best choice, since it's intended for testing purposes only.
> The FSDirectory should be efficient enough for your (and many other) use
> cases. The second point seems ok to me, if entity types are not very
> dynamic you can obtain good performance creating an index for each one.
>
> 2017-12-21 12:48 GMT+01:00 Kumar, Santosh <sa...@sap.com>:
>
> >
> > Hi,
> > I’m currently working on project which has the following scenario:
> >
> >
> >   1.  I have entities in DB on which I would like to prevent duplicates
> by
> > same name or near match, for example, SalesOrder or SlsOrd or SalesOrd
> > etc…are all considered same. For this, I would like to use fuzzy search
> and
> > return only entities depending on a matching criterion (say, return
> > entities with match >=60%).
> >   2.  How do I approach this use case? Should I create one index
> > (IndexWriter with RAMDirectory?) for the entire application and keep
> > updating the index (in the background as a separate micro service) and
> > whenever, a new entity is created or updated or removed (I need real time
> > updates, can’t wait for bulk updates on index) update the index as well?
> >   3.  I can then use the index created above as lookup when a user tries
> > to create a new entity and generate error or warning message.
> >
> > If the 2nd point above is fine, then is there any general guideline or
> > example that I can follow for creating a global index for the
> application?
> > Also, is there any guideline for using Lucene with Database.
> >
> > Appreciate your help!!!
> >
> > Thank you and Regards,
> > Santosh
> >
>
-- 
Sent from Gmail IPad

Re: Lucene with Database

Posted by Parit Bansal <Pa...@sib.swiss>.

Hi Santosh,

We have a similar lucene-db combo here at www.uniprot.org. We have 
lucene index over our datasets for searching and for database we have 
simple serialized memory mapped file ("a database" in some sense). 
Lucene index and database are linked through another memory mapped file 
that maps docIds to database file offsets. Earlier this database layer 
was berkleyDB but we changed it to serialized memory mapped file because 
of unnecessary overhead. This has proved to be very fast for our use 
case that has indexes of about 100 million entries (in some cases a lot 
more). This approach is helpful in our use case as we have write-once 
index/database.

Hope this helps.

- Best
Parit Bansal

On 12/28/2017 06:35 AM, Kumar, Santosh wrote:
> Hi Trejkaz, Evert, Riccardo,
>
> Thank you for your inputs. We have an application which we plan to migrate to Cloudfoundry and are yet to make a decision on DataBase with the contenders being PostgreSQL, MySQL, HANA DB, MongoDB. In the current setup, we use HANA DB which already has a fuzzy search query. But, when we migrate to Cloudfoundry we might use a different database and to keep fuzzy search DB agnostic, I think it would be better to have fuzzy search in Java layer rather than in DB layer.
>
> While looking up for examples of fuzzy search with Lucene, I came across examples that demonstrate Lucene with file system predominantly, so was wondering if there are any samples on ‘How to use Lucene with DB’ or if the Java logic remains same for Filesystem or DB (really sorry I am new to Lucene). Any differences or things to consider when the data source are different?
>
> Thank you and Regards,
> Santosh
>   
> On 28/12/17, 4:01 AM, "Trejkaz" <tr...@trypticon.org> wrote:
>
>      On Thu, Dec 28, 2017 at 1:07 AM, Riccardo Tasso
>      <ri...@gmail.com> wrote:
>      > Hi,
>      > I am not aware of any lucene integration with rdbms
>      
>      Derby has a plugin of some sort. I haven't tried it so I have no idea
>      what it actually does, but it looks like it adds table functions which
>      you could join to other queries.
>      
>      https://db.apache.org/derby/docs/10.13/tools/rtoolsoptlucene.html
>      
>      TX
>      
>      ---------------------------------------------------------------------
>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>      For additional commands, e-mail: java-user-help@lucene.apache.org
>      
>      
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene with Database

Posted by "Kumar, Santosh" <sa...@sap.com>.

Basically, I need indexing only for fuzzy search on entities. So, I’m thinking to create Index out of DB tables (for the search term) and store it on server (cloud foundry, yet to figure out how to achieve this). Now whenever, a user creates/updates/deletes any entity(es), I would like to perform real time update on the index as well. This is mandatory and helps in preventing duplicate entities based on fuzzy search (for ex: slsOrd, SalesOrder etc… are considered same).

Thank you for pointing at Solr will give it a try as well.

On 28/12/17, 1:22 PM, "Riccardo Tasso" <ri...@gmail.com> wrote:

    2017-12-28 6:35 GMT+01:00 Kumar, Santosh <sa...@sap.com>:
    >
    > While looking up for examples of fuzzy search with Lucene, I came across
    > examples that demonstrate Lucene with file system predominantly, so was
    > wondering if there are any samples on ‘How to use Lucene with DB’ or if the
    > Java logic remains same for Filesystem or DB (really sorry I am new to
    > Lucene). Any differences or things to consider when the data source are
    > different?

    If we are speaking of indexing documents from db or from filesystem, it is
    the same thing.
    If you are thinking about a database for storing lucene data structure,
    instead of filesystem which is the default option, I will discourage you.
    The filesystem storage is the one officially supported.

    Since it's your first time with lucene, have you considered something like
    Solr or Elasticsearch, which offers you more functionalities without the
    need of implementing them?

    Riccardo

Re: Lucene with Database

Posted by Riccardo Tasso <ri...@gmail.com>.

2017-12-28 6:35 GMT+01:00 Kumar, Santosh <sa...@sap.com>:
>
> While looking up for examples of fuzzy search with Lucene, I came across
> examples that demonstrate Lucene with file system predominantly, so was
> wondering if there are any samples on ‘How to use Lucene with DB’ or if the
> Java logic remains same for Filesystem or DB (really sorry I am new to
> Lucene). Any differences or things to consider when the data source are
> different?


If we are speaking of indexing documents from db or from filesystem, it is
the same thing.
If you are thinking about a database for storing lucene data structure,
instead of filesystem which is the default option, I will discourage you.
The filesystem storage is the one officially supported.

Since it's your first time with lucene, have you considered something like
Solr or Elasticsearch, which offers you more functionalities without the
need of implementing them?

Riccardo

Re: Lucene with Database

Posted by "Kumar, Santosh" <sa...@sap.com>.

Hi Trejkaz, Evert, Riccardo,

Thank you for your inputs. We have an application which we plan to migrate to Cloudfoundry and are yet to make a decision on DataBase with the contenders being PostgreSQL, MySQL, HANA DB, MongoDB. In the current setup, we use HANA DB which already has a fuzzy search query. But, when we migrate to Cloudfoundry we might use a different database and to keep fuzzy search DB agnostic, I think it would be better to have fuzzy search in Java layer rather than in DB layer.

While looking up for examples of fuzzy search with Lucene, I came across examples that demonstrate Lucene with file system predominantly, so was wondering if there are any samples on ‘How to use Lucene with DB’ or if the Java logic remains same for Filesystem or DB (really sorry I am new to Lucene). Any differences or things to consider when the data source are different?

Thank you and Regards,
Santosh

On 28/12/17, 4:01 AM, "Trejkaz" <tr...@trypticon.org> wrote:

    On Thu, Dec 28, 2017 at 1:07 AM, Riccardo Tasso
    <ri...@gmail.com> wrote:
    > Hi,
    > I am not aware of any lucene integration with rdbms

    Derby has a plugin of some sort. I haven't tried it so I have no idea
    what it actually does, but it looks like it adds table functions which
    you could join to other queries.

    https://db.apache.org/derby/docs/10.13/tools/rtoolsoptlucene.html

    TX

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene with Database

Posted by Trejkaz <tr...@trypticon.org>.

On Thu, Dec 28, 2017 at 1:07 AM, Riccardo Tasso
<ri...@gmail.com> wrote:
> Hi,
> I am not aware of any lucene integration with rdbms

Derby has a plugin of some sort. I haven't tried it so I have no idea
what it actually does, but it looks like it adds table functions which
you could join to other queries.

https://db.apache.org/derby/docs/10.13/tools/rtoolsoptlucene.html

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene with Database

Posted by Riccardo Tasso <ri...@gmail.com>.

Hi,
 I am not aware of any lucene integration with rdbms but I don't think it
would be very usefull, what do you mean with "guideline for using Lucene
with Database"?.

Sometimes it makes sense a database integration with lucene (for example
Neo4J and OrientDB use lucene as one of their indexing engine), but it
depends on the specific vendor. Depending on your database, please check if
it already has some kind of fuzzy search (for example recent version of
Postgres has fuzzy search:
https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html).

If you decide to use lucene for your use case I don't think RAMDirectory
would be the best choice, since it's intended for testing purposes only.
The FSDirectory should be efficient enough for your (and many other) use
cases. The second point seems ok to me, if entity types are not very
dynamic you can obtain good performance creating an index for each one.

2017-12-21 12:48 GMT+01:00 Kumar, Santosh <sa...@sap.com>:

>
> Hi,
> I’m currently working on project which has the following scenario:
>
>
>   1.  I have entities in DB on which I would like to prevent duplicates by
> same name or near match, for example, SalesOrder or SlsOrd or SalesOrd
> etc…are all considered same. For this, I would like to use fuzzy search and
> return only entities depending on a matching criterion (say, return
> entities with match >=60%).
>   2.  How do I approach this use case? Should I create one index
> (IndexWriter with RAMDirectory?) for the entire application and keep
> updating the index (in the background as a separate micro service) and
> whenever, a new entity is created or updated or removed (I need real time
> updates, can’t wait for bulk updates on index) update the index as well?
>   3.  I can then use the index created above as lookup when a user tries
> to create a new entity and generate error or warning message.
>
> If the 2nd point above is fine, then is there any general guideline or
> example that I can follow for creating a global index for the application?
> Also, is there any guideline for using Lucene with Database.
>
> Appreciate your help!!!
>
> Thank you and Regards,
> Santosh
>