You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rajesh parab <ra...@yahoo.com> on 2008/04/11 00:16:23 UTC

Lucene index on relational data

Hi,

We are using Lucene 2.0 to index data stored inside
relational database. Like any relational database, our
database has quite a few one-to-one and one-to-many
relationships. For example, lets say an Object A has
one-to-many relationship with Object X and Object Y.
As we need to de-normalize relational data as
key-value pairs before storing it inside Lucene index,
we have de-normalized these relationships (Object X
and Object Y) while building an index on Object A.

We have large no of such object relationships and most
of the times, the related objects are modified more
frequently than the base objects. For example, in our
above case, objects X and Y are updated in the system
very frequently, whereas Object A is not updated that
often. Still, we will need to update Object A entries
inside the index, every time its related objects X
and/or Y are modified.

To avoid the above situation, we were thinking of
having 2 separate indexes  first index will only
index data of base objects (Object A in above example)
and second index will contain data about its
relationship objects (Object X and Y above), which are
updated more frequently. This way, the more frequent
updates to Object X and Y will only impact second
index that stores relationship information and reduce
the cost to re-index object A. However, I dont think,
MultiSearcher will be helpful if we want to search for
data which spans across both indexes (e.g. some fields
of Object A in first index and some fields of Object X
or Y in second index).

Do we have any option in Lucene to handle such
scenario? Can we search across multiple indexes which
have some relationships between them and search for
fields that span across these indexes?

Regards,
Rajesh

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Thanks Mathieu,

On your comments on partitioning of data -

<<Mathieu>>
Yes. You can index unfolded data, wich take lot of
space, or use two query in two index. The first build
a Filter for the second, just like with the previous
JDBC example. You can even cache the filter, like Solr
does with its faceted search.

<<Rajesh>>
I am looking for a way to use single query to run
across two indexes (static and dynamic index) and the
search query will have fields from both these indexes.

Rajesh

--- Mathieu Lecarme <ma...@garambrogne.net> wrote:

> 
> Le 11 avr. 08 à 19:29, Rajesh parab a écrit :
> > Thanks for these pointers Mathieu.
> >
> > We have earlier looked at Compass, but the main
> issue
> > with database index is DB vendor support for BLOB
> > locator. I understand that Oracle provides has
> this
> > support to get the partial data from BLOB, but I
> guess
> > the simiar support is not available in SQL Server
> and
> > DB2. Our application currently supports all these
> 3
> > databases.
> You misanderstood something. Compass can use JDBC
> Index, but it's only  
> an option, classical file index is available too.
> Other specific index  
> is GigaSpace and Terracotta, for cluster
> environment.
> 
> > Secondly I am reading that search performance
> degrades
> > drastically with database index.
> You can build a Filter from JDBC query to mix it
> with Lucene search.  
> If your JDBC query use too much join, it will be
> slow, so, your Lucene  
> search, wich wait its Filter, will be slow two.
> Building a Filter  
> froma Set of id is not slow.
> 
> > Will it be possible to partition data like main
> index
> > and relationship index using File System Lucne
> index
> > and search across these indexes?
> Yes. You can index unfolded data, wich take lot of
> space, or use two  
> query in two index. The first build a Filter for the
> second, just like  
> with the previous JDBC example.
> You can even cache the filter, like Solr does with
> its faceted search.
> 
> M.
> 
> >
> >
> > Regards,
> > Rajesh
> >
> > --- Mathieu Lecarme <ma...@garambrogne.net>
> wrote:
> >
> >> Have a look at Compass 2.0M3
> >>
> http://www.kimchy.org/searchable-cascading-mapping/
> >>
> >> Your multiple index will be nice for massive
> write.
> >> In a classical
> >> read/write ratio, Compass will be much easier.
> >>
> >> M.
> >>
> >> Rajesh parab a Ã©crit :
> >>> Hi,
> >>>
> >>> We are using Lucene 2.0 to index data stored
> >> inside
> >>> relational database. Like any relational
> database,
> >> our
> >>> database has quite a few one-to-one and
> >> one-to-many
> >>> relationships. For example, letâs say an
> Object
> >> A has
> >>> one-to-many relationship with Object X and
> Object
> >> Y.
> >>> As we need to de-normalize relational data as
> >>> key-value pairs before storing it inside Lucene
> >> index,
> >>> we have de-normalized these relationships
> (Object
> >> X
> >>> and Object Y) while building an index on Object
> A.
> >>>
> >>> We have large no of such object relationships
> and
> >> most
> >>> of the times, the related objects are modified
> >> more
> >>> frequently than the base objects. For example,
> in
> >> our
> >>> above case, objects X and Y are updated in the
> >> system
> >>> very frequently, whereas Object A is not updated
> >> that
> >>> often. Still, we will need to update Object A
> >> entries
> >>> inside the index, every time its related objects
> X
> >>> and/or Y are modified.
> >>>
> >>> To avoid the above situation, we were thinking
> of
> >>> having 2 separate indexes â first index will
> >> only
> >>> index data of base objects (Object A in above
> >> example)
> >>> and second index will contain data about its
> >>> relationship objects (Object X and Y above),
> which
> >> are
> >>> updated more frequently. This way, the more
> >> frequent
> >>> updates to Object X and Y will only impact
> second
> >>> index that stores relationship information and
> >> reduce
> >>> the cost to re-index object A. However, I
> donât
> >> think,
> >>> MultiSearcher will be helpful if we want to
> search
> >> for
> >>> data which spans across both indexes (e.g. some
> >> fields
> >>> of Object A in first index and some fields of
> >> Object X
> >>> or Y in second index).
> >>>
> >>> Do we have any option in Lucene to handle such
> >>> scenario? Can we search across multiple indexes
> >> which
> >>> have some relationships between them and search
> >> for
> >>> fields that span across these indexes?
> >>>
> >>> Regards,
> >>> Rajesh
> >>>
> >>>
> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam?  Yahoo! Mail has the best spam
> >> protection around
> >>> http://mail.yahoo.com
> >>>
> >>>
> >>
> >
>
---------------------------------------------------------------------
> >>> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
>
---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> >> java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail:
> >> java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

Le 11 avr. 08 à 19:29, Rajesh parab a écrit :
> Thanks for these pointers Mathieu.
>
> We have earlier looked at Compass, but the main issue
> with database index is DB vendor support for BLOB
> locator. I understand that Oracle provides has this
> support to get the partial data from BLOB, but I guess
> the simiar support is not available in SQL Server and
> DB2. Our application currently supports all these 3
> databases.
You misanderstood something. Compass can use JDBC Index, but it's only  
an option, classical file index is available too. Other specific index  
is GigaSpace and Terracotta, for cluster environment.

> Secondly I am reading that search performance degrades
> drastically with database index.
You can build a Filter from JDBC query to mix it with Lucene search.  
If your JDBC query use too much join, it will be slow, so, your Lucene  
search, wich wait its Filter, will be slow two. Building a Filter  
froma Set of id is not slow.

> Will it be possible to partition data like main index
> and relationship index using File System Lucne index
> and search across these indexes?
Yes. You can index unfolded data, wich take lot of space, or use two  
query in two index. The first build a Filter for the second, just like  
with the previous JDBC example.
You can even cache the filter, like Solr does with its faceted search.

M.

>
>
> Regards,
> Rajesh
>
> --- Mathieu Lecarme <ma...@garambrogne.net> wrote:
>
>> Have a look at Compass 2.0M3
>> http://www.kimchy.org/searchable-cascading-mapping/
>>
>> Your multiple index will be nice for massive write.
>> In a classical
>> read/write ratio, Compass will be much easier.
>>
>> M.
>>
>> Rajesh parab a Ã©crit :
>>> Hi,
>>>
>>> We are using Lucene 2.0 to index data stored
>> inside
>>> relational database. Like any relational database,
>> our
>>> database has quite a few one-to-one and
>> one-to-many
>>> relationships. For example, letâ€™s say an Object
>> A has
>>> one-to-many relationship with Object X and Object
>> Y.
>>> As we need to de-normalize relational data as
>>> key-value pairs before storing it inside Lucene
>> index,
>>> we have de-normalized these relationships (Object
>> X
>>> and Object Y) while building an index on Object A.
>>>
>>> We have large no of such object relationships and
>> most
>>> of the times, the related objects are modified
>> more
>>> frequently than the base objects. For example, in
>> our
>>> above case, objects X and Y are updated in the
>> system
>>> very frequently, whereas Object A is not updated
>> that
>>> often. Still, we will need to update Object A
>> entries
>>> inside the index, every time its related objects X
>>> and/or Y are modified.
>>>
>>> To avoid the above situation, we were thinking of
>>> having 2 separate indexes â€“ first index will
>> only
>>> index data of base objects (Object A in above
>> example)
>>> and second index will contain data about its
>>> relationship objects (Object X and Y above), which
>> are
>>> updated more frequently. This way, the more
>> frequent
>>> updates to Object X and Y will only impact second
>>> index that stores relationship information and
>> reduce
>>> the cost to re-index object A. However, I donâ€™t
>> think,
>>> MultiSearcher will be helpful if we want to search
>> for
>>> data which spans across both indexes (e.g. some
>> fields
>>> of Object A in first index and some fields of
>> Object X
>>> or Y in second index).
>>>
>>> Do we have any option in Lucene to handle such
>>> scenario? Can we search across multiple indexes
>> which
>>> have some relationships between them and search
>> for
>>> fields that span across these indexes?
>>>
>>> Regards,
>>> Rajesh
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam
>> protection around
>>> http://mail.yahoo.com
>>>
>>>
>>
> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Thanks for these pointers Mathieu.

We have earlier looked at Compass, but the main issue
with database index is DB vendor support for BLOB
locator. I understand that Oracle provides has this
support to get the partial data from BLOB, but I guess
the simiar support is not available in SQL Server and
DB2. Our application currently supports all these 3
databases.

Secondly I am reading that search performance degrades
drastically with database index.

Will it be possible to partition data like main index
and relationship index using File System Lucne index
and search across these indexes?

Regards,
Rajesh

--- Mathieu Lecarme <ma...@garambrogne.net> wrote:

> Have a look at Compass 2.0M3
> http://www.kimchy.org/searchable-cascading-mapping/
> 
> Your multiple index will be nice for massive write.
> In a classical 
> read/write ratio, Compass will be much easier.
> 
> M.
> 
> Rajesh parab a Ã©crit :
> > Hi,
> >
> > We are using Lucene 2.0 to index data stored
> inside
> > relational database. Like any relational database,
> our
> > database has quite a few one-to-one and
> one-to-many
> > relationships. For example, letâs say an Object
> A has
> > one-to-many relationship with Object X and Object
> Y.
> > As we need to de-normalize relational data as
> > key-value pairs before storing it inside Lucene
> index,
> > we have de-normalized these relationships (Object
> X
> > and Object Y) while building an index on Object A.
> >
> > We have large no of such object relationships and
> most
> > of the times, the related objects are modified
> more
> > frequently than the base objects. For example, in
> our
> > above case, objects X and Y are updated in the
> system
> > very frequently, whereas Object A is not updated
> that
> > often. Still, we will need to update Object A
> entries
> > inside the index, every time its related objects X
> > and/or Y are modified.
> >
> > To avoid the above situation, we were thinking of
> > having 2 separate indexes â first index will
> only
> > index data of base objects (Object A in above
> example)
> > and second index will contain data about its
> > relationship objects (Object X and Y above), which
> are
> > updated more frequently. This way, the more
> frequent
> > updates to Object X and Y will only impact second
> > index that stores relationship information and
> reduce
> > the cost to re-index object A. However, I donât
> think,
> > MultiSearcher will be helpful if we want to search
> for
> > data which spans across both indexes (e.g. some
> fields
> > of Object A in first index and some fields of
> Object X
> > or Y in second index).
> >
> > Do we have any option in Lucene to handle such
> > scenario? Can we search across multiple indexes
> which
> > have some relationships between them and search
> for
> > fields that span across these indexes?
> >
> > Regards,
> > Rajesh
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> > http://mail.yahoo.com 
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >
> >   
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

Have a look at Compass 2.0M3
http://www.kimchy.org/searchable-cascading-mapping/

Your multiple index will be nice for massive write. In a classical 
read/write ratio, Compass will be much easier.

M.

Rajesh parab a écrit :
> Hi,
>
> We are using Lucene 2.0 to index data stored inside
> relational database. Like any relational database, our
> database has quite a few one-to-one and one-to-many
> relationships. For example, let’s say an Object A has
> one-to-many relationship with Object X and Object Y.
> As we need to de-normalize relational data as
> key-value pairs before storing it inside Lucene index,
> we have de-normalized these relationships (Object X
> and Object Y) while building an index on Object A.
>
> We have large no of such object relationships and most
> of the times, the related objects are modified more
> frequently than the base objects. For example, in our
> above case, objects X and Y are updated in the system
> very frequently, whereas Object A is not updated that
> often. Still, we will need to update Object A entries
> inside the index, every time its related objects X
> and/or Y are modified.
>
> To avoid the above situation, we were thinking of
> having 2 separate indexes – first index will only
> index data of base objects (Object A in above example)
> and second index will contain data about its
> relationship objects (Object X and Y above), which are
> updated more frequently. This way, the more frequent
> updates to Object X and Y will only impact second
> index that stores relationship information and reduce
> the cost to re-index object A. However, I don’t think,
> MultiSearcher will be helpful if we want to search for
> data which spans across both indexes (e.g. some fields
> of Object A in first index and some fields of Object X
> or Y in second index).
>
> Do we have any option in Lucene to handle such
> scenario? Can we search across multiple indexes which
> have some relationships between them and search for
> fields that span across these indexes?
>
> Regards,
> Rajesh
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Thanks Karl. I think your solution would be useful in
case we would like to partition the index into two
indexes and use ParallelReader to query both indexes
simultaneously. 

If this solution is not getting including inside
future Lucene releases, what other options we have to
update just one of the two indexes and keep doc ids in
sync so that we can use ParallelReader?

Regards,
Rajesh

--- Karl Wettin <ka...@gmail.com> wrote:

> Rajesh parab skrev:
> > How do we specify the primary key or doc id so
> that
> > newly added document will use the same doc id. Do
> you
> > have any sample code that makes use of this patch?
> 
> Sorry, there is only the test case in the patch.
> 
> > 
> > Secondly, there was a comment saying it is a proof
> of
> > concept and not a real project. Is anyone using
> this
> > patch on their production environments? Will this
> fix
> > get rolled into latest Lucene release?
> 
> I very much doubt this patch would ever be rolled
> in. It is just 
> something I did do see if it was possible to solve
> some way without 
> doing major changes to the core architecture.
> 
> It works though. Feel free to report back in the
> issue with any results 
> you get in case you try it out.
> 
> 
>      karl
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Karl Wettin <ka...@gmail.com>.

Rajesh parab skrev:
> How do we specify the primary key or doc id so that
> newly added document will use the same doc id. Do you
> have any sample code that makes use of this patch?

Sorry, there is only the test case in the patch.

> 
> Secondly, there was a comment saying it is a proof of
> concept and not a real project. Is anyone using this
> patch on their production environments? Will this fix
> get rolled into latest Lucene release?

I very much doubt this patch would ever be rolled in. It is just 
something I did do see if it was possible to solve some way without 
doing major changes to the core architecture.

It works though. Feel free to report back in the issue with any results 
you get in case you try it out.


     karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Thanks Karl.

How do we specify the primary key or doc id so that
newly added document will use the same doc id. Do you
have any sample code that makes use of this patch?

Secondly, there was a comment saying it is a proof of
concept and not a real project. Is anyone using this
patch on their production environments? Will this fix
get rolled into latest Lucene release?

Regards,
Rajesh

--- Karl Wettin <ka...@gmail.com> wrote:

> Rajesh parab skrev:
> 
> >  https://issues.apache.org/jira/browse/LUCENE-879
> > <<Rajesh>>
> > As per the hack you mentioned inside JIRA, if some
> of
> > the documents are deleted and re-inserted into
> > secondary index, the other documents inside the
> index
> > do not change their doc id. However, the newly
> added
> > documents will have different doc ids and hence,
> we
> > will have to sync them with primary index doc ids.
> Is
> > my understanind correct? If this is the case, then
> we
> > will have to update both the indexes every time
> > something inside secondary index changes.
> 
>  From the JIRA comments to the second patch in
> there:
> 
> This new patch allows consumer to, based on a
> primary key, delete a 
> document and add a new document with the same
> document number as the 
> deleted. The events will occur on merging.
> 
> 
>      karl
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Karl Wettin <ka...@gmail.com>.

Rajesh parab skrev:

>  https://issues.apache.org/jira/browse/LUCENE-879
> <<Rajesh>>
> As per the hack you mentioned inside JIRA, if some of
> the documents are deleted and re-inserted into
> secondary index, the other documents inside the index
> do not change their doc id. However, the newly added
> documents will have different doc ids and hence, we
> will have to sync them with primary index doc ids. Is
> my understanind correct? If this is the case, then we
> will have to update both the indexes every time
> something inside secondary index changes.

 From the JIRA comments to the second patch in there:

This new patch allows consumer to, based on a primary key, delete a 
document and add a new document with the same document number as the 
deleted. The events will occur on merging.


     karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Hi Everyone,

Any help around this topic will be very useful. Is
anyone partitioning the data into 2 or more indexes
and using parallelReader to search these indexes? If
yes, how do you handle updates to the indexes and make
sure the doc ids for all indexes are in same order?

Regards,
Rajesh

--- Rajesh parab <ra...@yahoo.com> wrote:

> Hi Mathieu,
> 
> I can definitely store the foreign key inside the
> dynamic index. However if I understand correctly,
> for
> ParallelReader to work properly, doc ids for all
> documents in both primary and secondary (dynamic)
> index should be in same order.
> 
> How can we achieve it if there are frequest changes
> to
> the dynamic index? The doc ids will keep on changing
> as we delete and re-insert records in dynamic index.
> As Karl pointed out, there is a hack available in
> JIRA
> that can take care of this doc id update issue, but
> it
> is not an official patch and not tested for
> performance.
> 
> How are people updating their indexes when used in
> conjuction with ParallelReader. I think
> ParallelReader
> will work well for data partitioned between 2
> indexes
> (static and dynamic). However, I am not finding any
> better approach to just update the dynamic index.
> 
> Regards,
> Rajesh
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Hi Mathieu,

I can definitely store the foreign key inside the
dynamic index. However if I understand correctly, for
ParallelReader to work properly, doc ids for all
documents in both primary and secondary (dynamic)
index should be in same order.

How can we achieve it if there are frequest changes to
the dynamic index? The doc ids will keep on changing
as we delete and re-insert records in dynamic index.
As Karl pointed out, there is a hack available in JIRA
that can take care of this doc id update issue, but it
is not an official patch and not tested for
performance.

How are people updating their indexes when used in
conjuction with ParallelReader. I think ParallelReader
will work well for data partitioned between 2 indexes
(static and dynamic). However, I am not finding any
better approach to just update the dynamic index.

Regards,
Rajesh

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

> Regarding data and its relationships - the use case I
> am trying to solve is to partition my data into 2
> indexes, a primary index that will contains majority
> of the data and it is fairly static. The secondary
> index will have related information for the same data
> set in primary index and this related information
> inside secondary index will change very frequently.
>
> The no of documents in each index will go in millions
> and hence, re-building index in memory will not work
> :-(

When you've got too many document, you can shard it.
If you're lucky enough, you can split your data in autonomous split,  
but you always can ask multiple index in one search.
You can even split with a modulo or something like that.

With your two index pattern, why don't you use a foreign stored key in  
your dynamic index?

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

<<Karl>>
How much data do you have? I have a hard time to
understand the relationship between your objects and
what sort of normalized data you add to the documents.
If you are lucky it is just a single or few fields
that needs to be updated and you can manage to keep it
in RAM and rebuild the whole thing everytime something
happends or on some schedule.
<<Rajesh>>
Regarding data and its relationships - the use case I
am trying to solve is to partition my data into 2
indexes, a primary index that will contains majority
of the data and it is fairly static. The secondary
index will have related information for the same data
set in primary index and this related information
inside secondary index will change very frequently.

The no of documents in each index will go in millions
and hence, re-building index in memory will not work
:-(


<<Karl>>
There are some hacks in the JIRA that allows you to
replace a document at a certain position at index
optimization time. You might want to update a number
of document every time you do that.
 https://issues.apache.org/jira/browse/LUCENE-879
<<Rajesh>>
As per the hack you mentioned inside JIRA, if some of
the documents are deleted and re-inserted into
secondary index, the other documents inside the index
do not change their doc id. However, the newly added
documents will have different doc ids and hence, we
will have to sync them with primary index doc ids. Is
my understanind correct? If this is the case, then we
will have to update both the indexes every time
something inside secondary index changes.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Karl Wettin <ka...@gmail.com>.

How much data do you have? I have a hard time to understand the 
relationship between your objects and what sort of normalized data you 
add to the documents.

If you are lucky it is just a single or few fields that needs to be 
updated and you can manage to keep it in RAM and rebuild the whole thing 
everytime something happends or on some schedule.

There are some hacks in the JIRA that allows you to replace a document 
at a certain position at index optimization time. You might want to 
update a number of document every time you do that.

https://issues.apache.org/jira/browse/LUCENE-879

Rajesh parab skrev:
> Thanks for details Karl.
> 
> I was looking for something like it. However, I have a
> question around the warning mentioned in javadoc of
> parallelReader. 
> 
> It says -
> It is up to you to make sure all indexes are created
> and modified the same way. For example, if you add
> documents to one index, you need to add the same
> documents in the same order to the other indexes.
> Failure to do so will result in undefined behavior.
> 
> 
> So now, if I want to update one of the index document
> from my dynamic index, I will have to delete the
> document and insert it again as Lucene does not allow
> updating the document. Correct? If this is the case,
> re-insert of document in dynamic index will change the
> order of the index with static index, which is not
> modified. How should we take care of this situation?
> Am I missing something here?
> 
> Regards,
> Rajesh
> 
> --- Karl Wettin <ka...@gmail.com> wrote:
> 
>> Hi Rajesh,
>>
>> I think you are looking for ParallelReader.
>>
>>
> <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/ParallelReader.html>
>> public class ParallelReader
>> extends IndexReader
>>
>> An IndexReader which reads multiple, parallel
>> indexes. Each index added 
>> must have the same number of documents, but
>> typically each contains 
>> different fields. Each document contains the union
>> of the fields of all 
>> documents with the same document number. When
>> searching, matches for a 
>> query term are from the first index added that has
>> the field.
>>
>> This is useful, e.g., with collections that have
>> large fields which 
>> change rarely and small fields that change more
>> frequently. The smaller 
>> fields may be re-indexed in a new index and both
>> indexes may be searched 
>> together.
>>
>> Warning: It is up to you to make sure all indexes
>> are created and 
>> modified the same way. For example, if you add
>> documents to one index, 
>> you need to add the same documents in the same order
>> to the other 
>> indexes. Failure to do so will result in undefined
>> behavior.
>>
>>
>>
>>      karl
>>
>> Rajesh parab skrev:
>>> Hi,
>>>
>>> We are using Lucene 2.0 to index data stored
>> inside
>>> relational database. Like any relational database,
>> our
>>> database has quite a few one-to-one and
>> one-to-many
>>> relationships. For example, let’s say an Object A
>> has
>>> one-to-many relationship with Object X and Object
>> Y.
>>> As we need to de-normalize relational data as
>>> key-value pairs before storing it inside Lucene
>> index,
>>> we have de-normalized these relationships (Object
>> X
>>> and Object Y) while building an index on Object A.
>>>
>>> We have large no of such object relationships and
>> most
>>> of the times, the related objects are modified
>> more
>>> frequently than the base objects. For example, in
>> our
>>> above case, objects X and Y are updated in the
>> system
>>> very frequently, whereas Object A is not updated
>> that
>>> often. Still, we will need to update Object A
>> entries
>>> inside the index, every time its related objects X
>>> and/or Y are modified.
>>>
>>> To avoid the above situation, we were thinking of
>>> having 2 separate indexes – first index will only
>>> index data of base objects (Object A in above
>> example)
>>> and second index will contain data about its
>>> relationship objects (Object X and Y above), which
>> are
>>> updated more frequently. This way, the more
>> frequent
>>> updates to Object X and Y will only impact second
>>> index that stores relationship information and
>> reduce
>>> the cost to re-index object A. However, I don’t
>> think,
>>> MultiSearcher will be helpful if we want to search
>> for
>>> data which spans across both indexes (e.g. some
>> fields
>>> of Object A in first index and some fields of
>> Object X
>>> or Y in second index).
>>>
>>> Do we have any option in Lucene to handle such
>>> scenario? Can we search across multiple indexes
>> which
>>> have some relationships between them and search
>> for
>>> fields that span across these indexes?
>>>
>>> Regards,
>>> Rajesh
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam
>> protection around 
>>> http://mail.yahoo.com 
>>>
>>>
> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>>
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

While going over the forum, I found one more thread
where Otis has asked similar question around the
syncronization of doc ids between 2 indexes.

http://www.gossamer-threads.com/lists/lucene/java-user/50227?search_string=parallelreader;#50227

Otis,
Have you found the answer to your question?

Regards,
Rajesh

--- Rajesh parab <ra...@yahoo.com> wrote:

> Thanks for details Karl.
> 
> I was looking for something like it. However, I have
> a
> question around the warning mentioned in javadoc of
> parallelReader. 
> 
> It says -
> It is up to you to make sure all indexes are created
> and modified the same way. For example, if you add
> documents to one index, you need to add the same
> documents in the same order to the other indexes.
> Failure to do so will result in undefined behavior.
> 
> 
> So now, if I want to update one of the index
> document
> from my dynamic index, I will have to delete the
> document and insert it again as Lucene does not
> allow
> updating the document. Correct? If this is the case,
> re-insert of document in dynamic index will change
> the
> order of the index with static index, which is not
> modified. How should we take care of this situation?
> Am I missing something here?
> 
> Regards,
> Rajesh
> 
> --- Karl Wettin <ka...@gmail.com> wrote:
> 
> > Hi Rajesh,
> > 
> > I think you are looking for ParallelReader.
> > 
> >
>
<http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/ParallelReader.html>
> > 
> > public class ParallelReader
> > extends IndexReader
> > 
> > An IndexReader which reads multiple, parallel
> > indexes. Each index added 
> > must have the same number of documents, but
> > typically each contains 
> > different fields. Each document contains the union
> > of the fields of all 
> > documents with the same document number. When
> > searching, matches for a 
> > query term are from the first index added that has
> > the field.
> > 
> > This is useful, e.g., with collections that have
> > large fields which 
> > change rarely and small fields that change more
> > frequently. The smaller 
> > fields may be re-indexed in a new index and both
> > indexes may be searched 
> > together.
> > 
> > Warning: It is up to you to make sure all indexes
> > are created and 
> > modified the same way. For example, if you add
> > documents to one index, 
> > you need to add the same documents in the same
> order
> > to the other 
> > indexes. Failure to do so will result in undefined
> > behavior.
> > 
> > 
> > 
> >      karl
> > 
> > Rajesh parab skrev:
> > > Hi,
> > > 
> > > We are using Lucene 2.0 to index data stored
> > inside
> > > relational database. Like any relational
> database,
> > our
> > > database has quite a few one-to-one and
> > one-to-many
> > > relationships. For example, lets say an Object
> A
> > has
> > > one-to-many relationship with Object X and
> Object
> > Y.
> > > As we need to de-normalize relational data as
> > > key-value pairs before storing it inside Lucene
> > index,
> > > we have de-normalized these relationships
> (Object
> > X
> > > and Object Y) while building an index on Object
> A.
> > > 
> > > We have large no of such object relationships
> and
> > most
> > > of the times, the related objects are modified
> > more
> > > frequently than the base objects. For example,
> in
> > our
> > > above case, objects X and Y are updated in the
> > system
> > > very frequently, whereas Object A is not updated
> > that
> > > often. Still, we will need to update Object A
> > entries
> > > inside the index, every time its related objects
> X
> > > and/or Y are modified.
> > > 
> > > To avoid the above situation, we were thinking
> of
> > > having 2 separate indexes  first index will
> only
> > > index data of base objects (Object A in above
> > example)
> > > and second index will contain data about its
> > > relationship objects (Object X and Y above),
> which
> > are
> > > updated more frequently. This way, the more
> > frequent
> > > updates to Object X and Y will only impact
> second
> > > index that stores relationship information and
> > reduce
> > > the cost to re-index object A. However, I dont
> > think,
> > > MultiSearcher will be helpful if we want to
> search
> > for
> > > data which spans across both indexes (e.g. some
> > fields
> > > of Object A in first index and some fields of
> > Object X
> > > or Y in second index).
> > > 
> > > Do we have any option in Lucene to handle such
> > > scenario? Can we search across multiple indexes
> > which
> > > have some relationships between them and search
> > for
> > > fields that span across these indexes?
> > > 
> > > Regards,
> > > Rajesh
> > > 
> > >
> __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around 
> > > http://mail.yahoo.com 
> > > 
> > >
> >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > 
> > 
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > 
> > 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Rajesh parab <ra...@yahoo.com>.

Thanks for details Karl.

I was looking for something like it. However, I have a
question around the warning mentioned in javadoc of
parallelReader. 

It says -
It is up to you to make sure all indexes are created
and modified the same way. For example, if you add
documents to one index, you need to add the same
documents in the same order to the other indexes.
Failure to do so will result in undefined behavior.


So now, if I want to update one of the index document
from my dynamic index, I will have to delete the
document and insert it again as Lucene does not allow
updating the document. Correct? If this is the case,
re-insert of document in dynamic index will change the
order of the index with static index, which is not
modified. How should we take care of this situation?
Am I missing something here?

Regards,
Rajesh

--- Karl Wettin <ka...@gmail.com> wrote:

> Hi Rajesh,
> 
> I think you are looking for ParallelReader.
> 
>
<http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/ParallelReader.html>
> 
> public class ParallelReader
> extends IndexReader
> 
> An IndexReader which reads multiple, parallel
> indexes. Each index added 
> must have the same number of documents, but
> typically each contains 
> different fields. Each document contains the union
> of the fields of all 
> documents with the same document number. When
> searching, matches for a 
> query term are from the first index added that has
> the field.
> 
> This is useful, e.g., with collections that have
> large fields which 
> change rarely and small fields that change more
> frequently. The smaller 
> fields may be re-indexed in a new index and both
> indexes may be searched 
> together.
> 
> Warning: It is up to you to make sure all indexes
> are created and 
> modified the same way. For example, if you add
> documents to one index, 
> you need to add the same documents in the same order
> to the other 
> indexes. Failure to do so will result in undefined
> behavior.
> 
> 
> 
>      karl
> 
> Rajesh parab skrev:
> > Hi,
> > 
> > We are using Lucene 2.0 to index data stored
> inside
> > relational database. Like any relational database,
> our
> > database has quite a few one-to-one and
> one-to-many
> > relationships. For example, lets say an Object A
> has
> > one-to-many relationship with Object X and Object
> Y.
> > As we need to de-normalize relational data as
> > key-value pairs before storing it inside Lucene
> index,
> > we have de-normalized these relationships (Object
> X
> > and Object Y) while building an index on Object A.
> > 
> > We have large no of such object relationships and
> most
> > of the times, the related objects are modified
> more
> > frequently than the base objects. For example, in
> our
> > above case, objects X and Y are updated in the
> system
> > very frequently, whereas Object A is not updated
> that
> > often. Still, we will need to update Object A
> entries
> > inside the index, every time its related objects X
> > and/or Y are modified.
> > 
> > To avoid the above situation, we were thinking of
> > having 2 separate indexes  first index will only
> > index data of base objects (Object A in above
> example)
> > and second index will contain data about its
> > relationship objects (Object X and Y above), which
> are
> > updated more frequently. This way, the more
> frequent
> > updates to Object X and Y will only impact second
> > index that stores relationship information and
> reduce
> > the cost to re-index object A. However, I dont
> think,
> > MultiSearcher will be helpful if we want to search
> for
> > data which spans across both indexes (e.g. some
> fields
> > of Object A in first index and some fields of
> Object X
> > or Y in second index).
> > 
> > Do we have any option in Lucene to handle such
> > scenario? Can we search across multiple indexes
> which
> > have some relationships between them and search
> for
> > fields that span across these indexes?
> > 
> > Regards,
> > Rajesh
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> > http://mail.yahoo.com 
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene index on relational data

Posted by Karl Wettin <ka...@gmail.com>.

Hi Rajesh,

I think you are looking for ParallelReader.

<http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/ParallelReader.html>

public class ParallelReader
extends IndexReader

An IndexReader which reads multiple, parallel indexes. Each index added 
must have the same number of documents, but typically each contains 
different fields. Each document contains the union of the fields of all 
documents with the same document number. When searching, matches for a 
query term are from the first index added that has the field.

This is useful, e.g., with collections that have large fields which 
change rarely and small fields that change more frequently. The smaller 
fields may be re-indexed in a new index and both indexes may be searched 
together.

Warning: It is up to you to make sure all indexes are created and 
modified the same way. For example, if you add documents to one index, 
you need to add the same documents in the same order to the other 
indexes. Failure to do so will result in undefined behavior.



     karl

Rajesh parab skrev:
> Hi,
> 
> We are using Lucene 2.0 to index data stored inside
> relational database. Like any relational database, our
> database has quite a few one-to-one and one-to-many
> relationships. For example, let’s say an Object A has
> one-to-many relationship with Object X and Object Y.
> As we need to de-normalize relational data as
> key-value pairs before storing it inside Lucene index,
> we have de-normalized these relationships (Object X
> and Object Y) while building an index on Object A.
> 
> We have large no of such object relationships and most
> of the times, the related objects are modified more
> frequently than the base objects. For example, in our
> above case, objects X and Y are updated in the system
> very frequently, whereas Object A is not updated that
> often. Still, we will need to update Object A entries
> inside the index, every time its related objects X
> and/or Y are modified.
> 
> To avoid the above situation, we were thinking of
> having 2 separate indexes – first index will only
> index data of base objects (Object A in above example)
> and second index will contain data about its
> relationship objects (Object X and Y above), which are
> updated more frequently. This way, the more frequent
> updates to Object X and Y will only impact second
> index that stores relationship information and reduce
> the cost to re-index object A. However, I don’t think,
> MultiSearcher will be helpful if we want to search for
> data which spans across both indexes (e.g. some fields
> of Object A in first index and some fields of Object X
> or Y in second index).
> 
> Do we have any option in Lucene to handle such
> scenario? Can we search across multiple indexes which
> have some relationships between them and search for
> fields that span across these indexes?
> 
> Regards,
> Rajesh
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org