You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Shane <sh...@LEXUM.UMontreal.CA> on 2009/08/13 16:33:45 UTC

Is there a way to check for field "uniqueness" when indexing?

Hi all!

I'm currently running a big lucene index and one of my main concerns is 
the integrity of the data entered. A few things come to mind, like 
enforcing that certain fields be non-blank, forcing certain formats etc...

All these validations are easy to do with lucene, since I can validate 
the document before it is indexed or when it is retrieved.

The thing however that I have a hard time with, is field uniquness.

Lets say I have a field and I really want it to be unique. I can't seem 
to find out how to do it during the indexation phase since everything 
that is added to the index is not readable by an index reader until the 
index is closed.

Add to that the fact that items can be deleted from the index during the 
indexation and the only way I have to figure uniquness is to check every 
unique field values using termEnums and checking for docFreq.

This has a major disadvantage that I cannot inform people who are using 
the library of the unique conflit when it happens, only when the index 
is closed.

Does anyone have an idea on how I could check an index that is in the 
process of being indexed (things added, things deleted) for the uniquess 
of a given field *at the time I index a document* ?

Daniel Shane


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Aug 26, 2009 at 12:47 PM, Daniel Shane<sh...@lexum.umontreal.ca> wrote:
> Humm... there is something I dont catch..
>
> When you open up an index writer, you batch up add and deletes. Now if you
> create a signature for the document, as long as you add it works, but what
> happens if you delete stuff from the index using a query as well as adding?
>
> Does Solr also remember the deletions as well?

It used to - but now it delegates all that to IndexWriter as well (and
lucene buffers them instead).

-Yonik
http://www.lucidimagination.com


> Daniel Shane
>
> Yonik Seeley wrote:
>>
>> On Fri, Aug 21, 2009 at 12:49 AM, Chris
>> Hostetter<ho...@fucit.org> wrote:
>>
>>>
>>> : But in that case, I assume Solr does a commit per document added.
>>>
>>> not at all ... it computes a signature and then uses that as a unique
>>> key.
>>> IndexWriter.updateDocument does all the hard work.
>>>
>>
>> Right - Solr used to do that hard work, but we handed that over to
>> Lucene when that capability was added.  It involves batching either
>> way (but letting Lucene handle it at a lower level is "better" since
>> it can prevent inconsistencies from crashes).
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Humm... there is something I dont catch..

When you open up an index writer, you batch up add and deletes. Now if 
you create a signature for the document, as long as you add it works, 
but what happens if you delete stuff from the index using a query as 
well as adding?

Does Solr also remember the deletions as well?

Daniel Shane

Yonik Seeley wrote:
> On Fri, Aug 21, 2009 at 12:49 AM, Chris
> Hostetter<ho...@fucit.org> wrote:
>   
>> : But in that case, I assume Solr does a commit per document added.
>>
>> not at all ... it computes a signature and then uses that as a unique key.
>> IndexWriter.updateDocument does all the hard work.
>>     
>
> Right - Solr used to do that hard work, but we handed that over to
> Lucene when that capability was added.  It involves batching either
> way (but letting Lucene handle it at a lower level is "better" since
> it can prevent inconsistencies from crashes).
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Aug 21, 2009 at 12:49 AM, Chris
Hostetter<ho...@fucit.org> wrote:
>
> : But in that case, I assume Solr does a commit per document added.
>
> not at all ... it computes a signature and then uses that as a unique key.
> IndexWriter.updateDocument does all the hard work.

Right - Solr used to do that hard work, but we handed that over to
Lucene when that capability was added.  It involves batching either
way (but letting Lucene handle it at a lower level is "better" since
it can prevent inconsistencies from crashes).

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Chris Hostetter <ho...@fucit.org>.
: But in that case, I assume Solr does a commit per document added.

not at all ... it computes a signature and then uses that as a unique key.  
IndexWriter.updateDocument does all the hard work.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
But in that case, I assume Solr does a commit per document added.

Lets say I wanted to index a collection of 1 million pages, would it 
take much longer if I comited at each insertion rather than comiting at 
the end?

Daniel Shane
 
Grant Ingersoll wrote:
>
>
> On Aug 13, 2009, at 10:33 AM, Daniel Shane wrote:
>>
>> Does anyone have an idea on how I could check an index that is in the 
>> process of being indexed (things added, things deleted) for the 
>> uniquess of a given field *at the time I index a document* ?
>
>
> Solr has de-duplication built-in at indexing time: 
> http://wiki.apache.org/solr/Deduplication
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 13, 2009, at 10:33 AM, Daniel Shane wrote:
>
> Does anyone have an idea on how I could check an index that is in  
> the process of being indexed (things added, things deleted) for the  
> uniquess of a given field *at the time I index a document* ?


Solr has de-duplication built-in at indexing time: http://wiki.apache.org/solr/Deduplication

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Shai Erera <se...@gmail.com>.
In 2.9 there will be - IndexWriter#getReader().

BTW, note that even if someone deletes, your reader may not see this delete.
If you use IndexWriter to delete docs, the open reader won't see those
deletes. So you may still have a problem.

I don't know how much stuff users can index, and how often you plan to
commit. But I'd like to think humans can't generate that much traffic in a
short period of time (say every couple of minutes). If they do, you might
want to commit anyway, so they'll be able to search on their newly added
data?

Anyway, I'll give it some more thought.

Shai

On Thu, Aug 13, 2009 at 5:52 PM, Daniel Shane <sh...@lexum.umontreal.ca>wrote:

> Users can index really a lot of stuff, so I'd like not to keep things in
> memory for too long.
>
> Even if I keep a set of things added, how do I know if something has been
> deleted via a delete? It seems rather difficult to keep this set of
> documents added in sync with the index reader on the index (before it has
> been written to).
>
> What I'd like is to have an access to the stuff the index writer has
> written but not yet commited. Is there something that can access that data?
>
> Daniel Shane
>
>
> Shai Erera wrote:
>
>> How many documents do you index between you refresh a reader? If it's not
>> too much, I'd keep a Set of those terms and check every incoming document
>> in
>> the set and then the reader.
>>
>> Note that the set keeps only just the terms of those documents your reader
>> doesn't see. You should clear() it after you've refreshed your reader.
>>
>> In 2.9, IndexWriter will expose a getReader(), so you might be able to use
>> it, by checking on its reader and the on disk reader.
>>
>> If it's possible, I think I'd prefer the first approach.
>>
>> Shai
>>
>> On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane <shaned@lexum.umontreal.ca
>> >wrote:
>>
>>
>>
>>> Hi all!
>>>
>>> I'm currently running a big lucene index and one of my main concerns is
>>> the
>>> integrity of the data entered. A few things come to mind, like enforcing
>>> that certain fields be non-blank, forcing certain formats etc...
>>>
>>> All these validations are easy to do with lucene, since I can validate
>>> the
>>> document before it is indexed or when it is retrieved.
>>>
>>> The thing however that I have a hard time with, is field uniquness.
>>>
>>> Lets say I have a field and I really want it to be unique. I can't seem
>>> to
>>> find out how to do it during the indexation phase since everything that
>>> is
>>> added to the index is not readable by an index reader until the index is
>>> closed.
>>>
>>> Add to that the fact that items can be deleted from the index during the
>>> indexation and the only way I have to figure uniquness is to check every
>>> unique field values using termEnums and checking for docFreq.
>>>
>>> This has a major disadvantage that I cannot inform people who are using
>>> the
>>> library of the unique conflit when it happens, only when the index is
>>> closed.
>>>
>>> Does anyone have an idea on how I could check an index that is in the
>>> process of being indexed (things added, things deleted) for the uniquess
>>> of
>>> a given field *at the time I index a document* ?
>>>
>>> Daniel Shane
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Users can index really a lot of stuff, so I'd like not to keep things in 
memory for too long.

Even if I keep a set of things added, how do I know if something has 
been deleted via a delete? It seems rather difficult to keep this set of 
documents added in sync with the index reader on the index (before it 
has been written to).

What I'd like is to have an access to the stuff the index writer has 
written but not yet commited. Is there something that can access that data?

Daniel Shane

Shai Erera wrote:
> How many documents do you index between you refresh a reader? If it's not
> too much, I'd keep a Set of those terms and check every incoming document in
> the set and then the reader.
>
> Note that the set keeps only just the terms of those documents your reader
> doesn't see. You should clear() it after you've refreshed your reader.
>
> In 2.9, IndexWriter will expose a getReader(), so you might be able to use
> it, by checking on its reader and the on disk reader.
>
> If it's possible, I think I'd prefer the first approach.
>
> Shai
>
> On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane <sh...@lexum.umontreal.ca>wrote:
>
>   
>> Hi all!
>>
>> I'm currently running a big lucene index and one of my main concerns is the
>> integrity of the data entered. A few things come to mind, like enforcing
>> that certain fields be non-blank, forcing certain formats etc...
>>
>> All these validations are easy to do with lucene, since I can validate the
>> document before it is indexed or when it is retrieved.
>>
>> The thing however that I have a hard time with, is field uniquness.
>>
>> Lets say I have a field and I really want it to be unique. I can't seem to
>> find out how to do it during the indexation phase since everything that is
>> added to the index is not readable by an index reader until the index is
>> closed.
>>
>> Add to that the fact that items can be deleted from the index during the
>> indexation and the only way I have to figure uniquness is to check every
>> unique field values using termEnums and checking for docFreq.
>>
>> This has a major disadvantage that I cannot inform people who are using the
>> library of the unique conflit when it happens, only when the index is
>> closed.
>>
>> Does anyone have an idea on how I could check an index that is in the
>> process of being indexed (things added, things deleted) for the uniquess of
>> a given field *at the time I index a document* ?
>>
>> Daniel Shane
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Shai Erera <se...@gmail.com>.
How many documents do you index between you refresh a reader? If it's not
too much, I'd keep a Set of those terms and check every incoming document in
the set and then the reader.

Note that the set keeps only just the terms of those documents your reader
doesn't see. You should clear() it after you've refreshed your reader.

In 2.9, IndexWriter will expose a getReader(), so you might be able to use
it, by checking on its reader and the on disk reader.

If it's possible, I think I'd prefer the first approach.

Shai

On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane <sh...@lexum.umontreal.ca>wrote:

> Hi all!
>
> I'm currently running a big lucene index and one of my main concerns is the
> integrity of the data entered. A few things come to mind, like enforcing
> that certain fields be non-blank, forcing certain formats etc...
>
> All these validations are easy to do with lucene, since I can validate the
> document before it is indexed or when it is retrieved.
>
> The thing however that I have a hard time with, is field uniquness.
>
> Lets say I have a field and I really want it to be unique. I can't seem to
> find out how to do it during the indexation phase since everything that is
> added to the index is not readable by an index reader until the index is
> closed.
>
> Add to that the fact that items can be deleted from the index during the
> indexation and the only way I have to figure uniquness is to check every
> unique field values using termEnums and checking for docFreq.
>
> This has a major disadvantage that I cannot inform people who are using the
> library of the unique conflit when it happens, only when the index is
> closed.
>
> Does anyone have an idea on how I could check an index that is in the
> process of being indexed (things added, things deleted) for the uniquess of
> a given field *at the time I index a document* ?
>
> Daniel Shane
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Is there a way to check for field "uniqueness" when indexing?

Posted by Jason Rutherglen <ja...@gmail.com>.
Daniel,

You may want to look at SOLR-1375 which enables ID checking
using a BloomFilter (with a specified errorrate of false
positives). Otherwise for what you're trying to do, you'd need
to create a hash map?

-J

On Thu, Aug 13, 2009 at 7:33 AM, Daniel Shane<sh...@lexum.umontreal.ca> wrote:
> Hi all!
>
> I'm currently running a big lucene index and one of my main concerns is the
> integrity of the data entered. A few things come to mind, like enforcing
> that certain fields be non-blank, forcing certain formats etc...
>
> All these validations are easy to do with lucene, since I can validate the
> document before it is indexed or when it is retrieved.
>
> The thing however that I have a hard time with, is field uniquness.
>
> Lets say I have a field and I really want it to be unique. I can't seem to
> find out how to do it during the indexation phase since everything that is
> added to the index is not readable by an index reader until the index is
> closed.
>
> Add to that the fact that items can be deleted from the index during the
> indexation and the only way I have to figure uniquness is to check every
> unique field values using termEnums and checking for docFreq.
>
> This has a major disadvantage that I cannot inform people who are using the
> library of the unique conflit when it happens, only when the index is
> closed.
>
> Does anyone have an idea on how I could check an index that is in the
> process of being indexed (things added, things deleted) for the uniquess of
> a given field *at the time I index a document* ?
>
> Daniel Shane
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org