You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2007/03/01 11:47:21 UTC

Re: document field updates

On Feb 28, 2007, at 8:59 AM, Steven Parkes wrote:

> 	Are unindexed fields stored seperately from the main inverted
> index?
> 	If so then, one could implement the field value change as a
> delete and
> 	re-add of just that value?
>
> The short answer is that won't work. Field values are stored in a
> different data structure than the postings lists but docids are
> consistent across all contents of a segment. Deleting something and
> readding it is going to put it into a different segment which is going
> to keep this from working. (Not to mention that you want the postings
> lists updated if you want it to be searchable ...)
>
> 	Are you aware of some implementation of Lucene that solves this
> need
> 	well with a second index for 'tags' complete with multi-index
> boolean
> 	queries?
>
> I'm pretty sure this has been done, I'm just not 100% sure where. Does
> Nutch index link text?

Nutch does do this sort of thing, but I'm not quite sure how.  It  
isn't doing any operations to the Lucene index beyond what plain ol'  
Lucene does.

> I don't know if Solr has anything like this but
> if I remember correctly, Collex has tags but as far as I can tell,  
> it's
> not been open sourced (yet?)

Collex is quite open source, its just ugly source :)  We're the  
'patacriticism' project at SourceForge, under the "collex" directory  
in Subversion.

Collex implements tagging by implementing JOIN cross-references  
between user/tag documents and regular object documents.  It's  
scalability is not going to be good at bigger numbers in its current  
architecture, but it works quite well for our 60k or so objects at  
the moment.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document field updates

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 1, 2007, at 1:35 PM, Neal Richter wrote:
>> Collex is quite open source, its just ugly source :)  We're the
>> 'patacriticism' project at SourceForge, under the "collex" directory
>> in Subversion.
>>
>> Collex implements tagging by implementing JOIN cross-references
>> between user/tag documents and regular object documents.  It's
>> scalability is not going to be good at bigger numbers in its current
>> architecture, but it works quite well for our 60k or so objects at
>> the moment.
>
> Have you implemented and code that enforces a Boolean query across
> these two indexes?

Actually its a single index, with a "type" field that separates the  
two different types of documents (archive objects, or collectable  
objects).

A pointer to this code is here: <http:// 
patacriticism.svn.sourceforge.net/viewvc/patacriticism/collex/trunk/ 
src/solr/org/nines/CollectableCache.java?view=markup>  It's a hack  
that leverages some of Solr's facilities (but not near enough!).

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document field updates

Posted by Neal Richter <nr...@gmail.com>.

> Collex is quite open source, its just ugly source :)  We're the
> 'patacriticism' project at SourceForge, under the "collex" directory
> in Subversion.
>
> Collex implements tagging by implementing JOIN cross-references
> between user/tag documents and regular object documents.  It's
> scalability is not going to be good at bigger numbers in its current
> architecture, but it works quite well for our 60k or so objects at
> the moment.

Have you implemented and code that enforces a Boolean query across
these two indexes?
Has anyone implemented a BooleanQuery class that operates across a set
of Fields that may live in different Indexes?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document field updates

Posted by Andrzej Bialecki <ab...@getopt.org>.

Erik Hatcher wrote:
>>
>> I'm pretty sure this has been done, I'm just not 100% sure where. Does
>> Nutch index link text?
>
> Nutch does do this sort of thing, but I'm not quite sure how.  It 
> isn't doing any operations to the Lucene index beyond what plain ol' 
> Lucene does.
>

Nutch maintains a set of separate DBs (using Hadoop 
MapFile/SequenceFile), where inlinks are stored (together with their 
anchor text). During indexing this data is pulled in from the DBs piece 
by piece using the URLs as "primary keys".

Nutch doesn't update _any_ data structures in-place - all "update" 
operations involve creating new data files and optionally deleting old 
data files. This includes also indexes - new indexes are being created 
from newly updated pages, and then only individual Lucene documents are 
deleted from older indexes to get rid of duplicates. After a while, 
really old indexes are removed completely, because their content is 
likely to be present in one of the newer indexes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org