You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Thomas Scheffler <th...@uni-jena.de> on 2013/11/27 09:13:14 UTC

weak documents

Hi,

I am relatively new to SOLR and I am looking for a neat way to implement 
weak documents with SOLR.

Whenever a document is updated or deleted all it's dependent documents 
should be removed from the index. In other words they exist as long as 
the document exist they refer to when they were indexed - in that 
specific version. On "update" they will be indexed after their master 
document.

I could like to have some kind of "dependsOn" field that carries the 
uniqueKey value of the master document.

Can this be done efficiently with SOLR?

I need this technique because on update and on delete I don't know how 
many dependent documents exists in the SOLR index. Especially for batch 
index processes, I need a more efficient way than query before every 
update or delete.

kind regards,

Thomas

Re: weak documents

Posted by Walter Underwood <wu...@wunderwood.org>.

Right. Delete by query "id:foo OR dependsOn:foo".  --wunder

On Nov 27, 2013, at 6:23 AM, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> Just bite the bullet and do the query at your application level. I mean, Solr/Lucene would have to do the same amount of work internally anyway. If the perceived performance overhead is too great, get beefier hardware.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Thomas Scheffler
> Sent: Wednesday, November 27, 2013 3:13 AM
> To: SOLR User
> Subject: weak documents
> 
> Hi,
> 
> I am relatively new to SOLR and I am looking for a neat way to implement
> weak documents with SOLR.
> 
> Whenever a document is updated or deleted all it's dependent documents
> should be removed from the index. In other words they exist as long as
> the document exist they refer to when they were indexed - in that
> specific version. On "update" they will be indexed after their master
> document.
> 
> I could like to have some kind of "dependsOn" field that carries the
> uniqueKey value of the master document.
> 
> Can this be done efficiently with SOLR?
> 
> I need this technique because on update and on delete I don't know how
> many dependent documents exists in the SOLR index. Especially for batch
> index processes, I need a more efficient way than query before every
> update or delete.
> 
> kind regards,
> 
> Thomas 

--
Walter Underwood
wunder@wunderwood.org

Re: weak documents

Posted by Jack Krupansky <ja...@basetechnology.com>.

Just bite the bullet and do the query at your application level. I mean, 
Solr/Lucene would have to do the same amount of work internally anyway. If 
the perceived performance overhead is too great, get beefier hardware.

-- Jack Krupansky

-----Original Message----- 
From: Thomas Scheffler
Sent: Wednesday, November 27, 2013 3:13 AM
To: SOLR User
Subject: weak documents

Hi,

I am relatively new to SOLR and I am looking for a neat way to implement
weak documents with SOLR.

Whenever a document is updated or deleted all it's dependent documents
should be removed from the index. In other words they exist as long as
the document exist they refer to when they were indexed - in that
specific version. On "update" they will be indexed after their master
document.

I could like to have some kind of "dependsOn" field that carries the
uniqueKey value of the master document.

Can this be done efficiently with SOLR?

I need this technique because on update and on delete I don't know how
many dependent documents exists in the SOLR index. Especially for batch
index processes, I need a more efficient way than query before every
update or delete.

kind regards,

Thomas

Re: weak documents

Posted by Thomas Scheffler <th...@uni-jena.de>.

Am 27.11.2013 09:58, schrieb Paul Libbrecht:
> Thomas,
>
> our experience with Curriki.org is that evaluating what I call the
> "related documents" is a procedure that needs access to the complete
> content and thus is run at the DB level and no thte sold-level.
>
> For example, if a user changes a part of its name, we need to reindex
> all of his resources. Sure we could try to run a solr query for this,
> and maybe add index fields for it, but we felt it better to run this
> on the index-trigger side, the thing in our (XWiki) wiki which
> listens to changes and requests the reindexing of a few documents
> (including deletions).
>
> For the maintenance operation, the same issue has appeared. So, if
> the indexer or listener or solr has been down for a few minutes or
> hours, we'd need to reindex not only all changed documents but all
> changed documents and their related documents.
>
> If you are able to work through your solution that would be
> solr-only,  to write down all depends-on at index time, it means you
> would index-update all "inverse related" documents every time that
> changes. For the relation above (documents of a user), it means the
> user documents needs reindexing every time a new document is added. I
> wonder if this makes a scale difference.

I think both use-cases differ a bit. On index-time of my master document 
I have all information of dependent documents ready. So instead of 
committing one document I commit - lets say - four.

In your case you have to query to get all documents of a user first.

Here is a more detailed use-case. I have metadata in 1 to n languages to 
describe a document (e.g. journal article).

I commit a master document in a specified default language to SOLR and 
one document for every language I have metadata for. If a user adds or 
removes metadata (e.g. abstract in French) there is one document more or 
one document less in SOLR. So their number changes and I want stalled 
data to be kept in the index.

A similar use case: I have article documents with authors. I create 
"author" documents for every article. If someone adds or removes an 
author I need to track that change. These "dump" author documents are 
used for an alphabetical person index and hold a unique field that is 
used to group them but these documents exists only as long as their 
master documents do.

My two use-cases are quite similar so I would like these "weak" 
documents functionality somehow.

SOLR knows if a document is added with id=foo it have to replace a 
document that matches id:"foo". If I can change this behavior to 
dependsOn:"foo" I am done. :-D

regards

Thomas

Re: weak documents

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Thomas,

our experience with Curriki.org is that evaluating what I call the "related documents" is a procedure that needs access to the complete content and thus is run at the DB level and no thte sold-level.

For example, if a user changes a part of its name, we need to reindex all of his resources. Sure we could try to run a solr query for this, and maybe add index fields for it, but we felt it better to run this on the index-trigger side, the thing in our (XWiki) wiki which listens to changes and requests the reindexing of a few documents (including deletions).

For the maintenance operation, the same issue has appeared.
So, if the indexer or listener or solr has been down for a few minutes or hours, we'd need to reindex not only all changed documents but all changed documents and their related documents.

If you are able to work through your solution that would be solr-only,  to write down all depends-on at index time, it means you would index-update all "inverse related" documents every time that changes. For the relation above (documents of a user), it means the user documents needs reindexing every time a new document is added. I wonder if this makes a scale difference.

Paul


Le 27 nov. 2013 à 09:13, Thomas Scheffler <th...@uni-jena.de> a écrit :

> Hi,
> 
> I am relatively new to SOLR and I am looking for a neat way to implement weak documents with SOLR.
> 
> Whenever a document is updated or deleted all it's dependent documents should be removed from the index. In other words they exist as long as the document exist they refer to when they were indexed - in that specific version. On "update" they will be indexed after their master document.
> 
> I could like to have some kind of "dependsOn" field that carries the uniqueKey value of the master document.
> 
> Can this be done efficiently with SOLR?
> 
> I need this technique because on update and on delete I don't know how many dependent documents exists in the SOLR index. Especially for batch index processes, I need a more efficient way than query before every update or delete.
> 
> kind regards,
> 
> Thomas

Re: weak documents

Posted by Upayavira <uv...@odoko.co.uk>.

Just a guess, I haven't investigated them fully yet, but I wonder if
block joins could serve you here, as they involve creating docs in a
parent child relationship.

Or, you could easily fake it:

<delete>
  <id>abcd</id>
  <query>parent:abcd</query>
</delete>

Not sure if that syntax is completely right, but using that sort of
thing would get you there, For deletes, think.

There isn't yet an update by query (batch update) feature, one that
would be very useful.

Upayavira

On Wed, Nov 27, 2013, at 08:13 AM, Thomas Scheffler wrote:
> Hi,
> 
> I am relatively new to SOLR and I am looking for a neat way to implement 
> weak documents with SOLR.
> 
> Whenever a document is updated or deleted all it's dependent documents 
> should be removed from the index. In other words they exist as long as 
> the document exist they refer to when they were indexed - in that 
> specific version. On "update" they will be indexed after their master 
> document.
> 
> I could like to have some kind of "dependsOn" field that carries the 
> uniqueKey value of the master document.
> 
> Can this be done efficiently with SOLR?
> 
> I need this technique because on update and on delete I don't know how 
> many dependent documents exists in the SOLR index. Especially for batch 
> index processes, I need a more efficient way than query before every 
> update or delete.
> 
> kind regards,
> 
> Thomas