You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2007/02/01 00:39:02 UTC

Re: loading many documents by ID

: Oh, and there have been numerous people interested in "updateable"
: documents, so it would be nice if that part was in the update handler.

We'd have to make it very clear that this only works if all fields are
STORED.



-Hoss


Re: loading many documents by ID

Posted by Yonik Seeley <yo...@apache.org>.
On 1/31/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> On Jan 31, 2007, at 6:39 PM, Chris Hostetter wrote:
> > : Oh, and there have been numerous people interested in "updateable"
> > : documents, so it would be nice if that part was in the update
> > handler.
> >
> > We'd have to make it very clear that this only works if all fields are
> > STORED.
>
> That is perfectly reasonable, for sure.  And I would support an
> "update" feature issuing an exception if it detected this case.
>
> There is an important caveat to all fields being stored though... if
> an update was sending in updated fields for all the non-stored
> fields, and only stored fields were being copied internally, all
> would be fine too.

I think there might be two useful types of updates:
1) overwrite original field
2) add an additional value for a multi-valued field (useful for tagging?)


> I think eventually we could have this sort of feature internally copy
> the terms for non-stored fields somehow, but maybe that would only
> come along once Lucene supported something to facilitate this more?

Not unless you store more info (a lot more info).
We sould also be able to copy unstored fields with term vectors stored.

ParallelReader might also hold some promise (putting a field to be
updated in a separate index)  The problem is that the lucene ids need
to be kept in sync... I don't know how to do that w/o reindexing.

-Yonik

Re: loading many documents by ID

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 31, 2007, at 6:39 PM, Chris Hostetter wrote:
> : Oh, and there have been numerous people interested in "updateable"
> : documents, so it would be nice if that part was in the update  
> handler.
>
> We'd have to make it very clear that this only works if all fields are
> STORED.

That is perfectly reasonable, for sure.  And I would support an  
"update" feature issuing an exception if it detected this case.

There is an important caveat to all fields being stored though... if  
an update was sending in updated fields for all the non-stored  
fields, and only stored fields were being copied internally, all  
would be fine too.

I think eventually we could have this sort of feature internally copy  
the terms for non-stored fields somehow, but maybe that would only  
come along once Lucene supported something to facilitate this more?

	Erik



Re: loading many documents by ID

Posted by Ryan McKinley <ry...@gmail.com>.
> >
> > How about: Iterable<SolrDocument>
>
> Maybe... but that might not be the easiest for request handlers to
> use... they would then need to spin up a different thread and use a
> pull model (provide a new doc on demand) rather than push (call
> addDocument()).
>

With Iterable, you don't need to start a thread to implement a
'streaming' parser.  You can use an anonymous inner class that waits
until next() is called before reading the next row/line/document, etc.
 In affect this lets the RequestHandler set up all the common
configurations and then lets the UpdateHandler ask for a document one
at a time.

What I like about this is that the code that loops through each row of
my SQL updater does not need to know *anything* about the
UpdateHandler.  I would rather not call updater.addDoc( cmd ) within
the while( rs.next() )  loop.  This makes it much cleaner and easier
to test.

If writing a 'streaming' Iterable is more trouble then someone wants
to go through, they can easily return a Collection<SolrDocument> or an
array with single element.


> When I'm coding, the design tends to morph a lot.
>

mine too!


> I think we need to figure out what type of update semantics we want
> w.r.t. adding multiple documents, and all the other misc autocommit
> params.
>

Right now, what i am working with is an 'update' command that you can
pass along modes for each field.  If no modes are specified (or they
are all OVERWRITE) it behaves exactly as we have now (SQL REPLACE).
If any field uses something other then OVERWRITE, it behaves like an
SQL INSERT ... ON DUPLICATE KEY UPDATE.

Re: loading many documents by ID

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Ryan McKinley <ry...@gmail.com> wrote:
> >
> > Not sure... depends on how update handlers will use it...
>
> by update handler, you mean UpdateRequestHandler(s)? or UpdateHandler?

Both.

> > One thing we might not want to get rid of though is streaming
> > (constructing and adding a document, then discarding it).  People are
> > starting to add a lot of documents in a single XML request, and this
> > will be much larger for CVS/SQL.
> >
>
> So you are uncomfortable with the Collection because you would have to
> load all the documents before indexing them.  If this was many, it
> could be a problem...
>
> If UpdateHandler is going to take care of stuff like autocommit and
> modifying documents, It seems best to have that apply to all the
> documents you are going to modify as a unit.  For example, say i have
> a SQL updater that will modify 100,000 documents incrementing field
> 'count_*' and replacing 'fl_*'.  If the DocumentCommand only applies
> to a single document, it would have to match each field as it went
> along rather then once when it starts.
>
> How about: Iterable<SolrDocument>

Maybe... but that might not be the easiest for request handlers to
use... they would then need to spin up a different thread and use a
pull model (provide a new doc on demand) rather than push (call
addDocument()).

 I'm really just thinking a little out loud... just first impressions
- don't read too much into it.
When I'm coding, the design tends to morph a lot.

I think we need to figure out what type of update semantics we want
w.r.t. adding multiple documents, and all the other misc autocommit
params.

-Yonik

Re: loading many documents by ID

Posted by Ryan McKinley <ry...@gmail.com>.
>
> Not sure... depends on how update handlers will use it...

by update handler, you mean UpdateRequestHandler(s)? or UpdateHandler?

> One thing we might not want to get rid of though is streaming
> (constructing and adding a document, then discarding it).  People are
> starting to add a lot of documents in a single XML request, and this
> will be much larger for CVS/SQL.
>

So you are uncomfortable with the Collection because you would have to
load all the documents before indexing them.  If this was many, it
could be a problem...

If UpdateHandler is going to take care of stuff like autocommit and
modifying documents, It seems best to have that apply to all the
documents you are going to modify as a unit.  For example, say i have
a SQL updater that will modify 100,000 documents incrementing field
'count_*' and replacing 'fl_*'.  If the DocumentCommand only applies
to a single document, it would have to match each field as it went
along rather then once when it starts.

How about: Iterable<SolrDocument>

this way, an UpdateRequestHandler can start the UpdateHandler running
while it streams each document from XML/CSV/SQL

ryan

Re: loading many documents by ID

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Ryan McKinley <ry...@gmail.com> wrote:
> I am (was?) using DISTINCT to say, only add the unique fields.  As
> implemented, it keeps a Collection<String> for each field name.  If
> the 'mode' is 'DISTINCT' the collection is Set<String>, otherwise
> List<String>

Ah, OK... that does seem useful.

> How would you feel about an interface like this:

Not sure... depends on how update handlers will use it...
One thing we might not want to get rid of though is streaming
(constructing and adding a document, then discarding it).  People are
starting to add a lot of documents in a single XML request, and this
will be much larger for CVS/SQL.

For that reason, I'm not sure how often the "Collection" part will be utilized.

I like the it OK on the conceptual level though.

-Yonik

> public class IndexDocumentsCommand
> {
>   public enum MODE {
>     APPEND,    // add the fields to existing fields
>     OVERWRITE, // overwrite existing fields
>     INCREMENT, // increment existing field
>     DISTINCT   // same as APPEND, but make sure there are distinct values
>   };
>
>   // optional id in "internal" indexed form... if it is needed and not supplied,
>   // it will be obtained from the doc.
>   public String indexedId;
>
>   public Collection<SolrDocument> docs;
>   public boolean allowDups;
>   public boolean overwrite;
>   public SimpleOrderedMap<MODE> modifyFieldMode; // What to do for
> each field.  We should support *
>   public int commitMaxTime = -1; // make sure these documents are
> committed within this much time
> }

Re: loading many documents by ID

Posted by Ryan McKinley <ry...@gmail.com>.
>
> > REPLACE_DOCUMENT
> > REPLACE_FIELDS
> > REPLACE_DISTINCT_FIELDS
> > ADD_FIELDS
> > ADD_DISTINCT_FIELDS
>
> What does "distinct" mean in this context?
>

I am (was?) using DISTINCT to say, only add the unique fields.  As
implemented, it keeps a Collection<String> for each field name.  If
the 'mode' is 'DISTINCT' the collection is Set<String>, otherwise
List<String>


> There is a lot of processing going on inside Document Builder.
> Once you get to the UpdateCommand, you have already lost some
> information (copyFields have executed, some things have been converted
> to index form, etc).
>

I noticed that!  It made sense when I was implementing this in a
RequestHandler, but it gets a little wonky inside the UpdateHandler -
as you said, copyFields already executed.

I think the best thing is to make a new command that does not directly
take a lucene document as its input.  perhaps:

http://svn.lapnap.net/solr/solrj/src/org/apache/solr/client/solrj/SolrDocument.java
http://svn.lapnap.net/solr/solrj/src/org/apache/solr/client/solrj/impl/SimpleSolrDoc.java

Then the UpdateHandler would open the DocumentBuilder merge the
existing document with the passed in document using whatever method
specified.


> I would think one would also want to specify things per field.
>
> - append this value to this field
> - increment the value of this field
> - append this value to this field
> - overwrite this field
>

How would you feel about an interface like this:


public class IndexDocumentsCommand
{
  public enum MODE {
    APPEND,    // add the fields to existing fields
    OVERWRITE, // overwrite existing fields
    INCREMENT, // increment existing field
    DISTINCT   // same as APPEND, but make sure there are distinct values
  };

  // optional id in "internal" indexed form... if it is needed and not supplied,
  // it will be obtained from the doc.
  public String indexedId;

  public Collection<SolrDocument> docs;
  public boolean allowDups;
  public boolean overwrite;
  public SimpleOrderedMap<MODE> modifyFieldMode; // What to do for
each field.  We should support *
  public int commitMaxTime = -1; // make sure these documents are
committed within this much time
}


ryan

Re: loading many documents by ID

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Ryan McKinley <ry...@gmail.com> wrote:
> I have something working that adds a 'mode' to AddUpdateCommand.  The
> modes I need are:

Feel free to suggest replacements for the UpdateCommand classes if
things become cumbersome.

> REPLACE_DOCUMENT
> REPLACE_FIELDS
> REPLACE_DISTINCT_FIELDS
> ADD_FIELDS
> ADD_DISTINCT_FIELDS

What does "distinct" mean in this context?

There is a lot of processing going on inside Document Builder.
Once you get to the UpdateCommand, you have already lost some
information (copyFields have executed, some things have been converted
to index form, etc).

I would think one would also want to specify things per field.

- append this value to this field
- increment the value of this field
- append this value to this field
- overwrite this field

CSV/SQL handlers could define these per-field for multiple docs
(column) for a request.

XML could define per-field instance if we want, or we might want to
restrict per field (column) for a single request.

-Yonik

Re: loading many documents by ID

Posted by Yonik Seeley <yo...@apache.org>.
On 2/3/07, Chris Hostetter <ho...@fucit.org> wrote:
> the schema creator should still have some say in what kinds of things are
> allowed/dissalloed though -- the person doing the "update" may not fully
> understand the underlying model.

I think the two concerns should be separated:
1) updateable docs implementation
2) constraint checking

IMO, it's unnecessary to link these features, and requiring (2) will
just delay (1).
(2) should also cover things unrelated to updateable docs, such as
mandatory fields (say someone changes the schema and can't provide a
default, and they want the clients to fail until they change).

-Yonik

Re: loading many documents by ID

Posted by Chris Hostetter <ho...@fucit.org>.
: I agree.  I started down that path, and it gets pretty ugly.  I
: stopped.  I have opted for a syntax that 'updates' all stored fields,
: but lets you say explicitly what to do for each field.  If there is a
: stored field you want to skip, you can specify that in command rather
: then in the schema.

the schema creator should still have some say in what kinds of things are
allowed/dissalloed though -- the person doing the "update" may not fully
understand the underlying model.

: > another simple approach would be to make "updatability" a property of the
: > schema, that can contain a few different values...

: This is an interisting idea, but (if i'm understaning your suggestion
: correctly) it seems like TOO big of change from the existing schema.

the schema.xml format wouldn't change much .. just a new attribute on the
<schema> tag ... the existing example schema would either be labeled
"loose"  or "none" and we could provide another example of "strict" ... or
we would label it "strict" and remove the refrences to indexed/stored and
only mention them in comments describing other things you can do if you
dont' require the ability to mutate documents.

: think throwing an error if there are no stored fields is reasonable
: and only updating stored fields is simple enough logic I don't think
: we need to over complicate it.

throwing an error if there are no stored fields in the schema, or no
stored fields in the existing document, or no stored fields in mutate
request?

what if the document just doesn't have any stored fields because the first
time it was added, the stored fields weren't known yet?

what if the document does have stored fields, but it also has an indexed
but not stored fields, and the person doing the update doesn't realize
htat and doesn't send a replacement value for that field?

: > another approach i don't really have fully fleshed out in my head would be
: > to introduce a concept of "fieldsets" ... an update that
: > sets/appends/incrments a field in a fieldset which does not provide a
:
: I may be working on this, but not sure if it is what you are saying.  I have:

no, i was thining of it as a new bit of syntax in the schema ... after
defining all of your <field>s you have some <fieldset>s and any time you
update a doc, and mutate a field (either overwrite, append,
increment, whatever) which is in some <fieldset>s then you have to also
provide a new value for any non-stored field also listed in those
fieldset.

in a simpel schema, you'd only need one <fieldset> and it would list every
field (we'd probably even want a simple syntactic alias for that) but in
more complex schemas where you want SOlr to provide some sanity checking
on your docs, but you frequently have different "types" of docs in your
schema with differnet sets of common overlapping fields - the <fieldset>s
are your way of telling Solr when to complain.

:   public enum FieldMODE {
:     APPEND,    // add the fields to existing fields
:     OVERWRITE, // overwrite existing fields
:     INCREMENT, // increment existing field.  Must be a number!
:     DISTINCT,  // same as APPEND, but make sure there are distinct values
:     IGNORE     // ignore the previous value -- don't copy it

as i understand it, these are options specified by the client triggering
the "mutate doc" command right? ... they totally make sense, but they
don't really address what Sol should do if the command doesn't mention a
field which is in the schema.

the use case i'm thinking about is an existing solr index with lots of
clients from differnet parts of a company adding/mutating documents, and
then the schema needs changed.  the Schema Owner should have some way of
saying what happens if one of those clients attempts to mutate a document
and doesn't provide a replacement value for an indexed/unstored field --
but there's no easy/fast way for the UpdateHandler to realize that a given
document has indexed values for that field -- hence either some simple
broad rules the schema owner can put in about hte schema as a whole, or
sets of fields the schema owner can define: (if they try to mutate x, y,
or z, then they better be providing a, b and c because they are all used
together)

: default mode.  I have not tried to tackle dynamic fields yet...  it
: seems a bit more complicated!

yeah .. that's what i'm worried about with the fieldset idea too.

It's one of the reasons why it might be a good idea to just say:

  * if you want to be able to mutate docs, and you want to be garunteed it
will allways work, then every indexed field must be stored.

  * if you want to be able to mutate docs, and you can't feasible store
every indexed field; then add this one line to your schema.xml and Solr
will trust that the clients sending mutate requests know what they are
doing.

  * if you don't trust your clients to know what they are doing when
mutating documents, add this one line to your schema and Solr will reject
any attempt to mutate a document (only wholesale document replacement will
be allowed)



-Hoss


Re: loading many documents by ID

Posted by Ryan McKinley <ry...@gmail.com>.
>
> 1) regardless of the verb (updatable/modifiable) i'm not sure that it
> makes sense to annotate in the schema the fields that should be copied on
> update, and not label the feilds that must be "set" on update (ie: the
> fields that cannot be copied)

I agree.  I started down that path, and it gets pretty ugly.  I
stopped.  I have opted for a syntax that 'updates' all stored fields,
but lets you say explicitly what to do for each field.  If there is a
stored field you want to skip, you can specify that in command rather
then in the schema.



> another simple approach would be to make "updatability" a property of the
> schema, that can contain a few different values...
>  "strict" - indexed and stored are no longer valid field(type)
>             attributes -- all fields are indexed and stored. all fields
>             are copied on "update" unless the update command inlcudes
>             instructions to replace, append or incriment the field value
>   "loose" - indexed/stored still exist, any attempt to "update" an
>             existing document is legal, all stored fields are copied
>             on update unless the update command includes in structures
>             to replace, append or increment the field value.
>    "none" - any attempt to update will fail.
>

This is an interisting idea, but (if i'm understaning your suggestion
correctly) it seems like TOO big of change from the existing schema.

The more I think about the 'error' behavior, I am convinced we just
need solid, easily explainable logic for what happens and why.  I
think throwing an error if there are no stored fields is reasonable
and only updating stored fields is simple enough logic I don't think
we need to over complicate it.


> another approach i don't really have fully fleshed out in my head would be
> to introduce a concept of "fieldsets" ... an update that
> sets/appends/incrments a field in a fieldset which does not provide a

I may be working on this, but not sure if it is what you are saying.  I have:

public class IndexDocumentCommand
{
  public enum FieldMODE {
    APPEND,    // add the fields to existing fields
    OVERWRITE, // overwrite existing fields
    INCREMENT, // increment existing field.  Must be a number!
    DISTINCT,  // same as APPEND, but make sure there are distinct values
    IGNORE     // ignore the previous value -- don't copy it
  };

  public Iterable<SolrDocument> docs;
  public Map<String,FieldMODE> fieldMode; // What to do for each field.
  public int commitMaxTime = -1;
}

If fieldMode is null or they are all OVERWRITE, the addDoc command
behaves as it always has.  Otherwise, it first extracts the exiting
stored values (unless the fieldMode is IGNORE) then applies the new
documents value on top of the old one.

Currently I am only handling wildcard substitution for "*" - the
default mode.  I have not tried to tackle dynamic fields yet...  it
seems a bit more complicated!

Re: loading many documents by ID

Posted by Chris Hostetter <ho...@fucit.org>.
: 1.  Set the "updateable" fields explicitly in the schema.
: <field name="name" type="text" updateable="true" indexed="true" stored="true"/>
:
: * throw an exception at startup if an updateable field is not stored.
: If somewhere down the road we figure out how to efficiently handled
: unstored fields, we can remove this error.
: * when 'updating', only copy the fields marked 'updateable'
: * If someone sends an 'update' request and there are no fields marked
: updateable, return an error

i have two concerns:

1) regardless of the verb (updatable/modifiable) i'm not sure that it
makes sense to annotate in the schema the fields that should be copied on
update, and not label the feilds that must be "set" on update (ie: the
fields that cannot be copied)

2) Solr makes it very easy to support different "classes" of documents
that use differnet subsets of hte fields in the schema -- some of which
may overlap.  if we assume that it's okay to allow an "update" of a
document because there's at least one field in the schema that is stored,
we won't catch cases where that one field isn't used for that "type" of
document.

a simple way to go that wouldn't catch all user mistakes, but could be
confident it never errored incorrectly would be to assume that any doc can
be "updated" as long as it has at least one stred field -- that's the
simplest possible use case afterall, that i want to modify a doc in place,
replacing all of the index but unstored values with new values, and i only
want the stored fields to be copied over again unchanged.

another simple approach would be to make "updatability" a property of the
schema, that can contain a few different values...
 "strict" - indexed and stored are no longer valid field(type)
            attributes -- all fields are indexed and stored. all fields
            are copied on "update" unless the update command inlcudes
            instructions to replace, append or incriment the field value
  "loose" - indexed/stored still exist, any attempt to "update" an
            existing document is legal, all stored fields are copied
            on update unless the update command includes in structures
            to replace, append or increment the field value.
   "none" - any attempt to update will fail.

...novice users who want updatability should use strict, more experienced
users who want updatability but smaller index sizes and understand the
issues with fields that are indexed but unstored can use loose.

another approach i don't really have fully fleshed out in my head would be
to introduce a concept of "fieldsets" ... an update that
sets/appends/incrments a field in a fieldset which does not provide a
value for any unstored fields in that fieldset could trigger an error ...
thta would help with the differnet 'classes' of documents, but i'm not
sure if it could relaly work with dynamicFields.



-Hoss


Re: loading many documents by ID

Posted by Walter Underwood <wu...@netflix.com>.
On 2/1/07 10:55 AM, "Ryan McKinley" <ry...@gmail.com> wrote:
> 
> Is there a better word then 'update'? It seems there is already enough
> confusion between UpdateHandlers, "Update Plugins",
> UpdateRequestHandler etc.

Try "modify". Solr uses "update" to include "add".

wunder



Re: loading many documents by ID

Posted by Ryan McKinley <ry...@gmail.com>.
What I think I'm seeing is two validation options:

1.  Set the "updateable" fields explicitly in the schema.
<field name="name" type="text" updateable="true" indexed="true" stored="true"/>

* throw an exception at startup if an updateable field is not stored.
If somewhere down the road we figure out how to efficiently handled
unstored fields, we can remove this error.
* when 'updating', only copy the fields marked 'updateable'
* If someone sends an 'update' request and there are no fields marked
updateable, return an error

2. Assume all stored fields that are not copied to are 'updateable'
* return an error if someone sends an 'update' request and there are
no stored fields

I vote for option #1 -- although most configurations that want to
'update' fields will probably mark all stored fields as 'updateable',
it seems valuable to make the schema designer explicitly specify what
will happen on an 'update'

- - - - - - -

Is there a better word then 'update'? It seems there is already enough
confusion between UpdateHandlers, "Update Plugins",
UpdateRequestHandler etc.

In this case "update" makes sense as it is the SQL equivolent.

- - - - - - -

I have something working that adds a 'mode' to AddUpdateCommand.  The
modes I need are:

REPLACE_DOCUMENT
REPLACE_FIELDS
REPLACE_DISTINCT_FIELDS
ADD_FIELDS
ADD_DISTINCT_FIELDS


ryan

Re: loading many documents by ID

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 1, 2007, at 12:05 AM, Ryan McKinley wrote:
>> >
>> > We'd have to make it very clear that this only works if all  
>> fields are
>> > STORED.
>>
>> Isn't there some way to do this automatically instead of relying
>> on documentation? We might need to add something, maybe a
>> "required" attribute on fields, but a runtime error would be
>> much, much better than a page on the wiki.
>>
>
> what about copyField?
>
> With copyField, it is reasonable to have fields that are not stored
> and are generated from the other stored fields.  (this is what my
> setup looks like)

I would think copyFields would be exempt from the STORED mandate, and  
only the <field> definitions would matter for an update restriction.

	Erik


Re: loading many documents by ID

Posted by Walter Underwood <wu...@netflix.com>.
On 1/31/07 9:05 PM, "Ryan McKinley" <ry...@gmail.com> wrote:
>>> 
>>> We'd have to make it very clear that this only works if all fields are
>>> STORED.
>> 
>> Isn't there some way to do this automatically instead of relying
>> on documentation? We might need to add something, maybe a
>> "required" attribute on fields, but a runtime error would be
>> much, much better than a page on the wiki.
> 
> what about copyField?
> 
> With copyField, it is reasonable to have fields that are not stored
> and are generated from the other stored fields.  (this is what my
> setup looks like).

Mine, too. That is why I suggested explicit declarations in the
schema to say which fields are required.

wunder


Re: loading many documents by ID

Posted by Ryan McKinley <ry...@gmail.com>.
> >
> > We'd have to make it very clear that this only works if all fields are
> > STORED.
>
> Isn't there some way to do this automatically instead of relying
> on documentation? We might need to add something, maybe a
> "required" attribute on fields, but a runtime error would be
> much, much better than a page on the wiki.
>

what about copyField?

With copyField, it is reasonable to have fields that are not stored
and are generated from the other stored fields.  (this is what my
setup looks like)

Re: loading many documents by ID

Posted by Walter Underwood <wu...@netflix.com>.
On 1/31/07 3:39 PM, "Chris Hostetter" <ho...@fucit.org> wrote:
> 
> : Oh, and there have been numerous people interested in "updateable"
> : documents, so it would be nice if that part was in the update handler.
> 
> We'd have to make it very clear that this only works if all fields are
> STORED.

Isn't there some way to do this automatically instead of relying
on documentation? We might need to add something, maybe a
"required" attribute on fields, but a runtime error would be
much, much better than a page on the wiki.

wunder