You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Matteo Grolla <m....@sourcesense.com> on 2014/06/13 18:52:40 UTC

processing document addition and delete in order

Hi
	I'm going to develop  a manifold connector and one requirements is that it should be able to handle document insertion and deletion in order (details coming).
Actually I've already built such crawler as a standalone application and the design was conceptually this

instead of a Document Queue I have a CommandQueue
	commands can be delete (specifying the docId) or add (specifying the doc to be added)
when a worker thread takes a delete no other worker is allowed to pick other commands from the queue until the delete has been committed
	

Ex. suppose I have the following chunk of CommandQueue:  

add{doc1}, delete{doc1}, add{doc1}

I need to avoid the situation where commands are processed in this order: add{doc1}, add{doc1}, delete{doc1}


I think the EventSequence could help me implement this synchronization in Manifold
when seeding the identifiers I could embed in the identifier the command
Ex.	
	instead of stuffing the identifier "hd-samsing-500GB"
	I could stuff "add hd-samsung-500GB"

The question is: Am I running into huge troubles trying to implement this requirement or not?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com


Re: processing document addition and delete in order

Posted by Matteo Grolla <m....@sourcesense.com>.
You perfectly described the situation.
If I could set of xml files where each set represents a snapshot of the source system state then my crawler would fit manifold design much better.
I'll see if it's possible. For sure concurrency can be better exploited this way.

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 19:21, Karl Wright ha scritto:

> I see; so you are not crawling a repository but instead a sequence of
> commands, and you don't know what the actual state of the "repository" is
> until all the commands are processed.
> 
> ManifoldCF is not really designed to crawl sequentially-ordered commands.
> If you can process the commands in sequence first into a "repository" of
> your own construction, then ManifoldCF would be well-suited to picking
> documents out of there.  I'm trying to think of a good way to do this
> without actually doing that preprocessing step, but at the moment I'm
> coming up with nothing useful.
> 
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 1:14 PM, Matteo Grolla <m....@sourcesense.com>
> wrote:
> 
>> Hi Karl
>>        the reason is that if I read the commands in this order from the
>> files
>> 
>>        add{doc1}, delete{doc1}, add{doc1}
>> 
>>        after the crawl I should find doc1 in solr
>>        but if I process them in this order
>> 
>>        add{doc1}, add{doc1}, delete{doc1}
>> 
>>        there won't be doc1 in solr after the crawl
>> 
>> The concern about sequential performance is right but my use cases
>> typically involve few deletion and lots of adds
>> 
>>        suppose I have
>> 
>>        add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}
>> 
>> 
>>        I could process
>>        add{doc1}, add{doc2}, add{doc3} in parallel
>>        then  delete{doc1}
>>        then proceed in parallel till the next delete
>> 
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:
>> 
>>> One other point: if the reason that you would be trying to order things
>> is
>>> because you'd want to process the xml document before processing its
>>> children, you don't need to worry about that at all either, because the
>>> framework takes care of that automatically.  All you need to do is handle
>>> the case where the carrydown data is not present.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <da...@gmail.com> wrote:
>>> 
>>>> Hi Matteo,
>>>> 
>>>> The prerequisite event logic is the only way to order document
>> processing
>>>> in ManifoldCF.  The javadoc for the event methods is probably the best
>>>> reference to use.  I can't say from your description how it would map,
>> but
>>>> here's the description in question:
>>>> 
>>>> /** This interface abstracts from the activities that use and govern
>>>> events.
>>>> *
>>>> * The purpose of this model is to allow a connector to:
>>>> * (a) insure that documents whose prerequisites have not been met do not
>>>> get processed until those prerequisites are completed
>>>> * (b) guarantee that only one thread at a time deal with sequencing of
>>>> documents
>>>> *
>>>> * The way it works is as follows.  We define the notion of an "event",
>>>> which is described by a simple string (and thus can be global,
>>>> * local to a connection, or local to a job, whichever is appropriate).
>> An
>>>> event is managed solely by the connector that knows about it.
>>>> * Effectively it can be in either of two states: "completed", or
>>>> "pending".  The only time the framework ever changes an event state is
>> when
>>>> * the crawler is restarted, at which point all pending events are marked
>>>> "completed".
>>>> *
>>>> * Documents, when they are added to the processing queue, specify the
>> set
>>>> of events on which they will block.  If an event is in the "pending"
>> state,
>>>> * no documents that block on that event will be processed at that time.
>>>> Of course, it is possible that a document could be handed to processing
>>>> just before
>>>> * an event entered the "pending" state - in which case it is the
>>>> responsibility of the connector itself to avoid any problems or
>> conflicts.
>>>> This can
>>>> * usually be handled by proper handling of event signalling.  More on
>> that
>>>> later.
>>>> *
>>>> * The presumed underlying model of flow inside the connector's
>> processing
>>>> method is as follows:
>>>> * (1) The connector examines the document in question, and decides
>> whether
>>>> it can be processed successfully or not, based on what it knows about
>>>> sequencing
>>>> * (2) If the connector determines that the document can properly be
>>>> processed, it does so, and that's it.
>>>> * (3) If the connector finds a sequencing-related problem, it:
>>>> *     (a) Begins an appropriate event sequence.
>>>> *     (b) If the framework indicates that this event is already in the
>>>> "pending" state, then some other thread is already handling the event,
>> and
>>>> the connector
>>>> *          should abort processing of the current document.
>>>> *     (c) If the framework successfully begins the event sequence, then
>>>> the connector code knows unequivocably that it is the only thread
>>>> processing the event.
>>>> *         It should take whatever action it needs to - which might be
>>>> requesting special documents, for instance.  [Note well: At this time,
>>>> there is no way
>>>> *         to guarantee that special documents added to the queue are in
>>>> fact properly synchronized by this mechanism, so I recommend avoiding
>> this
>>>> practice,
>>>> *         and instead handling any special document sequences without
>>>> involving the queue.]
>>>> *     (d) If the connector CANNOT successfully take the action it needs
>> to
>>>> to push the sequence along, it MUST set the event back to the
>> "completed"
>>>> state.
>>>> *         Otherwise, the event will remain in the "pending" state until
>>>> the next time the crawler is restarted.
>>>> *     (e) If the current document cannot yet be processed, its
>> processing
>>>> should be aborted.
>>>> * (4) When the connector determines that the event's conditions have
>> been
>>>> met, or when it determines that an event sequence is no longer viable
>> and
>>>> has been
>>>> *     aborted, it must set the event status to "completed".
>>>> *
>>>> * In summary, a connector may perform the following event-related
>> actions:
>>>> * (a) Set an event into the "pending" state
>>>> * (b) Set an event into the "completed" state
>>>> * (c) Add a document to the queue with a specified set of prerequisite
>>>> events attached
>>>> * (d) Request that the current document be requeued for later processing
>>>> (i.e. abort processing of a document due to sequencing reasons)
>>>> *
>>>> */
>>>> public interface IEventActivity extends INamingActivity
>>>> {
>>>> public static final String _rcsid = "@(#)$Id: IEventActivity.java
>> 988245
>>>> 2010-08-23 18:39:35Z kwright $";
>>>> 
>>>> /** Begin an event sequence.
>>>> * This method should be called by a connector when a sequencing event
>>>> should enter the "pending" state.  If the event is already in that
>> state,
>>>> * this method will return false, otherwise true.  The connector has the
>>>> responsibility of appropriately managing sequencing given the response
>>>> * status.
>>>> *@param eventName is the event name.
>>>> *@return false if the event is already in the "pending" state.
>>>> */
>>>> public boolean beginEventSequence(String eventName)
>>>>   throws ManifoldCFException;
>>>> 
>>>> /** Complete an event sequence.
>>>> * This method should be called to signal that an event is no longer in
>>>> the "pending" state.  This can mean that the prerequisite processing is
>>>> * completed, but it can also mean that prerequisite processing was
>>>> aborted or cannot be completed.
>>>> * Note well: This method should not be called unless the connector is
>>>> CERTAIN that an event is in progress, and that the current thread has
>>>> * the sole right to complete it.  Otherwise, race conditions can
>> develop
>>>> which would be difficult to diagnose.
>>>> *@param eventName is the event name.
>>>> */
>>>> public void completeEventSequence(String eventName)
>>>>   throws ManifoldCFException;
>>>> 
>>>> /** Abort processing a document (for sequencing reasons).
>>>> * This method should be called in order to cause the specified document
>>>> to be requeued for later processing.  While this is similar in some
>> respects
>>>> * to the semantics of a ServiceInterruption, it is applicable to only
>>>> one document at a time, and also does not specify any delay period,
>> since
>>>> it is
>>>> * presumed that the reason for the requeue is because of sequencing
>>>> issues synchronized around an underlying event.
>>>> *@param localIdentifier is the document identifier to requeue
>>>> */
>>>> public void retryDocumentProcessing(String localIdentifier)
>>>>   throws ManifoldCFException;
>>>> 
>>>> 
>>>> }
>>>> 
>>>> 
>>>> As you can see, these constraints are significant and can cause
>>>> single-threaded behavior, so unless you've got a real requirement for
>>>> ordering, it's better not to do it.
>>>> 
>>>> Furthermore, the question of deletions is really not germane, because
>>>> ManifoldCF does not in fact order deletions at all.  They are done
>> either
>>>> as a side-effect of document processing (when a document is discovered
>> to
>>>> not be there anymore), or at the end of a job (when orphaned documents
>> are
>>>> removed).  They are also deleted when the job that owns them is deleted.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <
>> m.grolla@sourcesense.com>
>>>> wrote:
>>>> 
>>>>> Hi
>>>>>       I'm going to develop  a manifold connector and one requirements
>>>>> is that it should be able to handle document insertion and deletion in
>>>>> order (details coming).
>>>>> Actually I've already built such crawler as a standalone application
>> and
>>>>> the design was conceptually this
>>>>> 
>>>>> instead of a Document Queue I have a CommandQueue
>>>>>       commands can be delete (specifying the docId) or add (specifying
>>>>> the doc to be added)
>>>>> when a worker thread takes a delete no other worker is allowed to pick
>>>>> other commands from the queue until the delete has been committed
>>>>> 
>>>>> 
>>>>> Ex. suppose I have the following chunk of CommandQueue:
>>>>> 
>>>>> add{doc1}, delete{doc1}, add{doc1}
>>>>> 
>>>>> I need to avoid the situation where commands are processed in this
>> order:
>>>>> add{doc1}, add{doc1}, delete{doc1}
>>>>> 
>>>>> 
>>>>> I think the EventSequence could help me implement this synchronization
>> in
>>>>> Manifold
>>>>> when seeding the identifiers I could embed in the identifier the
>> command
>>>>> Ex.
>>>>>       instead of stuffing the identifier "hd-samsing-500GB"
>>>>>       I could stuff "add hd-samsung-500GB"
>>>>> 
>>>>> The question is: Am I running into huge troubles trying to implement
>> this
>>>>> requirement or not?
>>>>> 
>>>>> --
>>>>> Matteo Grolla
>>>>> Sourcesense - making sense of Open Source
>>>>> http://www.sourcesense.com
>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: processing document addition and delete in order

Posted by Karl Wright <da...@gmail.com>.
I see; so you are not crawling a repository but instead a sequence of
commands, and you don't know what the actual state of the "repository" is
until all the commands are processed.

ManifoldCF is not really designed to crawl sequentially-ordered commands.
If you can process the commands in sequence first into a "repository" of
your own construction, then ManifoldCF would be well-suited to picking
documents out of there.  I'm trying to think of a good way to do this
without actually doing that preprocessing step, but at the moment I'm
coming up with nothing useful.

Karl



On Fri, Jun 13, 2014 at 1:14 PM, Matteo Grolla <m....@sourcesense.com>
wrote:

> Hi Karl
>         the reason is that if I read the commands in this order from the
> files
>
>         add{doc1}, delete{doc1}, add{doc1}
>
>         after the crawl I should find doc1 in solr
>         but if I process them in this order
>
>         add{doc1}, add{doc1}, delete{doc1}
>
>         there won't be doc1 in solr after the crawl
>
> The concern about sequential performance is right but my use cases
> typically involve few deletion and lots of adds
>
>         suppose I have
>
>         add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}
>
>
>         I could process
>         add{doc1}, add{doc2}, add{doc3} in parallel
>         then  delete{doc1}
>         then proceed in parallel till the next delete
>
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
> Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:
>
> > One other point: if the reason that you would be trying to order things
> is
> > because you'd want to process the xml document before processing its
> > children, you don't need to worry about that at all either, because the
> > framework takes care of that automatically.  All you need to do is handle
> > the case where the carrydown data is not present.
> >
> > Thanks,
> > Karl
> >
> >
> >
> > On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <da...@gmail.com> wrote:
> >
> >> Hi Matteo,
> >>
> >> The prerequisite event logic is the only way to order document
> processing
> >> in ManifoldCF.  The javadoc for the event methods is probably the best
> >> reference to use.  I can't say from your description how it would map,
> but
> >> here's the description in question:
> >>
> >> /** This interface abstracts from the activities that use and govern
> >> events.
> >> *
> >> * The purpose of this model is to allow a connector to:
> >> * (a) insure that documents whose prerequisites have not been met do not
> >> get processed until those prerequisites are completed
> >> * (b) guarantee that only one thread at a time deal with sequencing of
> >> documents
> >> *
> >> * The way it works is as follows.  We define the notion of an "event",
> >> which is described by a simple string (and thus can be global,
> >> * local to a connection, or local to a job, whichever is appropriate).
>  An
> >> event is managed solely by the connector that knows about it.
> >> * Effectively it can be in either of two states: "completed", or
> >> "pending".  The only time the framework ever changes an event state is
> when
> >> * the crawler is restarted, at which point all pending events are marked
> >> "completed".
> >> *
> >> * Documents, when they are added to the processing queue, specify the
> set
> >> of events on which they will block.  If an event is in the "pending"
> state,
> >> * no documents that block on that event will be processed at that time.
> >> Of course, it is possible that a document could be handed to processing
> >> just before
> >> * an event entered the "pending" state - in which case it is the
> >> responsibility of the connector itself to avoid any problems or
> conflicts.
> >> This can
> >> * usually be handled by proper handling of event signalling.  More on
> that
> >> later.
> >> *
> >> * The presumed underlying model of flow inside the connector's
> processing
> >> method is as follows:
> >> * (1) The connector examines the document in question, and decides
> whether
> >> it can be processed successfully or not, based on what it knows about
> >> sequencing
> >> * (2) If the connector determines that the document can properly be
> >> processed, it does so, and that's it.
> >> * (3) If the connector finds a sequencing-related problem, it:
> >> *     (a) Begins an appropriate event sequence.
> >> *     (b) If the framework indicates that this event is already in the
> >> "pending" state, then some other thread is already handling the event,
> and
> >> the connector
> >> *          should abort processing of the current document.
> >> *     (c) If the framework successfully begins the event sequence, then
> >> the connector code knows unequivocably that it is the only thread
> >> processing the event.
> >> *         It should take whatever action it needs to - which might be
> >> requesting special documents, for instance.  [Note well: At this time,
> >> there is no way
> >> *         to guarantee that special documents added to the queue are in
> >> fact properly synchronized by this mechanism, so I recommend avoiding
> this
> >> practice,
> >> *         and instead handling any special document sequences without
> >> involving the queue.]
> >> *     (d) If the connector CANNOT successfully take the action it needs
> to
> >> to push the sequence along, it MUST set the event back to the
> "completed"
> >> state.
> >> *         Otherwise, the event will remain in the "pending" state until
> >> the next time the crawler is restarted.
> >> *     (e) If the current document cannot yet be processed, its
> processing
> >> should be aborted.
> >> * (4) When the connector determines that the event's conditions have
> been
> >> met, or when it determines that an event sequence is no longer viable
> and
> >> has been
> >> *     aborted, it must set the event status to "completed".
> >> *
> >> * In summary, a connector may perform the following event-related
> actions:
> >> * (a) Set an event into the "pending" state
> >> * (b) Set an event into the "completed" state
> >> * (c) Add a document to the queue with a specified set of prerequisite
> >> events attached
> >> * (d) Request that the current document be requeued for later processing
> >> (i.e. abort processing of a document due to sequencing reasons)
> >> *
> >> */
> >> public interface IEventActivity extends INamingActivity
> >> {
> >>  public static final String _rcsid = "@(#)$Id: IEventActivity.java
> 988245
> >> 2010-08-23 18:39:35Z kwright $";
> >>
> >>  /** Begin an event sequence.
> >>  * This method should be called by a connector when a sequencing event
> >> should enter the "pending" state.  If the event is already in that
> state,
> >>  * this method will return false, otherwise true.  The connector has the
> >> responsibility of appropriately managing sequencing given the response
> >>  * status.
> >>  *@param eventName is the event name.
> >>  *@return false if the event is already in the "pending" state.
> >>  */
> >>  public boolean beginEventSequence(String eventName)
> >>    throws ManifoldCFException;
> >>
> >>  /** Complete an event sequence.
> >>  * This method should be called to signal that an event is no longer in
> >> the "pending" state.  This can mean that the prerequisite processing is
> >>  * completed, but it can also mean that prerequisite processing was
> >> aborted or cannot be completed.
> >>  * Note well: This method should not be called unless the connector is
> >> CERTAIN that an event is in progress, and that the current thread has
> >>  * the sole right to complete it.  Otherwise, race conditions can
> develop
> >> which would be difficult to diagnose.
> >>  *@param eventName is the event name.
> >>  */
> >>  public void completeEventSequence(String eventName)
> >>    throws ManifoldCFException;
> >>
> >>  /** Abort processing a document (for sequencing reasons).
> >>  * This method should be called in order to cause the specified document
> >> to be requeued for later processing.  While this is similar in some
> respects
> >>  * to the semantics of a ServiceInterruption, it is applicable to only
> >> one document at a time, and also does not specify any delay period,
> since
> >> it is
> >>  * presumed that the reason for the requeue is because of sequencing
> >> issues synchronized around an underlying event.
> >>  *@param localIdentifier is the document identifier to requeue
> >>  */
> >>  public void retryDocumentProcessing(String localIdentifier)
> >>    throws ManifoldCFException;
> >>
> >>
> >> }
> >>
> >>
> >> As you can see, these constraints are significant and can cause
> >> single-threaded behavior, so unless you've got a real requirement for
> >> ordering, it's better not to do it.
> >>
> >> Furthermore, the question of deletions is really not germane, because
> >> ManifoldCF does not in fact order deletions at all.  They are done
> either
> >> as a side-effect of document processing (when a document is discovered
> to
> >> not be there anymore), or at the end of a job (when orphaned documents
> are
> >> removed).  They are also deleted when the job that owns them is deleted.
> >>
> >> Karl
> >>
> >>
> >>
> >> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <
> m.grolla@sourcesense.com>
> >> wrote:
> >>
> >>> Hi
> >>>        I'm going to develop  a manifold connector and one requirements
> >>> is that it should be able to handle document insertion and deletion in
> >>> order (details coming).
> >>> Actually I've already built such crawler as a standalone application
> and
> >>> the design was conceptually this
> >>>
> >>> instead of a Document Queue I have a CommandQueue
> >>>        commands can be delete (specifying the docId) or add (specifying
> >>> the doc to be added)
> >>> when a worker thread takes a delete no other worker is allowed to pick
> >>> other commands from the queue until the delete has been committed
> >>>
> >>>
> >>> Ex. suppose I have the following chunk of CommandQueue:
> >>>
> >>> add{doc1}, delete{doc1}, add{doc1}
> >>>
> >>> I need to avoid the situation where commands are processed in this
> order:
> >>> add{doc1}, add{doc1}, delete{doc1}
> >>>
> >>>
> >>> I think the EventSequence could help me implement this synchronization
> in
> >>> Manifold
> >>> when seeding the identifiers I could embed in the identifier the
> command
> >>> Ex.
> >>>        instead of stuffing the identifier "hd-samsing-500GB"
> >>>        I could stuff "add hd-samsung-500GB"
> >>>
> >>> The question is: Am I running into huge troubles trying to implement
> this
> >>> requirement or not?
> >>>
> >>> --
> >>> Matteo Grolla
> >>> Sourcesense - making sense of Open Source
> >>> http://www.sourcesense.com
> >>>
> >>>
> >>
>
>

Re: processing document addition and delete in order

Posted by Matteo Grolla <m....@sourcesense.com>.
Hi Karl
	the reason is that if I read the commands in this order from the files

	add{doc1}, delete{doc1}, add{doc1}

	after the crawl I should find doc1 in solr
	but if I process them in this order

	add{doc1}, add{doc1}, delete{doc1}

	there won't be doc1 in solr after the crawl

The concern about sequential performance is right but my use cases typically involve few deletion and lots of adds

	suppose I have

	add{doc1}, add{doc2}, add{doc3}, delete{doc1}, add{doc1}


	I could process	
	add{doc1}, add{doc2}, add{doc3} in parallel
	then  delete{doc1}
	then proceed in parallel till the next delete


-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 19:06, Karl Wright ha scritto:

> One other point: if the reason that you would be trying to order things is
> because you'd want to process the xml document before processing its
> children, you don't need to worry about that at all either, because the
> framework takes care of that automatically.  All you need to do is handle
> the case where the carrydown data is not present.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <da...@gmail.com> wrote:
> 
>> Hi Matteo,
>> 
>> The prerequisite event logic is the only way to order document processing
>> in ManifoldCF.  The javadoc for the event methods is probably the best
>> reference to use.  I can't say from your description how it would map, but
>> here's the description in question:
>> 
>> /** This interface abstracts from the activities that use and govern
>> events.
>> *
>> * The purpose of this model is to allow a connector to:
>> * (a) insure that documents whose prerequisites have not been met do not
>> get processed until those prerequisites are completed
>> * (b) guarantee that only one thread at a time deal with sequencing of
>> documents
>> *
>> * The way it works is as follows.  We define the notion of an "event",
>> which is described by a simple string (and thus can be global,
>> * local to a connection, or local to a job, whichever is appropriate).  An
>> event is managed solely by the connector that knows about it.
>> * Effectively it can be in either of two states: "completed", or
>> "pending".  The only time the framework ever changes an event state is when
>> * the crawler is restarted, at which point all pending events are marked
>> "completed".
>> *
>> * Documents, when they are added to the processing queue, specify the set
>> of events on which they will block.  If an event is in the "pending" state,
>> * no documents that block on that event will be processed at that time.
>> Of course, it is possible that a document could be handed to processing
>> just before
>> * an event entered the "pending" state - in which case it is the
>> responsibility of the connector itself to avoid any problems or conflicts.
>> This can
>> * usually be handled by proper handling of event signalling.  More on that
>> later.
>> *
>> * The presumed underlying model of flow inside the connector's processing
>> method is as follows:
>> * (1) The connector examines the document in question, and decides whether
>> it can be processed successfully or not, based on what it knows about
>> sequencing
>> * (2) If the connector determines that the document can properly be
>> processed, it does so, and that's it.
>> * (3) If the connector finds a sequencing-related problem, it:
>> *     (a) Begins an appropriate event sequence.
>> *     (b) If the framework indicates that this event is already in the
>> "pending" state, then some other thread is already handling the event, and
>> the connector
>> *          should abort processing of the current document.
>> *     (c) If the framework successfully begins the event sequence, then
>> the connector code knows unequivocably that it is the only thread
>> processing the event.
>> *         It should take whatever action it needs to - which might be
>> requesting special documents, for instance.  [Note well: At this time,
>> there is no way
>> *         to guarantee that special documents added to the queue are in
>> fact properly synchronized by this mechanism, so I recommend avoiding this
>> practice,
>> *         and instead handling any special document sequences without
>> involving the queue.]
>> *     (d) If the connector CANNOT successfully take the action it needs to
>> to push the sequence along, it MUST set the event back to the "completed"
>> state.
>> *         Otherwise, the event will remain in the "pending" state until
>> the next time the crawler is restarted.
>> *     (e) If the current document cannot yet be processed, its processing
>> should be aborted.
>> * (4) When the connector determines that the event's conditions have been
>> met, or when it determines that an event sequence is no longer viable and
>> has been
>> *     aborted, it must set the event status to "completed".
>> *
>> * In summary, a connector may perform the following event-related actions:
>> * (a) Set an event into the "pending" state
>> * (b) Set an event into the "completed" state
>> * (c) Add a document to the queue with a specified set of prerequisite
>> events attached
>> * (d) Request that the current document be requeued for later processing
>> (i.e. abort processing of a document due to sequencing reasons)
>> *
>> */
>> public interface IEventActivity extends INamingActivity
>> {
>>  public static final String _rcsid = "@(#)$Id: IEventActivity.java 988245
>> 2010-08-23 18:39:35Z kwright $";
>> 
>>  /** Begin an event sequence.
>>  * This method should be called by a connector when a sequencing event
>> should enter the "pending" state.  If the event is already in that state,
>>  * this method will return false, otherwise true.  The connector has the
>> responsibility of appropriately managing sequencing given the response
>>  * status.
>>  *@param eventName is the event name.
>>  *@return false if the event is already in the "pending" state.
>>  */
>>  public boolean beginEventSequence(String eventName)
>>    throws ManifoldCFException;
>> 
>>  /** Complete an event sequence.
>>  * This method should be called to signal that an event is no longer in
>> the "pending" state.  This can mean that the prerequisite processing is
>>  * completed, but it can also mean that prerequisite processing was
>> aborted or cannot be completed.
>>  * Note well: This method should not be called unless the connector is
>> CERTAIN that an event is in progress, and that the current thread has
>>  * the sole right to complete it.  Otherwise, race conditions can develop
>> which would be difficult to diagnose.
>>  *@param eventName is the event name.
>>  */
>>  public void completeEventSequence(String eventName)
>>    throws ManifoldCFException;
>> 
>>  /** Abort processing a document (for sequencing reasons).
>>  * This method should be called in order to cause the specified document
>> to be requeued for later processing.  While this is similar in some respects
>>  * to the semantics of a ServiceInterruption, it is applicable to only
>> one document at a time, and also does not specify any delay period, since
>> it is
>>  * presumed that the reason for the requeue is because of sequencing
>> issues synchronized around an underlying event.
>>  *@param localIdentifier is the document identifier to requeue
>>  */
>>  public void retryDocumentProcessing(String localIdentifier)
>>    throws ManifoldCFException;
>> 
>> 
>> }
>> 
>> 
>> As you can see, these constraints are significant and can cause
>> single-threaded behavior, so unless you've got a real requirement for
>> ordering, it's better not to do it.
>> 
>> Furthermore, the question of deletions is really not germane, because
>> ManifoldCF does not in fact order deletions at all.  They are done either
>> as a side-effect of document processing (when a document is discovered to
>> not be there anymore), or at the end of a job (when orphaned documents are
>> removed).  They are also deleted when the job that owns them is deleted.
>> 
>> Karl
>> 
>> 
>> 
>> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <m....@sourcesense.com>
>> wrote:
>> 
>>> Hi
>>>        I'm going to develop  a manifold connector and one requirements
>>> is that it should be able to handle document insertion and deletion in
>>> order (details coming).
>>> Actually I've already built such crawler as a standalone application and
>>> the design was conceptually this
>>> 
>>> instead of a Document Queue I have a CommandQueue
>>>        commands can be delete (specifying the docId) or add (specifying
>>> the doc to be added)
>>> when a worker thread takes a delete no other worker is allowed to pick
>>> other commands from the queue until the delete has been committed
>>> 
>>> 
>>> Ex. suppose I have the following chunk of CommandQueue:
>>> 
>>> add{doc1}, delete{doc1}, add{doc1}
>>> 
>>> I need to avoid the situation where commands are processed in this order:
>>> add{doc1}, add{doc1}, delete{doc1}
>>> 
>>> 
>>> I think the EventSequence could help me implement this synchronization in
>>> Manifold
>>> when seeding the identifiers I could embed in the identifier the command
>>> Ex.
>>>        instead of stuffing the identifier "hd-samsing-500GB"
>>>        I could stuff "add hd-samsung-500GB"
>>> 
>>> The question is: Am I running into huge troubles trying to implement this
>>> requirement or not?
>>> 
>>> --
>>> Matteo Grolla
>>> Sourcesense - making sense of Open Source
>>> http://www.sourcesense.com
>>> 
>>> 
>> 


Re: processing document addition and delete in order

Posted by Karl Wright <da...@gmail.com>.
One other point: if the reason that you would be trying to order things is
because you'd want to process the xml document before processing its
children, you don't need to worry about that at all either, because the
framework takes care of that automatically.  All you need to do is handle
the case where the carrydown data is not present.

Thanks,
Karl



On Fri, Jun 13, 2014 at 1:03 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Matteo,
>
> The prerequisite event logic is the only way to order document processing
> in ManifoldCF.  The javadoc for the event methods is probably the best
> reference to use.  I can't say from your description how it would map, but
> here's the description in question:
>
> /** This interface abstracts from the activities that use and govern
> events.
> *
> * The purpose of this model is to allow a connector to:
> * (a) insure that documents whose prerequisites have not been met do not
> get processed until those prerequisites are completed
> * (b) guarantee that only one thread at a time deal with sequencing of
> documents
> *
> * The way it works is as follows.  We define the notion of an "event",
> which is described by a simple string (and thus can be global,
> * local to a connection, or local to a job, whichever is appropriate).  An
> event is managed solely by the connector that knows about it.
> * Effectively it can be in either of two states: "completed", or
> "pending".  The only time the framework ever changes an event state is when
> * the crawler is restarted, at which point all pending events are marked
> "completed".
> *
> * Documents, when they are added to the processing queue, specify the set
> of events on which they will block.  If an event is in the "pending" state,
> * no documents that block on that event will be processed at that time.
> Of course, it is possible that a document could be handed to processing
> just before
> * an event entered the "pending" state - in which case it is the
> responsibility of the connector itself to avoid any problems or conflicts.
> This can
> * usually be handled by proper handling of event signalling.  More on that
> later.
> *
> * The presumed underlying model of flow inside the connector's processing
> method is as follows:
> * (1) The connector examines the document in question, and decides whether
> it can be processed successfully or not, based on what it knows about
> sequencing
> * (2) If the connector determines that the document can properly be
> processed, it does so, and that's it.
> * (3) If the connector finds a sequencing-related problem, it:
> *     (a) Begins an appropriate event sequence.
> *     (b) If the framework indicates that this event is already in the
> "pending" state, then some other thread is already handling the event, and
> the connector
> *          should abort processing of the current document.
> *     (c) If the framework successfully begins the event sequence, then
> the connector code knows unequivocably that it is the only thread
> processing the event.
> *         It should take whatever action it needs to - which might be
> requesting special documents, for instance.  [Note well: At this time,
> there is no way
> *         to guarantee that special documents added to the queue are in
> fact properly synchronized by this mechanism, so I recommend avoiding this
> practice,
> *         and instead handling any special document sequences without
> involving the queue.]
> *     (d) If the connector CANNOT successfully take the action it needs to
> to push the sequence along, it MUST set the event back to the "completed"
> state.
> *         Otherwise, the event will remain in the "pending" state until
> the next time the crawler is restarted.
> *     (e) If the current document cannot yet be processed, its processing
> should be aborted.
> * (4) When the connector determines that the event's conditions have been
> met, or when it determines that an event sequence is no longer viable and
> has been
> *     aborted, it must set the event status to "completed".
> *
> * In summary, a connector may perform the following event-related actions:
> * (a) Set an event into the "pending" state
> * (b) Set an event into the "completed" state
> * (c) Add a document to the queue with a specified set of prerequisite
> events attached
> * (d) Request that the current document be requeued for later processing
> (i.e. abort processing of a document due to sequencing reasons)
> *
> */
> public interface IEventActivity extends INamingActivity
> {
>   public static final String _rcsid = "@(#)$Id: IEventActivity.java 988245
> 2010-08-23 18:39:35Z kwright $";
>
>   /** Begin an event sequence.
>   * This method should be called by a connector when a sequencing event
> should enter the "pending" state.  If the event is already in that state,
>   * this method will return false, otherwise true.  The connector has the
> responsibility of appropriately managing sequencing given the response
>   * status.
>   *@param eventName is the event name.
>   *@return false if the event is already in the "pending" state.
>   */
>   public boolean beginEventSequence(String eventName)
>     throws ManifoldCFException;
>
>   /** Complete an event sequence.
>   * This method should be called to signal that an event is no longer in
> the "pending" state.  This can mean that the prerequisite processing is
>   * completed, but it can also mean that prerequisite processing was
> aborted or cannot be completed.
>   * Note well: This method should not be called unless the connector is
> CERTAIN that an event is in progress, and that the current thread has
>   * the sole right to complete it.  Otherwise, race conditions can develop
> which would be difficult to diagnose.
>   *@param eventName is the event name.
>   */
>   public void completeEventSequence(String eventName)
>     throws ManifoldCFException;
>
>   /** Abort processing a document (for sequencing reasons).
>   * This method should be called in order to cause the specified document
> to be requeued for later processing.  While this is similar in some respects
>   * to the semantics of a ServiceInterruption, it is applicable to only
> one document at a time, and also does not specify any delay period, since
> it is
>   * presumed that the reason for the requeue is because of sequencing
> issues synchronized around an underlying event.
>   *@param localIdentifier is the document identifier to requeue
>   */
>   public void retryDocumentProcessing(String localIdentifier)
>     throws ManifoldCFException;
>
>
> }
>
>
> As you can see, these constraints are significant and can cause
> single-threaded behavior, so unless you've got a real requirement for
> ordering, it's better not to do it.
>
> Furthermore, the question of deletions is really not germane, because
> ManifoldCF does not in fact order deletions at all.  They are done either
> as a side-effect of document processing (when a document is discovered to
> not be there anymore), or at the end of a job (when orphaned documents are
> removed).  They are also deleted when the job that owns them is deleted.
>
> Karl
>
>
>
> On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <m....@sourcesense.com>
> wrote:
>
>> Hi
>>         I'm going to develop  a manifold connector and one requirements
>> is that it should be able to handle document insertion and deletion in
>> order (details coming).
>> Actually I've already built such crawler as a standalone application and
>> the design was conceptually this
>>
>> instead of a Document Queue I have a CommandQueue
>>         commands can be delete (specifying the docId) or add (specifying
>> the doc to be added)
>> when a worker thread takes a delete no other worker is allowed to pick
>> other commands from the queue until the delete has been committed
>>
>>
>> Ex. suppose I have the following chunk of CommandQueue:
>>
>> add{doc1}, delete{doc1}, add{doc1}
>>
>> I need to avoid the situation where commands are processed in this order:
>> add{doc1}, add{doc1}, delete{doc1}
>>
>>
>> I think the EventSequence could help me implement this synchronization in
>> Manifold
>> when seeding the identifiers I could embed in the identifier the command
>> Ex.
>>         instead of stuffing the identifier "hd-samsing-500GB"
>>         I could stuff "add hd-samsung-500GB"
>>
>> The question is: Am I running into huge troubles trying to implement this
>> requirement or not?
>>
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>>
>>
>

Re: processing document addition and delete in order

Posted by Karl Wright <da...@gmail.com>.
Hi Matteo,

The prerequisite event logic is the only way to order document processing
in ManifoldCF.  The javadoc for the event methods is probably the best
reference to use.  I can't say from your description how it would map, but
here's the description in question:

/** This interface abstracts from the activities that use and govern events.
*
* The purpose of this model is to allow a connector to:
* (a) insure that documents whose prerequisites have not been met do not
get processed until those prerequisites are completed
* (b) guarantee that only one thread at a time deal with sequencing of
documents
*
* The way it works is as follows.  We define the notion of an "event",
which is described by a simple string (and thus can be global,
* local to a connection, or local to a job, whichever is appropriate).  An
event is managed solely by the connector that knows about it.
* Effectively it can be in either of two states: "completed", or
"pending".  The only time the framework ever changes an event state is when
* the crawler is restarted, at which point all pending events are marked
"completed".
*
* Documents, when they are added to the processing queue, specify the set
of events on which they will block.  If an event is in the "pending" state,
* no documents that block on that event will be processed at that time.  Of
course, it is possible that a document could be handed to processing just
before
* an event entered the "pending" state - in which case it is the
responsibility of the connector itself to avoid any problems or conflicts.
This can
* usually be handled by proper handling of event signalling.  More on that
later.
*
* The presumed underlying model of flow inside the connector's processing
method is as follows:
* (1) The connector examines the document in question, and decides whether
it can be processed successfully or not, based on what it knows about
sequencing
* (2) If the connector determines that the document can properly be
processed, it does so, and that's it.
* (3) If the connector finds a sequencing-related problem, it:
*     (a) Begins an appropriate event sequence.
*     (b) If the framework indicates that this event is already in the
"pending" state, then some other thread is already handling the event, and
the connector
*          should abort processing of the current document.
*     (c) If the framework successfully begins the event sequence, then the
connector code knows unequivocably that it is the only thread processing
the event.
*         It should take whatever action it needs to - which might be
requesting special documents, for instance.  [Note well: At this time,
there is no way
*         to guarantee that special documents added to the queue are in
fact properly synchronized by this mechanism, so I recommend avoiding this
practice,
*         and instead handling any special document sequences without
involving the queue.]
*     (d) If the connector CANNOT successfully take the action it needs to
to push the sequence along, it MUST set the event back to the "completed"
state.
*         Otherwise, the event will remain in the "pending" state until the
next time the crawler is restarted.
*     (e) If the current document cannot yet be processed, its processing
should be aborted.
* (4) When the connector determines that the event's conditions have been
met, or when it determines that an event sequence is no longer viable and
has been
*     aborted, it must set the event status to "completed".
*
* In summary, a connector may perform the following event-related actions:
* (a) Set an event into the "pending" state
* (b) Set an event into the "completed" state
* (c) Add a document to the queue with a specified set of prerequisite
events attached
* (d) Request that the current document be requeued for later processing
(i.e. abort processing of a document due to sequencing reasons)
*
*/
public interface IEventActivity extends INamingActivity
{
  public static final String _rcsid = "@(#)$Id: IEventActivity.java 988245
2010-08-23 18:39:35Z kwright $";

  /** Begin an event sequence.
  * This method should be called by a connector when a sequencing event
should enter the "pending" state.  If the event is already in that state,
  * this method will return false, otherwise true.  The connector has the
responsibility of appropriately managing sequencing given the response
  * status.
  *@param eventName is the event name.
  *@return false if the event is already in the "pending" state.
  */
  public boolean beginEventSequence(String eventName)
    throws ManifoldCFException;

  /** Complete an event sequence.
  * This method should be called to signal that an event is no longer in
the "pending" state.  This can mean that the prerequisite processing is
  * completed, but it can also mean that prerequisite processing was
aborted or cannot be completed.
  * Note well: This method should not be called unless the connector is
CERTAIN that an event is in progress, and that the current thread has
  * the sole right to complete it.  Otherwise, race conditions can develop
which would be difficult to diagnose.
  *@param eventName is the event name.
  */
  public void completeEventSequence(String eventName)
    throws ManifoldCFException;

  /** Abort processing a document (for sequencing reasons).
  * This method should be called in order to cause the specified document
to be requeued for later processing.  While this is similar in some respects
  * to the semantics of a ServiceInterruption, it is applicable to only one
document at a time, and also does not specify any delay period, since it is
  * presumed that the reason for the requeue is because of sequencing
issues synchronized around an underlying event.
  *@param localIdentifier is the document identifier to requeue
  */
  public void retryDocumentProcessing(String localIdentifier)
    throws ManifoldCFException;


}


As you can see, these constraints are significant and can cause
single-threaded behavior, so unless you've got a real requirement for
ordering, it's better not to do it.

Furthermore, the question of deletions is really not germane, because
ManifoldCF does not in fact order deletions at all.  They are done either
as a side-effect of document processing (when a document is discovered to
not be there anymore), or at the end of a job (when orphaned documents are
removed).  They are also deleted when the job that owns them is deleted.

Karl



On Fri, Jun 13, 2014 at 12:52 PM, Matteo Grolla <m....@sourcesense.com>
wrote:

> Hi
>         I'm going to develop  a manifold connector and one requirements is
> that it should be able to handle document insertion and deletion in order
> (details coming).
> Actually I've already built such crawler as a standalone application and
> the design was conceptually this
>
> instead of a Document Queue I have a CommandQueue
>         commands can be delete (specifying the docId) or add (specifying
> the doc to be added)
> when a worker thread takes a delete no other worker is allowed to pick
> other commands from the queue until the delete has been committed
>
>
> Ex. suppose I have the following chunk of CommandQueue:
>
> add{doc1}, delete{doc1}, add{doc1}
>
> I need to avoid the situation where commands are processed in this order:
> add{doc1}, add{doc1}, delete{doc1}
>
>
> I think the EventSequence could help me implement this synchronization in
> Manifold
> when seeding the identifiers I could embed in the identifier the command
> Ex.
>         instead of stuffing the identifier "hd-samsing-500GB"
>         I could stuff "add hd-samsung-500GB"
>
> The question is: Am I running into huge troubles trying to implement this
> requirement or not?
>
> --
> Matteo Grolla
> Sourcesense - making sense of Open Source
> http://www.sourcesense.com
>
>