You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Mark Lugert <ml...@yahoo.com> on 2013/05/16 22:19:32 UTC

CMIS Connector

Hi Karl,
 
I fixed the cmis connector to include paging, so it supports more than 100 or so documents on the first page (default in alfresco).  I'll put that in a jira soon once I do more performance testing on it to make sure it's where it needs to be.
 
However, I noticed that the seeding has to finish before the connector starts processing the documents.
 
Is there a reason for this?  Seems like you should be able to start processing as soon as you put in the first identifier?
 
I ask because on my small test system I have 5k docs and it fully ran though those before starting to ingest.  What happens if it's a million docs?  Not understanding the reason for waiting and if there isn't one can we do something to not wait?  
 
thanks,
mark

Re: CMIS Connector

Posted by Karl Wright <da...@gmail.com>.

If you are talking about how to prevent seeding queries from returning way
too many results at once, there are a number of strategies.

First, as long as the results are streamed, it's actually not a problem.
The ISeedingActivity methods that add seeds to the database expect them to
come in one at a time.  They get batched inside into groups of 100 and are
sent to the database as the come in.

Second, if you can't guarantee in your connector that millions of documents
won't all hit memory, then you do have to "batch" them in some way.  In the
Wiki connector, we did something like this - we asked for N documents
ordered alphabetically, and then we ask for the next N starting at the last
document from the previous N.  Your mileage may well vary though.

Dates also seem like a reasonable way to slice, provided you have pretty
fine granularity.

Karl



On Thu, May 16, 2013 at 5:15 PM, Mark Lugert <ml...@yahoo.com> wrote:

> That makes sense.
>
> Seems like seeding works best when not just listing a bunch of docs, but
> instead buckets.
>
> Any idea how to bucket things when doing queries?  Thinking maybe by
> date?  If we know latest modified date and oldest we can take that gap and
> divide by bucket count?
>
> Thouhts?
>
> thanks,
> mark
>
> From: Karl Wright <da...@gmail.com>
> To: dev <de...@manifoldcf.apache.org>; Mark Lugert <ml...@yahoo.com>
> Sent: Thursday, May 16, 2013 4:39 PM
> Subject: Re: CMIS Connector
>
>
> There are two stages to the initial startup of a job.  The first stage is
> getting the existing documents in the job to a proper state.  The second is
> to do the first seeding pass.
>
> For a job is not continuous - e.g., there IS only one seeding pass - if the
> seeding pass is interrupted in any way, then it must be retried in toto.
> In other words, we have to guarantee that the seeding pass actually
> finishes - a job is in a fundamentally different state if that doesn't
> happen.  This makes the job state transition diagram pretty hairy, if we
> would separate the initialization state from the seeding state.  I've
> looked at it before and given up after some time at it.
>
> Nevertheless, if you want to open a ticket, please go ahead, and maybe
> someday I'll think of a good, robust way to do it without doubling the
> complexity of the system.
>
> Karl
>
>
>
> On Thu, May 16, 2013 at 4:19 PM, Mark Lugert <ml...@yahoo.com> wrote:
>
> > Hi Karl,
> >
> > I fixed the cmis connector to include paging, so it supports more than
> 100
> > or so documents on the first page (default in alfresco).  I'll put that
> in
> > a jira soon once I do more performance testing on it to make sure it's
> > where it needs to be.
> >
> > However, I noticed that the seeding has to finish before the connector
> > starts processing the documents.
> >
> > Is there a reason for this?  Seems like you should be able to start
> > processing as soon as you put in the first identifier?
> >
> > I ask because on my small test system I have 5k docs and it fully ran
> > though those before starting to ingest.  What happens if it's a million
> > docs?  Not understanding the reason for waiting and if there isn't one
> can
> > we do something to not wait?
> >
> > thanks,
> > mark
>

Re: CMIS Connector

Posted by Mark Lugert <ml...@yahoo.com>.

That makes sense.

Seems like seeding works best when not just listing a bunch of docs, but instead buckets.  

Any idea how to bucket things when doing queries?  Thinking maybe by date?  If we know latest modified date and oldest we can take that gap and divide by bucket count?

Thouhts?

thanks,
mark

From: Karl Wright <da...@gmail.com>
To: dev <de...@manifoldcf.apache.org>; Mark Lugert <ml...@yahoo.com> 
Sent: Thursday, May 16, 2013 4:39 PM
Subject: Re: CMIS Connector

There are two stages to the initial startup of a job.  The first stage is
getting the existing documents in the job to a proper state.  The second is
to do the first seeding pass.

For a job is not continuous - e.g., there IS only one seeding pass - if the
seeding pass is interrupted in any way, then it must be retried in toto.
In other words, we have to guarantee that the seeding pass actually
finishes - a job is in a fundamentally different state if that doesn't
happen.  This makes the job state transition diagram pretty hairy, if we
would separate the initialization state from the seeding state.  I've
looked at it before and given up after some time at it.

Nevertheless, if you want to open a ticket, please go ahead, and maybe
someday I'll think of a good, robust way to do it without doubling the
complexity of the system.

Karl

On Thu, May 16, 2013 at 4:19 PM, Mark Lugert <ml...@yahoo.com> wrote:

> Hi Karl,
>
> I fixed the cmis connector to include paging, so it supports more than 100
> or so documents on the first page (default in alfresco).  I'll put that in
> a jira soon once I do more performance testing on it to make sure it's
> where it needs to be.
>
> However, I noticed that the seeding has to finish before the connector
> starts processing the documents.
>
> Is there a reason for this?  Seems like you should be able to start
> processing as soon as you put in the first identifier?
>
> I ask because on my small test system I have 5k docs and it fully ran
> though those before starting to ingest.  What happens if it's a million
> docs?  Not understanding the reason for waiting and if there isn't one can
> we do something to not wait?
>
> thanks,
> mark

Re: CMIS Connector

Posted by Karl Wright <da...@gmail.com>.

There are two stages to the initial startup of a job.  The first stage is
getting the existing documents in the job to a proper state.  The second is
to do the first seeding pass.

For a job is not continuous - e.g., there IS only one seeding pass - if the
seeding pass is interrupted in any way, then it must be retried in toto.
In other words, we have to guarantee that the seeding pass actually
finishes - a job is in a fundamentally different state if that doesn't
happen.  This makes the job state transition diagram pretty hairy, if we
would separate the initialization state from the seeding state.  I've
looked at it before and given up after some time at it.

Nevertheless, if you want to open a ticket, please go ahead, and maybe
someday I'll think of a good, robust way to do it without doubling the
complexity of the system.

Karl

On Thu, May 16, 2013 at 4:19 PM, Mark Lugert <ml...@yahoo.com> wrote:

> Hi Karl,
>
> I fixed the cmis connector to include paging, so it supports more than 100
> or so documents on the first page (default in alfresco).  I'll put that in
> a jira soon once I do more performance testing on it to make sure it's
> where it needs to be.
>
> However, I noticed that the seeding has to finish before the connector
> starts processing the documents.
>
> Is there a reason for this?  Seems like you should be able to start
> processing as soon as you put in the first identifier?
>
> I ask because on my small test system I have 5k docs and it fully ran
> though those before starting to ingest.  What happens if it's a million
> docs?  Not understanding the reason for waiting and if there isn't one can
> we do something to not wait?
>
> thanks,
> mark