You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by hank williams <ha...@gmail.com> on 2015/03/19 18:44:59 UTC

A hopfully a few simple question about ManifoldCF and SharePoint

I am embarking on an effort for which ManifoldCF may  be an appropriate
tool. I am a total noob, having just discovered this project and have a few
questions that I am hoping someone can answer so that I can begin to gain
some confidence about the way things work. Basically I am trying to make
sure I understand, at a top level, how ManifoldCF works.

Our project involves a database that has a private secure user space for
each user. Our database is built on Lucene and indexes every object in the
database. Each user presumably has some number of SharePoint sites that
they have access to. We want to index each sharepoint object (file or
sharepoint page) as we find it, for each user. The user then ends up with
an index of just the objects that they have perrmissions for. But to do
that we need to, for each user crawl all of the sharepoint sites that they
have access to. Permissions to each sharepoint site are managed by Kerberos.

So the questions are:

a. Can I, with ManifoldCF take list of sharepoint sites and a list of users
and relevant Kerberos appropriate authentication tokens or keys (just
learning about Kerberos), and get back a list of indexable objects/URIs
(HTML, .docx, pptx, etc.)?

b. Is this the right way to think about it?

c. If so, is there any example code or documentation that would explain how
I do this?

d. Does manifoldCF provide any information to help indicate whether the
given object has changed, or is that something we need to figure out by
manually comparing the old and new documents in our code?

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by hank williams <ha...@gmail.com>.

Thanks Karl.

We're not indexing into solr. Its our own technology. What we are really
looking for it sounds like to me is "SharePoint-from-java" experience and
writing web apps that talk to sharepoint.

Well, I'll just keep looking.

Best,
Hank

On Mon, Mar 23, 2015 at 10:10 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Hank,
>
> I can't really recommend any consulting firms specifically skilled with
> using bits and pieces of ManifoldCF to build a whole new solution.  If you
> are indexing into Solr, maybe you can contact a Solr consulting firm, e.g.
> LucidImagination etc.  You *could* try a firm like Zaizi (based in London),
> but I can't be sure they'd find the job amenable either.
>
> Karl
>
> On Mon, Mar 23, 2015 at 9:43 AM, hank williams <ha...@gmail.com> wrote:
>
>> Karl,
>>
>> At this point it seems like perhaps ManifoldCF may not be the right tool.
>>
>> I think the best solution is to have our server log into SharePoint using
>> Kerberos or OAuth, and to provide our engine links to the content available
>> to the logged in user. This is, in essence, a single user crawl of a
>> sharepoint site I guess (we are not interested in other data sources). From
>> what I gather based on your responses, ManifoldCF wouldnt help much here,
>> but this does not seem like an extraordinarily complicated task (at least
>> from the perspective of someone who's never played with any of this stuff!).
>>
>> So my question is, is my assumption that its not "an extraordinarily
>> complicated task" correct, and if not, are there folks in the ManifoldCF
>> community (or other communities) that you know of might be available as
>> consultants to create that module?
>>
>> Best ,
>> Hank
>>
>> On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> "If output connectors have access to the access tokens then I am
>>> presuming a custom output connector could look and say, "oh this document
>>> is accessible to these specific people", but is that a reasonable
>>> assumption?"
>>>
>>> The problem is that you don't know what is in those access tokens.  If
>>> you knew beyond question that the only thing you'd ever index was stuff
>>> that (for instance) came from SharePoint, maybe you could make it work.
>>> But if you add other connection types, then you'd have to modify your
>>> output connector for each one.
>>>
>>> The other thing you should think about is that usually access tokens
>>> correspond to *groups* of users rather than individual users.  There is no
>>> obvious mapping then that you can use to turn that into a list of
>>> corresponding users.  I believe that when the SharePoint connector is
>>> configured for "Active Directory" authorization, it maps to individual
>>> SIDs, but as you might expect the list of SIDs for a given document can be
>>> quite large, which is why we went to the SharePoint/Native authorization
>>> model as our default.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <ha...@gmail.com>
>>> wrote:
>>>
>>>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>>>
>>>> Regarding #2, since our database is proprietary, there would be no
>>>> existing output connection type so in any case we would need to create our
>>>> own.
>>>>
>>>> But #1 is clearly an issue. My first thought is that the answer would
>>>> be to just read everything (not limited by permissions) and then to use a
>>>> custom output connector to "place" copies in the right accounts. If output
>>>> connectors have access to the access tokens then I am presuming a custom
>>>> output connector could look and say, "oh this document is accessible to
>>>> these specific people", but is that a reasonable assumption?
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> "So my question is, notwithstanding that this is not the "typical"
>>>>> way ManifoldCF works, can we use it in the way that I am describing. Is it
>>>>> malleable enough to work or is it designed to do something so different
>>>>> from what we need that it would be useless. I guess the key question is
>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>> specific user and would there be any performance or other unexpected
>>>>> downsides to doing that."
>>>>>
>>>>> Hi Hank,
>>>>>
>>>>> There is nothing specific about the ManifoldCF *framework* that
>>>>> prevents you from doing what you suggest.  But there are problems, as
>>>>> follows:
>>>>>
>>>>> (1) Most out-of-the-box repository connection types, including the
>>>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>>>> user.  Instead, because they are intended to support a very different
>>>>> security model, they fetch a document's access tokens, which are described
>>>>> by the book chapter I pointed you to.
>>>>> (2) If you modified the SharePoint repository connection type in the
>>>>> manner you suggest, you would still need to create a custom output
>>>>> connection type to drop the content into your per-user database instances.
>>>>> The alternative would be to use an appropriate out-of-the-box output
>>>>> connection type, if there is one, and have N jobs for N users.
>>>>>
>>>>> Hope that answers your question.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <ha...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Karl.
>>>>>>
>>>>>> I will most certainly be reading the document you linked to in great
>>>>>> detail. It looks like stuff I need to know.
>>>>>>
>>>>>> That said, we have a given technology that we have developed and that
>>>>>> we will be using. It creates a separate index for each user. The technology
>>>>>> has vastly greater utility than just for sharepoint and Its been in
>>>>>> development for about six years . (in fact this sharepoint thing is a
>>>>>> recent add-on request.)
>>>>>>
>>>>>> So my question is, notwithstanding that this is not the "typical" way
>>>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>>>> malleable enough to work or is it designed to do something so different
>>>>>> from what we need that it would be useless. I guess the key question is
>>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>>> specific user and would there be any performance or other unexpected
>>>>>> downsides to doing that.
>>>>>>
>>>>>> Hank
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Hank,
>>>>>>>
>>>>>>> "Our project involves a database that has a private secure user
>>>>>>> space for each user. Our database is built on Lucene and indexes every
>>>>>>> object in the database. Each user presumably has some number of SharePoint
>>>>>>> sites that they have access to. We want to index each sharepoint object
>>>>>>> (file or sharepoint page) as we find it, for each user. The user then ends
>>>>>>> up with an index of just the objects that they have perrmissions for. But
>>>>>>> to do that we need to, for each user crawl all of the sharepoint sites that
>>>>>>> they have access to. Permissions to each sharepoint site are managed by K
>>>>>>> erberos."
>>>>>>>
>>>>>>> This is not the typical ManifoldCF model.  In the typical case,
>>>>>>> there is ONE lucene search engine (not N), and any searches that take place
>>>>>>> apply security restrictions internally based on the user's security
>>>>>>> information, as obtained from the ManifoldCF authority service, which is in
>>>>>>> turn querying SharePoint.
>>>>>>>
>>>>>>> You can read more about the standard authorization setup here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>>>>> appropriate tool. I am a total noob, having just discovered this project
>>>>>>>> and have a few questions that I am hoping someone can answer so that I can
>>>>>>>> begin to gain some confidence about the way things work. Basically I am
>>>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works.
>>>>>>>>
>>>>>>>> Our project involves a database that has a private secure user
>>>>>>>> space for each user. Our database is built on Lucene and indexes every
>>>>>>>> object in the database. Each user presumably has some number of SharePoint
>>>>>>>> sites that they have access to. We want to index each sharepoint object
>>>>>>>> (file or sharepoint page) as we find it, for each user. The user then ends
>>>>>>>> up with an index of just the objects that they have perrmissions for. But
>>>>>>>> to do that we need to, for each user crawl all of the sharepoint sites that
>>>>>>>> they have access to. Permissions to each sharepoint site are managed by K
>>>>>>>> erberos.
>>>>>>>>
>>>>>>>> So the questions are:
>>>>>>>>
>>>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list
>>>>>>>> of users and relevant Kerberos appropriate authentication tokens or keys
>>>>>>>> (just learning about Kerberos), and get back a list of indexable
>>>>>>>> objects/URIs (HTML, .docx, pptx, etc.)?
>>>>>>>>
>>>>>>>> b. Is this the right way to think about it?
>>>>>>>>
>>>>>>>> c. If so, is there any example code or documentation that would
>>>>>>>> explain how I do this?
>>>>>>>>
>>>>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>>>>> the given object has changed, or is that something we need to figure out by
>>>>>>>> manually comparing the old and new documents in our code?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by Karl Wright <da...@gmail.com>.

Hi Hank,

I can't really recommend any consulting firms specifically skilled with
using bits and pieces of ManifoldCF to build a whole new solution.  If you
are indexing into Solr, maybe you can contact a Solr consulting firm, e.g.
LucidImagination etc.  You *could* try a firm like Zaizi (based in London),
but I can't be sure they'd find the job amenable either.

Karl

On Mon, Mar 23, 2015 at 9:43 AM, hank williams <ha...@gmail.com> wrote:

> Karl,
>
> At this point it seems like perhaps ManifoldCF may not be the right tool.
>
> I think the best solution is to have our server log into SharePoint using
> Kerberos or OAuth, and to provide our engine links to the content available
> to the logged in user. This is, in essence, a single user crawl of a
> sharepoint site I guess (we are not interested in other data sources). From
> what I gather based on your responses, ManifoldCF wouldnt help much here,
> but this does not seem like an extraordinarily complicated task (at least
> from the perspective of someone who's never played with any of this stuff!).
>
> So my question is, is my assumption that its not "an extraordinarily
> complicated task" correct, and if not, are there folks in the ManifoldCF
> community (or other communities) that you know of might be available as
> consultants to create that module?
>
> Best ,
> Hank
>
> On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <da...@gmail.com> wrote:
>
>> "If output connectors have access to the access tokens then I am
>> presuming a custom output connector could look and say, "oh this document
>> is accessible to these specific people", but is that a reasonable
>> assumption?"
>>
>> The problem is that you don't know what is in those access tokens.  If
>> you knew beyond question that the only thing you'd ever index was stuff
>> that (for instance) came from SharePoint, maybe you could make it work.
>> But if you add other connection types, then you'd have to modify your
>> output connector for each one.
>>
>> The other thing you should think about is that usually access tokens
>> correspond to *groups* of users rather than individual users.  There is no
>> obvious mapping then that you can use to turn that into a list of
>> corresponding users.  I believe that when the SharePoint connector is
>> configured for "Active Directory" authorization, it maps to individual
>> SIDs, but as you might expect the list of SIDs for a given document can be
>> quite large, which is why we went to the SharePoint/Native authorization
>> model as our default.
>>
>> Karl
>>
>>
>> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <ha...@gmail.com> wrote:
>>
>>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>>
>>> Regarding #2, since our database is proprietary, there would be no
>>> existing output connection type so in any case we would need to create our
>>> own.
>>>
>>> But #1 is clearly an issue. My first thought is that the answer would be
>>> to just read everything (not limited by permissions) and then to use a
>>> custom output connector to "place" copies in the right accounts. If output
>>> connectors have access to the access tokens then I am presuming a custom
>>> output connector could look and say, "oh this document is accessible to
>>> these specific people", but is that a reasonable assumption?
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> "So my question is, notwithstanding that this is not the "typical" way
>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>> malleable enough to work or is it designed to do something so different
>>>> from what we need that it would be useless. I guess the key question is
>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>> specific user and would there be any performance or other unexpected
>>>> downsides to doing that."
>>>>
>>>> Hi Hank,
>>>>
>>>> There is nothing specific about the ManifoldCF *framework* that
>>>> prevents you from doing what you suggest.  But there are problems, as
>>>> follows:
>>>>
>>>> (1) Most out-of-the-box repository connection types, including the
>>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>>> user.  Instead, because they are intended to support a very different
>>>> security model, they fetch a document's access tokens, which are described
>>>> by the book chapter I pointed you to.
>>>> (2) If you modified the SharePoint repository connection type in the
>>>> manner you suggest, you would still need to create a custom output
>>>> connection type to drop the content into your per-user database instances.
>>>> The alternative would be to use an appropriate out-of-the-box output
>>>> connection type, if there is one, and have N jobs for N users.
>>>>
>>>> Hope that answers your question.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <ha...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Karl.
>>>>>
>>>>> I will most certainly be reading the document you linked to in great
>>>>> detail. It looks like stuff I need to know.
>>>>>
>>>>> That said, we have a given technology that we have developed and that
>>>>> we will be using. It creates a separate index for each user. The technology
>>>>> has vastly greater utility than just for sharepoint and Its been in
>>>>> development for about six years . (in fact this sharepoint thing is a
>>>>> recent add-on request.)
>>>>>
>>>>> So my question is, notwithstanding that this is not the "typical" way
>>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>>> malleable enough to work or is it designed to do something so different
>>>>> from what we need that it would be useless. I guess the key question is
>>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>>> specific user and would there be any performance or other unexpected
>>>>> downsides to doing that.
>>>>>
>>>>> Hank
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Hank,
>>>>>>
>>>>>> "Our project involves a database that has a private secure user
>>>>>> space for each user. Our database is built on Lucene and indexes every
>>>>>> object in the database. Each user presumably has some number of SharePoint
>>>>>> sites that they have access to. We want to index each sharepoint object
>>>>>> (file or sharepoint page) as we find it, for each user. The user then ends
>>>>>> up with an index of just the objects that they have perrmissions for. But
>>>>>> to do that we need to, for each user crawl all of the sharepoint sites that
>>>>>> they have access to. Permissions to each sharepoint site are managed by K
>>>>>> erberos."
>>>>>>
>>>>>> This is not the typical ManifoldCF model.  In the typical case, there
>>>>>> is ONE lucene search engine (not N), and any searches that take place apply
>>>>>> security restrictions internally based on the user's security information,
>>>>>> as obtained from the ManifoldCF authority service, which is in turn
>>>>>> querying SharePoint.
>>>>>>
>>>>>> You can read more about the standard authorization setup here:
>>>>>>
>>>>>>
>>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>>>> appropriate tool. I am a total noob, having just discovered this project
>>>>>>> and have a few questions that I am hoping someone can answer so that I can
>>>>>>> begin to gain some confidence about the way things work. Basically I am
>>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works.
>>>>>>>
>>>>>>> Our project involves a database that has a private secure user space
>>>>>>> for each user. Our database is built on Lucene and indexes every object in
>>>>>>> the database. Each user presumably has some number of SharePoint sites that
>>>>>>> they have access to. We want to index each sharepoint object (file or
>>>>>>> sharepoint page) as we find it, for each user. The user then ends up with
>>>>>>> an index of just the objects that they have perrmissions for. But to do
>>>>>>> that we need to, for each user crawl all of the sharepoint sites that they
>>>>>>> have access to. Permissions to each sharepoint site are managed by K
>>>>>>> erberos.
>>>>>>>
>>>>>>> So the questions are:
>>>>>>>
>>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list
>>>>>>> of users and relevant Kerberos appropriate authentication tokens or keys
>>>>>>> (just learning about Kerberos), and get back a list of indexable
>>>>>>> objects/URIs (HTML, .docx, pptx, etc.)?
>>>>>>>
>>>>>>> b. Is this the right way to think about it?
>>>>>>>
>>>>>>> c. If so, is there any example code or documentation that would
>>>>>>> explain how I do this?
>>>>>>>
>>>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>>>> the given object has changed, or is that something we need to figure out by
>>>>>>> manually comparing the old and new documents in our code?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by hank williams <ha...@gmail.com>.

Karl,

At this point it seems like perhaps ManifoldCF may not be the right tool.

I think the best solution is to have our server log into SharePoint using
Kerberos or OAuth, and to provide our engine links to the content available
to the logged in user. This is, in essence, a single user crawl of a
sharepoint site I guess (we are not interested in other data sources). From
what I gather based on your responses, ManifoldCF wouldnt help much here,
but this does not seem like an extraordinarily complicated task (at least
from the perspective of someone who's never played with any of this stuff!).

So my question is, is my assumption that its not "an extraordinarily
complicated task" correct, and if not, are there folks in the ManifoldCF
community (or other communities) that you know of might be available as
consultants to create that module?

Best ,
Hank

On Thu, Mar 19, 2015 at 3:43 PM, Karl Wright <da...@gmail.com> wrote:

> "If output connectors have access to the access tokens then I am
> presuming a custom output connector could look and say, "oh this document
> is accessible to these specific people", but is that a reasonable
> assumption?"
>
> The problem is that you don't know what is in those access tokens.  If you
> knew beyond question that the only thing you'd ever index was stuff that
> (for instance) came from SharePoint, maybe you could make it work.  But if
> you add other connection types, then you'd have to modify your output
> connector for each one.
>
> The other thing you should think about is that usually access tokens
> correspond to *groups* of users rather than individual users.  There is no
> obvious mapping then that you can use to turn that into a list of
> corresponding users.  I believe that when the SharePoint connector is
> configured for "Active Directory" authorization, it maps to individual
> SIDs, but as you might expect the list of SIDs for a given document can be
> quite large, which is why we went to the SharePoint/Native authorization
> model as our default.
>
> Karl
>
>
> On Thu, Mar 19, 2015 at 2:43 PM, hank williams <ha...@gmail.com> wrote:
>
>> This is *super* helpful. I think perhaps I am seeing how to handle this.
>>
>> Regarding #2, since our database is proprietary, there would be no
>> existing output connection type so in any case we would need to create our
>> own.
>>
>> But #1 is clearly an issue. My first thought is that the answer would be
>> to just read everything (not limited by permissions) and then to use a
>> custom output connector to "place" copies in the right accounts. If output
>> connectors have access to the access tokens then I am presuming a custom
>> output connector could look and say, "oh this document is accessible to
>> these specific people", but is that a reasonable assumption?
>>
>>
>> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> "So my question is, notwithstanding that this is not the "typical" way
>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>> malleable enough to work or is it designed to do something so different
>>> from what we need that it would be useless. I guess the key question is
>>> really, can we tell ManifoldCF to limit results to those visible to a
>>> specific user and would there be any performance or other unexpected
>>> downsides to doing that."
>>>
>>> Hi Hank,
>>>
>>> There is nothing specific about the ManifoldCF *framework* that prevents
>>> you from doing what you suggest.  But there are problems, as follows:
>>>
>>> (1) Most out-of-the-box repository connection types, including the
>>> SharePoint type, do not give you any ability to limit crawls to a specific
>>> user.  Instead, because they are intended to support a very different
>>> security model, they fetch a document's access tokens, which are described
>>> by the book chapter I pointed you to.
>>> (2) If you modified the SharePoint repository connection type in the
>>> manner you suggest, you would still need to create a custom output
>>> connection type to drop the content into your per-user database instances.
>>> The alternative would be to use an appropriate out-of-the-box output
>>> connection type, if there is one, and have N jobs for N users.
>>>
>>> Hope that answers your question.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <ha...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Karl.
>>>>
>>>> I will most certainly be reading the document you linked to in great
>>>> detail. It looks like stuff I need to know.
>>>>
>>>> That said, we have a given technology that we have developed and that
>>>> we will be using. It creates a separate index for each user. The technology
>>>> has vastly greater utility than just for sharepoint and Its been in
>>>> development for about six years . (in fact this sharepoint thing is a
>>>> recent add-on request.)
>>>>
>>>> So my question is, notwithstanding that this is not the "typical" way
>>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>>> malleable enough to work or is it designed to do something so different
>>>> from what we need that it would be useless. I guess the key question is
>>>> really, can we tell ManifoldCF to limit results to those visible to a
>>>> specific user and would there be any performance or other unexpected
>>>> downsides to doing that.
>>>>
>>>> Hank
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hank,
>>>>>
>>>>> "Our project involves a database that has a private secure user space
>>>>> for each user. Our database is built on Lucene and indexes every object in
>>>>> the database. Each user presumably has some number of SharePoint sites that
>>>>> they have access to. We want to index each sharepoint object (file or
>>>>> sharepoint page) as we find it, for each user. The user then ends up with
>>>>> an index of just the objects that they have perrmissions for. But to do
>>>>> that we need to, for each user crawl all of the sharepoint sites that they
>>>>> have access to. Permissions to each sharepoint site are managed by K
>>>>> erberos."
>>>>>
>>>>> This is not the typical ManifoldCF model.  In the typical case, there
>>>>> is ONE lucene search engine (not N), and any searches that take place apply
>>>>> security restrictions internally based on the user's security information,
>>>>> as obtained from the ManifoldCF authority service, which is in turn
>>>>> querying SharePoint.
>>>>>
>>>>> You can read more about the standard authorization setup here:
>>>>>
>>>>>
>>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>>> appropriate tool. I am a total noob, having just discovered this project
>>>>>> and have a few questions that I am hoping someone can answer so that I can
>>>>>> begin to gain some confidence about the way things work. Basically I am
>>>>>> trying to make sure I understand, at a top level, how ManifoldCF works.
>>>>>>
>>>>>> Our project involves a database that has a private secure user space
>>>>>> for each user. Our database is built on Lucene and indexes every object in
>>>>>> the database. Each user presumably has some number of SharePoint sites that
>>>>>> they have access to. We want to index each sharepoint object (file or
>>>>>> sharepoint page) as we find it, for each user. The user then ends up with
>>>>>> an index of just the objects that they have perrmissions for. But to do
>>>>>> that we need to, for each user crawl all of the sharepoint sites that they
>>>>>> have access to. Permissions to each sharepoint site are managed by K
>>>>>> erberos.
>>>>>>
>>>>>> So the questions are:
>>>>>>
>>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list of
>>>>>> users and relevant Kerberos appropriate authentication tokens or keys (just
>>>>>> learning about Kerberos), and get back a list of indexable objects/URIs
>>>>>> (HTML, .docx, pptx, etc.)?
>>>>>>
>>>>>> b. Is this the right way to think about it?
>>>>>>
>>>>>> c. If so, is there any example code or documentation that would
>>>>>> explain how I do this?
>>>>>>
>>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>>> the given object has changed, or is that something we need to figure out by
>>>>>> manually comparing the old and new documents in our code?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by Karl Wright <da...@gmail.com>.

"If output connectors have access to the access tokens then I am presuming
a custom output connector could look and say, "oh this document is
accessible to these specific people", but is that a reasonable assumption?"

The problem is that you don't know what is in those access tokens.  If you
knew beyond question that the only thing you'd ever index was stuff that
(for instance) came from SharePoint, maybe you could make it work.  But if
you add other connection types, then you'd have to modify your output
connector for each one.

The other thing you should think about is that usually access tokens
correspond to *groups* of users rather than individual users.  There is no
obvious mapping then that you can use to turn that into a list of
corresponding users.  I believe that when the SharePoint connector is
configured for "Active Directory" authorization, it maps to individual
SIDs, but as you might expect the list of SIDs for a given document can be
quite large, which is why we went to the SharePoint/Native authorization
model as our default.

Karl


On Thu, Mar 19, 2015 at 2:43 PM, hank williams <ha...@gmail.com> wrote:

> This is *super* helpful. I think perhaps I am seeing how to handle this.
>
> Regarding #2, since our database is proprietary, there would be no
> existing output connection type so in any case we would need to create our
> own.
>
> But #1 is clearly an issue. My first thought is that the answer would be
> to just read everything (not limited by permissions) and then to use a
> custom output connector to "place" copies in the right accounts. If output
> connectors have access to the access tokens then I am presuming a custom
> output connector could look and say, "oh this document is accessible to
> these specific people", but is that a reasonable assumption?
>
>
> On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <da...@gmail.com> wrote:
>
>> "So my question is, notwithstanding that this is not the "typical" way
>> ManifoldCF works, can we use it in the way that I am describing. Is it
>> malleable enough to work or is it designed to do something so different
>> from what we need that it would be useless. I guess the key question is
>> really, can we tell ManifoldCF to limit results to those visible to a
>> specific user and would there be any performance or other unexpected
>> downsides to doing that."
>>
>> Hi Hank,
>>
>> There is nothing specific about the ManifoldCF *framework* that prevents
>> you from doing what you suggest.  But there are problems, as follows:
>>
>> (1) Most out-of-the-box repository connection types, including the
>> SharePoint type, do not give you any ability to limit crawls to a specific
>> user.  Instead, because they are intended to support a very different
>> security model, they fetch a document's access tokens, which are described
>> by the book chapter I pointed you to.
>> (2) If you modified the SharePoint repository connection type in the
>> manner you suggest, you would still need to create a custom output
>> connection type to drop the content into your per-user database instances.
>> The alternative would be to use an appropriate out-of-the-box output
>> connection type, if there is one, and have N jobs for N users.
>>
>> Hope that answers your question.
>>
>> Karl
>>
>>
>>
>> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <ha...@gmail.com> wrote:
>>
>>> Thanks Karl.
>>>
>>> I will most certainly be reading the document you linked to in great
>>> detail. It looks like stuff I need to know.
>>>
>>> That said, we have a given technology that we have developed and that we
>>> will be using. It creates a separate index for each user. The technology
>>> has vastly greater utility than just for sharepoint and Its been in
>>> development for about six years . (in fact this sharepoint thing is a
>>> recent add-on request.)
>>>
>>> So my question is, notwithstanding that this is not the "typical" way
>>> ManifoldCF works, can we use it in the way that I am describing. Is it
>>> malleable enough to work or is it designed to do something so different
>>> from what we need that it would be useless. I guess the key question is
>>> really, can we tell ManifoldCF to limit results to those visible to a
>>> specific user and would there be any performance or other unexpected
>>> downsides to doing that.
>>>
>>> Hank
>>>
>>>
>>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Hank,
>>>>
>>>> "Our project involves a database that has a private secure user space
>>>> for each user. Our database is built on Lucene and indexes every object in
>>>> the database. Each user presumably has some number of SharePoint sites that
>>>> they have access to. We want to index each sharepoint object (file or
>>>> sharepoint page) as we find it, for each user. The user then ends up with
>>>> an index of just the objects that they have perrmissions for. But to do
>>>> that we need to, for each user crawl all of the sharepoint sites that they
>>>> have access to. Permissions to each sharepoint site are managed by K
>>>> erberos."
>>>>
>>>> This is not the typical ManifoldCF model.  In the typical case, there
>>>> is ONE lucene search engine (not N), and any searches that take place apply
>>>> security restrictions internally based on the user's security information,
>>>> as obtained from the ManifoldCF authority service, which is in turn
>>>> querying SharePoint.
>>>>
>>>> You can read more about the standard authorization setup here:
>>>>
>>>>
>>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am embarking on an effort for which ManifoldCF may  be an
>>>>> appropriate tool. I am a total noob, having just discovered this project
>>>>> and have a few questions that I am hoping someone can answer so that I can
>>>>> begin to gain some confidence about the way things work. Basically I am
>>>>> trying to make sure I understand, at a top level, how ManifoldCF works.
>>>>>
>>>>> Our project involves a database that has a private secure user space
>>>>> for each user. Our database is built on Lucene and indexes every object in
>>>>> the database. Each user presumably has some number of SharePoint sites that
>>>>> they have access to. We want to index each sharepoint object (file or
>>>>> sharepoint page) as we find it, for each user. The user then ends up with
>>>>> an index of just the objects that they have perrmissions for. But to do
>>>>> that we need to, for each user crawl all of the sharepoint sites that they
>>>>> have access to. Permissions to each sharepoint site are managed by K
>>>>> erberos.
>>>>>
>>>>> So the questions are:
>>>>>
>>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list of
>>>>> users and relevant Kerberos appropriate authentication tokens or keys (just
>>>>> learning about Kerberos), and get back a list of indexable objects/URIs
>>>>> (HTML, .docx, pptx, etc.)?
>>>>>
>>>>> b. Is this the right way to think about it?
>>>>>
>>>>> c. If so, is there any example code or documentation that would
>>>>> explain how I do this?
>>>>>
>>>>> d. Does manifoldCF provide any information to help indicate whether
>>>>> the given object has changed, or is that something we need to figure out by
>>>>> manually comparing the old and new documents in our code?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by hank williams <ha...@gmail.com>.

This is *super* helpful. I think perhaps I am seeing how to handle this.

Regarding #2, since our database is proprietary, there would be no existing
output connection type so in any case we would need to create our own.

But #1 is clearly an issue. My first thought is that the answer would be to
just read everything (not limited by permissions) and then to use a custom
output connector to "place" copies in the right accounts. If output
connectors have access to the access tokens then I am presuming a custom
output connector could look and say, "oh this document is accessible to
these specific people", but is that a reasonable assumption?


On Thu, Mar 19, 2015 at 2:26 PM, Karl Wright <da...@gmail.com> wrote:

> "So my question is, notwithstanding that this is not the "typical" way
> ManifoldCF works, can we use it in the way that I am describing. Is it
> malleable enough to work or is it designed to do something so different
> from what we need that it would be useless. I guess the key question is
> really, can we tell ManifoldCF to limit results to those visible to a
> specific user and would there be any performance or other unexpected
> downsides to doing that."
>
> Hi Hank,
>
> There is nothing specific about the ManifoldCF *framework* that prevents
> you from doing what you suggest.  But there are problems, as follows:
>
> (1) Most out-of-the-box repository connection types, including the
> SharePoint type, do not give you any ability to limit crawls to a specific
> user.  Instead, because they are intended to support a very different
> security model, they fetch a document's access tokens, which are described
> by the book chapter I pointed you to.
> (2) If you modified the SharePoint repository connection type in the
> manner you suggest, you would still need to create a custom output
> connection type to drop the content into your per-user database instances.
> The alternative would be to use an appropriate out-of-the-box output
> connection type, if there is one, and have N jobs for N users.
>
> Hope that answers your question.
>
> Karl
>
>
>
> On Thu, Mar 19, 2015 at 2:15 PM, hank williams <ha...@gmail.com> wrote:
>
>> Thanks Karl.
>>
>> I will most certainly be reading the document you linked to in great
>> detail. It looks like stuff I need to know.
>>
>> That said, we have a given technology that we have developed and that we
>> will be using. It creates a separate index for each user. The technology
>> has vastly greater utility than just for sharepoint and Its been in
>> development for about six years . (in fact this sharepoint thing is a
>> recent add-on request.)
>>
>> So my question is, notwithstanding that this is not the "typical" way
>> ManifoldCF works, can we use it in the way that I am describing. Is it
>> malleable enough to work or is it designed to do something so different
>> from what we need that it would be useless. I guess the key question is
>> really, can we tell ManifoldCF to limit results to those visible to a
>> specific user and would there be any performance or other unexpected
>> downsides to doing that.
>>
>> Hank
>>
>>
>> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Hank,
>>>
>>> "Our project involves a database that has a private secure user space
>>> for each user. Our database is built on Lucene and indexes every object in
>>> the database. Each user presumably has some number of SharePoint sites that
>>> they have access to. We want to index each sharepoint object (file or
>>> sharepoint page) as we find it, for each user. The user then ends up with
>>> an index of just the objects that they have perrmissions for. But to do
>>> that we need to, for each user crawl all of the sharepoint sites that they
>>> have access to. Permissions to each sharepoint site are managed by K
>>> erberos."
>>>
>>> This is not the typical ManifoldCF model.  In the typical case, there is
>>> ONE lucene search engine (not N), and any searches that take place apply
>>> security restrictions internally based on the user's security information,
>>> as obtained from the ManifoldCF authority service, which is in turn
>>> querying SharePoint.
>>>
>>> You can read more about the standard authorization setup here:
>>>
>>>
>>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com>
>>> wrote:
>>>
>>>> I am embarking on an effort for which ManifoldCF may  be an appropriate
>>>> tool. I am a total noob, having just discovered this project and have a few
>>>> questions that I am hoping someone can answer so that I can begin to gain
>>>> some confidence about the way things work. Basically I am trying to make
>>>> sure I understand, at a top level, how ManifoldCF works.
>>>>
>>>> Our project involves a database that has a private secure user space
>>>> for each user. Our database is built on Lucene and indexes every object in
>>>> the database. Each user presumably has some number of SharePoint sites that
>>>> they have access to. We want to index each sharepoint object (file or
>>>> sharepoint page) as we find it, for each user. The user then ends up with
>>>> an index of just the objects that they have perrmissions for. But to do
>>>> that we need to, for each user crawl all of the sharepoint sites that they
>>>> have access to. Permissions to each sharepoint site are managed by K
>>>> erberos.
>>>>
>>>> So the questions are:
>>>>
>>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list of
>>>> users and relevant Kerberos appropriate authentication tokens or keys (just
>>>> learning about Kerberos), and get back a list of indexable objects/URIs
>>>> (HTML, .docx, pptx, etc.)?
>>>>
>>>> b. Is this the right way to think about it?
>>>>
>>>> c. If so, is there any example code or documentation that would explain
>>>> how I do this?
>>>>
>>>> d. Does manifoldCF provide any information to help indicate whether the
>>>> given object has changed, or is that something we need to figure out by
>>>> manually comparing the old and new documents in our code?
>>>>
>>>
>>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by Karl Wright <da...@gmail.com>.

"So my question is, notwithstanding that this is not the "typical" way
ManifoldCF works, can we use it in the way that I am describing. Is it
malleable enough to work or is it designed to do something so different
from what we need that it would be useless. I guess the key question is
really, can we tell ManifoldCF to limit results to those visible to a
specific user and would there be any performance or other unexpected
downsides to doing that."

Hi Hank,

There is nothing specific about the ManifoldCF *framework* that prevents
you from doing what you suggest.  But there are problems, as follows:

(1) Most out-of-the-box repository connection types, including the
SharePoint type, do not give you any ability to limit crawls to a specific
user.  Instead, because they are intended to support a very different
security model, they fetch a document's access tokens, which are described
by the book chapter I pointed you to.
(2) If you modified the SharePoint repository connection type in the manner
you suggest, you would still need to create a custom output connection type
to drop the content into your per-user database instances.  The alternative
would be to use an appropriate out-of-the-box output connection type, if
there is one, and have N jobs for N users.

Hope that answers your question.

Karl



On Thu, Mar 19, 2015 at 2:15 PM, hank williams <ha...@gmail.com> wrote:

> Thanks Karl.
>
> I will most certainly be reading the document you linked to in great
> detail. It looks like stuff I need to know.
>
> That said, we have a given technology that we have developed and that we
> will be using. It creates a separate index for each user. The technology
> has vastly greater utility than just for sharepoint and Its been in
> development for about six years . (in fact this sharepoint thing is a
> recent add-on request.)
>
> So my question is, notwithstanding that this is not the "typical" way
> ManifoldCF works, can we use it in the way that I am describing. Is it
> malleable enough to work or is it designed to do something so different
> from what we need that it would be useless. I guess the key question is
> really, can we tell ManifoldCF to limit results to those visible to a
> specific user and would there be any performance or other unexpected
> downsides to doing that.
>
> Hank
>
>
> On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Hank,
>>
>> "Our project involves a database that has a private secure user space
>> for each user. Our database is built on Lucene and indexes every object in
>> the database. Each user presumably has some number of SharePoint sites that
>> they have access to. We want to index each sharepoint object (file or
>> sharepoint page) as we find it, for each user. The user then ends up with
>> an index of just the objects that they have perrmissions for. But to do
>> that we need to, for each user crawl all of the sharepoint sites that they
>> have access to. Permissions to each sharepoint site are managed by K
>> erberos."
>>
>> This is not the typical ManifoldCF model.  In the typical case, there is
>> ONE lucene search engine (not N), and any searches that take place apply
>> security restrictions internally based on the user's security information,
>> as obtained from the ManifoldCF authority service, which is in turn
>> querying SharePoint.
>>
>> You can read more about the standard authorization setup here:
>>
>>
>> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>>
>> Karl
>>
>>
>>
>>
>> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com> wrote:
>>
>>> I am embarking on an effort for which ManifoldCF may  be an appropriate
>>> tool. I am a total noob, having just discovered this project and have a few
>>> questions that I am hoping someone can answer so that I can begin to gain
>>> some confidence about the way things work. Basically I am trying to make
>>> sure I understand, at a top level, how ManifoldCF works.
>>>
>>> Our project involves a database that has a private secure user space for
>>> each user. Our database is built on Lucene and indexes every object in the
>>> database. Each user presumably has some number of SharePoint sites that
>>> they have access to. We want to index each sharepoint object (file or
>>> sharepoint page) as we find it, for each user. The user then ends up with
>>> an index of just the objects that they have perrmissions for. But to do
>>> that we need to, for each user crawl all of the sharepoint sites that they
>>> have access to. Permissions to each sharepoint site are managed by K
>>> erberos.
>>>
>>> So the questions are:
>>>
>>> a. Can I, with ManifoldCF take list of sharepoint sites and a list of
>>> users and relevant Kerberos appropriate authentication tokens or keys (just
>>> learning about Kerberos), and get back a list of indexable objects/URIs
>>> (HTML, .docx, pptx, etc.)?
>>>
>>> b. Is this the right way to think about it?
>>>
>>> c. If so, is there any example code or documentation that would explain
>>> how I do this?
>>>
>>> d. Does manifoldCF provide any information to help indicate whether the
>>> given object has changed, or is that something we need to figure out by
>>> manually comparing the old and new documents in our code?
>>>
>>
>>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by hank williams <ha...@gmail.com>.

Thanks Karl.

I will most certainly be reading the document you linked to in great
detail. It looks like stuff I need to know.

That said, we have a given technology that we have developed and that we
will be using. It creates a separate index for each user. The technology
has vastly greater utility than just for sharepoint and Its been in
development for about six years . (in fact this sharepoint thing is a
recent add-on request.)

So my question is, notwithstanding that this is not the "typical" way
ManifoldCF works, can we use it in the way that I am describing. Is it
malleable enough to work or is it designed to do something so different
from what we need that it would be useless. I guess the key question is
really, can we tell ManifoldCF to limit results to those visible to a
specific user and would there be any performance or other unexpected
downsides to doing that.

Hank


On Thu, Mar 19, 2015 at 1:53 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Hank,
>
> "Our project involves a database that has a private secure user space for
> each user. Our database is built on Lucene and indexes every object in the
> database. Each user presumably has some number of SharePoint sites that
> they have access to. We want to index each sharepoint object (file or
> sharepoint page) as we find it, for each user. The user then ends up with
> an index of just the objects that they have perrmissions for. But to do
> that we need to, for each user crawl all of the sharepoint sites that they
> have access to. Permissions to each sharepoint site are managed by K
> erberos."
>
> This is not the typical ManifoldCF model.  In the typical case, there is
> ONE lucene search engine (not N), and any searches that take place apply
> security restrictions internally based on the user's security information,
> as obtained from the ManifoldCF authority service, which is in turn
> querying SharePoint.
>
> You can read more about the standard authorization setup here:
>
>
> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf
>
> Karl
>
>
>
>
> On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com> wrote:
>
>> I am embarking on an effort for which ManifoldCF may  be an appropriate
>> tool. I am a total noob, having just discovered this project and have a few
>> questions that I am hoping someone can answer so that I can begin to gain
>> some confidence about the way things work. Basically I am trying to make
>> sure I understand, at a top level, how ManifoldCF works.
>>
>> Our project involves a database that has a private secure user space for
>> each user. Our database is built on Lucene and indexes every object in the
>> database. Each user presumably has some number of SharePoint sites that
>> they have access to. We want to index each sharepoint object (file or
>> sharepoint page) as we find it, for each user. The user then ends up with
>> an index of just the objects that they have perrmissions for. But to do
>> that we need to, for each user crawl all of the sharepoint sites that they
>> have access to. Permissions to each sharepoint site are managed by K
>> erberos.
>>
>> So the questions are:
>>
>> a. Can I, with ManifoldCF take list of sharepoint sites and a list of
>> users and relevant Kerberos appropriate authentication tokens or keys (just
>> learning about Kerberos), and get back a list of indexable objects/URIs
>> (HTML, .docx, pptx, etc.)?
>>
>> b. Is this the right way to think about it?
>>
>> c. If so, is there any example code or documentation that would explain
>> how I do this?
>>
>> d. Does manifoldCF provide any information to help indicate whether the
>> given object has changed, or is that something we need to figure out by
>> manually comparing the old and new documents in our code?
>>
>
>

Re: A hopfully a few simple question about ManifoldCF and SharePoint

Posted by Karl Wright <da...@gmail.com>.

Hi Hank,

"Our project involves a database that has a private secure user space for
each user. Our database is built on Lucene and indexes every object in the
database. Each user presumably has some number of SharePoint sites that
they have access to. We want to index each sharepoint object (file or
sharepoint page) as we find it, for each user. The user then ends up with
an index of just the objects that they have perrmissions for. But to do
that we need to, for each user crawl all of the sharepoint sites that they
have access to. Permissions to each sharepoint site are managed by K
erberos."

This is not the typical ManifoldCF model.  In the typical case, there is
ONE lucene search engine (not N), and any searches that take place apply
security restrictions internally based on the user's security information,
as obtained from the ManifoldCF authority service, which is in turn
querying SharePoint.

You can read more about the standard authorization setup here:

https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs/MCFiA%20CH%2004.pdf

Karl

On Thu, Mar 19, 2015 at 1:44 PM, hank williams <ha...@gmail.com> wrote:

> I am embarking on an effort for which ManifoldCF may  be an appropriate
> tool. I am a total noob, having just discovered this project and have a few
> questions that I am hoping someone can answer so that I can begin to gain
> some confidence about the way things work. Basically I am trying to make
> sure I understand, at a top level, how ManifoldCF works.
>
> Our project involves a database that has a private secure user space for
> each user. Our database is built on Lucene and indexes every object in the
> database. Each user presumably has some number of SharePoint sites that
> they have access to. We want to index each sharepoint object (file or
> sharepoint page) as we find it, for each user. The user then ends up with
> an index of just the objects that they have perrmissions for. But to do
> that we need to, for each user crawl all of the sharepoint sites that they
> have access to. Permissions to each sharepoint site are managed by K
> erberos.
>
> So the questions are:
>
> a. Can I, with ManifoldCF take list of sharepoint sites and a list of
> users and relevant Kerberos appropriate authentication tokens or keys (just
> learning about Kerberos), and get back a list of indexable objects/URIs
> (HTML, .docx, pptx, etc.)?
>
> b. Is this the right way to think about it?
>
> c. If so, is there any example code or documentation that would explain
> how I do this?
>
> d. Does manifoldCF provide any information to help indicate whether the
> given object has changed, or is that something we need to figure out by
> manually comparing the old and new documents in our code?
>