You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Jan Høydahl <ja...@cominvent.com> on 2011/05/24 14:57:22 UTC

Re-sending docs to output connector

Hi,

Is there an easy way to separate fetching from ingestion?
I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
Also, after schema changes in Solr, there is a need to re-feed all docs.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

Re: Re-sending docs to output connector

Posted by Karl Wright <da...@gmail.com>.

More thoughts:

Including this functionality as a general feature of ManifoldCF would
allow one to use ManifoldCF as a repository of content in its own
right.  In this model, probably the data would be keyed by the output
connection name, and if integrated at this level in theory this would
work with any output connection.  The UI modifications would be modest
and would consist of additional buttons on the output connection view
page to re-feed documents to the connection rather than recrawl.

Advantages: Would leverage multiple output connectors transparently,
and would support the "refeed everything to Solr" model.  Guaranteed
commit on the part of a target search engine would no longer be a
requirement.

Downside: First, lots of storage would be required that probably can't
live in PostgreSQL, complicating the deployment model.  Second,
depending on the details of implementation, there may not be feedback
available at crawl time from the output connection about the
acceptability of a document for indexing.  Third, for many repository
connectors the benefit of reading from the file system might well be
zero.  Fourth, the entire process of keeping the target repository
managed properly is a manual one, and thus prone to errors.

Karl

On Wed, May 25, 2011 at 10:13 AM, Karl Wright <da...@gmail.com> wrote:
> "On a "refeed from cache" request, send all objects to Solr - this
> should probably be per Job, not per output connector"
>
> This is where your proposal gets in trouble, I think.  There is no
> infrastructure mechanism in ManifoldCF to do either of these things at
> this time.  Connections are not aware of what jobs are using them, and
> there is no way to send a signal to a connector to tell it to refeed,
> nor is there a button in the crawler UI for it.  You're basically
> proposing significant infrastructure changes in ManifoldCF to support
> a missing feature in Solr, seems to me.
>
> Also, I'm pretty sure we want to try to solve the "guaranteed
> delivery" problem using the same mechanism, whatever it turns out to
> be.  The problems are almost identical and the overhead of having two
> independent solutions for the same issue is very high.  So let us try
> to make this work for both cases.
>
> Karl
>
> On Wed, May 25, 2011 at 9:55 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>> Hi,
>>
>> Definitely, Solr also needs some sort of guaranteed delivery mechanism, but it's probably not the same thing as this cache, I imagine more like a message queue or callback mechanism. But that's a separate discussion all together :)
>>
>> So if we don't shoot for a 100% solution, but try to solve the need to re-feed a bunch of documents from MCF really quickly after some schema change or other processing change on the output (may be any output really), then we'd have a simpler case:
>>
>> Not a standalone server but a lightweight library (jar) which knows how to talk to a persistent object store (CouchDB), supporting simple put(), get(), delete() operations as well as querying for objects within time stamps etc. An outputConnector that wish to support caching could then inject calls to this library in all places it talks with Solr:
>> * On add: put() object to cache along with a timestamp for sequence, then send doc directly to Solr
>> * On delete: Delete document from cache, then add a "delete" meta object with timestamp as a "transaction log" feature, then delete from Solr
>> * On a "refeed from cache" request, send all objects to Solr - this should probably be per Job, not per output connector
>> * A "refeed from cache since timestamp X" request would be useful after Solr downtime. The command would use the cache as a transaction log.
>>
>> The cache will always be a mirror of what the output (Solr) SHOULD look like, thus it would also be possible to support a "consistency check" feature, in which we compare all IDs from cache with all IDs in Solr, and if not equal, get back in sync.
>>
>> Doing this as a lightweight library would then provide a tool for programmers of other clients.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> On 25. mai 2011, at 13.49, Karl Wright wrote:
>>
>>> I've been thinking about this further.
>>>
>>> First, it seems clear to me that both Solr AND ManifoldCF would need
>>> access to the document cache.  If the cache lives under ManifoldCF, I
>>> cannot see a good way towards a Solr integration that works the way
>>> I'd hope it would.  Furthermore, the cache is not needed by many (or
>>> even most) ManifoldCF targets, so adding this as a general feature of
>>> ManifoldCF doesn't make sense to me.
>>>
>>> On the other hand, while Solr can certainly use this facility, I can
>>> well imagine other situations where it would be very useful as well.
>>> So I am now leaning towards having a wholly separate "service" which
>>> functions as both a cache and a transaction log.  A ManifoldCF output
>>> connector would communicate with the service, and Solr also would -
>>> or, rather, some automatic Solr-specific "push" process would query
>>> for changes between a specified time range and push those into Solr.
>>> Other such processes would be possible too.  The list of moving parts
>>> would therefore be:
>>>
>>> - a configuration file containing details on how to communicate with Solr
>>> - a stand-alone web application which accepts documents and metadata
>>> via HTTP, and can also respond to HTTP transaction log queries and
>>> commands
>>> - a number of command classes (processes) which provide a means of
>>> push the transaction log contents into Solr, using the HTTP API
>>> mentioned above.
>>>
>>> I'd be interested in working on the development of such a widget, but
>>> I probably wouldn't have the serious time necessary to do much until
>>> July 1 given current schedule.  Anybody else interested in
>>> collaborating?  Other thoughts?
>>>
>>> Karl
>>>
>>> On Tue, May 24, 2011 at 7:28 PM, Karl Wright <da...@gmail.com> wrote:
>>>> The only requirement you may have overlooked is the requirement that
>>>> Solr be able to take advantage of the item cache automatically if it
>>>> happens to be restarted in the middle of an indexing pass.  If you
>>>> think about it, you will realize that this cannot be done externally
>>>> to Solr, unless Solr learns how to "pull" documents from the item
>>>> cache, and keep track somehow of the last item/operation it
>>>> successfully committed.  That's why I proposed putting the whole cache
>>>> under Solr auspices.  Deletions also would need to be enumerated in
>>>> the "cache", so it would not really be a cache but more like a
>>>> transaction log.  But I agree that the right place for such a
>>>> transaction log is effectively between MCF and Solr.
>>>>
>>>> Obviously the cache would also need to be disk based, or once again
>>>> guaranteed delivery would not be possible.  Compression might be
>>>> useful, as would be checkpoints in case the data got large.  This is
>>>> very database-like, so CouchDB might be a reasonable way to do it,
>>>> especially if this code is considered to be part of Solr.  If part of
>>>> ManifoldCF, we should try to see if PostgreSQL would suffice, since it
>>>> will likely be already installed and ready to go.
>>>>
>>>> Karl
>>>>
>>>> On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl <ja...@cominvent.com> wrote:
>>>>> The "Refetch all ingested documents" works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed.
>>>>>
>>>>> The solutions could be
>>>>> A) Add a stand-alone cache in front of Solr
>>>>> B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
>>>>> C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...
>>>>>
>>>>> The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml.
>>>>>
>>>>> --
>>>>> Jan Høydahl, search solution architect
>>>>> Cominvent AS - www.cominvent.com
>>>>>
>>>>> On 24. mai 2011, at 15.11, Karl Wright wrote:
>>>>>
>>>>>> ManifoldCF is designed to deal with the problem of repeated or
>>>>>> continuous crawling, doing only what is needed on subsequent crawls.
>>>>>> It is thus a true incremental crawler.  But in order for this to work
>>>>>> for you, you need to let ManifoldCF do its job of keeping track of
>>>>>> what documents (and what document versions) have been handed to the
>>>>>> output connection.  For the situation where you change something in
>>>>>> Solr, the ManifoldCF solution to that is the "refetch all ingested
>>>>>> documents" button in the Crawler UI.  This is on the view page for the
>>>>>> output connection.  Clicking that button will cause ManifoldCF to
>>>>>> re-index all documents - but will also require ManifoldCF to recrawl
>>>>>> them, because ManifoldCF does not keep copies of the documents it
>>>>>> crawls anywhere.
>>>>>>
>>>>>> If you need to avoid recrawling at all costs when you change Solr
>>>>>> configurations, you may well need to put some sort of software of your
>>>>>> own devising between ManifoldCF and Solr.  You basically want to
>>>>>> develop a content repository which ManifoldCF outputs to which can be
>>>>>> scanned to send to your Solr instance.  I actually proposed this
>>>>>> design for a Solr "guaranteed delivery" mechanism, because until Solr
>>>>>> commits a document it can still be lost if the Solr instance is shut
>>>>>> down.  Clearly something like this is needed and would also likely
>>>>>> solve your problem too.  The main issue, though, is that it would need
>>>>>> to be integrated with Solr itself, because you'd really want it to
>>>>>> pick up where it left off if Solr is cycled etc.  In my opinion this
>>>>>> functionality really can't function as part of ManifoldCF for that
>>>>>> reason.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there an easy way to separate fetching from ingestion?
>>>>>>> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
>>>>>>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>>>>>>>
>>>>>>> --
>>>>>>> Jan Høydahl, search solution architect
>>>>>>> Cominvent AS - www.cominvent.com
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Re-sending docs to output connector

Posted by Karl Wright <da...@gmail.com>.

"On a "refeed from cache" request, send all objects to Solr - this
should probably be per Job, not per output connector"

This is where your proposal gets in trouble, I think.  There is no
infrastructure mechanism in ManifoldCF to do either of these things at
this time.  Connections are not aware of what jobs are using them, and
there is no way to send a signal to a connector to tell it to refeed,
nor is there a button in the crawler UI for it.  You're basically
proposing significant infrastructure changes in ManifoldCF to support
a missing feature in Solr, seems to me.

Also, I'm pretty sure we want to try to solve the "guaranteed
delivery" problem using the same mechanism, whatever it turns out to
be.  The problems are almost identical and the overhead of having two
independent solutions for the same issue is very high.  So let us try
to make this work for both cases.

Karl

On Wed, May 25, 2011 at 9:55 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> Hi,
>
> Definitely, Solr also needs some sort of guaranteed delivery mechanism, but it's probably not the same thing as this cache, I imagine more like a message queue or callback mechanism. But that's a separate discussion all together :)
>
> So if we don't shoot for a 100% solution, but try to solve the need to re-feed a bunch of documents from MCF really quickly after some schema change or other processing change on the output (may be any output really), then we'd have a simpler case:
>
> Not a standalone server but a lightweight library (jar) which knows how to talk to a persistent object store (CouchDB), supporting simple put(), get(), delete() operations as well as querying for objects within time stamps etc. An outputConnector that wish to support caching could then inject calls to this library in all places it talks with Solr:
> * On add: put() object to cache along with a timestamp for sequence, then send doc directly to Solr
> * On delete: Delete document from cache, then add a "delete" meta object with timestamp as a "transaction log" feature, then delete from Solr
> * On a "refeed from cache" request, send all objects to Solr - this should probably be per Job, not per output connector
> * A "refeed from cache since timestamp X" request would be useful after Solr downtime. The command would use the cache as a transaction log.
>
> The cache will always be a mirror of what the output (Solr) SHOULD look like, thus it would also be possible to support a "consistency check" feature, in which we compare all IDs from cache with all IDs in Solr, and if not equal, get back in sync.
>
> Doing this as a lightweight library would then provide a tool for programmers of other clients.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 25. mai 2011, at 13.49, Karl Wright wrote:
>
>> I've been thinking about this further.
>>
>> First, it seems clear to me that both Solr AND ManifoldCF would need
>> access to the document cache.  If the cache lives under ManifoldCF, I
>> cannot see a good way towards a Solr integration that works the way
>> I'd hope it would.  Furthermore, the cache is not needed by many (or
>> even most) ManifoldCF targets, so adding this as a general feature of
>> ManifoldCF doesn't make sense to me.
>>
>> On the other hand, while Solr can certainly use this facility, I can
>> well imagine other situations where it would be very useful as well.
>> So I am now leaning towards having a wholly separate "service" which
>> functions as both a cache and a transaction log.  A ManifoldCF output
>> connector would communicate with the service, and Solr also would -
>> or, rather, some automatic Solr-specific "push" process would query
>> for changes between a specified time range and push those into Solr.
>> Other such processes would be possible too.  The list of moving parts
>> would therefore be:
>>
>> - a configuration file containing details on how to communicate with Solr
>> - a stand-alone web application which accepts documents and metadata
>> via HTTP, and can also respond to HTTP transaction log queries and
>> commands
>> - a number of command classes (processes) which provide a means of
>> push the transaction log contents into Solr, using the HTTP API
>> mentioned above.
>>
>> I'd be interested in working on the development of such a widget, but
>> I probably wouldn't have the serious time necessary to do much until
>> July 1 given current schedule.  Anybody else interested in
>> collaborating?  Other thoughts?
>>
>> Karl
>>
>> On Tue, May 24, 2011 at 7:28 PM, Karl Wright <da...@gmail.com> wrote:
>>> The only requirement you may have overlooked is the requirement that
>>> Solr be able to take advantage of the item cache automatically if it
>>> happens to be restarted in the middle of an indexing pass.  If you
>>> think about it, you will realize that this cannot be done externally
>>> to Solr, unless Solr learns how to "pull" documents from the item
>>> cache, and keep track somehow of the last item/operation it
>>> successfully committed.  That's why I proposed putting the whole cache
>>> under Solr auspices.  Deletions also would need to be enumerated in
>>> the "cache", so it would not really be a cache but more like a
>>> transaction log.  But I agree that the right place for such a
>>> transaction log is effectively between MCF and Solr.
>>>
>>> Obviously the cache would also need to be disk based, or once again
>>> guaranteed delivery would not be possible.  Compression might be
>>> useful, as would be checkpoints in case the data got large.  This is
>>> very database-like, so CouchDB might be a reasonable way to do it,
>>> especially if this code is considered to be part of Solr.  If part of
>>> ManifoldCF, we should try to see if PostgreSQL would suffice, since it
>>> will likely be already installed and ready to go.
>>>
>>> Karl
>>>
>>> On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl <ja...@cominvent.com> wrote:
>>>> The "Refetch all ingested documents" works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed.
>>>>
>>>> The solutions could be
>>>> A) Add a stand-alone cache in front of Solr
>>>> B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
>>>> C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...
>>>>
>>>> The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml.
>>>>
>>>> --
>>>> Jan Høydahl, search solution architect
>>>> Cominvent AS - www.cominvent.com
>>>>
>>>> On 24. mai 2011, at 15.11, Karl Wright wrote:
>>>>
>>>>> ManifoldCF is designed to deal with the problem of repeated or
>>>>> continuous crawling, doing only what is needed on subsequent crawls.
>>>>> It is thus a true incremental crawler.  But in order for this to work
>>>>> for you, you need to let ManifoldCF do its job of keeping track of
>>>>> what documents (and what document versions) have been handed to the
>>>>> output connection.  For the situation where you change something in
>>>>> Solr, the ManifoldCF solution to that is the "refetch all ingested
>>>>> documents" button in the Crawler UI.  This is on the view page for the
>>>>> output connection.  Clicking that button will cause ManifoldCF to
>>>>> re-index all documents - but will also require ManifoldCF to recrawl
>>>>> them, because ManifoldCF does not keep copies of the documents it
>>>>> crawls anywhere.
>>>>>
>>>>> If you need to avoid recrawling at all costs when you change Solr
>>>>> configurations, you may well need to put some sort of software of your
>>>>> own devising between ManifoldCF and Solr.  You basically want to
>>>>> develop a content repository which ManifoldCF outputs to which can be
>>>>> scanned to send to your Solr instance.  I actually proposed this
>>>>> design for a Solr "guaranteed delivery" mechanism, because until Solr
>>>>> commits a document it can still be lost if the Solr instance is shut
>>>>> down.  Clearly something like this is needed and would also likely
>>>>> solve your problem too.  The main issue, though, is that it would need
>>>>> to be integrated with Solr itself, because you'd really want it to
>>>>> pick up where it left off if Solr is cycled etc.  In my opinion this
>>>>> functionality really can't function as part of ManifoldCF for that
>>>>> reason.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Is there an easy way to separate fetching from ingestion?
>>>>>> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
>>>>>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>>>>>>
>>>>>> --
>>>>>> Jan Høydahl, search solution architect
>>>>>> Cominvent AS - www.cominvent.com
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>
>

Re: Re-sending docs to output connector

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Definitely, Solr also needs some sort of guaranteed delivery mechanism, but it's probably not the same thing as this cache, I imagine more like a message queue or callback mechanism. But that's a separate discussion all together :)

So if we don't shoot for a 100% solution, but try to solve the need to re-feed a bunch of documents from MCF really quickly after some schema change or other processing change on the output (may be any output really), then we'd have a simpler case:

Not a standalone server but a lightweight library (jar) which knows how to talk to a persistent object store (CouchDB), supporting simple put(), get(), delete() operations as well as querying for objects within time stamps etc. An outputConnector that wish to support caching could then inject calls to this library in all places it talks with Solr:
* On add: put() object to cache along with a timestamp for sequence, then send doc directly to Solr
* On delete: Delete document from cache, then add a "delete" meta object with timestamp as a "transaction log" feature, then delete from Solr
* On a "refeed from cache" request, send all objects to Solr - this should probably be per Job, not per output connector
* A "refeed from cache since timestamp X" request would be useful after Solr downtime. The command would use the cache as a transaction log.

The cache will always be a mirror of what the output (Solr) SHOULD look like, thus it would also be possible to support a "consistency check" feature, in which we compare all IDs from cache with all IDs in Solr, and if not equal, get back in sync.

Doing this as a lightweight library would then provide a tool for programmers of other clients.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 25. mai 2011, at 13.49, Karl Wright wrote:

> I've been thinking about this further.
> 
> First, it seems clear to me that both Solr AND ManifoldCF would need
> access to the document cache.  If the cache lives under ManifoldCF, I
> cannot see a good way towards a Solr integration that works the way
> I'd hope it would.  Furthermore, the cache is not needed by many (or
> even most) ManifoldCF targets, so adding this as a general feature of
> ManifoldCF doesn't make sense to me.
> 
> On the other hand, while Solr can certainly use this facility, I can
> well imagine other situations where it would be very useful as well.
> So I am now leaning towards having a wholly separate "service" which
> functions as both a cache and a transaction log.  A ManifoldCF output
> connector would communicate with the service, and Solr also would -
> or, rather, some automatic Solr-specific "push" process would query
> for changes between a specified time range and push those into Solr.
> Other such processes would be possible too.  The list of moving parts
> would therefore be:
> 
> - a configuration file containing details on how to communicate with Solr
> - a stand-alone web application which accepts documents and metadata
> via HTTP, and can also respond to HTTP transaction log queries and
> commands
> - a number of command classes (processes) which provide a means of
> push the transaction log contents into Solr, using the HTTP API
> mentioned above.
> 
> I'd be interested in working on the development of such a widget, but
> I probably wouldn't have the serious time necessary to do much until
> July 1 given current schedule.  Anybody else interested in
> collaborating?  Other thoughts?
> 
> Karl
> 
> On Tue, May 24, 2011 at 7:28 PM, Karl Wright <da...@gmail.com> wrote:
>> The only requirement you may have overlooked is the requirement that
>> Solr be able to take advantage of the item cache automatically if it
>> happens to be restarted in the middle of an indexing pass.  If you
>> think about it, you will realize that this cannot be done externally
>> to Solr, unless Solr learns how to "pull" documents from the item
>> cache, and keep track somehow of the last item/operation it
>> successfully committed.  That's why I proposed putting the whole cache
>> under Solr auspices.  Deletions also would need to be enumerated in
>> the "cache", so it would not really be a cache but more like a
>> transaction log.  But I agree that the right place for such a
>> transaction log is effectively between MCF and Solr.
>> 
>> Obviously the cache would also need to be disk based, or once again
>> guaranteed delivery would not be possible.  Compression might be
>> useful, as would be checkpoints in case the data got large.  This is
>> very database-like, so CouchDB might be a reasonable way to do it,
>> especially if this code is considered to be part of Solr.  If part of
>> ManifoldCF, we should try to see if PostgreSQL would suffice, since it
>> will likely be already installed and ready to go.
>> 
>> Karl
>> 
>> On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl <ja...@cominvent.com> wrote:
>>> The "Refetch all ingested documents" works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed.
>>> 
>>> The solutions could be
>>> A) Add a stand-alone cache in front of Solr
>>> B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
>>> C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...
>>> 
>>> The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml.
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
>>> On 24. mai 2011, at 15.11, Karl Wright wrote:
>>> 
>>>> ManifoldCF is designed to deal with the problem of repeated or
>>>> continuous crawling, doing only what is needed on subsequent crawls.
>>>> It is thus a true incremental crawler.  But in order for this to work
>>>> for you, you need to let ManifoldCF do its job of keeping track of
>>>> what documents (and what document versions) have been handed to the
>>>> output connection.  For the situation where you change something in
>>>> Solr, the ManifoldCF solution to that is the "refetch all ingested
>>>> documents" button in the Crawler UI.  This is on the view page for the
>>>> output connection.  Clicking that button will cause ManifoldCF to
>>>> re-index all documents - but will also require ManifoldCF to recrawl
>>>> them, because ManifoldCF does not keep copies of the documents it
>>>> crawls anywhere.
>>>> 
>>>> If you need to avoid recrawling at all costs when you change Solr
>>>> configurations, you may well need to put some sort of software of your
>>>> own devising between ManifoldCF and Solr.  You basically want to
>>>> develop a content repository which ManifoldCF outputs to which can be
>>>> scanned to send to your Solr instance.  I actually proposed this
>>>> design for a Solr "guaranteed delivery" mechanism, because until Solr
>>>> commits a document it can still be lost if the Solr instance is shut
>>>> down.  Clearly something like this is needed and would also likely
>>>> solve your problem too.  The main issue, though, is that it would need
>>>> to be integrated with Solr itself, because you'd really want it to
>>>> pick up where it left off if Solr is cycled etc.  In my opinion this
>>>> functionality really can't function as part of ManifoldCF for that
>>>> reason.
>>>> 
>>>> Karl
>>>> 
>>>> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> Is there an easy way to separate fetching from ingestion?
>>>>> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
>>>>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>>>>> 
>>>>> --
>>>>> Jan Høydahl, search solution architect
>>>>> Cominvent AS - www.cominvent.com
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Re-sending docs to output connector

Posted by Karl Wright <da...@gmail.com>.

I've been thinking about this further.

First, it seems clear to me that both Solr AND ManifoldCF would need
access to the document cache.  If the cache lives under ManifoldCF, I
cannot see a good way towards a Solr integration that works the way
I'd hope it would.  Furthermore, the cache is not needed by many (or
even most) ManifoldCF targets, so adding this as a general feature of
ManifoldCF doesn't make sense to me.

On the other hand, while Solr can certainly use this facility, I can
well imagine other situations where it would be very useful as well.
So I am now leaning towards having a wholly separate "service" which
functions as both a cache and a transaction log.  A ManifoldCF output
connector would communicate with the service, and Solr also would -
or, rather, some automatic Solr-specific "push" process would query
for changes between a specified time range and push those into Solr.
Other such processes would be possible too.  The list of moving parts
would therefore be:

- a configuration file containing details on how to communicate with Solr
- a stand-alone web application which accepts documents and metadata
via HTTP, and can also respond to HTTP transaction log queries and
commands
- a number of command classes (processes) which provide a means of
push the transaction log contents into Solr, using the HTTP API
mentioned above.

I'd be interested in working on the development of such a widget, but
I probably wouldn't have the serious time necessary to do much until
July 1 given current schedule.  Anybody else interested in
collaborating?  Other thoughts?

Karl

On Tue, May 24, 2011 at 7:28 PM, Karl Wright <da...@gmail.com> wrote:
> The only requirement you may have overlooked is the requirement that
> Solr be able to take advantage of the item cache automatically if it
> happens to be restarted in the middle of an indexing pass.  If you
> think about it, you will realize that this cannot be done externally
> to Solr, unless Solr learns how to "pull" documents from the item
> cache, and keep track somehow of the last item/operation it
> successfully committed.  That's why I proposed putting the whole cache
> under Solr auspices.  Deletions also would need to be enumerated in
> the "cache", so it would not really be a cache but more like a
> transaction log.  But I agree that the right place for such a
> transaction log is effectively between MCF and Solr.
>
> Obviously the cache would also need to be disk based, or once again
> guaranteed delivery would not be possible.  Compression might be
> useful, as would be checkpoints in case the data got large.  This is
> very database-like, so CouchDB might be a reasonable way to do it,
> especially if this code is considered to be part of Solr.  If part of
> ManifoldCF, we should try to see if PostgreSQL would suffice, since it
> will likely be already installed and ready to go.
>
> Karl
>
> On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl <ja...@cominvent.com> wrote:
>> The "Refetch all ingested documents" works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed.
>>
>> The solutions could be
>> A) Add a stand-alone cache in front of Solr
>> B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
>> C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...
>>
>> The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> On 24. mai 2011, at 15.11, Karl Wright wrote:
>>
>>> ManifoldCF is designed to deal with the problem of repeated or
>>> continuous crawling, doing only what is needed on subsequent crawls.
>>> It is thus a true incremental crawler.  But in order for this to work
>>> for you, you need to let ManifoldCF do its job of keeping track of
>>> what documents (and what document versions) have been handed to the
>>> output connection.  For the situation where you change something in
>>> Solr, the ManifoldCF solution to that is the "refetch all ingested
>>> documents" button in the Crawler UI.  This is on the view page for the
>>> output connection.  Clicking that button will cause ManifoldCF to
>>> re-index all documents - but will also require ManifoldCF to recrawl
>>> them, because ManifoldCF does not keep copies of the documents it
>>> crawls anywhere.
>>>
>>> If you need to avoid recrawling at all costs when you change Solr
>>> configurations, you may well need to put some sort of software of your
>>> own devising between ManifoldCF and Solr.  You basically want to
>>> develop a content repository which ManifoldCF outputs to which can be
>>> scanned to send to your Solr instance.  I actually proposed this
>>> design for a Solr "guaranteed delivery" mechanism, because until Solr
>>> commits a document it can still be lost if the Solr instance is shut
>>> down.  Clearly something like this is needed and would also likely
>>> solve your problem too.  The main issue, though, is that it would need
>>> to be integrated with Solr itself, because you'd really want it to
>>> pick up where it left off if Solr is cycled etc.  In my opinion this
>>> functionality really can't function as part of ManifoldCF for that
>>> reason.
>>>
>>> Karl
>>>
>>> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>>>> Hi,
>>>>
>>>> Is there an easy way to separate fetching from ingestion?
>>>> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
>>>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>>>>
>>>> --
>>>> Jan Høydahl, search solution architect
>>>> Cominvent AS - www.cominvent.com
>>>>
>>>>
>>
>>
>

Re: Re-sending docs to output connector

Posted by Karl Wright <da...@gmail.com>.

The only requirement you may have overlooked is the requirement that
Solr be able to take advantage of the item cache automatically if it
happens to be restarted in the middle of an indexing pass.  If you
think about it, you will realize that this cannot be done externally
to Solr, unless Solr learns how to "pull" documents from the item
cache, and keep track somehow of the last item/operation it
successfully committed.  That's why I proposed putting the whole cache
under Solr auspices.  Deletions also would need to be enumerated in
the "cache", so it would not really be a cache but more like a
transaction log.  But I agree that the right place for such a
transaction log is effectively between MCF and Solr.

Obviously the cache would also need to be disk based, or once again
guaranteed delivery would not be possible.  Compression might be
useful, as would be checkpoints in case the data got large.  This is
very database-like, so CouchDB might be a reasonable way to do it,
especially if this code is considered to be part of Solr.  If part of
ManifoldCF, we should try to see if PostgreSQL would suffice, since it
will likely be already installed and ready to go.

Karl

On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl <ja...@cominvent.com> wrote:
> The "Refetch all ingested documents" works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed.
>
> The solutions could be
> A) Add a stand-alone cache in front of Solr
> B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
> C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...
>
> The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 24. mai 2011, at 15.11, Karl Wright wrote:
>
>> ManifoldCF is designed to deal with the problem of repeated or
>> continuous crawling, doing only what is needed on subsequent crawls.
>> It is thus a true incremental crawler.  But in order for this to work
>> for you, you need to let ManifoldCF do its job of keeping track of
>> what documents (and what document versions) have been handed to the
>> output connection.  For the situation where you change something in
>> Solr, the ManifoldCF solution to that is the "refetch all ingested
>> documents" button in the Crawler UI.  This is on the view page for the
>> output connection.  Clicking that button will cause ManifoldCF to
>> re-index all documents - but will also require ManifoldCF to recrawl
>> them, because ManifoldCF does not keep copies of the documents it
>> crawls anywhere.
>>
>> If you need to avoid recrawling at all costs when you change Solr
>> configurations, you may well need to put some sort of software of your
>> own devising between ManifoldCF and Solr.  You basically want to
>> develop a content repository which ManifoldCF outputs to which can be
>> scanned to send to your Solr instance.  I actually proposed this
>> design for a Solr "guaranteed delivery" mechanism, because until Solr
>> commits a document it can still be lost if the Solr instance is shut
>> down.  Clearly something like this is needed and would also likely
>> solve your problem too.  The main issue, though, is that it would need
>> to be integrated with Solr itself, because you'd really want it to
>> pick up where it left off if Solr is cycled etc.  In my opinion this
>> functionality really can't function as part of ManifoldCF for that
>> reason.
>>
>> Karl
>>
>> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>>> Hi,
>>>
>>> Is there an easy way to separate fetching from ingestion?
>>> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
>>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>>
>>>
>
>

Re: Re-sending docs to output connector

Posted by Jan Høydahl <ja...@cominvent.com>.

The "Refetch all ingested documents" works, but with Web crawling the problem is that it will take almost as long as a new crawl to re-feed.

The solutions could be
A) Add a stand-alone cache in front of Solr
B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but clunky to administer)
C) Extend MCF with an optional item cache. This could allow a "refeed from cache" button somewhere...

The cache in C could be realized externally to MCF, e.g. as a CouchDB cluster. To enable, you'd add the CouchDB access into to properties.xml.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 24. mai 2011, at 15.11, Karl Wright wrote:

> ManifoldCF is designed to deal with the problem of repeated or
> continuous crawling, doing only what is needed on subsequent crawls.
> It is thus a true incremental crawler.  But in order for this to work
> for you, you need to let ManifoldCF do its job of keeping track of
> what documents (and what document versions) have been handed to the
> output connection.  For the situation where you change something in
> Solr, the ManifoldCF solution to that is the "refetch all ingested
> documents" button in the Crawler UI.  This is on the view page for the
> output connection.  Clicking that button will cause ManifoldCF to
> re-index all documents - but will also require ManifoldCF to recrawl
> them, because ManifoldCF does not keep copies of the documents it
> crawls anywhere.
> 
> If you need to avoid recrawling at all costs when you change Solr
> configurations, you may well need to put some sort of software of your
> own devising between ManifoldCF and Solr.  You basically want to
> develop a content repository which ManifoldCF outputs to which can be
> scanned to send to your Solr instance.  I actually proposed this
> design for a Solr "guaranteed delivery" mechanism, because until Solr
> commits a document it can still be lost if the Solr instance is shut
> down.  Clearly something like this is needed and would also likely
> solve your problem too.  The main issue, though, is that it would need
> to be integrated with Solr itself, because you'd really want it to
> pick up where it left off if Solr is cycled etc.  In my opinion this
> functionality really can't function as part of ManifoldCF for that
> reason.
> 
> Karl
> 
> On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
>> Hi,
>> 
>> Is there an easy way to separate fetching from ingestion?
>> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
>> Also, after schema changes in Solr, there is a need to re-feed all docs.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>

Re: Re-sending docs to output connector

Posted by Karl Wright <da...@gmail.com>.

ManifoldCF is designed to deal with the problem of repeated or
continuous crawling, doing only what is needed on subsequent crawls.
It is thus a true incremental crawler.  But in order for this to work
for you, you need to let ManifoldCF do its job of keeping track of
what documents (and what document versions) have been handed to the
output connection.  For the situation where you change something in
Solr, the ManifoldCF solution to that is the "refetch all ingested
documents" button in the Crawler UI.  This is on the view page for the
output connection.  Clicking that button will cause ManifoldCF to
re-index all documents - but will also require ManifoldCF to recrawl
them, because ManifoldCF does not keep copies of the documents it
crawls anywhere.

If you need to avoid recrawling at all costs when you change Solr
configurations, you may well need to put some sort of software of your
own devising between ManifoldCF and Solr.  You basically want to
develop a content repository which ManifoldCF outputs to which can be
scanned to send to your Solr instance.  I actually proposed this
design for a Solr "guaranteed delivery" mechanism, because until Solr
commits a document it can still be lost if the Solr instance is shut
down.  Clearly something like this is needed and would also likely
solve your problem too.  The main issue, though, is that it would need
to be integrated with Solr itself, because you'd really want it to
pick up where it left off if Solr is cycled etc.  In my opinion this
functionality really can't function as part of ManifoldCF for that
reason.

Karl

On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl <ja...@cominvent.com> wrote:
> Hi,
>
> Is there an easy way to separate fetching from ingestion?
> I'd like to first run a crawl for several days, and then feed it to my Solr output as fast as possible.
> Also, after schema changes in Solr, there is a need to re-feed all docs.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
>