You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by Mark Bennett <mb...@ideaeng.com> on 2012/01/09 15:29:32 UTC

Revisiting: Should Manifold include Pipelines

We've been hoping to do some work this year to embed pipeline processing
into MCF, such as UIMA or OpenPipeline or XPump.

But reading through some recent posts there was a discussion about leaving
this sort of thing to the Solr pipeline, and it suddenly dawned on me that
maybe not everybody was on board with the idea of moving this into MCF.

So, before we spin our wheels, I wanted explain some reasons why this would
be a GOOD thing to do, and get some reactions:


1: Not everybody is using Solr / or using exclusively Solr.

Lucene and Solr a great of course, but open source isn't about walled
gardens.  Most companies have multiple search engines.

And, even if you just wanted to use Lucene (and not Solr), then the Solr
pipeline is not very attractive.

As an example, the Google appliance gets lots of press for Enterprise
search.  And it's got enough traction that their format of connector is
starting to be used by other companies.  BUT, at least in the past,
Google's document processing wasn't very pipeline friendly.  They had calls
you could make, but there were issues.

Wouldn't it be cool if Manifold could be used to feed Google appliances?  I
realize some open source folks might not care, but it would suddenly make
MCF interesting to a lot more developers.

Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
tradition of pipeline goodness, but once Microsoft acquired them, that
pipeline technology is being re-cast in a very Microsoft centric stack.
That's fine if you're a Microsoft shop, you might like it even better than
before, but if you're company prefers Linux, you might be looking for
something else.


2: Not every information application is about search

Classically there's been a need to go from one database to another.  But in
more recent times there's been a need to go from Databases into Content
Management Systems, or from one CMS to another, or to convert one corpus of
documents into another.

Sure there was ETL technology (Extract, Transform, Load), but that tended
to be around structured data.

More generally there's the class of going between structured and
unstructured data, and vice versa.  The latter, going from unstructured
back to structured, is where Entity Extraction comes into play, and where I
had thought MCF could really shine.

There's a somewhat subtle point here as well.  There's the format of
individual documents or files, such as HTML, PDF, RSS or MS Word, but also
the type of repository it resides in (filesystem, database, CMS, web
services, etc)  I was drawn to MCF for the connections, but a document
pipeline would let it also work on the document formats as well.


3: Even spidering to feed a search engine can benefit from "early binding"
and "extended state"

A slight aside: generic web page spidering doesn't often need fancy
processing.  What I'm about to talk about might at first seem like "edge
cases".  BUT, almost by definition, many of us are not brought into a
project unless it's well outside the mainstream use case.  So many
programmers find themselves working almost fulltime on rather unusual
projects.  Open source is quite attractive because it provides a wealth of
tools to choose from.

"Early Binding" for Spiders:

Generally it's the need to deeply parse a document before instructing the
spider what next action to take next.

Let me give one simple example, but trust me there are many more!

Suppose you have Web pages (or PDF files!) filled with part numbers.  And
you have a REST API that, presented with a part number, will give more
details.

But you need to parse out the part numbers in order to create the URLs that
you need to spider to fetch next.

Many other applications of this involve helping the spider decide what type
of document it has, or what quality of data it's getting.  You might decide
to tell the spider to drill down deeper, or conversely, give up and work on
higher value targets.

I could imagine a workaround where Manifold passes documents to Solr, and
then Solr's pipeline later resubmits URLs back into MCF, but it's a lot
more direct to just make these determinations more immediately.  In a few
cases it WOULD be nice to have Solr's fullword index, so maybe in it'd be
nice to have both options.  Commercial software companies would want to
make the decision for you, they'd choose one way or the other, but this
aint their garden.  ;-)


"extended state" for Spiders:

This is where you need the context of 2 or 3 pages back in your traversed
path in order to make full use of the current page.

Here's an example from a few years back:

Steps:
1: Start with a list of concert venue web sites.
2: Foreach venue, lookup the upcoming events, including dates, bands and
ticketing links.
3: Foreach band, go to this other site and lookup their albums.
4: Foreach album, lookup each song.
5: Foreach song, go to a third site to get the lyrics.

Now users can search for songs including the text in the lyrics.
When a match is found, also show them upcoming performances near them, and
maybe even let them click to buy tickets.

You can see that the unit of retrieval is particular songs, in steps 4 and
5.  But we want data that we parsed from several steps back.

Even in the case of song lyrics, where they will have the band's name, they
might not have the Album title.  (and a song could have been on several
albums of course)  So even things you'd expect to be able to parse, you've
often already had that info during a previous step.

I realize MCF probably doesn't include this type of state trail now.  But I
was thinking it'd at least be easier to build something on top of MCF than
going way out to Solr and then back into Manifold.

In the past I think folks would have used Perl or Python to handcraft these
types of projects.  But that doesn't scale very well, and you still need
persistence for long running jobs, AND it doesn't encourage code reuse.


So, Manifold could really benefit from pipelines!

I have a lot of technical thoughts about how this might be achieved, and a
bunch related thoughts.  But if pipelines are really unwelcome, I don't
want to force it.


One final thought:

The main search vendors seem to be abandoning high end, precision
spidering.  There's a tendency now to see all the world as "Internet", and
the data behind firewalls as just "a smaller Internet (intranet)"

This is fine for 80-90% of common use cases.

But that last 5-10% of atypical projects are HUGELY under-served at this
time.  They often have expensive problems that simply won't go away.

Sure, true open source folks may or may not care about "markets" or
"expensive problems".

BUT these are also INTERESTING problems!  If you're bored with "appliances"
and the latest downloadable free search bar, then trust me, these edge
cases will NOT bore you!!!

And I've given up on the current crop of Tier 1 search vendors for solving
anything "interesting".  They are so distracted with so many other
things.... and selling a hot 80% solution to a hungry market is fine with
them anyway.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Revisiting: Should Manifold include Pipelines

Posted by Karl Wright <da...@gmail.com>.

Hi Mark,




>
> I'm not sure if this question is revisiting the motivation for preferring
> this in MCF, or a technical question about how to package metadata for
> different engines that might want it in a different format.
>

I'm looking not so much for justification, but for enough context as
to how to structure the code.  Based on what I've heard, it probably
makes the most sense to provide a service available for both
repository connectors and output connectors to use in massaging
content.  The configuration needed for the service would therefore be
managed by the repository connector or output connector which required
the pipeline's services.


> For the latter, how to pass metadata to engines, that's interesting.  One
> almost universal way is to add metadata tags the header portion of an HTML
> file.  There are some other microformats that some engines understand.
> Could we just assume, for now, that additional meta data will be jammed
> into the HTML header, perhaps with an "x-" for the name (a convention some
> folks like).
>

I would presume that a Java coder who writes the output connector that
knows how to connect to the given search engine would tackle this
problem in the appropriate way.  I don't think it's a pipeline
question.

>
> Including Tika would be useful for connectors that need to look at binary
> doc files to do their parsing.  Even if the pipeline then discards Tika's
> output when it's done, it's still a likely expense *if* it's meets the
> project objective.
>
> As an example, the current MCF system looks for links in HTML.  But
> hyperlinks can also appear in Word, Excel and PDF files.  Tika could, in
> theory, convert those docs so that they cal also be scanned for links, and
> then later discard that converted file.
>

Sure, that's why I'd make the pipeline be available to every
connector.  The Java code for the connector would be modified, if
appropriate, to use the pipeline if it was helpful for it.

>
> Given the dismal state of open tools, I'd be excited to just see 1:1
> "pipeline" functionality be made widely available.
>
> I'm regretting, to some extent, bringing in the more complex Pipeline logic
> as it may have partially derailed the conversation.  I'm one of the authors
> of the old XPump tool, which was able to do very fancy things, but suffered
> from other issues.
>
> But better to have something now then nothing.  And I'll ponder the more
> complex scenarios some more.
>

I'll talk about this more further down.

>
>>
>> So, my question to you is, what would the main use case(s) be for a
>> "pipeline" in your view?
>>
>
> I've given a couple examples above, of 1:1 transforms.  I *KNOW* this is of
> interest to some folks, but it sounds like I've failed to convince you.
> I'd ask you to take it on faith, but you don't know me very well, so that'd
> be asking a lot.
>

The goal of the question was to confirm that you thought the value of
having a "pipeline" was high enough, vs. building a "Pipeline", as
we've defined it.  I wanted to be sure there was no communication
issue and that we understood one another before anybody went off and
started writing code.

>
> A final question for you Karl, since we've both invested some time in
> discussing something that would normally be very complex to others.  What
> open source tools would YOU suggest I look at, for a new home for uber
> pipeline processing?  I think you understand some of the logical
> functionality I want to model.
>
> Some other wish list items:
> * Leverage MCF connectors
> * A web UI framework for monitoring
>
> I'd say up front that I've considered Nutch, but I don't think it's a good
> fit for other reasons.
>
> I'm still looking around at UIMA.  I keep finding the justification for
> UIMA, how awesome it is, but less on the technical side.  I'm not sure it
> models a data flow design that well.
>
> The other area I looked at was some of the Eclipse process graph stuff,
> "Business Process Management" I think.
>
>
> There's a TON of open source projects.
>

I can't claim to speak for knowing all the open-source projects out
there.  But I'm unaware of one that really focuses on "Pipeline"
building from the perspective of crawling.

On the other hand, it seems pretty clear to me how one would go about
converting ManifoldCF to a "Pipeline" project.  What you'd get would
be a tool with UI components where you'd either glue the components
together with code, or use an "amalgamation" UI to generate the
necessary data flow.  There may already be tools in this space I don't
know of, but before you'd get to that point you'd want to have all the
technical underpinnings worked out.

The "Pipeline" services you'd want to provide would include functions
that each connector currently performs, but broken out as I'd
described in one of my earlier posts.  The document queue, which is
managed by the ManifoldCF framework right now, would need to be
redesigned since the entire notion of what a job is would require
redesign in a "Pipeline" world.

In order to develop such a thing, I'd be tempted to say "fork the
project", except for one interesting detail: if it were done
carefully, I think today's connectors would not need to change in any
significant way in order to be used by both today's framework and
tomorrow's "Pipeline".  I'm therefore pondering whether there can be a
"two frameworks, one set of connectors" kind of model.  Going with the
"Pipeline" framework exclusively would, I think, make it very hard on
the casual user of ManifoldCF.

In closing, therefore, feel free to think about the "Pipeline"
question further, and contribute "pipeline" code if you have the time.

Thanks for an interesting discussion!
Karl

Re: Revisiting: Should Manifold include Pipelines

Posted by Mark Bennett <mb...@ideaeng.com>.

Hi Karl,

On Wed, Jan 11, 2012 at 4:21 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Mark,
>
> I think I'd describe this simplified proposal as "pipeline" (vs.
> "Pipeline".  Your original description was the latter.)  This proposal
> is simpler but does not have the ability to amalgamate content from
> multiple connectors, correct?

Yes.

>  As long as it is just modifying the
> content and metadata (as described by RepositoryDocument), it's not
> hard to develop a generic idea of a content processing pipeline, e.g.
> Tika.
>

Yay!

>
> There's a question in my mind as to where it belongs.  If its purpose
> is to make up for missing code in particular search engines, then I'd
> argue it should be a service available to output connector coders, who
> can then choose how much configurability makes sense from the point of
> view of their target system.

I'm not sure if this question is revisiting the motivation for preferring
this in MCF, or a technical question about how to package metadata for
different engines that might want it in a different format.

For the former, I'd briefly rehash my answer earlier in the thread.
Pipelines are not in every search engine, and many organizations deal with
multiple search engines, so having a more standard for that logic would be
awesome!

For the latter, how to pass metadata to engines, that's interesting.  One
almost universal way is to add metadata tags the header portion of an HTML
file.  There are some other microformats that some engines understand.
Could we just assume, for now, that additional meta data will be jammed
into the HTML header, perhaps with an "x-" for the name (a convention some
folks like).

>  For instance, since Tika is already part
> of Solr, there would seem little benefit in adding a Tika pipeline
> upstream of Solr as well, but maybe a Google Appliance connector would
> want it and therefore expose it.

Including Tika would be useful for connectors that need to look at binary
doc files to do their parsing.  Even if the pipeline then discards Tika's
output when it's done, it's still a likely expense *if* it's meets the
project objective.

As an example, the current MCF system looks for links in HTML.  But
hyperlinks can also appear in Word, Excel and PDF files.  Tika could, in
theory, convert those docs so that they cal also be scanned for links, and
then later discard that converted file.

Another attractive pipeline would have Tika convert non-HTML binary files
into some primitive for of HTML and then perhaps discard the original
binary.  So binary content would be transformed into HTML or TXT in the MCF
stage, INSTEAD of having a search engine (or other system to do it).  It's
still a 1:1 transformation.

>  If the pipeline's purpose is to
> include arbitrary business logic, on the other hand, then I think what
> you'd really need is a Pipeline and not a pipeline, if you see what I
> mean.
>

Given the dismal state of open tools, I'd be excited to just see 1:1
"pipeline" functionality be made widely available.

I'm regretting, to some extent, bringing in the more complex Pipeline logic
as it may have partially derailed the conversation.  I'm one of the authors
of the old XPump tool, which was able to do very fancy things, but suffered
from other issues.

But better to have something now then nothing.  And I'll ponder the more
complex scenarios some more.

>
> So, my question to you is, what would the main use case(s) be for a
> "pipeline" in your view?
>

I've given a couple examples above, of 1:1 transforms.  I *KNOW* this is of
interest to some folks, but it sounds like I've failed to convince you.
I'd ask you to take it on faith, but you don't know me very well, so that'd
be asking a lot.

>From an academic exercise, suppose it was given that this was a good thing
to do, then what would be the easiest way to do it?

A final question for you Karl, since we've both invested some time in
discussing something that would normally be very complex to others.  What
open source tools would YOU suggest I look at, for a new home for uber
pipeline processing?  I think you understand some of the logical
functionality I want to model.

Some other wish list items:
* Leverage MCF connectors
* A web UI framework for monitoring

I'd say up front that I've considered Nutch, but I don't think it's a good
fit for other reasons.

I'm still looking around at UIMA.  I keep finding the justification for
UIMA, how awesome it is, but less on the technical side.  I'm not sure it
models a data flow design that well.

The other area I looked at was some of the Eclipse process graph stuff,
"Business Process Management" I think.

There's a TON of open source projects.

>
> Karl
>
> On Wed, Jan 11, 2012 at 6:31 AM, Mark Bennett <mb...@ideaeng.com>
> wrote:
> > Hi Karl,
> >
> > Still pondering our last discussion.  Wondering if I got things off
> track.
> >
> > As a start, what if I backtracked a bit, to this:
> >
> > What's the easiest way to do this:
> > * A connector that tweaks metadata form a single source.
> > * Sits between any existing MCF datasource connector and the main MCF
> engine
> >
> > Before:
> >
> > CMS/DB -> Existing MCF connector -> MCF core -> output
> >
> > After:
> >
> > CMS/DB -> Existing MCF connector -> Metadata tweaker -> MCF core ->
> output
> >
> >
> > Assume the matadata changes don't have any impact on security, or that no
> > security is being used (public data)
>

Re: Revisiting: Should Manifold include Pipelines

Posted by Karl Wright <da...@gmail.com>.

Hi Mark,

I think I'd describe this simplified proposal as "pipeline" (vs.
"Pipeline".  Your original description was the latter.)  This proposal
is simpler but does not have the ability to amalgamate content from
multiple connectors, correct?  As long as it is just modifying the
content and metadata (as described by RepositoryDocument), it's not
hard to develop a generic idea of a content processing pipeline, e.g.
Tika.

There's a question in my mind as to where it belongs.  If its purpose
is to make up for missing code in particular search engines, then I'd
argue it should be a service available to output connector coders, who
can then choose how much configurability makes sense from the point of
view of their target system.  For instance, since Tika is already part
of Solr, there would seem little benefit in adding a Tika pipeline
upstream of Solr as well, but maybe a Google Appliance connector would
want it and therefore expose it.  If the pipeline's purpose is to
include arbitrary business logic, on the other hand, then I think what
you'd really need is a Pipeline and not a pipeline, if you see what I
mean.

So, my question to you is, what would the main use case(s) be for a
"pipeline" in your view?

Karl

On Wed, Jan 11, 2012 at 6:31 AM, Mark Bennett <mb...@ideaeng.com> wrote:
> Hi Karl,
>
> Still pondering our last discussion.  Wondering if I got things off track.
>
> As a start, what if I backtracked a bit, to this:
>
> What's the easiest way to do this:
> * A connector that tweaks metadata form a single source.
> * Sits between any existing MCF datasource connector and the main MCF engine
>
> Before:
>
> CMS/DB -> Existing MCF connector -> MCF core -> output
>
> After:
>
> CMS/DB -> Existing MCF connector -> Metadata tweaker -> MCF core -> output
>
>
> Assume the matadata changes don't have any impact on security, or that no
> security is being used (public data)

Re: Revisiting: Should Manifold include Pipelines

Posted by Mark Bennett <mb...@ideaeng.com>.

Hi Karl,

Still pondering our last discussion.  Wondering if I got things off track.

As a start, what if I backtracked a bit, to this:

What's the easiest way to do this:
* A connector that tweaks metadata form a single source.
* Sits between any existing MCF datasource connector and the main MCF engine

Before:

CMS/DB -> Existing MCF connector -> MCF core -> output

After:

CMS/DB -> Existing MCF connector -> Metadata tweaker -> MCF core -> output


Assume the matadata changes don't have any impact on security, or that no
security is being used (public data)

Re: Revisiting: Should Manifold include Pipelines

Posted by Mark Bennett <mb...@ideaeng.com>.

Hi Karl,

I wanted to acknowledge and thank you for your 2 emails.

I need to think a bit.  I *do* have answers to some of your concerns, and I
hopefully reasonable sounding ones at that.

Also, maybe I should take another look at Nutch - BUT Manifold's Web UI is
so much further along, and more inline with the type of admin "view" of
what's going on, that I had given up on Nutch for a bit.  I have some other
thoughts about Nutch but won't go into them here.

Also, to be clear, I in no way meant to even imply you had any other
motives for having materials in the book.  You've demonstrated, time and
again, that you sincerely want to share MCF, and info about it, with the
whole world!

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

On Tue, Jan 10, 2012 at 12:27 AM, Karl Wright <da...@gmail.com> wrote:

> you wanted a connection to be a pipeline component rather than what it
> is today.
>

Re: Revisiting: Should Manifold include Pipelines

Posted by Karl Wright <da...@gmail.com>.

As an exercise in understanding, it might be helpful to consider how
exactly a document specification in today's ManifoldCF would morph if
you wanted a connection to be a pipeline component rather than what it
is today.

Right now, the document specification for a job is an XML doc of a
form that only the underlying connector understands, which specifies
the following kinds of information:

- What documents to include in the crawl (which is meaningful only in
the context of an existing underlying connection);
- What parts of those documents to index (e.g. what metadata is included).

The information is used in several places during the crawl:

- At the time seeding is done (the initial documents)
- When a decision is being made to include a document on the queue
- Before a document is going to be fetched
- In order to set up the document for indexing

The repository connector allows you to edit the document specification
in the Crawler UI.  This is done by the repository connector
contributing tabs to the job.

Now, in order for a pipeline to work, most of the activities of the
connector will need to be broken out into separate pipeline tasks.
For instance, "seeding" would be a different task from "filtering"
which would be different from "enqueuing" which would be different
from "obtaining security info".  I would expect that each pipeline
step would have its own UI, so if you were using Connection X to seed,
then you would want to specify what documents to seed in the UI for
that step, in a manner consistent with the underlying connection.

So the connector would need to break up its document specification
into multiple pieces, e.g. a "seeding document specification" with a
seeding document specification UI.  There would be a corresponding
specification and UI for "connector document filtering" and for
"connector document enqueuing".  I suspect there would be a lot of
duplication and overlap too, which would be hard to avoid.

The end result of this exercise would be something that would allow
more flexibility, at the expense of ease of use.

Karl

On Tue, Jan 10, 2012 at 2:49 AM, Karl Wright <da...@gmail.com> wrote:
> Hi Mark,
>
> Please see below.
>
> On Mon, Jan 9, 2012 at 9:53 PM, Mark Bennett <mb...@ideaeng.com> wrote:
>> Hi Karl,
>>
>> Thanks for the reply, most comments inline.
>>
>> General comments:
>>
>> I was wondering if you've used a custom pipeline like FAST ESP or
>> Ultraseek's old "patches.py", and if there were any that you liked or
>> disliked?  In more recent times the OpenPipeline effort has been a bit
>> nascent, I think in part because it lacks some of connectors.  Coming from
>> my background I'm probably a bit biased to thinking of problems in terms of
>> a pipeline, and it's also a frequent discussion with some of our more
>> challenging clients.
>>
>> Generally speaking we define the virtual document to be the basic unit of
>> retrieval, and it doesn't really matter whether it starts life as a Web
>> Page or PDF or Outlook node.  Most "documents" have a create / modified
>> date, some type of title, and a few other semi-common meta data fields.
>> They do vary by source, but there's mapping techniques.
>>
>> Having more connector services, or even just more examples, is certainly a
>> step in the right direction.
>>
>> But leaving it at writing custom monolithic connectors has a few
>> disadvantages:
>> - Not as modular, so discourages code reuse
>> - Maintains 100% coding, vs. some mix of configure vs. code
>> - Keeps the bar at rather advanced Java programmers, vs. opening up to
>> folks that feel more comfortable with "scripting" (of a sort, not
>> suggesting a full language)
>> - I think folks tend to share more when using "configurable" systems,
>> though I have no proof.  I might just be the larger number of people.
>> - Sort of the "blank canvas syndrome" as each person tries to grasp all the
>> nuances; granted one I'm suggesting merely presents a smaller blank canvas,
>> but maybe with crayons and connect the dots, vs. oil paints.
>>
>
> It sounds to me like what you are proposing is a reorganization of the
> architecture of ManifoldCF so that documents that are fetched by
> repository connectors are only obliquely related to documents indexed
> through an output connector.  You are proposing that an indexed
> document be possibly assembled from multiple connector sources, but
> with arbitrary manipulation of the document content along the way.  Is
> this correct?
>
> If so, how would you handle document security?  Each repository
> connection today specifies the security context for the documents it
> fetches.  It also knows about relationships between those documents
> that come from the same connector, and about document versioning for
> documents fetched from that source.  How does this translate into a
> pipelined world in your view?  Is the security of the final indexed
> document the intersection of the security for all the sources of the
> indexed document?  Is the version of the indexed document the
> concatenation of the versions of all the input documents?
>
>
>>> > Generally it's the need to deeply parse a document before instructing the
>>> > spider what next action to take next.
>>> >
>>>
>>> We've looked at this as primarily a connector-specific activity.  For
>>> example, you wouldn't want to do such a thing from within documents
>>> fetched via JCIFs.  The main use case I can see is in extracting links
>>> from web content.
>>>
>>
>> There are so many more things to be extracted in the world.... and things
>> that a spider can use.
>>
>> I don't understand the comment about JCIFs.  Presumably there's still the
>> concept of a unit of retrieval, some "document" or "page", with some type
>> of title and URL?
>>
>
> My example was meant to be instructive, not all-inclusive.  Documents
> that are fetched from a Windows shared filesystem do not in general
> point at other documents within the same Windows shared filesystem.
> There is no point in that case in parsing those documents looking for
> links; it just slows the whole system down.  The only kinds of
> universal links you might find in a document from any arbitrary source
> will likely be web urls, is all that I'm saying.
>
>>
>> We might be talking past each other here, and maybe we're already agreeing.
>>
>> So I'm a developer and I need a fancy connector that pulls from multiple
>> sources.
>>
>> But then I notice that ManifoldCF already has connectors for all 3 of my
>> sources.
>>
>> So yes, I'd need to write some custom code.  But I want to "plugin"
>> existing manaifold connectors, but route their output as input to my
>> connector.
>>
>> Or more likely, I'll be pullilng "primary" records form on of the existing
>> manaifold connectors, and will then make requests to 1 or 2 other MCF
>> connectors to fill-in additional details.
>>
>> Maybe I can do this now?  Maybe it's so trivial to you that it didn't even
>> seem like a question???
>>
>
> No, this is not trivial now.  Code-driven assembly of access to
> multiple connectors and amalgamation into documents would, like I
> said, require very significant changes to the ManifoldCF architecture.
>  For example, your question overlooks several major features of the
> way ManifoldCF works at the moment:
>
> - Each repository connector supplies its own means of specifying what
> documents should be included in a job, and the means of editing that
> specification via the UI.
> - Each repository connector knows how to handle document versioning so
> that the framework can support incremental crawling properly.
> - Each repository connector knows how to generate security information
> for the documents it fetches.
>
> If you adopt a code-driven assembly pipeline, it's hard to see how all
> of this would come together.  And yet it must, unless what you are
> really looking for is Nutch but with repository connector components.
> By having the criteria for what documents to include in a crawl be
> part of some pipeline code, you take it out of the non-programmer's
> hands.  There may be ways of respecifying what a connector is that
> dodge this problem but I certainly don't know what that might look
> like yet.
>
>
>>
>> I have the book, I'll go check those sections.
>>
>> Is there some chance of porting some of this info to the Wiki?  I've
>> noticed references to the book in a couple emails, which is fine, I
>> reference my book every now and then as too.  But for open source info it'd
>> be kind of a bummer to force people to shell out $ for a book.  Not sure
>> what type of a deal you have with the publisher.  Or maybe somebody else
>> would have to create the equivalent info from scratch in wiki?
>>
>
> The other place it is found is in the javadoc for the IProcessActivity
> interface.  You can see how it is used by looking at the RSS
> connector.
>
> I'm not trying to make people spend money on a book but I honestly
> don't have the time to write the same thing two or three times.
> Please feel free to come up with contributions to the ManifoldCF
> documentation that's based on the book content.  You obviously can't
> cut-and-paste, but you can digest the material if you think it would
> be helpful to others.
>
>>>
>>> (1) Providing a content-extraction and modification pipeline, for
>>> those output connectors that are targeting systems that cannot do
>>> content extraction on their own.
>>> (2) Providing framework-level services that allow "connectors" to be
>>> readily constructed along a pipeline model.
>>>
>>
>> A good start, let me add to it:
>>
>> (3) Easily use other Connectors as both inputs and outputs to a custom
>> connector.
>>
>> I'm not sure whether it's better to have such hybrids mimic a datasource
>> connector, an output connector, or maybe slot in as a security connector?
>>
>> Conceptually the security connectors are both "input" and "output", so
>> presumably would be easier to chain?  But it'd be a bid odd to hang off of
>> the "security" side of things for a process that just tweaks metadata.
>
> This is where we run into trouble.  From my perspective, the security
> of a document involves stuff that can only come from a repository,
> combined with security information that comes from one (or more)
> authorities. It makes absolutely no sense to have a "document security
> pipeline", because there is not enforcement of security within
> ManifoldCF itself; that happens elsewhere.
>
>> Also, I don't know if security connectors have access to all of the data
>> and the ability to modify metadata / document content.
>>
>> (4) Inclusion of Tika for filtering, which is often needed.
>>
>> (5) Ability for a custom connector to inject additional URLs into the
>> various queues
>>
>> (6) Some type of "accountability" for work that has been submitted.  So a
>> record comes in on connector A, I then generate requests to connectors B
>> and C, and I'd like to be efficiently called back when those other tasks
>> are completed or have failed.
>>
>
> I think it is now clear that you are thinking of ManifoldCF as Nutch
> with connectors, where you primarily code up your crawl in Java and
> fire it off that way.  But that's not the problem that ManifoldCF set
> out to address.  I'm not adverse to making ManifoldCF change
> architecturally to support this kind of thing, but I don't think we
> should forget ManifoldCF's primary mission along the way.
>
>>
>> Being able to configure custom pipelines would be even better, but not a
>> deal breaker.  Obviously most Manifold users are Java coders at the moment,
>> so re-usablity could come at a later time.
>>
>
> Actually, most ManifoldCF users are *not* java coders - that's where
> our ideas fundamentally differ.  The whole reason there's a crawler UI
> in the first place is so someone who doesn't want to code can set up
> crawls and run them.
>
> I believe I have a clearer idea of what you are looking for.  Please
> correct me if you disagree.  I'll ponder some of the architectural
> questions and see if I can arrive at a proposal that meets most of the
> goals.
>
>
> Karl

Re: Revisiting: Should Manifold include Pipelines

Posted by Karl Wright <da...@gmail.com>.

Hi Mark,

Please see below.

On Mon, Jan 9, 2012 at 9:53 PM, Mark Bennett <mb...@ideaeng.com> wrote:
> Hi Karl,
>
> Thanks for the reply, most comments inline.
>
> General comments:
>
> I was wondering if you've used a custom pipeline like FAST ESP or
> Ultraseek's old "patches.py", and if there were any that you liked or
> disliked?  In more recent times the OpenPipeline effort has been a bit
> nascent, I think in part because it lacks some of connectors.  Coming from
> my background I'm probably a bit biased to thinking of problems in terms of
> a pipeline, and it's also a frequent discussion with some of our more
> challenging clients.
>
> Generally speaking we define the virtual document to be the basic unit of
> retrieval, and it doesn't really matter whether it starts life as a Web
> Page or PDF or Outlook node.  Most "documents" have a create / modified
> date, some type of title, and a few other semi-common meta data fields.
> They do vary by source, but there's mapping techniques.
>
> Having more connector services, or even just more examples, is certainly a
> step in the right direction.
>
> But leaving it at writing custom monolithic connectors has a few
> disadvantages:
> - Not as modular, so discourages code reuse
> - Maintains 100% coding, vs. some mix of configure vs. code
> - Keeps the bar at rather advanced Java programmers, vs. opening up to
> folks that feel more comfortable with "scripting" (of a sort, not
> suggesting a full language)
> - I think folks tend to share more when using "configurable" systems,
> though I have no proof.  I might just be the larger number of people.
> - Sort of the "blank canvas syndrome" as each person tries to grasp all the
> nuances; granted one I'm suggesting merely presents a smaller blank canvas,
> but maybe with crayons and connect the dots, vs. oil paints.
>

It sounds to me like what you are proposing is a reorganization of the
architecture of ManifoldCF so that documents that are fetched by
repository connectors are only obliquely related to documents indexed
through an output connector.  You are proposing that an indexed
document be possibly assembled from multiple connector sources, but
with arbitrary manipulation of the document content along the way.  Is
this correct?

If so, how would you handle document security?  Each repository
connection today specifies the security context for the documents it
fetches.  It also knows about relationships between those documents
that come from the same connector, and about document versioning for
documents fetched from that source.  How does this translate into a
pipelined world in your view?  Is the security of the final indexed
document the intersection of the security for all the sources of the
indexed document?  Is the version of the indexed document the
concatenation of the versions of all the input documents?

>> > Generally it's the need to deeply parse a document before instructing the
>> > spider what next action to take next.
>> >
>>
>> We've looked at this as primarily a connector-specific activity.  For
>> example, you wouldn't want to do such a thing from within documents
>> fetched via JCIFs.  The main use case I can see is in extracting links
>> from web content.
>>
>
> There are so many more things to be extracted in the world.... and things
> that a spider can use.
>
> I don't understand the comment about JCIFs.  Presumably there's still the
> concept of a unit of retrieval, some "document" or "page", with some type
> of title and URL?
>

My example was meant to be instructive, not all-inclusive.  Documents
that are fetched from a Windows shared filesystem do not in general
point at other documents within the same Windows shared filesystem.
There is no point in that case in parsing those documents looking for
links; it just slows the whole system down.  The only kinds of
universal links you might find in a document from any arbitrary source
will likely be web urls, is all that I'm saying.

>
> We might be talking past each other here, and maybe we're already agreeing.
>
> So I'm a developer and I need a fancy connector that pulls from multiple
> sources.
>
> But then I notice that ManifoldCF already has connectors for all 3 of my
> sources.
>
> So yes, I'd need to write some custom code.  But I want to "plugin"
> existing manaifold connectors, but route their output as input to my
> connector.
>
> Or more likely, I'll be pullilng "primary" records form on of the existing
> manaifold connectors, and will then make requests to 1 or 2 other MCF
> connectors to fill-in additional details.
>
> Maybe I can do this now?  Maybe it's so trivial to you that it didn't even
> seem like a question???
>

No, this is not trivial now.  Code-driven assembly of access to
multiple connectors and amalgamation into documents would, like I
said, require very significant changes to the ManifoldCF architecture.
 For example, your question overlooks several major features of the
way ManifoldCF works at the moment:

- Each repository connector supplies its own means of specifying what
documents should be included in a job, and the means of editing that
specification via the UI.
- Each repository connector knows how to handle document versioning so
that the framework can support incremental crawling properly.
- Each repository connector knows how to generate security information
for the documents it fetches.

If you adopt a code-driven assembly pipeline, it's hard to see how all
of this would come together.  And yet it must, unless what you are
really looking for is Nutch but with repository connector components.
By having the criteria for what documents to include in a crawl be
part of some pipeline code, you take it out of the non-programmer's
hands.  There may be ways of respecifying what a connector is that
dodge this problem but I certainly don't know what that might look
like yet.

>
> I have the book, I'll go check those sections.
>
> Is there some chance of porting some of this info to the Wiki?  I've
> noticed references to the book in a couple emails, which is fine, I
> reference my book every now and then as too.  But for open source info it'd
> be kind of a bummer to force people to shell out $ for a book.  Not sure
> what type of a deal you have with the publisher.  Or maybe somebody else
> would have to create the equivalent info from scratch in wiki?
>

The other place it is found is in the javadoc for the IProcessActivity
interface.  You can see how it is used by looking at the RSS
connector.

I'm not trying to make people spend money on a book but I honestly
don't have the time to write the same thing two or three times.
Please feel free to come up with contributions to the ManifoldCF
documentation that's based on the book content.  You obviously can't
cut-and-paste, but you can digest the material if you think it would
be helpful to others.

>>
>> (1) Providing a content-extraction and modification pipeline, for
>> those output connectors that are targeting systems that cannot do
>> content extraction on their own.
>> (2) Providing framework-level services that allow "connectors" to be
>> readily constructed along a pipeline model.
>>
>
> A good start, let me add to it:
>
> (3) Easily use other Connectors as both inputs and outputs to a custom
> connector.
>
> I'm not sure whether it's better to have such hybrids mimic a datasource
> connector, an output connector, or maybe slot in as a security connector?
>
> Conceptually the security connectors are both "input" and "output", so
> presumably would be easier to chain?  But it'd be a bid odd to hang off of
> the "security" side of things for a process that just tweaks metadata.

This is where we run into trouble.  From my perspective, the security
of a document involves stuff that can only come from a repository,
combined with security information that comes from one (or more)
authorities. It makes absolutely no sense to have a "document security
pipeline", because there is not enforcement of security within
ManifoldCF itself; that happens elsewhere.

> Also, I don't know if security connectors have access to all of the data
> and the ability to modify metadata / document content.
>
> (4) Inclusion of Tika for filtering, which is often needed.
>
> (5) Ability for a custom connector to inject additional URLs into the
> various queues
>
> (6) Some type of "accountability" for work that has been submitted.  So a
> record comes in on connector A, I then generate requests to connectors B
> and C, and I'd like to be efficiently called back when those other tasks
> are completed or have failed.
>

I think it is now clear that you are thinking of ManifoldCF as Nutch
with connectors, where you primarily code up your crawl in Java and
fire it off that way.  But that's not the problem that ManifoldCF set
out to address.  I'm not adverse to making ManifoldCF change
architecturally to support this kind of thing, but I don't think we
should forget ManifoldCF's primary mission along the way.

>
> Being able to configure custom pipelines would be even better, but not a
> deal breaker.  Obviously most Manifold users are Java coders at the moment,
> so re-usablity could come at a later time.
>

Actually, most ManifoldCF users are *not* java coders - that's where
our ideas fundamentally differ.  The whole reason there's a crawler UI
in the first place is so someone who doesn't want to code can set up
crawls and run them.

I believe I have a clearer idea of what you are looking for.  Please
correct me if you disagree.  I'll ponder some of the architectural
questions and see if I can arrive at a proposal that meets most of the
goals.

Karl

Re: Revisiting: Should Manifold include Pipelines

Posted by Mark Bennett <mb...@ideaeng.com>.

Hi Karl,

Thanks for the reply, most comments inline.

General comments:

I was wondering if you've used a custom pipeline like FAST ESP or
Ultraseek's old "patches.py", and if there were any that you liked or
disliked?  In more recent times the OpenPipeline effort has been a bit
nascent, I think in part because it lacks some of connectors.  Coming from
my background I'm probably a bit biased to thinking of problems in terms of
a pipeline, and it's also a frequent discussion with some of our more
challenging clients.

Generally speaking we define the virtual document to be the basic unit of
retrieval, and it doesn't really matter whether it starts life as a Web
Page or PDF or Outlook node.  Most "documents" have a create / modified
date, some type of title, and a few other semi-common meta data fields.
They do vary by source, but there's mapping techniques.

Having more connector services, or even just more examples, is certainly a
step in the right direction.

But leaving it at writing custom monolithic connectors has a few
disadvantages:
- Not as modular, so discourages code reuse
- Maintains 100% coding, vs. some mix of configure vs. code
- Keeps the bar at rather advanced Java programmers, vs. opening up to
folks that feel more comfortable with "scripting" (of a sort, not
suggesting a full language)
- I think folks tend to share more when using "configurable" systems,
though I have no proof.  I might just be the larger number of people.
- Sort of the "blank canvas syndrome" as each person tries to grasp all the
nuances; granted one I'm suggesting merely presents a smaller blank canvas,
but maybe with crayons and connect the dots, vs. oil paints.

On to specific comments....

On Mon, Jan 9, 2012 at 6:55 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Mark,
>
> I have some initial impressions; please read below.
>
> On Mon, Jan 9, 2012 at 9:29 AM, Mark Bennett <mb...@ideaeng.com> wrote:
> > We've been hoping to do some work this year to embed pipeline processing
> > into MCF, such as UIMA or OpenPipeline or XPump.
> >
> > But reading through some recent posts there was a discussion about
> leaving
> > this sort of thing to the Solr pipeline, and it suddenly dawned on me
> that
> > maybe not everybody was on board with the idea of moving this into MCF.
> >
>
> Having a pipeline for content extraction is a pretty standard thing
> for a search engine to have.  Having said that, I agree there are
> times when this is not enough.
>

But every engine has a different pipeline, and they're not always
comparable.

And virtually every large company has multiple search engines.  So
re-implementing business logic over and over is expensive and buggy.  And
there's also the question of basic connector and filters availability and
licensing.

And some vendors are fussy about their IP so code is rarely shared online.

And having a standard open source pipeline, that actually gets some use,
benefits from many more users.

>
> > So, before we spin our wheels, I wanted explain some reasons why this
> would
> > be a GOOD thing to do, and get some reactions:
> >
> >
> > 1: Not everybody is using Solr / or using exclusively Solr.
> >
> > Lucene and Solr a great of course, but open source isn't about walled
> > gardens.  Most companies have multiple search engines.
> >
> > And, even if you just wanted to use Lucene (and not Solr), then the Solr
> > pipeline is not very attractive.
> >
> > As an example, the Google appliance gets lots of press for Enterprise
> > search.  And it's got enough traction that their format of connector is
> > starting to be used by other companies.  BUT, at least in the past,
> > Google's document processing wasn't very pipeline friendly.  They had
> calls
> > you could make, but there were issues.
> >
> > Wouldn't it be cool if Manifold could be used to feed Google appliances?
>  I
> > realize some open source folks might not care, but it would suddenly make
> > MCF interesting to a lot more developers.
> >
> > Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
> > tradition of pipeline goodness, but once Microsoft acquired them, that
> > pipeline technology is being re-cast in a very Microsoft centric stack.
> > That's fine if you're a Microsoft shop, you might like it even better
> than
> > before, but if you're company prefers Linux, you might be looking for
> > something else.
> >
> >
> > 2: Not every information application is about search
> >
> > Classically there's been a need to go from one database to another.  But
> in
> > more recent times there's been a need to go from Databases into Content
> > Management Systems, or from one CMS to another, or to convert one corpus
> of
> > documents into another.
> >
> > Sure there was ETL technology (Extract, Transform, Load), but that tended
> > to be around structured data.
> >
> > More generally there's the class of going between structured and
> > unstructured data, and vice versa.  The latter, going from unstructured
> > back to structured, is where Entity Extraction comes into play, and
> where I
> > had thought MCF could really shine.
> >
> > There's a somewhat subtle point here as well.  There's the format of
> > individual documents or files, such as HTML, PDF, RSS or MS Word, but
> also
> > the type of repository it resides in (filesystem, database, CMS, web
> > services, etc)  I was drawn to MCF for the connections, but a document
> > pipeline would let it also work on the document formats as well.
> >
> >
> > 3: Even spidering to feed a search engine can benefit from "early
> binding"
> > and "extended state"
> >
> > A slight aside: generic web page spidering doesn't often need fancy
> > processing.  What I'm about to talk about might at first seem like "edge
> > cases".  BUT, almost by definition, many of us are not brought into a
> > project unless it's well outside the mainstream use case.  So many
> > programmers find themselves working almost fulltime on rather unusual
> > projects.  Open source is quite attractive because it provides a wealth
> of
> > tools to choose from.
> >
> > "Early Binding" for Spiders:
> >
> > Generally it's the need to deeply parse a document before instructing the
> > spider what next action to take next.
> >
>
> We've looked at this as primarily a connector-specific activity.  For
> example, you wouldn't want to do such a thing from within documents
> fetched via JCIFs.  The main use case I can see is in extracting links
> from web content.
>

There are so many more things to be extracted in the world.... and things
that a spider can use.

I don't understand the comment about JCIFs.  Presumably there's still the
concept of a unit of retrieval, some "document" or "page", with some type
of title and URL?

>
> > Let me give one simple example, but trust me there are many more!
> >
> > Suppose you have Web pages (or PDF files!) filled with part numbers.  And
> > you have a REST API that, presented with a part number, will give more
> > details.
> >
> > But you need to parse out the part numbers in order to create the URLs
> that
> > you need to spider to fetch next.
> >
> > Many other applications of this involve helping the spider decide what
> type
> > of document it has, or what quality of data it's getting.  You might
> decide
> > to tell the spider to drill down deeper, or conversely, give up and work
> on
> > higher value targets.
> >
>
> What you've described is a case by which ManifoldCF obtains content
> references from one source, and indexes them from another.  Today, in
> order to pull that kind of thing off with ManifoldCF, you need to
> write a custom connector to do it.  That's not so unreasonable; it
> involves a lot of domain-specific pieces - e.g. how to obtain the
> PDFs, how to build the URLs, etc.  A similar situation existed for
> crawling Wiki's; there was a custom API which basically worked with
> HTTP requests.  A generic web crawler could have been used but because
> of the very specific requirements for understanding the API it was
> best modeled as a new connector.
>
> I think the rough breakdown of what component of the ManifoldCF system
> is responsible for what remains correct.  Making it easier to
> construct a custom connector in this way, by using building blocks
> that ManifoldCF would make available for all connectors, makes sense
> to some degree.
>

We might be talking past each other here, and maybe we're already agreeing.

So I'm a developer and I need a fancy connector that pulls from multiple
sources.

But then I notice that ManifoldCF already has connectors for all 3 of my
sources.

So yes, I'd need to write some custom code.  But I want to "plugin"
existing manaifold connectors, but route their output as input to my
connector.

Or more likely, I'll be pullilng "primary" records form on of the existing
manaifold connectors, and will then make requests to 1 or 2 other MCF
connectors to fill-in additional details.

Maybe I can do this now?  Maybe it's so trivial to you that it didn't even
seem like a question???

>
> > I could imagine a workaround where Manifold passes documents to Solr, and
> > then Solr's pipeline later resubmits URLs back into MCF, but it's a lot
> > more direct to just make these determinations more immediately.  In a few
> > cases it WOULD be nice to have Solr's fullword index, so maybe in it'd be
> > nice to have both options.  Commercial software companies would want to
> > make the decision for you, they'd choose one way or the other, but this
> > aint their garden.  ;-)
> >
> >
> > "extended state" for Spiders:
> >
> > This is where you need the context of 2 or 3 pages back in your traversed
> > path in order to make full use of the current page.
> >
>
> ManifoldCF framework uses the concept of "Carrydown" data for this
> purpose.  This is covered in Chapters 6 and 7 of "ManifoldCF in
> Action".
>

I have the book, I'll go check those sections.

Is there some chance of porting some of this info to the Wiki?  I've
noticed references to the book in a couple emails, which is fine, I
reference my book every now and then as too.  But for open source info it'd
be kind of a bummer to force people to shell out $ for a book.  Not sure
what type of a deal you have with the publisher.  Or maybe somebody else
would have to create the equivalent info from scratch in wiki?

>
> > Here's an example from a few years back:
> >
> > Steps:
> > 1: Start with a list of concert venue web sites.
> > 2: Foreach venue, lookup the upcoming events, including dates, bands and
> > ticketing links.
> > 3: Foreach band, go to this other site and lookup their albums.
> > 4: Foreach album, lookup each song.
> > 5: Foreach song, go to a third site to get the lyrics.
> >
>
> Yeah, and this is why we have a connector framework.  The actual
> content here is an amalgam of many different pages.  Each individual
> "document" you'd index into your search engine contains content that
> comes from many sources, and it has to be the connector's
> responsibility to pull all that together.
>

Awesome.

>
> > Now users can search for songs including the text in the lyrics.
> > When a match is found, also show them upcoming performances near them,
> and
> > maybe even let them click to buy tickets.
> >
> > You can see that the unit of retrieval is particular songs, in steps 4
> and
> > 5.  But we want data that we parsed from several steps back.
> >
> > Even in the case of song lyrics, where they will have the band's name,
> they
> > might not have the Album title.  (and a song could have been on several
> > albums of course)  So even things you'd expect to be able to parse,
> you've
> > often already had that info during a previous step.
> >
> > I realize MCF probably doesn't include this type of state trail now.
>  But I
> > was thinking it'd at least be easier to build something on top of MCF
> than
> > going way out to Solr and then back into Manifold.
> >
> > In the past I think folks would have used Perl or Python to handcraft
> these
> > types of projects.  But that doesn't scale very well, and you still need
> > persistence for long running jobs, AND it doesn't encourage code reuse.
> >
> >
> > So, Manifold could really benefit from pipelines!
> >
> > I have a lot of technical thoughts about how this might be achieved, and
> a
> > bunch related thoughts.  But if pipelines are really unwelcome, I don't
> > want to force it.
> >
> >
> > One final thought:
> >
> > The main search vendors seem to be abandoning high end, precision
> > spidering.  There's a tendency now to see all the world as "Internet",
> and
> > the data behind firewalls as just "a smaller Internet (intranet)"
> >
> > This is fine for 80-90% of common use cases.
> >
> > But that last 5-10% of atypical projects are HUGELY under-served at this
> > time.  They often have expensive problems that simply won't go away.
> >
> > Sure, true open source folks may or may not care about "markets" or
> > "expensive problems".
> >
> > BUT these are also INTERESTING problems!  If you're bored with
> "appliances"
> > and the latest downloadable free search bar, then trust me, these edge
> > cases will NOT bore you!!!
> >
> > And I've given up on the current crop of Tier 1 search vendors for
> solving
> > anything "interesting".  They are so distracted with so many other
> > things.... and selling a hot 80% solution to a hungry market is fine with
> > them anyway.
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
> I'd be interested in hearing your broad-brush idea of a proposal.  It
> seems to me that there are several wholly-independent situations you
> are describing, which don't offhand seem like they'd be served by a
> common architectural component.  These are:
>
> (1) Providing a content-extraction and modification pipeline, for
> those output connectors that are targeting systems that cannot do
> content extraction on their own.
> (2) Providing framework-level services that allow "connectors" to be
> readily constructed along a pipeline model.
>

A good start, let me add to it:

(3) Easily use other Connectors as both inputs and outputs to a custom
connector.

I'm not sure whether it's better to have such hybrids mimic a datasource
connector, an output connector, or maybe slot in as a security connector?

Conceptually the security connectors are both "input" and "output", so
presumably would be easier to chain?  But it'd be a bid odd to hang off of
the "security" side of things for a process that just tweaks metadata.
Also, I don't know if security connectors have access to all of the data
and the ability to modify metadata / document content.

(4) Inclusion of Tika for filtering, which is often needed.

(5) Ability for a custom connector to inject additional URLs into the
various queues

(6) Some type of "accountability" for work that has been submitted.  So a
record comes in on connector A, I then generate requests to connectors B
and C, and I'd like to be efficiently called back when those other tasks
are completed or have failed.

And if I've submitted 1,000 URLs to spider, I want to know when either all
of them are done, or if 997 are done, I want to know which 3 didn't make it
and why.

Bulk spiders don't care about this type of thing.  If they miss 10,000
Justin Bieber pages that's fine, there's 20,000,000 more out there.
Precision spiders are at the opposite end of the spectrum, they care about
every single record.

And finally, a few more examples, since I think they're helping the
discussion:

New example 1:

Suppose I have a system that looks for citations to other works, for
example medical research that references other studies, legal case law
precedents, or patents from multiple countries.  These citations could
appear in HTML, MS Word, text files or PDF documents, so I can't just rely
on hyperlinks, and I'd certainly need to document filters.  Then, for each
citation, I have canonical sources, which are also selected by rules.  For
example, for Japanese patents, go "here".  And further suppose that there
is some nominal cost to fetching external links, so I don't want to just
spider entire data sources.  And suppose I also don't want to re-download
something I've already paid for.  And suppose further that, for some types
of data, I have a cheap/free data source that *might* have all the info I
need, but if it doesn't, I'll fall back to one of the premium sources.

Sure, somebody could write a giant monolithic connector in Java, but this
really seems like the type of system that would be:
1: Likely to have frequent changes to business rules, so recompiling a java
app seems a bit onerous.
2: Would be interesting to a number different organizations BUT with
slightly different business rules, so again easier to reconfigure than to
recode.

New example 2:
Read in a set of RSS and Atom feeds that cover similar topics.  Attempt to
identify posts that are talking about the same topic.  Output a
consolidated RSS feed that represents consolidates stories, and then
stories that appear to NOT be in other RSS feeds.

Obviously a lot of statistics or rules or NLP involed, and would be an
imperfect system.  And may blog posts mention multiple items, and various
posts can talk about the same subject, but at different times and in
different ways, so not really duplicates of each other.  There's LOTS of
imperfections in this spec.  BUT writing all this from scratch would be a
huge hassle.

New example 3:
Read in FAQ pages from various sources.  Break the FAQ's apart into
separate documents, such that each holds a single Question/Answer combo.
Store that document in some new repository, and calculate a new URL.
Back-reference the original source, including a #anchor if you can.  For
extra credit, detect actual changes in individual Q/A pairs (vs. reindexing
the entire FAQ page's content just because 1 character changed)

New example 4: (slight repeat from last email, but more detailed)
Migrate content from legacy CMS to new super CMS.  But legacy CMS doesn't
have a connector, only a web interface.  So items from the old CMS system
will be wrapped with navigational "cruft" that needs to be removed.  Also,
the same content can appear in multiple places, for example in a listing
for "February 2009", and also on its own canonical page.  Further, assume
that some rules require looking at the URL, BUT it has various session
things that need to be ignored or normalized.  Also run the content past
some business rules to assign it to 1 or more categories.

All of these things really benefit from connectors!  But they also need
some non-traditional spider logic.

And repeating, being able to do these things in a monolithic Manifold
process would better than nothing.

Being able to configure custom pipelines would be even better, but not a
deal breaker.  Obviously most Manifold users are Java coders at the moment,
so re-usablity could come at a later time.

>
> Karl
>

As always Karl I appreciate your comments and efforts!

Re: Revisiting: Should Manifold include Pipelines

Posted by Karl Wright <da...@gmail.com>.

Hi Mark,

I have some initial impressions; please read below.

On Mon, Jan 9, 2012 at 9:29 AM, Mark Bennett <mb...@ideaeng.com> wrote:
> We've been hoping to do some work this year to embed pipeline processing
> into MCF, such as UIMA or OpenPipeline or XPump.
>
> But reading through some recent posts there was a discussion about leaving
> this sort of thing to the Solr pipeline, and it suddenly dawned on me that
> maybe not everybody was on board with the idea of moving this into MCF.
>

Having a pipeline for content extraction is a pretty standard thing
for a search engine to have.  Having said that, I agree there are
times when this is not enough.

> So, before we spin our wheels, I wanted explain some reasons why this would
> be a GOOD thing to do, and get some reactions:
>
>
> 1: Not everybody is using Solr / or using exclusively Solr.
>
> Lucene and Solr a great of course, but open source isn't about walled
> gardens.  Most companies have multiple search engines.
>
> And, even if you just wanted to use Lucene (and not Solr), then the Solr
> pipeline is not very attractive.
>
> As an example, the Google appliance gets lots of press for Enterprise
> search.  And it's got enough traction that their format of connector is
> starting to be used by other companies.  BUT, at least in the past,
> Google's document processing wasn't very pipeline friendly.  They had calls
> you could make, but there were issues.
>
> Wouldn't it be cool if Manifold could be used to feed Google appliances?  I
> realize some open source folks might not care, but it would suddenly make
> MCF interesting to a lot more developers.
>
> Or look at FAST ESP (which was bought by Microsoft).  FAST ESP had a rich
> tradition of pipeline goodness, but once Microsoft acquired them, that
> pipeline technology is being re-cast in a very Microsoft centric stack.
> That's fine if you're a Microsoft shop, you might like it even better than
> before, but if you're company prefers Linux, you might be looking for
> something else.
>
>
> 2: Not every information application is about search
>
> Classically there's been a need to go from one database to another.  But in
> more recent times there's been a need to go from Databases into Content
> Management Systems, or from one CMS to another, or to convert one corpus of
> documents into another.
>
> Sure there was ETL technology (Extract, Transform, Load), but that tended
> to be around structured data.
>
> More generally there's the class of going between structured and
> unstructured data, and vice versa.  The latter, going from unstructured
> back to structured, is where Entity Extraction comes into play, and where I
> had thought MCF could really shine.
>
> There's a somewhat subtle point here as well.  There's the format of
> individual documents or files, such as HTML, PDF, RSS or MS Word, but also
> the type of repository it resides in (filesystem, database, CMS, web
> services, etc)  I was drawn to MCF for the connections, but a document
> pipeline would let it also work on the document formats as well.
>
>
> 3: Even spidering to feed a search engine can benefit from "early binding"
> and "extended state"
>
> A slight aside: generic web page spidering doesn't often need fancy
> processing.  What I'm about to talk about might at first seem like "edge
> cases".  BUT, almost by definition, many of us are not brought into a
> project unless it's well outside the mainstream use case.  So many
> programmers find themselves working almost fulltime on rather unusual
> projects.  Open source is quite attractive because it provides a wealth of
> tools to choose from.
>
> "Early Binding" for Spiders:
>
> Generally it's the need to deeply parse a document before instructing the
> spider what next action to take next.
>

We've looked at this as primarily a connector-specific activity.  For
example, you wouldn't want to do such a thing from within documents
fetched via JCIFs.  The main use case I can see is in extracting links
from web content.

> Let me give one simple example, but trust me there are many more!
>
> Suppose you have Web pages (or PDF files!) filled with part numbers.  And
> you have a REST API that, presented with a part number, will give more
> details.
>
> But you need to parse out the part numbers in order to create the URLs that
> you need to spider to fetch next.
>
> Many other applications of this involve helping the spider decide what type
> of document it has, or what quality of data it's getting.  You might decide
> to tell the spider to drill down deeper, or conversely, give up and work on
> higher value targets.
>

What you've described is a case by which ManifoldCF obtains content
references from one source, and indexes them from another.  Today, in
order to pull that kind of thing off with ManifoldCF, you need to
write a custom connector to do it.  That's not so unreasonable; it
involves a lot of domain-specific pieces - e.g. how to obtain the
PDFs, how to build the URLs, etc.  A similar situation existed for
crawling Wiki's; there was a custom API which basically worked with
HTTP requests.  A generic web crawler could have been used but because
of the very specific requirements for understanding the API it was
best modeled as a new connector.

I think the rough breakdown of what component of the ManifoldCF system
is responsible for what remains correct.  Making it easier to
construct a custom connector in this way, by using building blocks
that ManifoldCF would make available for all connectors, makes sense
to some degree.

> I could imagine a workaround where Manifold passes documents to Solr, and
> then Solr's pipeline later resubmits URLs back into MCF, but it's a lot
> more direct to just make these determinations more immediately.  In a few
> cases it WOULD be nice to have Solr's fullword index, so maybe in it'd be
> nice to have both options.  Commercial software companies would want to
> make the decision for you, they'd choose one way or the other, but this
> aint their garden.  ;-)
>
>
> "extended state" for Spiders:
>
> This is where you need the context of 2 or 3 pages back in your traversed
> path in order to make full use of the current page.
>

ManifoldCF framework uses the concept of "Carrydown" data for this
purpose.  This is covered in Chapters 6 and 7 of "ManifoldCF in
Action".

> Here's an example from a few years back:
>
> Steps:
> 1: Start with a list of concert venue web sites.
> 2: Foreach venue, lookup the upcoming events, including dates, bands and
> ticketing links.
> 3: Foreach band, go to this other site and lookup their albums.
> 4: Foreach album, lookup each song.
> 5: Foreach song, go to a third site to get the lyrics.
>

Yeah, and this is why we have a connector framework.  The actual
content here is an amalgam of many different pages.  Each individual
"document" you'd index into your search engine contains content that
comes from many sources, and it has to be the connector's
responsibility to pull all that together.

> Now users can search for songs including the text in the lyrics.
> When a match is found, also show them upcoming performances near them, and
> maybe even let them click to buy tickets.
>
> You can see that the unit of retrieval is particular songs, in steps 4 and
> 5.  But we want data that we parsed from several steps back.
>
> Even in the case of song lyrics, where they will have the band's name, they
> might not have the Album title.  (and a song could have been on several
> albums of course)  So even things you'd expect to be able to parse, you've
> often already had that info during a previous step.
>
> I realize MCF probably doesn't include this type of state trail now.  But I
> was thinking it'd at least be easier to build something on top of MCF than
> going way out to Solr and then back into Manifold.
>
> In the past I think folks would have used Perl or Python to handcraft these
> types of projects.  But that doesn't scale very well, and you still need
> persistence for long running jobs, AND it doesn't encourage code reuse.
>
>
> So, Manifold could really benefit from pipelines!
>
> I have a lot of technical thoughts about how this might be achieved, and a
> bunch related thoughts.  But if pipelines are really unwelcome, I don't
> want to force it.
>
>
> One final thought:
>
> The main search vendors seem to be abandoning high end, precision
> spidering.  There's a tendency now to see all the world as "Internet", and
> the data behind firewalls as just "a smaller Internet (intranet)"
>
> This is fine for 80-90% of common use cases.
>
> But that last 5-10% of atypical projects are HUGELY under-served at this
> time.  They often have expensive problems that simply won't go away.
>
> Sure, true open source folks may or may not care about "markets" or
> "expensive problems".
>
> BUT these are also INTERESTING problems!  If you're bored with "appliances"
> and the latest downloadable free search bar, then trust me, these edge
> cases will NOT bore you!!!
>
> And I've given up on the current crop of Tier 1 search vendors for solving
> anything "interesting".  They are so distracted with so many other
> things.... and selling a hot 80% solution to a hungry market is fine with
> them anyway.
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

I'd be interested in hearing your broad-brush idea of a proposal.  It
seems to me that there are several wholly-independent situations you
are describing, which don't offhand seem like they'd be served by a
common architectural component.  These are:

(1) Providing a content-extraction and modification pipeline, for
those output connectors that are targeting systems that cannot do
content extraction on their own.
(2) Providing framework-level services that allow "connectors" to be
readily constructed along a pipeline model.

Karl