You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/01/08 14:51:57 UTC

[VOTE] Incubate Lucene Connector Framework

Hi,

Given the lack of response on the proposal, I'll assume lazy consensus and call a vote.

On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
subproject called the Lucene Connector Framework (LCF). I think we have all the
necessary bits in place for the proposal to go forward.

Proposal: http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal

[] +1. Accept LCF into the Incubator.
[] 0.  Don't care.
[] -1. Do not accept (and why.)

Here's my +1.

Thanks, Grant Ingersoll



------ Wiki Text Copied Below -----

Lucene Connector Framework

Abstract

Many, many search engines, as well as other applications, have a need to connect
with content repositories (SharePoint, CMS, Documentum, etc.) in a standard
manner. The Lucene Connector Framework (LCF) is a project aimed at building out
these connectors in open source under the Apache brand.

Proposal

The goal of LCF is to create a viable Lucene subproject aimed at delivering a
best of breed connector framework under the Apache Lucene name. As a framework,
the project will not only provide a way to connect to individual repositories,
but also a mechanism for plugging in new connectors or custom connectors in a
straightforward manner.

A connector framework is vital for search engines and other tools that need to
access data located in corporate repositories. By abstracting the problem into a
framework, applications can code to a set of well-defined interfaces instead of
having to use a different interface for each connector.

Connector Framework is an extendible incremental crawler, which uses a database
to manage configuration and crawl history, and provides reasonably high
performance in accessing content in multiple repositories for the main purpose
of search engine indexing. Connector Framework also establishes a
repository-specific security model which can be used to limit search user access
to repository content based on a user's identity. Connector Framework also
includes existing connectors and authorities for:

• File system • Windows shares • JDBC-supported databases • RSS feeds • General websites • LiveLink [from OpenText]

• Documentum [from EMC] • SharePoint [from Microsoft]

• Meridio [from Meridio] • Memex [from Memex] • FileNet [from IBM]

Key design points for Connector Framework are as follows:

• Extendability - you can add new connectors for new repositories, and new
authorities for specific repository security models • Incrementality - the ability to process only what changed between crawls, in
a repository-specific manner • Restartability - using a database with ACID properties to insure that crawls
are safe against process interruption or machine shutdown • Security - establishing a model of security tokens that allows a search
engine to enforce a repository's security model • Limited footprint - ability to operate reliably within a fixed amount of
process memory, regardless of configuration • Performance - management of connector-specific resources to maximize overall
thoughput • Transparency - ability to generate reports on the activity of all crawls and
repository connections

Background

MetaCarta originally approached Grant Ingersoll from the Lucene PMC about
donating their existing connector framework to the Lucene PMC. After some
discussion about accepting it as a software grant, the PMC decided it would be
best to incubate the project first.

Rationale

The Connector Framework fills an often significant gap in the Lucene experience,
namely, how to get content locked away in a content repository into
Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other tools (search engines and
others) will also have this same problem. A Connector Framework would also be
useful for someone wishing to migrate between content repositories, too.

Current Status

Connector Framework has been under development and in use in the field for close
to five years, deployed on a MetaCarta search appliance. Almost all development
of the project has been done by Karl Wright ( kwri...@metacarta.com ). Some
individual connectors were developed initially by contractors hired by
MetaCarta, Inc., but maintenance and further development is currently handled by
the MetaCarta team.

Development of Connector Framework can therefore be viewed as core framework
development, plus development of individual connectors. Core framework
development is currently not a terribly collaborative process, as there are no
maintainers of the core functionality other than Mr. Wright. Development of new
connectors has been done in the past in a much more collaborative way by
supplying a developer with a "development kit", and then integrating the
resulting connector (with whatever changes might have been necessary) into the
source tree.

Reasonable efforts have been made to maintain the generality of the code base
during the time that MetaCarta has owned it. Nevertheless, certain
MetaCarta-specific changes have been made which may require review and
modification. The following areas probably need to be addressed in the code
before graduation can occur:

• Branding. The UI brands it as a MetaCarta project.

• Package names. Package names would have to be changed. • How Connector Framework handles document delivery needs to be generalized, at
least for a single, configurable target output connector, and perhaps for
multiple, independently-configurable targets. Simple example output connectors
need to be written. Work in this direction is currently underway at MetaCarta
and may or may not be complete at the time of the code handover.

• Connector Framework-specific dependent package modifications need to be
addressed somehow. For instance, the following projects that Connector Framework
depends upon have been modified, but the modifications have not been accepted
upstream: commons-httpclient NTLMv2 and NTLM2 support [RSS, Web, SharePoint,
Meridio, and Livelink connectors]; commons-httpclient custom HTTPS protocol
factory support [Web, SharePoint, Meridio, and Livelink connectors]; xerces
ability to handle non-legal RSS feeds [RSS and Web connectors]

• MetaCarta-specific features, like document templates, are explicitly handled
by the UI and the infrastructure. These features should be generalized so that
they are controlled by the choice of output connector.

• Some specific hooks, namely support for configuration change notification,
and for database maintenance notification, may need to be made more generic. • Share Connector has a "fingerprinting" feature, which prefilters documents
based on a document type it surmises using a document inspection technique. This
feature is only viable at the moment for very basic document types. It should
either be removed, or generalized significantly to be much more flexible. • Documentation needs to be fleshed out, including javadoc and overall usage
documents. • Tests need to be written and/or ported from MetaCarta's test suite.

Longer term, the project will likely grow into a more distributed crawler, where
multiple machines might well be involved in coordinated crawling activity.

Meritocracy

Building the community using a meritocratic approach is very important to the
success of LCF. We know many, many people in the search space (and otherwise)
have either written their own connectors or are in need of connectors. Thus, we
expect a meritocratic community will lead to widespread participation.

Community

Our hope is that our existing code, features and capabilities will attract a
large community of both developers and users. We also believe that other
organizations will find this project interesting and relevant, and contribute
resources.

The user community of LCF would be similar to that of the other Lucene projects,
and in many cases they would overlap.

Core Developers

See the initial committer list below.

Alignment

We expect LCF will align quite well with the existing Lucene community and will
also provide significant value to other ASF and non-ASF projects as well as many
companies and individuals looking to access their content repositories in a
programmatic fashion.

Known Risks

Orphaned Products

The Connector Framework is an important piece of any search engine, including
MetaCarta's, as it provides the primary mechanism for getting content out of a
repository and into the search engine's index. Thus, we don't expect it will be
orphaned anytime soon. Once the project is established and the code is
available, we expect to attract not only other search companies, but others with
similar needs.

Inexperience with Open Source

Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the majority of the
experience with Open Source at the ASF, but all of the initial committers are
familiar with Open Source and have contributed to other open source projects.

Homogeneous Developers

The current list of committers are mostly members of either the MetaCarta or
Lucid Imagination developer team, but several are not. Additionally, we are
actively recruiting other developers.

Reliance on Salaried Developers

We have a variety of committers represented. Some are being paid to work on the
project and some are not.

Cryptography

Connector Framework itself has no real cryptography component, although it does
currently obfuscate passwords it saves to the database or to a configuration
file using a proprietary algorithm. The algorithm is present simply to avoid
using cleartext and is not secure in any sense other than by obscurity.

Various connectors, such as Share Connector, Web Connector, RSS Connector,
SharePoint Connector, LiveLink Connector, and Meridio Connector make use of
cryptographic principles via secondary libraries. Specifically, these connectors
support NTLM, NTLMv2, and NTLM2 Session authentication via commons-httpclient
and jCIFS. The changes to commons-httpclient necessary to support these
varieties of Windows protocols have not yet been accepted upstream by the Apache
httpclient project.

It is unknown at this time exactly to what degree the Oracle JDBC driver, the
jtds JDBC driver, or the Postgresql JDBC driver uses cryptography. Also, the
FileNet API class, the Memex API classes, the OpenText LAPI api classes, and the
Documentum DFC classes all may or may not use cryptography.

Legal Concerns

Some of the connectors in the existing framework require paid licenses to use.
We will need to evaluate each connector to see what can be appropriately
included. For those connectors that require a paid license, we will need to
determine a plan for including the wrapper code without the underlying bindings
in a legal manner. We expect we can provide the wrapper code without the binding
and that the code will thus only be compilable by someone who has access to the
binding. (This is what Google has done for their individual connectors). Longer
term, we expect to demonstrate to the companies with proprietary connectors why
it is more valuable for them to open up their specific connector pieces to give
broader access to people looking to leverage their content in the repository.

Trademark

The project is being rebranded from a MetaCarta internal name to the Lucene
Connector Framework, which will be an ASF mark.

Relationships with Other Apache Products

We expect almost all of the Apache Lucene ecosystem will benefit from having a
standard way of connecting to content repositories. Additionally, users of UIMA
should also benefit. We also see an especially tight connection with Tika, as
much of the content in these types of repositories are "rich" document types
which will then need their content extracted.

An Excessive Fascination with the Apache Brand

All of us are familiar with the value that Apache brings to a project in
building out a community. We also are all significant users of Apache Lucene and
related tools (Solr, Nutch, Mahout, Tika) and expect a close relationship with
those projects will help significantly grow the LCF community.

Documentation

MetaCarta has end-user documentation for Lucene Connector Framework, which might
function as the core the open-source end-user documentation. The documentation
is in LaTeX form, and thus usable sources can readily be extracted. Research as
to any ownership issues for the documentation as it stands still needs to be
examined.

The existing java doc of the code, while fairly extensive, needs review and
perhaps augmentation to insure it meets the needs of an ASF project. Significant
attention to maintaining its accuracy was made during MetaCarta's ownership of
the code base.

Initial Source

All initial sources will be coming from MetaCarta, Inc., with the goal of
folding in changes from others shortly thereafter.

Source and Intellectual Property Submission Plan

Code IP grants need to be made from MetaCarta, Inc. But, in addition, several
connectors (notably Documentum, LiveLink, Memex, and FileNet) rely directly on
client API's in order to be compiled. Another connector (JDBC) relies on the
existence of the Oracle JDBC Driver in the classpath in order to enable crawls
against Oracle databases.

It is unlikely that EMC, OpenText, Memex, or IBM would grant
Apache-license-compatible use of these client libraries. Thus, the expectation
is that users of these connectors obtain the necessary client libraries from the
owners prior to building or using the corresponding connector. An alternative
would be to undertake a clean-room implementation of the client API's, which may
well yield suitable results in some cases (LiveLink, Memex, FileNet), while
being out of reach in others (Documentum). Conditional compilation, for the
short term, is thus likely to be a necessity.

Other external dependencies, such as jCIFS for the Share Connector, are licensed
with LGPL, and thus may need to be treated in a manner similar to the closed
API's even though they are open source. These include the postgresql JDBC
driver, and JTDS.

The Lucene Connector Framework core and individual connectors are completely
separable, and many of the connectors require no third party licenses.
Therefore, there is significant utility for this project even in the absence of
any third-party software grants, or clean-room engineering.

The software grant will be faxed to the Apache Software Foundation if and when
the proposal herein described is accepted. MetaCarta patents are not infringed
by this grant. Also, MetaCarta trademarks are not included in this grant.

External Dependencies

The project dependencies, other than on other Apache projects, are as follows:

The ConnectorFramework core currently uses the Bitmechanic JDBC pool driver,
which is BSD licensed, and the Postgresql JDBC driver, which is also BSD
licensed.

The LiveLink Connector relies on LAPI, which is privately licensed by OpenText.
The Documentum Connector relies on DFC, which is privately licensed by EMC. The
Share Connector relies on jCIFS, which is LGPL. The Memex Connector relies on
privately licensed java libraries from Memex. The FileNet Connector relies on
privately licensed java libraries from IBM.

Required Resources

• Mailing lists • connectors-private (with moderated subscriptions) • connectors-user@ • connectors-dev@ • connectors-commit@ • Subversion directory • https://svn.apache.org/repos/asf/incubator/connectors  

• Website • Confluence (CONNECTORS) • Issue Tracking • JIRA (CONNECTORS)

Initial Committers

Names of initial committers with affiliation and current ASF status:

• Karl Wright (kwright at metacarta) • Josiah Strandberg (jstrandberg at metacarta) • Ken Baker (bakerkj at metacarta) • Marc Meadows (mam at metacarta) • Grant Ingersoll ( gsingers@a.o Lucid Imagination, ASF Member)

• Brian Pinkerton (brian.pinkerton at Lucid Imagination) • Simon Willnauer (simonw at apache org, Committer on Lucene Java and Lucene
Open Relevance Project) • Ryan McKinley (ryan at apache org, Committer on Lucene and Solr)

• Robert Muir (rmuir at apache org, Committer on Lucene and Open Relevance) • Sami Siren ( siren@a.o , Committer on Nutch and Tika)

• Otis Gospodnetic ( otis@a.o , Committer on Lucene, Solr, Nutch, Mahout, and
Open Relevance Project)

• Shalin Shekhar Mangar ( shalin@a.o , AOL, Committer on Apache Solr)

• Noble Paul ( noble@a.o , AOL, Committer on Apache Solr)

• George Aroush (george at aroush.net, Committer on Lucene.Net)

Sponsors

Champion

• Grant Ingersoll

Nominated Mentors

• Grant Ingersoll • Jukka Zitting • Gianugo Rabellino

Sponsoring Entity

• Apache Lucene PMC: Message ID: AF7E...@gmail.com
in private@lucene.a.o



Re: [VOTE] Incubate Lucene Connector Framework

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
[X] +1. Accept LCF into the Incubator.

(my non-binding vote)

Cheers,
Chris



On 1/8/10 11:31 AM, "Bertrand Delacretaz" <bd...@apache.org> wrote:

> [X] +1. Accept LCF into the Incubator.

Cool stuff!
-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Bertrand Delacretaz <bd...@apache.org>.
> [X] +1. Accept LCF into the Incubator.

Cool stuff!
-Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Niklas Gustavsson <ni...@protocol7.com>.
On Fri, Jan 8, 2010 at 2:51 PM, Grant Ingersoll <gs...@apache.org> wrote:
> On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
> subproject called the Lucene Connector Framework (LCF). I think we have all the
> necessary bits in place for the proposal to go forward.

+1 (non-binding), sounds very useful!

/niklas

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Tommaso Teofili <to...@gmail.com>.
my (non binding) +1

Cheers,
Tommaso

2010/1/9 Otis Gospodnetic <ot...@yahoo.com>

> +1
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message ----
> > From: Grant Ingersoll <gs...@apache.org>
> > To: general@incubator.apache.org
> > Sent: Fri, January 8, 2010 8:51:57 AM
> > Subject: [VOTE] Incubate Lucene Connector Framework
> >
> > Hi,
> >
> > Given the lack of response on the proposal, I'll assume lazy consensus
> and call
> > a vote.
> >
> > On behalf of the Lucene PMC, I'd like to propose incubation for a new
> Lucene
> > subproject called the Lucene Connector Framework (LCF). I think we have
> all the
> > necessary bits in place for the proposal to go forward.
> >
> > Proposal:
> http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal
> >
> > [] +1. Accept LCF into the Incubator.
> > [] 0.  Don't care.
> > [] -1. Do not accept (and why.)
> >
> > Here's my +1.
> >
> > Thanks, Grant Ingersoll
> >
> >
> >
> > ------ Wiki Text Copied Below -----
> >
> > Lucene Connector Framework
> >
> > Abstract
> >
> > Many, many search engines, as well as other applications, have a need to
> connect
> > with content repositories (SharePoint, CMS, Documentum, etc.) in a
> standard
> > manner. The Lucene Connector Framework (LCF) is a project aimed at
> building out
> > these connectors in open source under the Apache brand.
> >
> > Proposal
> >
> > The goal of LCF is to create a viable Lucene subproject aimed at
> delivering a
> > best of breed connector framework under the Apache Lucene name. As a
> framework,
> > the project will not only provide a way to connect to individual
> repositories,
> > but also a mechanism for plugging in new connectors or custom connectors
> in a
> > straightforward manner.
> >
> > A connector framework is vital for search engines and other tools that
> need to
> > access data located in corporate repositories. By abstracting the problem
> into a
> > framework, applications can code to a set of well-defined interfaces
> instead of
> > having to use a different interface for each connector.
> >
> > Connector Framework is an extendible incremental crawler, which uses a
> database
> > to manage configuration and crawl history, and provides reasonably high
> > performance in accessing content in multiple repositories for the main
> purpose
> > of search engine indexing. Connector Framework also establishes a
> > repository-specific security model which can be used to limit search user
> access
> > to repository content based on a user's identity. Connector Framework
> also
> > includes existing connectors and authorities for:
> >
> > • File system • Windows shares • JDBC-supported databases • RSS feeds •
> General
> > websites • LiveLink [from OpenText]
> >
> > • Documentum [from EMC] • SharePoint [from Microsoft]
> >
> > • Meridio [from Meridio] • Memex [from Memex] • FileNet [from IBM]
> >
> > Key design points for Connector Framework are as follows:
> >
> > • Extendability - you can add new connectors for new repositories, and
> new
> > authorities for specific repository security models • Incrementality -
> the
> > ability to process only what changed between crawls, in
> > a repository-specific manner • Restartability - using a database with
> ACID
> > properties to insure that crawls
> > are safe against process interruption or machine shutdown • Security -
> > establishing a model of security tokens that allows a search
> > engine to enforce a repository's security model • Limited footprint -
> ability to
> > operate reliably within a fixed amount of
> > process memory, regardless of configuration • Performance - management of
> > connector-specific resources to maximize overall
> > thoughput • Transparency - ability to generate reports on the activity of
> all
> > crawls and
> > repository connections
> >
> > Background
> >
> > MetaCarta originally approached Grant Ingersoll from the Lucene PMC about
> > donating their existing connector framework to the Lucene PMC. After some
> > discussion about accepting it as a software grant, the PMC decided it
> would be
> > best to incubate the project first.
> >
> > Rationale
> >
> > The Connector Framework fills an often significant gap in the Lucene
> experience,
> > namely, how to get content locked away in a content repository into
> > Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other tools (search
> engines and
> > others) will also have this same problem. A Connector Framework would
> also be
> > useful for someone wishing to migrate between content repositories, too.
> >
> > Current Status
> >
> > Connector Framework has been under development and in use in the field
> for close
> > to five years, deployed on a MetaCarta search appliance. Almost all
> development
> > of the project has been done by Karl Wright ( kwri...@metacarta.com ).
> Some
> > individual connectors were developed initially by contractors hired by
> > MetaCarta, Inc., but maintenance and further development is currently
> handled by
> > the MetaCarta team.
> >
> > Development of Connector Framework can therefore be viewed as core
> framework
> > development, plus development of individual connectors. Core framework
> > development is currently not a terribly collaborative process, as there
> are no
> > maintainers of the core functionality other than Mr. Wright. Development
> of new
> > connectors has been done in the past in a much more collaborative way by
> > supplying a developer with a "development kit", and then integrating the
> > resulting connector (with whatever changes might have been necessary)
> into the
> > source tree.
> >
> > Reasonable efforts have been made to maintain the generality of the code
> base
> > during the time that MetaCarta has owned it. Nevertheless, certain
> > MetaCarta-specific changes have been made which may require review and
> > modification. The following areas probably need to be addressed in the
> code
> > before graduation can occur:
> >
> > • Branding. The UI brands it as a MetaCarta project.
> >
> > • Package names. Package names would have to be changed. • How Connector
> > Framework handles document delivery needs to be generalized, at
> > least for a single, configurable target output connector, and perhaps for
> > multiple, independently-configurable targets. Simple example output
> connectors
> > need to be written. Work in this direction is currently underway at
> MetaCarta
> > and may or may not be complete at the time of the code handover.
> >
> > • Connector Framework-specific dependent package modifications need to be
> > addressed somehow. For instance, the following projects that Connector
> Framework
> > depends upon have been modified, but the modifications have not been
> accepted
> > upstream: commons-httpclient NTLMv2 and NTLM2 support [RSS, Web,
> SharePoint,
> > Meridio, and Livelink connectors]; commons-httpclient custom HTTPS
> protocol
> > factory support [Web, SharePoint, Meridio, and Livelink connectors];
> xerces
> > ability to handle non-legal RSS feeds [RSS and Web connectors]
> >
> > • MetaCarta-specific features, like document templates, are explicitly
> handled
> > by the UI and the infrastructure. These features should be generalized so
> that
> > they are controlled by the choice of output connector.
> >
> > • Some specific hooks, namely support for configuration change
> notification,
> > and for database maintenance notification, may need to be made more
> generic. •
> > Share Connector has a "fingerprinting" feature, which prefilters
> documents
> > based on a document type it surmises using a document inspection
> technique. This
> > feature is only viable at the moment for very basic document types. It
> should
> > either be removed, or generalized significantly to be much more flexible.
> •
> > Documentation needs to be fleshed out, including javadoc and overall
> usage
> > documents. • Tests need to be written and/or ported from MetaCarta's test
> suite.
> >
> > Longer term, the project will likely grow into a more distributed
> crawler, where
> > multiple machines might well be involved in coordinated crawling
> activity.
> >
> > Meritocracy
> >
> > Building the community using a meritocratic approach is very important to
> the
> > success of LCF. We know many, many people in the search space (and
> otherwise)
> > have either written their own connectors or are in need of connectors.
> Thus, we
> > expect a meritocratic community will lead to widespread participation.
> >
> > Community
> >
> > Our hope is that our existing code, features and capabilities will
> attract a
> > large community of both developers and users. We also believe that other
> > organizations will find this project interesting and relevant, and
> contribute
> > resources.
> >
> > The user community of LCF would be similar to that of the other Lucene
> projects,
> > and in many cases they would overlap.
> >
> > Core Developers
> >
> > See the initial committer list below.
> >
> > Alignment
> >
> > We expect LCF will align quite well with the existing Lucene community
> and will
> > also provide significant value to other ASF and non-ASF projects as well
> as many
> > companies and individuals looking to access their content repositories in
> a
> > programmatic fashion.
> >
> > Known Risks
> >
> > Orphaned Products
> >
> > The Connector Framework is an important piece of any search engine,
> including
> > MetaCarta's, as it provides the primary mechanism for getting content out
> of a
> > repository and into the search engine's index. Thus, we don't expect it
> will be
> > orphaned anytime soon. Once the project is established and the code is
> > available, we expect to attract not only other search companies, but
> others with
> > similar needs.
> >
> > Inexperience with Open Source
> >
> > Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the majority
> of the
> > experience with Open Source at the ASF, but all of the initial committers
> are
> > familiar with Open Source and have contributed to other open source
> projects.
> >
> > Homogeneous Developers
> >
> > The current list of committers are mostly members of either the MetaCarta
> or
> > Lucid Imagination developer team, but several are not. Additionally, we
> are
> > actively recruiting other developers.
> >
> > Reliance on Salaried Developers
> >
> > We have a variety of committers represented. Some are being paid to work
> on the
> > project and some are not.
> >
> > Cryptography
> >
> > Connector Framework itself has no real cryptography component, although
> it does
> > currently obfuscate passwords it saves to the database or to a
> configuration
> > file using a proprietary algorithm. The algorithm is present simply to
> avoid
> > using cleartext and is not secure in any sense other than by obscurity.
> >
> > Various connectors, such as Share Connector, Web Connector, RSS
> Connector,
> > SharePoint Connector, LiveLink Connector, and Meridio Connector make use
> of
> > cryptographic principles via secondary libraries. Specifically, these
> connectors
> > support NTLM, NTLMv2, and NTLM2 Session authentication via
> commons-httpclient
> > and jCIFS. The changes to commons-httpclient necessary to support these
> > varieties of Windows protocols have not yet been accepted upstream by the
> Apache
> > httpclient project.
> >
> > It is unknown at this time exactly to what degree the Oracle JDBC driver,
> the
> > jtds JDBC driver, or the Postgresql JDBC driver uses cryptography. Also,
> the
> > FileNet API class, the Memex API classes, the OpenText LAPI api classes,
> and the
> > Documentum DFC classes all may or may not use cryptography.
> >
> > Legal Concerns
> >
> > Some of the connectors in the existing framework require paid licenses to
> use.
> > We will need to evaluate each connector to see what can be appropriately
> > included. For those connectors that require a paid license, we will need
> to
> > determine a plan for including the wrapper code without the underlying
> bindings
> > in a legal manner. We expect we can provide the wrapper code without the
> binding
> > and that the code will thus only be compilable by someone who has access
> to the
> > binding. (This is what Google has done for their individual connectors).
> Longer
> > term, we expect to demonstrate to the companies with proprietary
> connectors why
> > it is more valuable for them to open up their specific connector pieces
> to give
> > broader access to people looking to leverage their content in the
> repository.
> >
> > Trademark
> >
> > The project is being rebranded from a MetaCarta internal name to the
> Lucene
> > Connector Framework, which will be an ASF mark.
> >
> > Relationships with Other Apache Products
> >
> > We expect almost all of the Apache Lucene ecosystem will benefit from
> having a
> > standard way of connecting to content repositories. Additionally, users
> of UIMA
> > should also benefit. We also see an especially tight connection with
> Tika, as
> > much of the content in these types of repositories are "rich" document
> types
> > which will then need their content extracted.
> >
> > An Excessive Fascination with the Apache Brand
> >
> > All of us are familiar with the value that Apache brings to a project in
> > building out a community. We also are all significant users of Apache
> Lucene and
> > related tools (Solr, Nutch, Mahout, Tika) and expect a close relationship
> with
> > those projects will help significantly grow the LCF community.
> >
> > Documentation
> >
> > MetaCarta has end-user documentation for Lucene Connector Framework,
> which might
> > function as the core the open-source end-user documentation. The
> documentation
> > is in LaTeX form, and thus usable sources can readily be extracted.
> Research as
> > to any ownership issues for the documentation as it stands still needs to
> be
> > examined.
> >
> > The existing java doc of the code, while fairly extensive, needs review
> and
> > perhaps augmentation to insure it meets the needs of an ASF project.
> Significant
> > attention to maintaining its accuracy was made during MetaCarta's
> ownership of
> > the code base.
> >
> > Initial Source
> >
> > All initial sources will be coming from MetaCarta, Inc., with the goal of
> > folding in changes from others shortly thereafter.
> >
> > Source and Intellectual Property Submission Plan
> >
> > Code IP grants need to be made from MetaCarta, Inc. But, in addition,
> several
> > connectors (notably Documentum, LiveLink, Memex, and FileNet) rely
> directly on
> > client API's in order to be compiled. Another connector (JDBC) relies on
> the
> > existence of the Oracle JDBC Driver in the classpath in order to enable
> crawls
> > against Oracle databases.
> >
> > It is unlikely that EMC, OpenText, Memex, or IBM would grant
> > Apache-license-compatible use of these client libraries. Thus, the
> expectation
> > is that users of these connectors obtain the necessary client libraries
> from the
> > owners prior to building or using the corresponding connector. An
> alternative
> > would be to undertake a clean-room implementation of the client API's,
> which may
> > well yield suitable results in some cases (LiveLink, Memex, FileNet),
> while
> > being out of reach in others (Documentum). Conditional compilation, for
> the
> > short term, is thus likely to be a necessity.
> >
> > Other external dependencies, such as jCIFS for the Share Connector, are
> licensed
> > with LGPL, and thus may need to be treated in a manner similar to the
> closed
> > API's even though they are open source. These include the postgresql JDBC
> > driver, and JTDS.
> >
> > The Lucene Connector Framework core and individual connectors are
> completely
> > separable, and many of the connectors require no third party licenses.
> > Therefore, there is significant utility for this project even in the
> absence of
> > any third-party software grants, or clean-room engineering.
> >
> > The software grant will be faxed to the Apache Software Foundation if and
> when
> > the proposal herein described is accepted. MetaCarta patents are not
> infringed
> > by this grant. Also, MetaCarta trademarks are not included in this grant.
> >
> > External Dependencies
> >
> > The project dependencies, other than on other Apache projects, are as
> follows:
> >
> > The ConnectorFramework core currently uses the Bitmechanic JDBC pool
> driver,
> > which is BSD licensed, and the Postgresql JDBC driver, which is also BSD
> > licensed.
> >
> > The LiveLink Connector relies on LAPI, which is privately licensed by
> OpenText.
> > The Documentum Connector relies on DFC, which is privately licensed by
> EMC. The
> > Share Connector relies on jCIFS, which is LGPL. The Memex Connector
> relies on
> > privately licensed java libraries from Memex. The FileNet Connector
> relies on
> > privately licensed java libraries from IBM.
> >
> > Required Resources
> >
> > • Mailing lists • connectors-private (with moderated subscriptions) •
> > connectors-user@ • connectors-dev@ • connectors-commit@ • Subversion
> directory •
> > https://svn.apache.org/repos/asf/incubator/connectors
> >
> > • Website • Confluence (CONNECTORS) • Issue Tracking • JIRA (CONNECTORS)
> >
> > Initial Committers
> >
> > Names of initial committers with affiliation and current ASF status:
> >
> > • Karl Wright (kwright at metacarta) • Josiah Strandberg (jstrandberg at
> > metacarta) • Ken Baker (bakerkj at metacarta) • Marc Meadows (mam at
> metacarta)
> > • Grant Ingersoll ( gsingers@a.o Lucid Imagination, ASF Member)
> >
> > • Brian Pinkerton (brian.pinkerton at Lucid Imagination) • Simon
> Willnauer
> > (simonw at apache org, Committer on Lucene Java and Lucene
> > Open Relevance Project) • Ryan McKinley (ryan at apache org, Committer on
> Lucene
> > and Solr)
> >
> > • Robert Muir (rmuir at apache org, Committer on Lucene and Open
> Relevance) •
> > Sami Siren ( siren@a.o , Committer on Nutch and Tika)
> >
> > • Otis Gospodnetic ( otis@a.o , Committer on Lucene, Solr, Nutch,
> Mahout, and
> > Open Relevance Project)
> >
> > • Shalin Shekhar Mangar ( shalin@a.o , AOL, Committer on Apache Solr)
> >
> > • Noble Paul ( noble@a.o , AOL, Committer on Apache Solr)
> >
> > • George Aroush (george at aroush.net, Committer on Lucene.Net)
> >
> > Sponsors
> >
> > Champion
> >
> > • Grant Ingersoll
> >
> > Nominated Mentors
> >
> > • Grant Ingersoll • Jukka Zitting • Gianugo Rabellino
> >
> > Sponsoring Entity
> >
> > • Apache Lucene PMC: Message ID: AF7E...@gmail.com
> > in private@lucene.a.o
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [VOTE] Incubate Lucene Connector Framework

Posted by Otis Gospodnetic <ot...@yahoo.com>.
+1
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Grant Ingersoll <gs...@apache.org>
> To: general@incubator.apache.org
> Sent: Fri, January 8, 2010 8:51:57 AM
> Subject: [VOTE] Incubate Lucene Connector Framework
> 
> Hi,
> 
> Given the lack of response on the proposal, I'll assume lazy consensus and call 
> a vote.
> 
> On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
> subproject called the Lucene Connector Framework (LCF). I think we have all the
> necessary bits in place for the proposal to go forward.
> 
> Proposal: http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal
> 
> [] +1. Accept LCF into the Incubator.
> [] 0.  Don't care.
> [] -1. Do not accept (and why.)
> 
> Here's my +1.
> 
> Thanks, Grant Ingersoll
> 
> 
> 
> ------ Wiki Text Copied Below -----
> 
> Lucene Connector Framework
> 
> Abstract
> 
> Many, many search engines, as well as other applications, have a need to connect
> with content repositories (SharePoint, CMS, Documentum, etc.) in a standard
> manner. The Lucene Connector Framework (LCF) is a project aimed at building out
> these connectors in open source under the Apache brand.
> 
> Proposal
> 
> The goal of LCF is to create a viable Lucene subproject aimed at delivering a
> best of breed connector framework under the Apache Lucene name. As a framework,
> the project will not only provide a way to connect to individual repositories,
> but also a mechanism for plugging in new connectors or custom connectors in a
> straightforward manner.
> 
> A connector framework is vital for search engines and other tools that need to
> access data located in corporate repositories. By abstracting the problem into a
> framework, applications can code to a set of well-defined interfaces instead of
> having to use a different interface for each connector.
> 
> Connector Framework is an extendible incremental crawler, which uses a database
> to manage configuration and crawl history, and provides reasonably high
> performance in accessing content in multiple repositories for the main purpose
> of search engine indexing. Connector Framework also establishes a
> repository-specific security model which can be used to limit search user access
> to repository content based on a user's identity. Connector Framework also
> includes existing connectors and authorities for:
> 
> • File system • Windows shares • JDBC-supported databases • RSS feeds • General 
> websites • LiveLink [from OpenText]
> 
> • Documentum [from EMC] • SharePoint [from Microsoft]
> 
> • Meridio [from Meridio] • Memex [from Memex] • FileNet [from IBM]
> 
> Key design points for Connector Framework are as follows:
> 
> • Extendability - you can add new connectors for new repositories, and new
> authorities for specific repository security models • Incrementality - the 
> ability to process only what changed between crawls, in
> a repository-specific manner • Restartability - using a database with ACID 
> properties to insure that crawls
> are safe against process interruption or machine shutdown • Security - 
> establishing a model of security tokens that allows a search
> engine to enforce a repository's security model • Limited footprint - ability to 
> operate reliably within a fixed amount of
> process memory, regardless of configuration • Performance - management of 
> connector-specific resources to maximize overall
> thoughput • Transparency - ability to generate reports on the activity of all 
> crawls and
> repository connections
> 
> Background
> 
> MetaCarta originally approached Grant Ingersoll from the Lucene PMC about
> donating their existing connector framework to the Lucene PMC. After some
> discussion about accepting it as a software grant, the PMC decided it would be
> best to incubate the project first.
> 
> Rationale
> 
> The Connector Framework fills an often significant gap in the Lucene experience,
> namely, how to get content locked away in a content repository into
> Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other tools (search engines and
> others) will also have this same problem. A Connector Framework would also be
> useful for someone wishing to migrate between content repositories, too.
> 
> Current Status
> 
> Connector Framework has been under development and in use in the field for close
> to five years, deployed on a MetaCarta search appliance. Almost all development
> of the project has been done by Karl Wright ( kwri...@metacarta.com ). Some
> individual connectors were developed initially by contractors hired by
> MetaCarta, Inc., but maintenance and further development is currently handled by
> the MetaCarta team.
> 
> Development of Connector Framework can therefore be viewed as core framework
> development, plus development of individual connectors. Core framework
> development is currently not a terribly collaborative process, as there are no
> maintainers of the core functionality other than Mr. Wright. Development of new
> connectors has been done in the past in a much more collaborative way by
> supplying a developer with a "development kit", and then integrating the
> resulting connector (with whatever changes might have been necessary) into the
> source tree.
> 
> Reasonable efforts have been made to maintain the generality of the code base
> during the time that MetaCarta has owned it. Nevertheless, certain
> MetaCarta-specific changes have been made which may require review and
> modification. The following areas probably need to be addressed in the code
> before graduation can occur:
> 
> • Branding. The UI brands it as a MetaCarta project.
> 
> • Package names. Package names would have to be changed. • How Connector 
> Framework handles document delivery needs to be generalized, at
> least for a single, configurable target output connector, and perhaps for
> multiple, independently-configurable targets. Simple example output connectors
> need to be written. Work in this direction is currently underway at MetaCarta
> and may or may not be complete at the time of the code handover.
> 
> • Connector Framework-specific dependent package modifications need to be
> addressed somehow. For instance, the following projects that Connector Framework
> depends upon have been modified, but the modifications have not been accepted
> upstream: commons-httpclient NTLMv2 and NTLM2 support [RSS, Web, SharePoint,
> Meridio, and Livelink connectors]; commons-httpclient custom HTTPS protocol
> factory support [Web, SharePoint, Meridio, and Livelink connectors]; xerces
> ability to handle non-legal RSS feeds [RSS and Web connectors]
> 
> • MetaCarta-specific features, like document templates, are explicitly handled
> by the UI and the infrastructure. These features should be generalized so that
> they are controlled by the choice of output connector.
> 
> • Some specific hooks, namely support for configuration change notification,
> and for database maintenance notification, may need to be made more generic. • 
> Share Connector has a "fingerprinting" feature, which prefilters documents
> based on a document type it surmises using a document inspection technique. This
> feature is only viable at the moment for very basic document types. It should
> either be removed, or generalized significantly to be much more flexible. • 
> Documentation needs to be fleshed out, including javadoc and overall usage
> documents. • Tests need to be written and/or ported from MetaCarta's test suite.
> 
> Longer term, the project will likely grow into a more distributed crawler, where
> multiple machines might well be involved in coordinated crawling activity.
> 
> Meritocracy
> 
> Building the community using a meritocratic approach is very important to the
> success of LCF. We know many, many people in the search space (and otherwise)
> have either written their own connectors or are in need of connectors. Thus, we
> expect a meritocratic community will lead to widespread participation.
> 
> Community
> 
> Our hope is that our existing code, features and capabilities will attract a
> large community of both developers and users. We also believe that other
> organizations will find this project interesting and relevant, and contribute
> resources.
> 
> The user community of LCF would be similar to that of the other Lucene projects,
> and in many cases they would overlap.
> 
> Core Developers
> 
> See the initial committer list below.
> 
> Alignment
> 
> We expect LCF will align quite well with the existing Lucene community and will
> also provide significant value to other ASF and non-ASF projects as well as many
> companies and individuals looking to access their content repositories in a
> programmatic fashion.
> 
> Known Risks
> 
> Orphaned Products
> 
> The Connector Framework is an important piece of any search engine, including
> MetaCarta's, as it provides the primary mechanism for getting content out of a
> repository and into the search engine's index. Thus, we don't expect it will be
> orphaned anytime soon. Once the project is established and the code is
> available, we expect to attract not only other search companies, but others with
> similar needs.
> 
> Inexperience with Open Source
> 
> Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the majority of the
> experience with Open Source at the ASF, but all of the initial committers are
> familiar with Open Source and have contributed to other open source projects.
> 
> Homogeneous Developers
> 
> The current list of committers are mostly members of either the MetaCarta or
> Lucid Imagination developer team, but several are not. Additionally, we are
> actively recruiting other developers.
> 
> Reliance on Salaried Developers
> 
> We have a variety of committers represented. Some are being paid to work on the
> project and some are not.
> 
> Cryptography
> 
> Connector Framework itself has no real cryptography component, although it does
> currently obfuscate passwords it saves to the database or to a configuration
> file using a proprietary algorithm. The algorithm is present simply to avoid
> using cleartext and is not secure in any sense other than by obscurity.
> 
> Various connectors, such as Share Connector, Web Connector, RSS Connector,
> SharePoint Connector, LiveLink Connector, and Meridio Connector make use of
> cryptographic principles via secondary libraries. Specifically, these connectors
> support NTLM, NTLMv2, and NTLM2 Session authentication via commons-httpclient
> and jCIFS. The changes to commons-httpclient necessary to support these
> varieties of Windows protocols have not yet been accepted upstream by the Apache
> httpclient project.
> 
> It is unknown at this time exactly to what degree the Oracle JDBC driver, the
> jtds JDBC driver, or the Postgresql JDBC driver uses cryptography. Also, the
> FileNet API class, the Memex API classes, the OpenText LAPI api classes, and the
> Documentum DFC classes all may or may not use cryptography.
> 
> Legal Concerns
> 
> Some of the connectors in the existing framework require paid licenses to use.
> We will need to evaluate each connector to see what can be appropriately
> included. For those connectors that require a paid license, we will need to
> determine a plan for including the wrapper code without the underlying bindings
> in a legal manner. We expect we can provide the wrapper code without the binding
> and that the code will thus only be compilable by someone who has access to the
> binding. (This is what Google has done for their individual connectors). Longer
> term, we expect to demonstrate to the companies with proprietary connectors why
> it is more valuable for them to open up their specific connector pieces to give
> broader access to people looking to leverage their content in the repository.
> 
> Trademark
> 
> The project is being rebranded from a MetaCarta internal name to the Lucene
> Connector Framework, which will be an ASF mark.
> 
> Relationships with Other Apache Products
> 
> We expect almost all of the Apache Lucene ecosystem will benefit from having a
> standard way of connecting to content repositories. Additionally, users of UIMA
> should also benefit. We also see an especially tight connection with Tika, as
> much of the content in these types of repositories are "rich" document types
> which will then need their content extracted.
> 
> An Excessive Fascination with the Apache Brand
> 
> All of us are familiar with the value that Apache brings to a project in
> building out a community. We also are all significant users of Apache Lucene and
> related tools (Solr, Nutch, Mahout, Tika) and expect a close relationship with
> those projects will help significantly grow the LCF community.
> 
> Documentation
> 
> MetaCarta has end-user documentation for Lucene Connector Framework, which might
> function as the core the open-source end-user documentation. The documentation
> is in LaTeX form, and thus usable sources can readily be extracted. Research as
> to any ownership issues for the documentation as it stands still needs to be
> examined.
> 
> The existing java doc of the code, while fairly extensive, needs review and
> perhaps augmentation to insure it meets the needs of an ASF project. Significant
> attention to maintaining its accuracy was made during MetaCarta's ownership of
> the code base.
> 
> Initial Source
> 
> All initial sources will be coming from MetaCarta, Inc., with the goal of
> folding in changes from others shortly thereafter.
> 
> Source and Intellectual Property Submission Plan
> 
> Code IP grants need to be made from MetaCarta, Inc. But, in addition, several
> connectors (notably Documentum, LiveLink, Memex, and FileNet) rely directly on
> client API's in order to be compiled. Another connector (JDBC) relies on the
> existence of the Oracle JDBC Driver in the classpath in order to enable crawls
> against Oracle databases.
> 
> It is unlikely that EMC, OpenText, Memex, or IBM would grant
> Apache-license-compatible use of these client libraries. Thus, the expectation
> is that users of these connectors obtain the necessary client libraries from the
> owners prior to building or using the corresponding connector. An alternative
> would be to undertake a clean-room implementation of the client API's, which may
> well yield suitable results in some cases (LiveLink, Memex, FileNet), while
> being out of reach in others (Documentum). Conditional compilation, for the
> short term, is thus likely to be a necessity.
> 
> Other external dependencies, such as jCIFS for the Share Connector, are licensed
> with LGPL, and thus may need to be treated in a manner similar to the closed
> API's even though they are open source. These include the postgresql JDBC
> driver, and JTDS.
> 
> The Lucene Connector Framework core and individual connectors are completely
> separable, and many of the connectors require no third party licenses.
> Therefore, there is significant utility for this project even in the absence of
> any third-party software grants, or clean-room engineering.
> 
> The software grant will be faxed to the Apache Software Foundation if and when
> the proposal herein described is accepted. MetaCarta patents are not infringed
> by this grant. Also, MetaCarta trademarks are not included in this grant.
> 
> External Dependencies
> 
> The project dependencies, other than on other Apache projects, are as follows:
> 
> The ConnectorFramework core currently uses the Bitmechanic JDBC pool driver,
> which is BSD licensed, and the Postgresql JDBC driver, which is also BSD
> licensed.
> 
> The LiveLink Connector relies on LAPI, which is privately licensed by OpenText.
> The Documentum Connector relies on DFC, which is privately licensed by EMC. The
> Share Connector relies on jCIFS, which is LGPL. The Memex Connector relies on
> privately licensed java libraries from Memex. The FileNet Connector relies on
> privately licensed java libraries from IBM.
> 
> Required Resources
> 
> • Mailing lists • connectors-private (with moderated subscriptions) • 
> connectors-user@ • connectors-dev@ • connectors-commit@ • Subversion directory • 
> https://svn.apache.org/repos/asf/incubator/connectors  
> 
> • Website • Confluence (CONNECTORS) • Issue Tracking • JIRA (CONNECTORS)
> 
> Initial Committers
> 
> Names of initial committers with affiliation and current ASF status:
> 
> • Karl Wright (kwright at metacarta) • Josiah Strandberg (jstrandberg at 
> metacarta) • Ken Baker (bakerkj at metacarta) • Marc Meadows (mam at metacarta) 
> • Grant Ingersoll ( gsingers@a.o Lucid Imagination, ASF Member)
> 
> • Brian Pinkerton (brian.pinkerton at Lucid Imagination) • Simon Willnauer 
> (simonw at apache org, Committer on Lucene Java and Lucene
> Open Relevance Project) • Ryan McKinley (ryan at apache org, Committer on Lucene 
> and Solr)
> 
> • Robert Muir (rmuir at apache org, Committer on Lucene and Open Relevance) • 
> Sami Siren ( siren@a.o , Committer on Nutch and Tika)
> 
> • Otis Gospodnetic ( otis@a.o , Committer on Lucene, Solr, Nutch, Mahout, and
> Open Relevance Project)
> 
> • Shalin Shekhar Mangar ( shalin@a.o , AOL, Committer on Apache Solr)
> 
> • Noble Paul ( noble@a.o , AOL, Committer on Apache Solr)
> 
> • George Aroush (george at aroush.net, Committer on Lucene.Net)
> 
> Sponsors
> 
> Champion
> 
> • Grant Ingersoll
> 
> Nominated Mentors
> 
> • Grant Ingersoll • Jukka Zitting • Gianugo Rabellino
> 
> Sponsoring Entity
> 
> • Apache Lucene PMC: Message ID: AF7E...@gmail.com
> in private@lucene.a.o


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Jan 8, 2010 at 7:21 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On behalf of the Lucene PMC, I'd like to propose incubation for a new
> Lucene
> subproject called the Lucene Connector Framework (LCF). I think we have all
> the
> necessary bits in place for the proposal to go forward.
>
> Proposal:
> http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal
>
>
[X] +1. Accept LCF into the Incubator.

-- 
Regards,
Shalin Shekhar Mangar.

Re: [VOTE] Incubate Lucene Connector Framework

Posted by Marcel Offermans <ma...@luminis.nl>.
+1. Accept LCF into the Incubator.
(non binding)

Sounds like a very useful project.

Greetings, Marcel


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Gianugo Rabellino <gi...@gmail.com>.
On Fri, Jan 8, 2010 at 2:51 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Hi,
>
> Given the lack of response on the proposal, I'll assume lazy consensus and call a vote.
>
> On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
> subproject called the Lucene Connector Framework (LCF). I think we have all the
> necessary bits in place for the proposal to go forward.
>
> Proposal: http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal
>
> [] +1. Accept LCF into the Incubator.
> [] 0.  Don't care.
> [] -1. Do not accept (and why.)

+1

-- 
Gianugo

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Luciano Resende <lu...@gmail.com>.
On Fri, Jan 8, 2010 at 8:32 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Fri, Jan 8, 2010 at 3:51 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
>> subproject called the Lucene Connector Framework (LCF).
>
> [x] +1. Accept LCF into the Incubator.
>

+1


-- 
Luciano Resende
http://people.apache.org/~lresende
http://lresende.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [RESULT][VOTE] Incubate Lucene Connector Framework

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jan 11, 2010 at 9:57 PM, Jukka Zitting <ju...@gmail.com> wrote:
> On Mon, Jan 11, 2010 at 7:33 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> I believe this has passed.   Any pointers on next steps would be welcome, otherwise, I will go RTM.

To get started, I've now requested a mailing lists for the project:

    https://issues.apache.org/jira/browse/INFRA-2430

I also created the Jira project and the Confluence space:

    https://issues.apache.org/jira/browse/CONNECTORS
    http://cwiki.apache.org/confluence/display/CONNECTORS

Both still need some tweaking. I've granted also you full admin rights to both.

Once the project mailing lists are up we can discuss the details of
how to import the existing code and any documentation and/or issue
tracking data that may come with the code.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [RESULT][VOTE] Incubate Lucene Connector Framework

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jan 11, 2010 at 7:33 PM, Grant Ingersoll <gs...@apache.org> wrote:
> I believe this has passed.   Any pointers on next steps would be welcome, otherwise, I will go RTM.

Here's a few pointers on what we did with Tika:

    https://issues.apache.org/jira/browse/INFRA-1204
    https://issues.apache.org/jira/browse/TIKA-1
    https://issues.apache.org/jira/browse/TIKA-2
    https://issues.apache.org/jira/browse/TIKA-3

The biggest issue is probably getting all the ICLAs for the new
committer accounts and the software grant for importing the initial
codebase.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [RESULT][VOTE] Incubate Lucene Connector Framework

Posted by David Crossley <cr...@apache.org>.
Grant Ingersoll wrote:
> I believe this has passed.   Any pointers on next steps would be welcome, otherwise, I will go RTM.

Please do step one, which is to add it to the
ReportingSchedule and "Projects in Incubation" table.

Clutch can then pick it up and start assisting
to track what remains to be done.

Clutch also tries to list the steps and point to the
relevant Incubator docs:
http://incubator.apache.org/clutch.html#steps

-David

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


[RESULT][VOTE] Incubate Lucene Connector Framework

Posted by Grant Ingersoll <gs...@apache.org>.
I believe this has passed.   Any pointers on next steps would be welcome, otherwise, I will go RTM.

On Jan 8, 2010, at 11:32 AM, Jukka Zitting wrote:

> Hi,
> 
> On Fri, Jan 8, 2010 at 3:51 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
>> subproject called the Lucene Connector Framework (LCF).
> 
> [x] +1. Accept LCF into the Incubator.
> 
> BR,
> 
> Jukka Zitting
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Jan 8, 2010 at 3:51 PM, Grant Ingersoll <gs...@apache.org> wrote:
> On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene
> subproject called the Lucene Connector Framework (LCF).

[x] +1. Accept LCF into the Incubator.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [VOTE] Incubate Lucene Connector Framework

Posted by "Alan D. Cabrera" <li...@toolazydogs.com>.
+1

Regards,
Alan

On Jan 8, 2010, at 5:51 AM, Grant Ingersoll wrote:

> Hi,
>
> Given the lack of response on the proposal, I'll assume lazy  
> consensus and call a vote.
>
> On behalf of the Lucene PMC, I'd like to propose incubation for a  
> new Lucene
> subproject called the Lucene Connector Framework (LCF). I think we  
> have all the
> necessary bits in place for the proposal to go forward.
>
> Proposal: http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal
>
> [] +1. Accept LCF into the Incubator.
> [] 0.  Don't care.
> [] -1. Do not accept (and why.)
>
> Here's my +1.
>
> Thanks, Grant Ingersoll
>
>
>
> ------ Wiki Text Copied Below -----
>
> Lucene Connector Framework
>
> Abstract
>
> Many, many search engines, as well as other applications, have a  
> need to connect
> with content repositories (SharePoint, CMS, Documentum, etc.) in a  
> standard
> manner. The Lucene Connector Framework (LCF) is a project aimed at  
> building out
> these connectors in open source under the Apache brand.
>
> Proposal
>
> The goal of LCF is to create a viable Lucene subproject aimed at  
> delivering a
> best of breed connector framework under the Apache Lucene name. As a  
> framework,
> the project will not only provide a way to connect to individual  
> repositories,
> but also a mechanism for plugging in new connectors or custom  
> connectors in a
> straightforward manner.
>
> A connector framework is vital for search engines and other tools  
> that need to
> access data located in corporate repositories. By abstracting the  
> problem into a
> framework, applications can code to a set of well-defined interfaces  
> instead of
> having to use a different interface for each connector.
>
> Connector Framework is an extendible incremental crawler, which uses  
> a database
> to manage configuration and crawl history, and provides reasonably  
> high
> performance in accessing content in multiple repositories for the  
> main purpose
> of search engine indexing. Connector Framework also establishes a
> repository-specific security model which can be used to limit search  
> user access
> to repository content based on a user's identity. Connector  
> Framework also
> includes existing connectors and authorities for:
>
> • File system • Windows shares • JDBC-supported databases • RSS  
> feeds • General websites • LiveLink [from OpenText]
>
> • Documentum [from EMC] • SharePoint [from Microsoft]
>
> • Meridio [from Meridio] • Memex [from Memex] • FileNet [from IBM]
>
> Key design points for Connector Framework are as follows:
>
> • Extendability - you can add new connectors for new repositories,  
> and new
> authorities for specific repository security models • Incrementality  
> - the ability to process only what changed between crawls, in
> a repository-specific manner • Restartability - using a database  
> with ACID properties to insure that crawls
> are safe against process interruption or machine shutdown • Security  
> - establishing a model of security tokens that allows a search
> engine to enforce a repository's security model • Limited footprint  
> - ability to operate reliably within a fixed amount of
> process memory, regardless of configuration • Performance -  
> management of connector-specific resources to maximize overall
> thoughput • Transparency - ability to generate reports on the  
> activity of all crawls and
> repository connections
>
> Background
>
> MetaCarta originally approached Grant Ingersoll from the Lucene PMC  
> about
> donating their existing connector framework to the Lucene PMC. After  
> some
> discussion about accepting it as a software grant, the PMC decided  
> it would be
> best to incubate the project first.
>
> Rationale
>
> The Connector Framework fills an often significant gap in the Lucene  
> experience,
> namely, how to get content locked away in a content repository into
> Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other tools (search  
> engines and
> others) will also have this same problem. A Connector Framework  
> would also be
> useful for someone wishing to migrate between content repositories,  
> too.
>
> Current Status
>
> Connector Framework has been under development and in use in the  
> field for close
> to five years, deployed on a MetaCarta search appliance. Almost all  
> development
> of the project has been done by Karl Wright  
> ( kwri...@metacarta.com ). Some
> individual connectors were developed initially by contractors hired by
> MetaCarta, Inc., but maintenance and further development is  
> currently handled by
> the MetaCarta team.
>
> Development of Connector Framework can therefore be viewed as core  
> framework
> development, plus development of individual connectors. Core framework
> development is currently not a terribly collaborative process, as  
> there are no
> maintainers of the core functionality other than Mr. Wright.  
> Development of new
> connectors has been done in the past in a much more collaborative  
> way by
> supplying a developer with a "development kit", and then integrating  
> the
> resulting connector (with whatever changes might have been  
> necessary) into the
> source tree.
>
> Reasonable efforts have been made to maintain the generality of the  
> code base
> during the time that MetaCarta has owned it. Nevertheless, certain
> MetaCarta-specific changes have been made which may require review and
> modification. The following areas probably need to be addressed in  
> the code
> before graduation can occur:
>
> • Branding. The UI brands it as a MetaCarta project.
>
> • Package names. Package names would have to be changed. • How  
> Connector Framework handles document delivery needs to be  
> generalized, at
> least for a single, configurable target output connector, and  
> perhaps for
> multiple, independently-configurable targets. Simple example output  
> connectors
> need to be written. Work in this direction is currently underway at  
> MetaCarta
> and may or may not be complete at the time of the code handover.
>
> • Connector Framework-specific dependent package modifications need  
> to be
> addressed somehow. For instance, the following projects that  
> Connector Framework
> depends upon have been modified, but the modifications have not been  
> accepted
> upstream: commons-httpclient NTLMv2 and NTLM2 support [RSS, Web,  
> SharePoint,
> Meridio, and Livelink connectors]; commons-httpclient custom HTTPS  
> protocol
> factory support [Web, SharePoint, Meridio, and Livelink connectors];  
> xerces
> ability to handle non-legal RSS feeds [RSS and Web connectors]
>
> • MetaCarta-specific features, like document templates, are  
> explicitly handled
> by the UI and the infrastructure. These features should be  
> generalized so that
> they are controlled by the choice of output connector.
>
> • Some specific hooks, namely support for configuration change  
> notification,
> and for database maintenance notification, may need to be made more  
> generic. • Share Connector has a "fingerprinting" feature, which  
> prefilters documents
> based on a document type it surmises using a document inspection  
> technique. This
> feature is only viable at the moment for very basic document types.  
> It should
> either be removed, or generalized significantly to be much more  
> flexible. • Documentation needs to be fleshed out, including javadoc  
> and overall usage
> documents. • Tests need to be written and/or ported from MetaCarta's  
> test suite.
>
> Longer term, the project will likely grow into a more distributed  
> crawler, where
> multiple machines might well be involved in coordinated crawling  
> activity.
>
> Meritocracy
>
> Building the community using a meritocratic approach is very  
> important to the
> success of LCF. We know many, many people in the search space (and  
> otherwise)
> have either written their own connectors or are in need of  
> connectors. Thus, we
> expect a meritocratic community will lead to widespread participation.
>
> Community
>
> Our hope is that our existing code, features and capabilities will  
> attract a
> large community of both developers and users. We also believe that  
> other
> organizations will find this project interesting and relevant, and  
> contribute
> resources.
>
> The user community of LCF would be similar to that of the other  
> Lucene projects,
> and in many cases they would overlap.
>
> Core Developers
>
> See the initial committer list below.
>
> Alignment
>
> We expect LCF will align quite well with the existing Lucene  
> community and will
> also provide significant value to other ASF and non-ASF projects as  
> well as many
> companies and individuals looking to access their content  
> repositories in a
> programmatic fashion.
>
> Known Risks
>
> Orphaned Products
>
> The Connector Framework is an important piece of any search engine,  
> including
> MetaCarta's, as it provides the primary mechanism for getting  
> content out of a
> repository and into the search engine's index. Thus, we don't expect  
> it will be
> orphaned anytime soon. Once the project is established and the code is
> available, we expect to attract not only other search companies, but  
> others with
> similar needs.
>
> Inexperience with Open Source
>
> Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the  
> majority of the
> experience with Open Source at the ASF, but all of the initial  
> committers are
> familiar with Open Source and have contributed to other open source  
> projects.
>
> Homogeneous Developers
>
> The current list of committers are mostly members of either the  
> MetaCarta or
> Lucid Imagination developer team, but several are not. Additionally,  
> we are
> actively recruiting other developers.
>
> Reliance on Salaried Developers
>
> We have a variety of committers represented. Some are being paid to  
> work on the
> project and some are not.
>
> Cryptography
>
> Connector Framework itself has no real cryptography component,  
> although it does
> currently obfuscate passwords it saves to the database or to a  
> configuration
> file using a proprietary algorithm. The algorithm is present simply  
> to avoid
> using cleartext and is not secure in any sense other than by  
> obscurity.
>
> Various connectors, such as Share Connector, Web Connector, RSS  
> Connector,
> SharePoint Connector, LiveLink Connector, and Meridio Connector make  
> use of
> cryptographic principles via secondary libraries. Specifically,  
> these connectors
> support NTLM, NTLMv2, and NTLM2 Session authentication via commons- 
> httpclient
> and jCIFS. The changes to commons-httpclient necessary to support  
> these
> varieties of Windows protocols have not yet been accepted upstream  
> by the Apache
> httpclient project.
>
> It is unknown at this time exactly to what degree the Oracle JDBC  
> driver, the
> jtds JDBC driver, or the Postgresql JDBC driver uses cryptography.  
> Also, the
> FileNet API class, the Memex API classes, the OpenText LAPI api  
> classes, and the
> Documentum DFC classes all may or may not use cryptography.
>
> Legal Concerns
>
> Some of the connectors in the existing framework require paid  
> licenses to use.
> We will need to evaluate each connector to see what can be  
> appropriately
> included. For those connectors that require a paid license, we will  
> need to
> determine a plan for including the wrapper code without the  
> underlying bindings
> in a legal manner. We expect we can provide the wrapper code without  
> the binding
> and that the code will thus only be compilable by someone who has  
> access to the
> binding. (This is what Google has done for their individual  
> connectors). Longer
> term, we expect to demonstrate to the companies with proprietary  
> connectors why
> it is more valuable for them to open up their specific connector  
> pieces to give
> broader access to people looking to leverage their content in the  
> repository.
>
> Trademark
>
> The project is being rebranded from a MetaCarta internal name to the  
> Lucene
> Connector Framework, which will be an ASF mark.
>
> Relationships with Other Apache Products
>
> We expect almost all of the Apache Lucene ecosystem will benefit  
> from having a
> standard way of connecting to content repositories. Additionally,  
> users of UIMA
> should also benefit. We also see an especially tight connection with  
> Tika, as
> much of the content in these types of repositories are "rich"  
> document types
> which will then need their content extracted.
>
> An Excessive Fascination with the Apache Brand
>
> All of us are familiar with the value that Apache brings to a  
> project in
> building out a community. We also are all significant users of  
> Apache Lucene and
> related tools (Solr, Nutch, Mahout, Tika) and expect a close  
> relationship with
> those projects will help significantly grow the LCF community.
>
> Documentation
>
> MetaCarta has end-user documentation for Lucene Connector Framework,  
> which might
> function as the core the open-source end-user documentation. The  
> documentation
> is in LaTeX form, and thus usable sources can readily be extracted.  
> Research as
> to any ownership issues for the documentation as it stands still  
> needs to be
> examined.
>
> The existing java doc of the code, while fairly extensive, needs  
> review and
> perhaps augmentation to insure it meets the needs of an ASF project.  
> Significant
> attention to maintaining its accuracy was made during MetaCarta's  
> ownership of
> the code base.
>
> Initial Source
>
> All initial sources will be coming from MetaCarta, Inc., with the  
> goal of
> folding in changes from others shortly thereafter.
>
> Source and Intellectual Property Submission Plan
>
> Code IP grants need to be made from MetaCarta, Inc. But, in  
> addition, several
> connectors (notably Documentum, LiveLink, Memex, and FileNet) rely  
> directly on
> client API's in order to be compiled. Another connector (JDBC)  
> relies on the
> existence of the Oracle JDBC Driver in the classpath in order to  
> enable crawls
> against Oracle databases.
>
> It is unlikely that EMC, OpenText, Memex, or IBM would grant
> Apache-license-compatible use of these client libraries. Thus, the  
> expectation
> is that users of these connectors obtain the necessary client  
> libraries from the
> owners prior to building or using the corresponding connector. An  
> alternative
> would be to undertake a clean-room implementation of the client  
> API's, which may
> well yield suitable results in some cases (LiveLink, Memex,  
> FileNet), while
> being out of reach in others (Documentum). Conditional compilation,  
> for the
> short term, is thus likely to be a necessity.
>
> Other external dependencies, such as jCIFS for the Share Connector,  
> are licensed
> with LGPL, and thus may need to be treated in a manner similar to  
> the closed
> API's even though they are open source. These include the postgresql  
> JDBC
> driver, and JTDS.
>
> The Lucene Connector Framework core and individual connectors are  
> completely
> separable, and many of the connectors require no third party licenses.
> Therefore, there is significant utility for this project even in the  
> absence of
> any third-party software grants, or clean-room engineering.
>
> The software grant will be faxed to the Apache Software Foundation  
> if and when
> the proposal herein described is accepted. MetaCarta patents are not  
> infringed
> by this grant. Also, MetaCarta trademarks are not included in this  
> grant.
>
> External Dependencies
>
> The project dependencies, other than on other Apache projects, are  
> as follows:
>
> The ConnectorFramework core currently uses the Bitmechanic JDBC pool  
> driver,
> which is BSD licensed, and the Postgresql JDBC driver, which is also  
> BSD
> licensed.
>
> The LiveLink Connector relies on LAPI, which is privately licensed  
> by OpenText.
> The Documentum Connector relies on DFC, which is privately licensed  
> by EMC. The
> Share Connector relies on jCIFS, which is LGPL. The Memex Connector  
> relies on
> privately licensed java libraries from Memex. The FileNet Connector  
> relies on
> privately licensed java libraries from IBM.
>
> Required Resources
>
> • Mailing lists • connectors-private (with moderated subscriptions)  
> • connectors-user@ • connectors-dev@ • connectors-commit@ •  
> Subversion directory • https://svn.apache.org/repos/asf/incubator/connectors
>
> • Website • Confluence (CONNECTORS) • Issue Tracking • JIRA  
> (CONNECTORS)
>
> Initial Committers
>
> Names of initial committers with affiliation and current ASF status:
>
> • Karl Wright (kwright at metacarta) • Josiah Strandberg  
> (jstrandberg at metacarta) • Ken Baker (bakerkj at metacarta) • Marc  
> Meadows (mam at metacarta) • Grant Ingersoll ( gsingers@a.o Lucid  
> Imagination, ASF Member)
>
> • Brian Pinkerton (brian.pinkerton at Lucid Imagination) • Simon  
> Willnauer (simonw at apache org, Committer on Lucene Java and Lucene
> Open Relevance Project) • Ryan McKinley (ryan at apache org,  
> Committer on Lucene and Solr)
>
> • Robert Muir (rmuir at apache org, Committer on Lucene and Open  
> Relevance) • Sami Siren ( siren@a.o , Committer on Nutch and Tika)
>
> • Otis Gospodnetic ( otis@a.o , Committer on Lucene, Solr, Nutch,  
> Mahout, and
> Open Relevance Project)
>
> • Shalin Shekhar Mangar ( shalin@a.o , AOL, Committer on Apache Solr)
>
> • Noble Paul ( noble@a.o , AOL, Committer on Apache Solr)
>
> • George Aroush (george at aroush.net, Committer on Lucene.Net)
>
> Sponsors
>
> Champion
>
> • Grant Ingersoll
>
> Nominated Mentors
>
> • Grant Ingersoll • Jukka Zitting • Gianugo Rabellino
>
> Sponsoring Entity
>
> • Apache Lucene PMC: Message ID: AF7E...@gmail.com
> in private@lucene.a.o
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org