You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Rida Benjelloun <ri...@doculibre.com> on 2007/01/31 01:27:28 UTC

Lius into apache incubator

Hi,
I would like to add Lius framework (http://sourceforge.net/projects/lius/)
to apache incubator. Is there some volontiers to do this job and to
contribute to the developement of this project.

Thanks.

Rida Benjelloun.

Re: [jira] Lius into apache incubator

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/1/07, Doug Cutting <cu...@apache.org> wrote:
> Jukka Zitting wrote:
> > PS. Will people mind if we use this list for fleshing out the details?
> > I've created a Google Group for Tika where we could also take the
> > discussion if that's preferred.
>
> I think the Incubator Wiki would be the best place for this.
>
> http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles
>
> Interested folks could subscribe to the proposal page.  You could
> announce the proposal page on several lists.  Will that work for you?

Sounds good. I uploaded the early draft to
http://wiki.apache.org/incubator/TikaProposal, I'll announce it in a
moment.

> Also, I can probably help as a mentor if needed.

Cool, thanks!

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Lius into apache incubator

Posted by Rida Benjelloun <ri...@doculibre.com>.
Hi,
Thanks Doug, I think that your help will be very appricieted as a mentor.
Regards.

On 3/1/07, Doug Cutting <cu...@apache.org> wrote:
>
> Jukka Zitting wrote:
> > PS. Will people mind if we use this list for fleshing out the details?
> > I've created a Google Group for Tika where we could also take the
> > discussion if that's preferred.
>
> I think the Incubator Wiki would be the best place for this.
>
>
> http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles
>
> Interested folks could subscribe to the proposal page.  You could
> announce the proposal page on several lists.  Will that work for you?
>
> Also, I can probably help as a mentor if needed.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: [jira] Lius into apache incubator

Posted by Doug Cutting <cu...@apache.org>.
Jukka Zitting wrote:
> PS. Will people mind if we use this list for fleshing out the details?
> I've created a Google Group for Tika where we could also take the
> discussion if that's preferred.

I think the Incubator Wiki would be the best place for this.

http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles

Interested folks could subscribe to the proposal page.  You could 
announce the proposal page on several lists.  Will that work for you?

Also, I can probably help as a mentor if needed.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Lius into apache incubator

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/1/07, Rida Benjelloun <ri...@doculibre.com> wrote:
> On 3/1/07, Jukka Zitting <ju...@gmail.com> wrote:
> > Would there be interest within the Lucene PMC in sponsoring a proposal
> > along such lines? I can volunteer to put together the proposal and act
> > as the champion and mentor of the project.
>
> -- >> We can put together the proposal and you can be the mentor of the
> project.

See below for a quick first draft (filled with TODOs).

PS. Will people mind if we use this list for fleshing out the details?
I've created a Google Group for Tika where we could also take the
discussion if that's preferred.

BR,

Jukka Zitting


Tika Proposal
=============

This is an early draft of a possible proposal for a Tika project
within the Apache Incubator. See
http://incubator.apache.org/guides/proposal.html for a description of
the propsal template.

Abstract
--------

Tika is a toolkit for detecting and extracting metadata and text
content from various documents using existing parser libraries.

Proposal
--------

The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other
characteristics of existing documents and for extracting structured
text content from the documents.

The toolkit is targeted especially for search engines and other
content indexing and analysis tools, but will be useful also for other
applications that need to extract meaningful information from
documents that might be presented as nothing else than binary streams.

Instead of implementing it's own document parsers, Tika will use
existing parser libraries like Jakarta POI and PDFBox.

Background
----------

The need for tools that automatically analyze and index content is
increasing as ever more information becomes available.

TODO: Discuss the various related projects and the lack of a common
analysis toolkit. Note how many of the existing tools have grown as
ad-hoc solutions to specific needs, and are often tightly bound to a
specific application or a parser library.

Rationale
---------

TODO

Initial Goals
-------------

TODO

Current Status
--------------

TODO

Meritocracy
-----------

TODO

Community
---------

TODO

Core Developers
---------------

TODO

Alignment
---------

TODO

Known Risks
-----------

TODO: There has been on-and-off interest in something like this for
quite a while already. How can we make sure that the current increase
in interest doesn't fade away?

Orphaned products
-----------------

TODO: See the comment above

Inexperience with Open Source
-----------------------------

TODO: Many of the interested participants have open source background.

Homogenous Developers
---------------------

TODO: There is no central company behind the proposal.

Reliance on Salaried Developers
-------------------------------

TODO: Some of us are salaried for this, other's are not.

Relationships with Other Apache Products
----------------------------------------

TODO: Lucene, Nutch, Jackrabbit, Droids, ...

A Excessive Fascination with the Apache Brand
---------------------------------------------

TODO

Documentation
-------------

TODO

Initial Source
--------------

TODO: Tika, Lius, Nutch?, ...

Source and Intellectual Property Submission Plan
------------------------------------------------

TODO

External Dependencies
---------------------

TODO: Some of the potential parser libraries will be GPL-licensed or
otherwise troublesome for an ASF project. How to best handle such
cases?

Cryptography
------------

TODO: Some of the document formats are involve encryption and features
like DRM. While Tika itself will probably not include any
cryptographic code, the parser dependencies will most likely include
such code.

Required Resources
------------------

Mailing lists

  * tika-dev@incubator.apache.org

Subversion Directory

  * https://svn.apache.org/repos/asf/incubator/tika

Issue Tracking

  * JIRA TIKA

Other Resources

  * none

Initial Committers
------------------

TODO

Affiliations
------------

TODO

Sponsors
--------

Champion

TODO (I can volunteer)

Nominated Mentors

TODO (Three mentors is the recommendation, I can volunteer as one)

Sponsoring Entity

TODO (Apache Lucene?)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Lius into apache incubator

Posted by Rida Benjelloun <ri...@doculibre.com>.
Hi,
On 3/1/07, Jukka Zitting <ju...@gmail.com> wrote:
>
> Hi,
>
> On 3/1/07, Rida Benjelloun <ri...@doculibre.com> wrote:
> > Lius could be used as a starting point of Tika project, if Tika
> committers
> > are interested on it. We can also as mark said decouple Lius's parser
> logic
> > from it's indexing logic.
>
> I'm very interested in doing that. Another very useful codebase, among
> others, would be the existing parser framework in the Nutch project.


-->> I agree


> Taking the project into Apache incubator could be also interesting, to get
> > more people involved on it.
>
> Exactly. I'd like to avoid starting just yet another codebase, and
> focus more on bringing the best parts (both code and ideas) of the
> existing projects together. The community-building focus of the
> Incubator would likely help with that. Another aspect that would
> benefit from the Incubator scrutiny are the legal implications of
> pulling together multiple document parser libraries under various
> different licenses.
>
> Would there be interest within the Lucene PMC in sponsoring a proposal
> along such lines? I can volunteer to put together the proposal and act
> as the champion and mentor of the project.


-- >> We can put together the proposal and you can be the mentor of the
project.

BR,
>
> Jukka Zitting
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
-----------------------------------------------------------
Rida Benjelloun, M.S.I., M.B.A.
Président directeur général
DocuLibre inc.
Téléphone : (418) 262-3222
Site Web : http://www.doculibre.com
Courriel : rida.benjelloun@doculibre.com
-----------------------------------------------------------

Re: [jira] Lius into apache incubator

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/1/07, Grant Ingersoll <gs...@apache.org> wrote:
> Is the Droids lab at all related to that parsing project in Nutch?

Partly, yes. I've been looking at Droids and so far I think it's main
focus has been on the crawling part rather than on the analysis of
retrieved content. A generic content analysis toolkit would likely be
a great companion for Droids. In fact I was earlier contemplating
about starting a related effort in Apache Labs (see
http://issues.apache.org/jira/browse/JCR-728), but there seems to be
enough demand for such functionality that a more full-fledged project
might be better.

> There seems to be several efforts that are related here that could
> probably make for a nice new project under Lucene, IMO.  They all
> seem to have to do with getting and preparing text for processing by
> some type of consumer of text.

Exactly. It would be great to see some consolidation of efforts.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Lius into apache incubator

Posted by Grant Ingersoll <gs...@apache.org>.
Is the Droids lab at all related to that parsing project in Nutch?   
There seems to be several efforts that are related here that could  
probably make for a nice new project under Lucene, IMO.  They all  
seem to have to do with  getting and preparing text for processing by  
some type of consumer of text.

I sometimes wonder if the Analysis stuff in Lucene proper would  
benefit from moving out of core too, but I'm not sure what it would  
look like just yet and it is nice having it "optimized" for Lucene  
versus having to support other types of analysis phases.


Just my two cents,
Grant


On Mar 1, 2007, at 11:42 AM, Jukka Zitting wrote:

> Hi,
>
> On 3/1/07, Rida Benjelloun <ri...@doculibre.com> wrote:
>> Lius could be used as a starting point of Tika project, if Tika  
>> committers
>> are interested on it. We can also as mark said decouple Lius's  
>> parser logic
>> from it's indexing logic.
>
> I'm very interested in doing that. Another very useful codebase, among
> others, would be the existing parser framework in the Nutch project.
>
>> Taking the project into Apache incubator could be also  
>> interesting, to get
>> more people involved on it.
>
> Exactly. I'd like to avoid starting just yet another codebase, and
> focus more on bringing the best parts (both code and ideas) of the
> existing projects together. The community-building focus of the
> Incubator would likely help with that. Another aspect that would
> benefit from the Incubator scrutiny are the legal implications of
> pulling together multiple document parser libraries under various
> different licenses.
>
> Would there be interest within the Lucene PMC in sponsoring a proposal
> along such lines? I can volunteer to put together the proposal and act
> as the champion and mentor of the project.
>
> BR,
>
> Jukka Zitting
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Lius into apache incubator

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/1/07, Rida Benjelloun <ri...@doculibre.com> wrote:
> Lius could be used as a starting point of Tika project, if Tika committers
> are interested on it. We can also as mark said decouple Lius's parser logic
> from it's indexing logic.

I'm very interested in doing that. Another very useful codebase, among
others, would be the existing parser framework in the Nutch project.

> Taking the project into Apache incubator could be also interesting, to get
> more people involved on it.

Exactly. I'd like to avoid starting just yet another codebase, and
focus more on bringing the best parts (both code and ideas) of the
existing projects together. The community-building focus of the
Incubator would likely help with that. Another aspect that would
benefit from the Incubator scrutiny are the legal implications of
pulling together multiple document parser libraries under various
different licenses.

Would there be interest within the Lucene PMC in sponsoring a proposal
along such lines? I can volunteer to put together the proposal and act
as the champion and mentor of the project.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Lius into apache incubator

Posted by Rida Benjelloun <ri...@doculibre.com>.
Hi,
You could actually use Lius as text extraction API, I have implement for
each Indexer a method that allows you to get the String content of the
Document.
Lius could be used as a starting point of Tika project, if Tika committers
are interested on it. We can also as mark said decouple Lius's parser logic
from it's indexing logic.
Taking the project into Apache incubator could be also interesting, to get
more people involved on it.

My goal is to join our effort to build a framework for text extraction.
Here is an example of text extraction with lius :

LiusConfig lc =
LiusConfigBuilder.getSingletonInstance().getLiusConfig(liusConfigPathString);

Indexer indexer = IndexerFactory.getIndexer(documentToIndex, lc);
String text = Indexer.getContent();


On 3/1/07, Jukka Zitting <ju...@gmail.com> wrote:
>
>
> Hi,
>
> I am interested in a Lius/Tika project that could be used not only with
> Lucene. As mentioned by Mark, there are a number of related efforts which
> leads me to believe a application-independent content analysis/parsing
> tool
> would be very helpful for many users.
>
> I'd like to propose taking the project to the Apache Incubator to better
> attract interest also from outside Lucene.
>
> BR,
>
> Jukka Zitting
>
> --
> View this message in context:
> http://www.nabble.com/Lius-into-apache-incubator-tf3145937.html#a9247508
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: [jira] Lius into apache incubator

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

I am interested in a Lius/Tika project that could be used not only with
Lucene. As mentioned by Mark, there are a number of related efforts which
leads me to believe a application-independent content analysis/parsing tool
would be very helpful for many users.

I'd like to propose taking the project to the Apache Incubator to better
attract interest also from outside Lucene.

BR,

Jukka Zitting

-- 
View this message in context: http://www.nabble.com/Lius-into-apache-incubator-tf3145937.html#a9247508
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org