You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peter Becker <pb...@dstc.edu.au> on 2003/09/02 13:51:50 UTC

ANN: Docco 0.2 / contribution offer

Hi all,

we finally finished the 0.2 release of our little personal document 
management tool based on Lucene:

  http://tockit.sourceforge.net/docco/index.html

This might be interesting for some readers of this list since its source 
contains some infrastructure for document handlers and index management. 
The document handlers are written with a very simple API, which just 
asks the implementation to fill a structure with the information 
retrieved from a URL. It is similar to the Ant task in the Lucene 
sandbox, but it separates the information collection and the actual 
indexing, i.e. all the decisions what should be stored and what shouldn't.

The program comes with implementations for plain text, HTML (based on 
Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We 
wrote plugins for POI, PDFbox and Multivalent. The latter is 
unfortunately a wild hack since Multivalent is the worst Java code I've 
seen. Literally. Bad C written in Java. The tool would be nice to use, 
but catching exceptions in little helper classes to do a System.exit is 
just insane. And that is just one of the problems -- we had to do some 
bad hacks to fix these issues. The other implementations should be fine, 
although they need some more testing.

The source (including all required libs) of the program is available via 
Sourceforge's CVS:

  http://sourceforge.net/cvs/?group_id=37081

The module in question is called "docco". A current snapshot of only the 
source is here:

  http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)


The relevant packages are:

  org.tockit.docco.documenthandler: the documenthandler interface and 
implementations
  org.tockit.docco.filefilter: some code to pick document handlers via 
file extensions or regexps
  org.tockit.docco.index: the model/static bits of the index management
  org.tockit.docco.indexer: the dynamic aspects of the index management: 
runnable, framework for handlers

The index management is probably not optimal, I strongly suspect that an 
expert could tweak it. But the structure should be ok.

We would be happy to contribute this code to the Lucene sandbox if there 
is interest. Or to turn it into a project of its own, we don't think it 
should be hidden in our more specific program. It should be easy to 
merge it with the Ant task and we are happy to give a hand if wanted. 
Adding some documentation would be easy, too -- at the moment the code 
is still more for ourself, but it should be very readable by itself. We 
require JDK 1.4, but this can be reduced by moving some more document 
handlers into plugins.

Anyone interested in joining into maintaining this code? Any feedback is 
welcome.

Cheers,
   Peter


Re: Docco 0.2 / contribution offer

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Gregor,

Gregor Heinrich wrote:

>Hi Peter.
>
>Docco is a great tool which I have been using since you posted your first
>announcement (version 1.0, that is). Beside the things you mention in you
>
No, it must be 0.1 :-) It wasn't too bad for a first hack, though *g*

>mail I also generally think it's a great idea to using formal concept
>analysis with Lucene. I would be interested to explore the idea also for
>more structured data (maybe include fields and even hierarchies).
>
Sounds interesting. We will be busy with other things for a month or 
two, afterwards we will probably do some experiments with extending 
Docco towards LSA and similar technologies (main idea is guessing good 
keywords to refine the query -- which is a bit different if you are 
happy with overspecified queries). If you have some particular ideas we 
can discuss them on tockit-general@lists.sf.net. Combining the current 
Docco approach with conceptual scaling could be interesting -- if that 
is what you mean. It would be a bit off-topic here, so let's put this on 
our list.

The question for this list was more if someone is interested in turning 
the indexing and index managements bits into a separate project. I'd 
love to see it used by a broader audience and I think it would make 
getting started with Lucene a bit easier for some people -- at least if 
we add a little JavaDoc in form of package.htmls and class descriptions. 
And there are not that many classes, so that shouldn't be too much pain.

>
>Apart from this, if I had an idea of the time commitments connected, I would
>definitely consider to join.
>
Hey -- it is Open Source and voluntary. You commit as much time as you 
want ;-) Not that that couldn't be a problem -- at least I tend to try 
too many things at once.

If you are interested in more Docco-specific things, just post your 
ideas on our mailing list. That's usually a good start to get involved. 
And I am academic enough to always find the time to discuss things -- 
doing them is the hard bit :-) Once I understand what exactly you want 
to do I can probably give you a reasonable estimate of the effort involved.

Cheers,
  Peter


>Best,
>
>Gregor
>
>
>
>-----Original Message-----
>From: Peter Becker [mailto:pbecker@dstc.edu.au]
>Sent: Tuesday, September 02, 2003 1:52 PM
>To: Lucene Users List
>Subject: ANN: Docco 0.2 / contribution offer
>
>
>Hi all,
>
>we finally finished the 0.2 release of our little personal document
>management tool based on Lucene:
>
>  http://tockit.sourceforge.net/docco/index.html
>
>This might be interesting for some readers of this list since its source
>contains some infrastructure for document handlers and index management.
>The document handlers are written with a very simple API, which just
>asks the implementation to fill a structure with the information
>retrieved from a URL. It is similar to the Ant task in the Lucene
>sandbox, but it separates the information collection and the actual
>indexing, i.e. all the decisions what should be stored and what shouldn't.
>
>The program comes with implementations for plain text, HTML (based on
>Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We
>wrote plugins for POI, PDFbox and Multivalent. The latter is
>unfortunately a wild hack since Multivalent is the worst Java code I've
>seen. Literally. Bad C written in Java. The tool would be nice to use,
>but catching exceptions in little helper classes to do a System.exit is
>just insane. And that is just one of the problems -- we had to do some
>bad hacks to fix these issues. The other implementations should be fine,
>although they need some more testing.
>
>The source (including all required libs) of the program is available via
>Sourceforge's CVS:
>
>  http://sourceforge.net/cvs/?group_id=37081
>
>The module in question is called "docco". A current snapshot of only the
>source is here:
>
>  http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)
>
>
>The relevant packages are:
>
>  org.tockit.docco.documenthandler: the documenthandler interface and
>implementations
>  org.tockit.docco.filefilter: some code to pick document handlers via
>file extensions or regexps
>  org.tockit.docco.index: the model/static bits of the index management
>  org.tockit.docco.indexer: the dynamic aspects of the index management:
>runnable, framework for handlers
>
>The index management is probably not optimal, I strongly suspect that an
>expert could tweak it. But the structure should be ok.
>
>We would be happy to contribute this code to the Lucene sandbox if there
>is interest. Or to turn it into a project of its own, we don't think it
>should be hidden in our more specific program. It should be easy to
>merge it with the Ant task and we are happy to give a hand if wanted.
>Adding some documentation would be easy, too -- at the moment the code
>is still more for ourself, but it should be very readable by itself. We
>require JDK 1.4, but this can be reduced by moving some more document
>handlers into plugins.
>
>Anyone interested in joining into maintaining this code? Any feedback is
>welcome.
>
>Cheers,
>   Peter
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>



RE: Docco 0.2 / contribution offer

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi Peter.

Docco is a great tool which I have been using since you posted your first
announcement (version 1.0, that is). Beside the things you mention in you
mail I also generally think it's a great idea to using formal concept
analysis with Lucene. I would be interested to explore the idea also for
more structured data (maybe include fields and even hierarchies).

Apart from this, if I had an idea of the time commitments connected, I would
definitely consider to join.

Best,

Gregor



-----Original Message-----
From: Peter Becker [mailto:pbecker@dstc.edu.au]
Sent: Tuesday, September 02, 2003 1:52 PM
To: Lucene Users List
Subject: ANN: Docco 0.2 / contribution offer


Hi all,

we finally finished the 0.2 release of our little personal document
management tool based on Lucene:

  http://tockit.sourceforge.net/docco/index.html

This might be interesting for some readers of this list since its source
contains some infrastructure for document handlers and index management.
The document handlers are written with a very simple API, which just
asks the implementation to fill a structure with the information
retrieved from a URL. It is similar to the Ant task in the Lucene
sandbox, but it separates the information collection and the actual
indexing, i.e. all the decisions what should be stored and what shouldn't.

The program comes with implementations for plain text, HTML (based on
Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We
wrote plugins for POI, PDFbox and Multivalent. The latter is
unfortunately a wild hack since Multivalent is the worst Java code I've
seen. Literally. Bad C written in Java. The tool would be nice to use,
but catching exceptions in little helper classes to do a System.exit is
just insane. And that is just one of the problems -- we had to do some
bad hacks to fix these issues. The other implementations should be fine,
although they need some more testing.

The source (including all required libs) of the program is available via
Sourceforge's CVS:

  http://sourceforge.net/cvs/?group_id=37081

The module in question is called "docco". A current snapshot of only the
source is here:

  http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)


The relevant packages are:

  org.tockit.docco.documenthandler: the documenthandler interface and
implementations
  org.tockit.docco.filefilter: some code to pick document handlers via
file extensions or regexps
  org.tockit.docco.index: the model/static bits of the index management
  org.tockit.docco.indexer: the dynamic aspects of the index management:
runnable, framework for handlers

The index management is probably not optimal, I strongly suspect that an
expert could tweak it. But the structure should be ok.

We would be happy to contribute this code to the Lucene sandbox if there
is interest. Or to turn it into a project of its own, we don't think it
should be hidden in our more specific program. It should be easy to
merge it with the Ant task and we are happy to give a hand if wanted.
Adding some documentation would be easy, too -- at the moment the code
is still more for ourself, but it should be very readable by itself. We
require JDK 1.4, but this can be reduced by moving some more document
handlers into plugins.

Anyone interested in joining into maintaining this code? Any feedback is
welcome.

Cheers,
   Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



Powerpoint filter.

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi all,

this is partly connected to Docco's new filters. Is there anyone interested
in a Powerpoint filter? This format seems to have been kept incredibly
obscure by out big friends in Redmond. I did only find a shareware tool
called CZ-PPT2TXT, but open source would i.m.h.o. be a bit more usable in
our community...

Cheers,

Gregor