You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cocoon.apache.org by Andrew Franz <af...@optushome.com.au> on 2005/06/06 00:07:17 UTC

Extending DirectoryGenerator

I am thinking about a simple CMS (Content Management System) which would 
have the following features:
1. Ability to list MS-Office files along with their <SummaryInformation> 
attributes (this would use Jakarta POI), ability to list "image" files 
(basically by cloning the functionality in ImageDirectoryGenerator) and 
be able to be extended to other commonly used document formats such as PDF
2. The output of #1 would be used as input to create a Lucene Index.
3. The Lucene index would be used to search an Intranet by Author, 
Title, Subject, etc.

This would mean that content-creators in the organisation would 
categorise documents simply by updating <SummaryInformation> 
('Properties' in MS-Office applications) and then uploading the file 
(the current implementation requires them to update a database separate 
to the document itself). The Cocoon application would automatically 
categorise the document, either by using Lucene or from the 
SummaryInformation. Indexing would only apply to the header/meta info - 
full text indexing of content is not required.

The question (to experienced Cocoon developers) is what is the preferred 
method of implementation?

Option 1. Extend DirectoryGenerator similar to the way 
ImageDirectoryGenerator is implemented but adding new file types

Option 2. Use DirectoryGenerator 'as is' but augment it with a 
HeaderGenerator per file/mimetype and then aggregate results such that 
the output is similar to #1

Option 3. Tell the users to 'SaveAs' MS-Office documents into an XML 
format and use XSLT to extract the summary information. For example 
Visio binary format (VSD) can be saved as VXD and the same information 
can be extracted via XSLT

All of the above are feasible and invariant to the user-interface so the 
question is more about performance.

Has anyone gone down this route? Are there any pitfalls I need to be 
aware of? For the experienced Cocoon developers, what is your gut-feel 
about which is the preferred option?

Replies much appreciated.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org

Re: Extending DirectoryGenerator

Posted by Andrew Franz <af...@optushome.com.au>.

(reposted from the user mailing list)

Andrew Franz wrote:

> I am thinking about a simple CMS (Content Management System) which 
> would have the following features:
> 1. Ability to list MS-Office files along with their 
> <SummaryInformation> attributes (this would use Jakarta POI), ability 
> to list "image" files (basically by cloning the functionality in 
> ImageDirectoryGenerator) and be able to be extended to other commonly 
> used document formats such as PDF
> 2. The output of #1 would be used as input to create a Lucene Index.
> 3. The Lucene index would be used to search an Intranet by Author, 
> Title, Subject, etc.
>
> This would mean that content-creators in the organisation would 
> categorise documents simply by updating <SummaryInformation> 
> ('Properties' in MS-Office applications) and then uploading the file 
> (the current implementation requires them to update a database 
> separate to the document itself). The Cocoon application would 
> automatically categorise the document, either by using Lucene or from 
> the SummaryInformation. Indexing would only apply to the header/meta 
> info - full text indexing of content is not required.
>
> The question (to experienced Cocoon developers) is what is the 
> preferred method of implementation?
>
> Option 1. Extend DirectoryGenerator similar to the way 
> ImageDirectoryGenerator is implemented but adding new file types
>
> Option 2. Use DirectoryGenerator 'as is' but augment it with a 
> HeaderGenerator per file/mimetype and then aggregate results such that 
> the output is similar to #1
>
> Option 3. Tell the users to 'SaveAs' MS-Office documents into an XML 
> format and use XSLT to extract the summary information. For example 
> Visio binary format (VSD) can be saved as VXD and the same information 
> can be extracted via XSLT
>
> All of the above are feasible and invariant to the user-interface so 
> the question is more about performance.
>
> Has anyone gone down this route? Are there any pitfalls I need to be 
> aware of? For the experienced Cocoon developers, what is your gut-feel 
> about which is the preferred option?
>
> Replies much appreciated.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>