You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Alfred Fuchs <em...@alfred-fuchs.de> on 2003/10/01 09:43:03 UTC

DirectoryGenerator

hi,

hope, this is the right place to post.


I made some changes to the
DirectoryGenerator (1)
and HTMLGenerator (4)
and introduced a PipelineDirectoryGenerator(2)
and FileInfoGenerator (3) for convenience.

my aim was to scan a directory for (Html,XML-)articles,
extract the titles from the files and show a overview site to the user.

I dont know, if this is interesting for You,
where to post the code/example for review,
how to name the packages/classes.


best regards,
alfred



(1) ====================
made some extensions to the DirectoryGenerator:
(all can be switched on and off, so the new Generator behaves as the old one
when no optinal parameterers are given)

- it is now possible recursing directories, even when the pattern is not matching the
   directory name.
- directories with no matching files, can be excluded from the resultset.
- matching directories can be added to the resultset (<dir:file type="directory"..../>)
- the sort order can now be a whitespace separated list, so you can also define
   ordering on the second, third... level.



(2) ====================
introduced a PipelineDirectoryGenerator:
the content of the <dir:file>-tag is the result
of a matching pipline:

<map:generate type="pipeline-directory" src=".">
	<map:parameter name="include" value="\.(xml|html|jpg|gif|png)$" />
	<map:parameter name="mimeTypePipeline" value="cocoon:/mime-type-directory" />
</map:generate>

<map:match pattern="mime-type-directory**">
	<map:match pattern="mime-type-directory/**.xml">
		[...]
	</map:match>
	<map:match pattern="mime-type-directory/**.html">
		[...]
	</map:match>
	[...]
</map:match>



(3) ====================
introduced a FileInfoGenerator:
only extracted some pice of code from the ImageDirectoryGenerator
and added some additional code.




(4) ====================
made a patch to the HTMLGenerator
(I always got a NullPointerException on xhtml documents with <?xml ....?> in the header.)

changed:
	streamer.stream(doc);

to:
	this.contentHandler.startDocument();
	streamer.stream( doc.getDocumentElement() );
	this.contentHandler.endDocument();				

(would it not be better, move this down to the DOMStreamer?)



RE: DirectoryGenerator

Posted by Conal Tuohy <co...@paradise.net.nz>.
Alfred Fuchs wrote:

> in my expamle I extract the title of a HTML page in this way:
> if <title> exist and <title> not empty, use it as title.
> otherwise use the first <h1> etc...
> this is logic, simply done in a xslt, but hoe to do this in
> a single xpath-query?

string(/html/head/title[normalize-space()]|/html/body//h1[1])

Cheers

Con

Re: DirectoryGenerator

Posted by Alfred Fuchs <em...@alfred-fuchs.de>.
 > Why would you use a directory scanner separately?
because I already had a directoryscanner
(with a callbackhandler and the default callbackhandler writes xml to stdout :-)
and the algorithm (based on java.io.File and java.io.FileFilter)
is completely independent of other packages.
the benefit of separating algorithm and "view" (output) is,
that I can use the scanner in a commandline util and
in the DirectoryGenerator as well.

> Besides, I think you'd much better use Source instead than File (see 
> o.a.c.g.TraversableGenerator in scratchpad).
You are right, I overlooked the scratchpad... I will adapt the generator.


> Is that because you have to generate (XML) metadata out of an image, 
 > as an example?
yes, and HTML, or f.e. a description (xml)-file (not necessary in the same directory),
with the same name as the file (f.e. MS-Word) itself.


 > 2. extract information from the returned resource in a pluggable and
 > extensible way (could be XPath or some binary manipulation).
 >
 > Have I got the overall picture? If that's the case, I'd still go for a
yes, nearly.
in my expamle I extract the title of a HTML page in this way:
if <title> exist and <title> not empty, use it as title.
otherwise use the first <h1> etc...
this is logic, simply done in a xslt, but hoe to do this in
a single xpath-query?


 > I tend to like the multiple transformer best though,
 > since they're more in line with Cocoon's overall design.
but why then the XPathDirectoryGenerator?
<map:generator type="directory" src="..."/>
<map:transformer src="Add-XInclude-Tags.xsl"/>
<map:transformer type="xinclude"/>
<map:transformer src="apply-xpath-to-included-files.xsl"/>
...
would do the same.
the "pipeline" approach seems very modualar to me
and reuses components, which already are there.

whats wrong with the "pipeline"-approach?
<map:aggregate/> also uses multiple pipes.


> <map:transformer name="infoextractor src="...">
>     <renderer match="mime-type" pattern="text/html" class="..."/>
>     <renderer match="ext" pattern="gif|png|jpg" class="..."/>
> </map:transformer>
hmmm,
would
      <renderer match="mime-type" pattern="text/html" class="...XSLT...">
	<source src="some-xslt.xsl"/>
      </renderer>

be OK?

best regards,
alfred



Re: DirectoryGenerator

Posted by Gianugo Rabellino <gi...@apache.org>.
Alfred Fuchs wrote:
>>> [extension to] DirectoryGenerator [...]
>>
>> That's interesting. I guess, however, than that the best solution 
>> would be patching the real DirectoryGenerator (in doing so please sync 
>> the *TraversableGenerator in scratchpad too).
> 
> OK, but where should I post the patch?

Bugzilla (http://nagoya.apache.org/bugzilla) is your friend :-)

> 
> another question: I separated the real directory-scanner from the
> generator code. in what package should I put this helper-classes?

Either o.a.c.generation.helpers or o.a.c.components.something (if they 
are Avalon components, as I assume they should). But what is the need 
for that? Why would you use a directory scanner separately?

Besides, I think you'd much better use Source instead than File (see 
o.a.c.g.TraversableGenerator in scratchpad).

> =======================
> 
>>> [...] PipelineDirectoryGenerator [...]
>>
>> Delegating the inclusion to another pipeline looks pretty neat and 
>> definitely useful, but I don't feel like this should be the job of a 
>> monolithic generator. How about turning that part into a transformer 
>> who reads the DG/TG output and does exactly what you're planning?
> 
> I thougt of it, but this approach leads to the current design (I think):
> a special directory-scanner for XML files,
> a special directory-scanner for HTML files,
> a special directory-scanner for Image files

I can't see why. Is that because you have to generate (XML) metadata out 
of an image, as an example? In that case, it might make sense to have 
some kind of pluggable "scanners" that can be hooked into the generator 
in the component configuration phase. But that might also be 
overcomplicated.

Let's start from scratch: what you actually want is

1. dig through a set of directory, possibly disregarding the inclusion 
pattern in case of directories. This is just a simple patch to the 
current generator(s).

2. extract information from the returned resource in a pluggable and 
extensible way (could be XPath or some binary manipulation).

Have I got the overall picture? If that's the case, I'd still go for a 
set of small interchangeable components instead than for the monolithic 
approach. Transformers still look best to me, with a possible variation: 
instean than chaining transformers you could configure a generic 
transformer with pluggable "renderers" which might respond to specific 
mime/type or pattern. Such as:

<map:transformer name="infoextractor src="...">
	<renderer match="mime-type" pattern="text/html" class="..."/>
	<renderer match="ext" pattern="gif|png|jpg" class="..."/>
</map:transformer>

Where renderer is something like

interface XmlRenderer extends Reparameterizable {
	Document render(Source src);
}

Might be quite a bit of overcomponentization. OTOH, such renderers could 
be used elsewhere too (even outside Cocoon). I tend to like the multiple 
transformer best though, since they're more in line with Cocoon's 
overall design.

Ciao,

-- 
Gianugo Rabellino
Pro-netics s.r.l. -  http://www.pro-netics.com
Orixo, the XML business alliance - http://www.orixo.com
     (Now blogging at: http://blogs.cocoondev.org/gianugo/)


Re: DirectoryGenerator

Posted by Alfred Fuchs <em...@alfred-fuchs.de>.
>> [extension to] DirectoryGenerator [...]
> That's interesting. I guess, however, than that the best solution would 
> be patching the real DirectoryGenerator (in doing so please sync the 
> *TraversableGenerator in scratchpad too).
OK, but where should I post the patch?

another question: I separated the real directory-scanner from the
generator code. in what package should I put this helper-classes?


=======================
>> [...] PipelineDirectoryGenerator [...]
> Delegating the inclusion to another pipeline looks pretty neat and 
> definitely useful, but I don't feel like this should be the job of a 
> monolithic generator. How about turning that part into a transformer who 
> reads the DG/TG output and does exactly what you're planning?
I thougt of it, but this approach leads to the current design (I think):
a special directory-scanner for XML files,
a special directory-scanner for HTML files,
a special directory-scanner for Image files
etc...
and the merge the output via aggregation and apply a advanced stylesheet...


or in addition to the XIncludeTransformer
a HTMLIncludeTransformer,
a ImagePropertyIncludeTransformer
a XXXIncludeTransformer.

and the switching between the different files must
also be done in the transformers.

the current design:

<map:pipeline>
     <map:generate type="pipeline-directory" src=".">
         <map:parameter name="include" value="\.(xml|html|jpg|gif|png)$" />
         <map:parameter name="contentPipeline" value="cocoon:/content-pipeline" />
     </map:generate>

     <map:serialize type="xml"/>
</map:pipeline>


<map:pipeline internal-only="true">
     <map:match pattern="content-pipeline**">

         <map:match pattern="content-pipeline/**other**.xml">
             <map:generate src="{1}other{2}.xml"/>
             <map:transform src="extractOtherTitles.xsl"/>
             <map:serialize type="xml"/>
         </map:match>

         <map:match pattern="content-pipeline/**.xml">
             <map:generate src="{1}.xml"/>
             <map:transform src="extractTitles.xsl"/>
             <map:serialize type="xml"/>
         </map:match>

         <map:match pattern="content-pipeline/**.html">
             <map:generate type="html" src="{1}.html"/>
             <map:transform src="extractTitles.xsl"/>
             <map:serialize type="xml"/>
         </map:match>

         <map:match pattern="content-pipeline/**">
             <map:generate type="fileinfo" src="{1}"/>
             <map:serialize type="xml"/>
         </map:match>
     </map:match>
</map:pipeline>



would then look like this (?)

<map:pipeline>
	<map:generate type="directory" src=".">
		<map:parameter name="include" value="\.(xml|html|jpg|gif|png)$" />

		<map:parameter name="mime-type-match:text/xml"       value="\.xml$" />
		<map:parameter name="mime-type-match:text/xml/other" value=".*other.*\.xml$" />
		<map:parameter name="mime-type-match:text/html"      value="\.html$" />
		<map:parameter name="mime-type-match:image"          value="\.(jpg|gif|png)$ />
	</map:generate>

	<map:transform src="prepareIncludeTransformers.xsl"/>

	<map:transform type="xinclude"/>
	<map:transform type="htmlinclude"/>
	<map:transform type="imageinfoinclude"/>

	<map:transform src="extractTitlesAndFileInfo.xsl"/>
	
	<map:serialize type="xml"/>
</map:pipeline>


the advantage of the second approach is, that
all is together in one pipe.

the disadvantage is, that for all generators we need a
equivalent transformer.


best regards,
alfred






Re: DirectoryGenerator

Posted by Gianugo Rabellino <gi...@apache.org>.
Alfred Fuchs wrote:
> 
> assume the directory-structure to scan is:
> <ROOTDIR>
>     htmldir
>         hallo.html
>     emptydir
>     nomatchingdir
>     hallo.txt
> 
> 
> the generator
> <map:generate type="directory" src="<ROOTDIR>">
>     <map:parameter name="depth" value="3" />
>     <map:parameter name="include" value="\.(xml|html)$" />
> </map:generate>
> 
> produces this:
> <dir:directory name="<ROOTDIR>" .../>
> 
> ups!
> because, the include-pattern is also applied to directories.

That's interesting. I guess, however, than that the best solution would 
be patching the real DirectoryGenerator (in doing so please sync the 
*TraversableGenerator in scratchpad too).

> ========================
> the difference between the PipelineDirectoryGenerator and
> the XPathDirectoryGenerator:
> 
> the XPathDirectoryGenerator reads a xml-file, applies a XPath-query and
> adds the result as content to the <dir:file>-tag.
> 
> if the PipelineDirectoryGenerator finds a matching file (f.e. 
> path/hello.xml)
> it tries to get content from a matching pipeline (in the example: 
> "cocoon:/content-pipeline/path/hello.xml")
> and adds the result as content to the <dir:file>-tag.

Delegating the inclusion to another pipeline looks pretty neat and 
definitely useful, but I don't feel like this should be the job of a 
monolithic generator. How about turning that part into a transformer who 
reads the DG/TG output and does exactly what you're planning?

Ciao,

-- 
Gianugo Rabellino
Pro-netics s.r.l. -  http://www.pro-netics.com
Orixo, the XML business alliance - http://www.orixo.com
     (Now blogging at: http://blogs.cocoondev.org/gianugo/)


Re: DirectoryGenerator

Posted by Alfred Fuchs <em...@alfred-fuchs.de>.
Gianugo Rabellino wrote:

>> I made some changes to the
>> DirectoryGenerator (1)
>> and HTMLGenerator (4)
>> and introduced a PipelineDirectoryGenerator(2)
>> and FileInfoGenerator (3) for convenience.
>>
>> my aim was to scan a directory for (Html,XML-)articles,
>> extract the titles from the files and show a overview site to the user.
> 
> 
> Thaks for your willingness to share! What is the difference between your 
> implementation and the XPath(Directory|Traversable)Generator?

Difference DirectoryGenerator and my impelmentation:

assume the directory-structure to scan is:
<ROOTDIR>
     htmldir
         hallo.html
     emptydir
     nomatchingdir
	hallo.txt


the generator
<map:generate type="directory" src="<ROOTDIR>">
	<map:parameter name="depth" value="3" />
	<map:parameter name="include" value="\.(xml|html)$" />
</map:generate>

produces this:
<dir:directory name="<ROOTDIR>" .../>

ups!
because, the include-pattern is also applied to directories.


the extended directory generator with otpional parameter
recurseUnmatchingDirectories="true"

<map:generate type="directory-extended" src="<ROOTDIR>">
	<map:parameter name="depth" value="3" />
	<map:parameter name="include" value="\.(xml|html)$" />
	<map:parameter name="recurseUnmatchingDirectories" value="true" />
</map:generate>

gives You this:
<dir:directory name="<ROOTDIR>" ... >
	<dir:directory name="htmldir" ... >
		<dir:file name="hallo.html" ... />
	</dir:directory>
	<dir:directory name="emptydir" ...>
	</dir:directory>
	<dir:directory name="nomatchingdir" ...>
	</dir:directory>
</dir:directory>


and with parameter excludeEmptyDirectorys="true"
this:
<dir:directory name="<ROOTDIR>" ... >
	<dir:directory name="htmldir" ... >
		<dir:file name="hallo.html" ... />
	</dir:directory>
</dir:directory>

========================
the difference between the PipelineDirectoryGenerator and
the XPathDirectoryGenerator:

the XPathDirectoryGenerator reads a xml-file, applies a XPath-query and
adds the result as content to the <dir:file>-tag.

if the PipelineDirectoryGenerator finds a matching file (f.e. path/hello.xml)
it tries to get content from a matching pipeline (in the example: "cocoon:/content-pipeline/path/hello.xml")
and adds the result as content to the <dir:file>-tag.

example:
<map:pipeline>
	<map:generate type="pipeline-directory" src=".">
		<map:parameter name="extendedFeaturesAsDefault" value="true" />
		<map:parameter name="include" value="\.(xml|html|jpg|gif|png)$" />
		<map:parameter name="contentPipeline" value="cocoon:/content-pipeline" />
	</map:generate>
</map:pipeline>


<map:pipeline internal-only="true">
	<map:match pattern="content-pipeline**">
			
		<map:match pattern="content-pipeline/**.xml">
			<map:generate src="{1}.xml"/>
			<map:transform src="dirscan/xsl/dirscanExtractTitles.xsl"/>
			<map:serialize type="xml"/>
		</map:match>
				
		<map:match pattern="content-pipeline/**.html">
			<map:generate type="html" src="{1}.html"/>
			<map:transform src="dirscan/xsl/dirscanExtractTitles.xsl"/>
			<map:serialize type="xml"/>
		</map:match>
			
		<map:match pattern="content-pipeline/**">
			<map:generate type="fileinfo" src="{1}"/>
			<map:serialize type="xml"/>
		</map:match>
	</map:match>
</map:pipeline>

========================================

best regards,
alfred





Re: DirectoryGenerator

Posted by Gianugo Rabellino <gi...@apache.org>.
Alfred Fuchs wrote:
> hi,
> 
> hope, this is the right place to post.
> 
> 
> I made some changes to the
> DirectoryGenerator (1)
> and HTMLGenerator (4)
> and introduced a PipelineDirectoryGenerator(2)
> and FileInfoGenerator (3) for convenience.
> 
> my aim was to scan a directory for (Html,XML-)articles,
> extract the titles from the files and show a overview site to the user.

Thaks for your willingness to share! What is the difference between your 
implementation and the XPath(Directory|Traversable)Generator?

Ciao,

-- 
Gianugo Rabellino
Pro-netics s.r.l. -  http://www.pro-netics.com
Orixo, the XML business alliance - http://www.orixo.com
     (Now blogging at: http://blogs.cocoondev.org/gianugo/)