You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Robert Graham <in...@gmail.com> on 2005/07/14 23:39:04 UTC
RefDoc Direction
Hi All,
This is mostly a question for the mentors on the project, but ideas
are welcome. I posted to dev mostly to have it backed up.
I've been through the prototype code and I've even gotten a couple of
the TODO's taken care of (specifically the matching multiple key-value
pairs and most of a refactoring of slop/xml-to-snippets.xsl). I'm not
sure how to accomplish the selecting of multiple codebases, but I'm
interested in ideas. I thought we should make a minor change to the
doktor comment syntax in adding commas between key-value pairs to make
it more reasonable to process. Past all that I'm looking for a little
more direction. I understand the goal and the prototype as it is, but
I'm a little confused as to the next step.
Thanks,
Robert
Re: RefDoc Direction
Posted by Bertrand Delacretaz <bd...@apache.org>.
Le 18 juil. 05, à 16:49, Robert Graham a écrit :
>> 1. Extract snippets from the various types of source files: XML, java,
>> text
>>
>
> I feel that this is mostly complete, but I'm open to new suggestions.
ok - what's in the prototype is probably good enough for now.
>
>> 2. Convert these snippets to an XML form that is easily indexable with
>> Lucene, generating Lucene "fields" for all important pieces of
>> information: snippet key, snippet type, title, etc.
>>
>
> This needs a little work. This represents the "single snippet" page
> you had in the refdoc prototype if I'm not mistaken and currently they
> don't contain enough information.
Right, we certainly need more fields.
>
>> 2b. Also generate "navigation documents" which Lucene will use to find
>> all snippets. This is shown in the prototype already.
>>
>
> This seems mostly done, though I wonder if some of the links generated
> will work as is for indexing. For example one set of the "a" tags has
> the href="[@id]" or something like href="snippet_31". Can the
> crawler/indexer sort that out?
href="snippet_31" looks Ok as a relative URL.
[@id] is probably an XSLT typo, should be {@id} instead to generate a
dynamic link
>
>> 3. Crawl and index the generated XML documents with Lucene, at first
>> using the Lucene block out of the box, I assume. Some manual work
>> (like
>> starting the index creation from an URL) is ok at this stage, we're
>> trying to demonstrate the full chain before implementing everyting.
>>
>
> In the works. I might write some Java code for indexing and searching
> soon, but I'll keep it skeletal until I feel good about it.
ok
>
>> 4. Create the required Lucene queries to put together snippets coming
>> from different source files but having the same key (e.g. all
>> "FileGenerator" snippets). I might need to add @doktor stuff to
>> existing code and samples so that you can see better how this should
>> work.
>>
>
> Future work.
sure.
>
>> 5. Transform the results of these queries to XML document in a
>> publication-neutral format, where one document contains all the info
>> and code excerpts provided by snippets having the same key.
>
> Should we also retain the ability for a user-based query that could
> dynamically publish a document on their query?...
Probably useful, you can maybe leave this open and we'll see which
queries are useful.
>
> ... was also
> more stuck on where to go from the TODOs at the time, but found a
> direction to keep moving in...
Cool, thanks for your work! According to the ongoing vote you should be
able to get access soon to commit your work, in the meantime if you
want to put a patch in bugzilla I'll take care of it.
-Bertrand
Re: RefDoc Direction
Posted by Robert Graham <in...@gmail.com>.
> 1. Extract snippets from the various types of source files: XML, java,
> text
>
I feel that this is mostly complete, but I'm open to new suggestions.
> 2. Convert these snippets to an XML form that is easily indexable with
> Lucene, generating Lucene "fields" for all important pieces of
> information: snippet key, snippet type, title, etc.
>
This needs a little work. This represents the "single snippet" page
you had in the refdoc prototype if I'm not mistaken and currently they
don't contain enough information.
> 2b. Also generate "navigation documents" which Lucene will use to find
> all snippets. This is shown in the prototype already.
>
This seems mostly done, though I wonder if some of the links generated
will work as is for indexing. For example one set of the "a" tags has
the href="[@id]" or something like href="snippet_31". Can the
crawler/indexer sort that out?
> 3. Crawl and index the generated XML documents with Lucene, at first
> using the Lucene block out of the box, I assume. Some manual work (like
> starting the index creation from an URL) is ok at this stage, we're
> trying to demonstrate the full chain before implementing everyting.
>
In the works. I might write some Java code for indexing and searching
soon, but I'll keep it skeletal until I feel good about it.
> 4. Create the required Lucene queries to put together snippets coming
> from different source files but having the same key (e.g. all
> "FileGenerator" snippets). I might need to add @doktor stuff to
> existing code and samples so that you can see better how this should
> work.
>
Future work.
> 5. Transform the results of these queries to XML document in a
> publication-neutral format, where one document contains all the info
> and code excerpts provided by snippets having the same key.
Should we also retain the ability for a user-based query that could
dynamically publish a document on their query?
That sounds about like what I have in my notes. Thanks for walking
through it. I came to many of those conclusions in woring through the
prototype, but on some of them the precision was nebulous. I was also
more stuck on where to go from the TODOs at the time, but found a
direction to keep moving in.
Thanks,
Robert
Re: RefDoc Direction
Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi Robert,
Le 14 juil. 05, à 23:39, Robert Graham a écrit :
> ...This is mostly a question for the mentors on the project, but ideas
> are welcome. I posted to dev mostly to have it backed up...
Note that posting to dev@ only is fine, we're monitoring it (but feel
free to ping me directly if questions stay unanswered for too long).
> ...I've been through the prototype code and I've even gotten a couple
> of
> the TODO's taken care of (specifically the matching multiple key-value
> pairs and most of a refactoring of slop/xml-to-snippets.xsl)...
Cool. I'll ask for your access to the whiteboard/refdoc subdirectory in
the next few days, so that you can commit this.
> I'm not
> sure how to accomplish the selecting of multiple codebases, but I'm
> interested in ideas..
This can come later, for example using an XML file which maps symbolic
codebases names to actual paths, and generating a virtual "top-level
directory" from this file.
> . I thought we should make a minor change to the
> doktor comment syntax in adding commas between key-value pairs to make
> it more reasonable to process...
Good idea, go for it!
> Past all that I'm looking for a little
> more direction. I understand the goal and the prototype as it is, but
> I'm a little confused as to the next step...
Sorry to have left you with very little info until now, the timing was
bad with me being offline last week. But I'll have more time to follow
up now (until about August 10th when I'm going to be offline for 10
days again).
The steps that I'm seeing are the following (some of them are done
already, at least partially):
1. Extract snippets from the various types of source files: XML, java,
text
2. Convert these snippets to an XML form that is easily indexable with
Lucene, generating Lucene "fields" for all important pieces of
information: snippet key, snippet type, title, etc.
2b. Also generate "navigation documents" which Lucene will use to find
all snippets. This is shown in the prototype already.
3. Crawl and index the generated XML documents with Lucene, at first
using the Lucene block out of the box, I assume. Some manual work (like
starting the index creation from an URL) is ok at this stage, we're
trying to demonstrate the full chain before implementing everyting.
4. Create the required Lucene queries to put together snippets coming
from different source files but having the same key (e.g. all
"FileGenerator" snippets). I might need to add @doktor stuff to
existing code and samples so that you can see better how this should
work.
5. Transform the results of these queries to XML document in a
publication-neutral format, where one document contains all the info
and code excerpts provided by snippets having the same key.
Let me know if you need more info about this!
-Bertrand