You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lenya.apache.org by Robert Goene <ro...@goene.nl> on 2005/06/12 21:13:24 UTC

lenya-search proposal

Hi,

Hereby my new proposal. I have done some more research and have provided 
a bit more background information and a rudimentary timeline.

Hope you like the looks of it, because the clock is ticking!

Regards, Robert

Re: lenya-search proposal

Posted by Robert Goene <ro...@goene.nl>.
>>> I think this could be done more general, such that every time a document
>>> is changing is being indexed, e.g. also after editing, whereas there 
>>> could
>>> be one index for the authoring and one index for the live area.
>>
>> If the document changes, it will be reindexed. I don't really see the 
>> need of a seperate index for every area.
> 
> within the authoring area the content can be quite different. Also there 
> can be
> documents which don't exist within the live area. I think it definitely 
> makes
> sense to have different indices or rather being able to search on 
> "different versions"
> re workflow status.
> 
> I don't think it will be much more work to implement, but rather keep 
> the interface
> general enough and maybe just implement the live area if time it too 
> limited for you.
> But I'd suggest that you rather drop some othe features and focuse on this.
> 

I don't think i have the time to change the proposal. The deadline is 
tomorrow. I don't think this is difficult to implement: add a field to 
Lucene with the area and add a hidden field to the search form. It would 
also need the indexing of the document in the submit step. This should 
be fairly easy to add when i am working on it.

>>
>> Changing the fields would require a change of the 'obsolete' xml 
>> documents, but i think this is a rare case that should actually be 
>> avoided. Fields can be added or fields can become obsolete without a 
>> problem, but changing a field is something that is done rarely, if 
>> ever. Could you give me a scenario where this would be an urgent problem?
> 
> let's say you have one title field, and then one adopts the schema that one
> has a maintitle and a subtitle and title will be gone, whereas the title 
> is becoming
> the maintitle
> 
I would say: keep the title and add a field subtitle:)
I think changing a field is not something one should do often, also 
because of the iterative nature of the index building: the index isn't 
rebuilt but changed when a document changes. Making a complete change of 
the field in the index just does not fit into this idea.
One could, of course, change all documents. This would also cause the 
document to be reindexed. For this obscure occasion, one could write a 
simple script to run trough the documents that have to be changed.

I am not sure if i should add it to my proposal, due to time 
constraints. Do you think this is a major issue?

>> <pr>
>> <title>
>> <lenya:index>title</lenya:index>
>> Lenya 14 release preponed
>> </title>
>> <content>
>> <lenya:index>contents</lenya:index>
>> The release of Lenya 1.4, the Apache Content Management System, ladila
>> </content>
>> </pr>
> 
> how do you want to mark attributes?

Hmm. Good question. I don't see the need for this in my current 
documents, but it should be possible, of course.

The lenya:index element could have an attribute to select the indexable 
attribute.

Besides this: the lenya:index element should have an attribute to define 
the selection of the child elements. One could ony index the content of 
the current element or also include the child nodes. This would be the 
case with the <body> element of an XHTML page.

<lenya:index fieldname="title" includechilds="false" indexattr="test">

When one wants to add some element to multiple Lucene fields, the 
lenya:index element should be added a number of times.

Any other remarks?



>>> how do you want to treat these external links?
> 
> I was actually rather thinking about how to do you want to handle them 
> within the index, because they won't have the same fields as the ones 
> within Lenya. Do you want
> to create a separate index?

Ah, ok! I want to add them to the same index. Having some field unfilled 
  is no problem in Lucene. I could consider using the keywords of the 
document that contains the links, but this seems very nasty to me...

>>
>> As far as i can see, it contains all the output one can ask for from a 
>> Lucene query.
> 
> also pagening?

Yep. Pretty nice implemented!


> no problem, thanks very much for working on it. Please don't be afraid 
> of my comments (in case you are), but I just want to make sure that 
> various things are being
> considered.

Quite the contrary! It is a delight to know that people are actually 
reading it thoroughly. Thanks!


I would like to send the proposal at the end of this afternoon. Are 
there some major issues that should be changed or added?

Regards, Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by Michael Wechner <mi...@wyona.com>.
Robert Goene wrote:

> Hi,
>
> Thanks for actually reading it and giving a thorough reply!
>
>>>
>>
>> I think this could be done more general, such that every time a document
>> is changing is being indexed, e.g. also after editing, whereas there 
>> could
>> be one index for the authoring and one index for the live area.
>
>
> If the document changes, it will be reindexed. I don't really see the 
> need of a seperate index for every area.


within the authoring area the content can be quite different. Also there 
can be
documents which don't exist within the live area. I think it definitely 
makes
sense to have different indices or rather being able to search on 
"different versions"
re workflow status.

I don't think it will be much more work to implement, but rather keep 
the interface
general enough and maybe just implement the live area if time it too 
limited for you.
But I'd suggest that you rather drop some othe features and focuse on this.

>
>>
>> I don't think a document should require a schema, but I guess we get 
>> into a religious war here. But you can definitely not assume that 
>> everything is validated by RelaxNG, because Lenya would close itself 
>> badly if it would neglect schemas like XSD and others ...
>>
>
> On the one hand you like the centralized definition of the index, as 
> you propose to add the indexing to the schema and on the other hand, 
> you like to keep the schema requirement as flexible as possible. I see 
> the dilemma and that's why i think my idea is a nice way to keep some 
> sort of flexibility on the schema side, but with a centralized 
> definition in the form of the samplefile.


sorry, I didn't understand that you were talking about a samplefile, but one
thing to think about is probably reoccuring elements and how to handle them.

>
> Changing the fields would require a change of the 'obsolete' xml 
> documents, but i think this is a rare case that should actually be 
> avoided. Fields can be added or fields can become obsolete without a 
> problem, but changing a field is something that is done rarely, if 
> ever. Could you give me a scenario where this would be an urgent problem?


let's say you have one title field, and then one adopts the schema that one
has a maintitle and a subtitle and title will be gone, whereas the title 
is becoming
the maintitle

>
>>
>
> Well, this is just a first shot. I will probably change it, but 
> something like this:
>
> <pr>
> <title>
> <lenya:index>title</lenya:index>
> Lenya 14 release preponed
> </title>
> <content>
> <lenya:index>contents</lenya:index>
> The release of Lenya 1.4, the Apache Content Management System, ladila
> </content>
> </pr>


how do you want to mark attributes?

>
>>>
>>
>> how do you want to treat these external links?
>
>
> I want to fetch the links in the document parser and let Nutch fetch 
> them when the scheduled index process will run. I am not sur yet if i 
> can feed them to nutch directly or that i should add them to a text 
> file that nutch uses. I will give it another look.


I was actually rather thinking about how to do you want to handle them 
within the index, because they won't have the same fields as the ones 
within Lenya. Do you want
to create a separate index?

>
> As far as i can see, it contains all the output one can ask for from a 
> Lucene query.


also pagening?

> The nice thing is: it possible to scatter the result in different 
> pages. The links to all pages are delivered with the output. It looks 
> pretty comprehensive to me.
>
> Again, thanks for the reply!


no problem, thanks very much for working on it. Please don't be afraid 
of my comments (in case you are), but I just want to make sure that 
various things are being
considered.

Thanks

Michi

>
> Regards, Robert
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
> For additional commands, e-mail: dev-help@lenya.apache.org
>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by Robert Goene <ro...@goene.nl>.
Hi,

Thanks for actually reading it and giving a thorough reply!

>>
>> * Integrate the indexing process with the Lenya publishing usecases
>>
>>  * Index the document when published
>>
>>    When a document is published it should be added to the Lucene index 
>> immediately.
>>    This can be accomplished by extending the publish process, which is 
>> implemented
>>    as a Lenya 1.4 usecase.
>>
>>    
>> http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html 
>>
>>  
>>  * Remove the document from the index when deactivated
>>  
>>    Documents that are no longer a part of the 'Live' section of the 
>> Lenya publication
>>    (the public available website) should be immediately removed from 
>> the Lucene index.
>>    In a similar fashion as the publishing of a document, the 
>> deactivate usecase of
>>    Lenya 1.4 should be extended with a removal of the document of the 
>> Lucene index.
>>  
>>
> 
> I think this could be done more general, such that every time a document
> is changing is being indexed, e.g. also after editing, whereas there could
> be one index for the authoring and one index for the live area.

If the document changes, it will be reindexed. I don't really see the 
need of a seperate index for every area.
> 
> Even more general would to search the documents in association with the 
> workflow,
> but this would probably rather be Lenya 1.>4, but I am mentioning it to 
> point
> where I think it would make sense to head to

> 
> 
>>       
>> http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html 
>>
>>      * Document parser          * lenya.index
>>          Lucene comes packed with a standard xml and html parser to 
>> add documents to the index. This       parser fetches the data out of 
>> the document and stores this data in different fields of the
>>       Lucene index. The documents that Lenya works with are extended 
>> XHTML documents that can be
>>       parsed with the standard html parser, but they would lack the 
>> possibility of indexing the
>>       metadata that comes with these Lenya documents.
>>
>>       As a replacement for the ConfigurableIndexer that creates 
>> indexes from a document based
>>       on a collection of xpath statements, i would like to propose an 
>> alternative way of       configuring the indexed data. This 
>> replacement would consist of tags in the internal xml
>>       documents of Lenya. Every xml element that must be added to the 
>> index need a special
>>       attribute, something like indexField="fieldName".
>>  
>>
> 
> I don't think the ConfigurableIndexer should be replaced, whereas I am not
> saying the implementation of it is great. One wants to keep the 
> definition centralized
> and attached individually (just is the case for the workflow or 
> validation schema of a document) Always the same problem ;-)
> 
> IIUC then every document would have to be tagged. What if a field is 
> changing?!
> 
> I am not saying your suggestion doesn't make sense for certain cases, but
> I wouldn't treat it as replacement, but rather as enhancement
> 

I would agree with the term enhancement.

>>       One of the big advantages of this approach would be the 
>> availability of data that isn't
>>       visible for the outside world, but could be helpful for the 
>> search mechanism to determine
>>       the most relevant results. One could think of the metadata that 
>> isn't completely rendered       to html, like the date of creation or 
>> the creator.
>>
>>       Besides this, it would be more easy to add a new document type 
>> to Lenya when the indexing of the       document can be specified in 
>> the sample document and the Relax NG schema.
>>  
>>
> 
> adding the indexing to the Schema would probably make life easier but is 
> basically
> the same as the current solution (one transformation from one to the 
> other).
> 
>>              Every document in Lenya has an accompanying RelaxNG 
>> schema that validates every edited document when
>>       it is saved.
>>
> 
> I don't think a document should require a schema, but I guess we get 
> into a religious war here. But you can definitely not assume that 
> everything is validated by RelaxNG, because Lenya would close itself 
> badly if it would neglect schemas like XSD and others ...
> 

On the one hand you like the centralized definition of the index, as you 
propose to add the indexing to the schema and on the other hand, you 
like to keep the schema requirement as flexible as possible. I see the 
dilemma and that's why i think my idea is a nice way to keep some sort 
of flexibility on the schema side, but with a centralized definition in 
the form of the samplefile.

Changing the fields would require a change of the 'obsolete' xml 
documents, but i think this is a rare case that should actually be 
avoided. Fields can be added or fields can become obsolete without a 
problem, but changing a field is something that is done rarely, if ever. 
Could you give me a scenario where this would be an urgent problem?

>> This schema should allow a document to have the index attribute 
>> assigned to a number of
>>       elements. These elements should be extended with the lenya.index 
>> attribute. This must be done for all
>>       elements that are allowed to be added to the Lucene index. This 
>> may sound like a lot work, but it       shouln't be that hard. An 
>> XHTML document, for instance, only needs several metadata and the body 
>> elements
>>       to be specified.
>>
>>       The following Relax NG snippet should be added to all elements 
>> that can be indexed. The LenyaFieldName will
>>       contain the name of the Lucene index field.
>>       <define name="lenya:index">
>>        <ZeroOrMore>
>>         <element name="lenya.index">
>>          <text/>
>>         </element>
>>        </ZeroOrMore>
>>       </define>
>>
>>       Notice the possibility to add more than one LenyaIndexFieldName. 
>> This makes it possible to add the same data
>>       to different fields, which can be useful when the user want a 
>> general or more specific search: the data will       be added to the 
>> more general field that is also used for other fields and the specific 
>> field is queried when
>>       one is aware of the exact field that must be addressed.
>>          The actual xml document must add the lenya.index elements to 
>> the elements that must be indexed. The actual field
>>       name is specified in the xml document and not the specification. 
>> This makes filling the index more flexible, without
>>       making it harder to have a common indexfield for all document. 
>> Since all documents are created from a sample xml file,
>>       the default indexfields can be provided in this file. This way 
>> individual exceptions are still possible.
>>
>>       The LenyaIndex parser, as described above, must be applied to 
>> the most used document in Lenya:
>>       the XHTML document that is extended with Dublin Core metadata.
>>           
>> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html   
>>
> 
> 
> I don't fully understand your example. Can you make one which shows the 
> mapping to the Lucene document and a content example, e.g. a press release:
> 
> <pr>
> <title>...</title>
> <date>...</date>
> <content>...</content>
> </pr>
> 

Well, this is just a first shot. I will probably change it, but 
something like this:

<pr>
<title>
<lenya:index>title</lenya:index>
Lenya 14 release preponed
</title>
<content>
<lenya:index>contents</lenya:index>
The release of Lenya 1.4, the Apache Content Management System, ladila
</content>
</pr>


>>                  * Document boost
>>       By Adding an extra field to the metadata of the documents called 
>> 'Document
>>     Boost' it will be possible to use the boosting feature of Lucene 
>> to control
>>     the relevance of specific documents in the search results. A 
>> pulldown menu
>>     with a choosable digit to specify the boostlevel should be 
>> sufficient.
>>        
>> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float) 
>>
>>
>>   * Extract external links
>>
>>     The publish process should also extract all the external links - 
>> html and pdf - from the document and     add them to the nutch 
>> crawler, so they can be fetched and indexed in the next Nutch run.
>>
>>     In a similar fashion, the external links should be removed from 
>> the Nutch fetch list  and the Lucene
>>     index when  deactivating a document.
>>  
>>
> 
> how do you want to treat these external links?

I want to fetch the links in the document parser and let Nutch fetch 
them when the scheduled index process will run. I am not sur yet if i 
can feed them to nutch directly or that i should add them to a text file 
that nutch uses. I will give it another look.


>>  * Replace custom Lucene search generator with Cocoon Search generator
>>
>>   There is a very clean and easy alternative to this nasty xsp page 
>> the   xslt sheets that process the result it: the Cocoon search generator
>>   By using this generator instead of the clumpsy search pipeline 
>> currently
>>   employed, it will be easier to debug or change the resultset for a   
>> specific publication. Besides this, it seems to me as a good practice
>>   to take advantage of Cocoon's facilities as much as possible.
>>
>>   http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html
>>  
>>
> 
> how does the XML of the search-generator differ from the the current Lenya
> implementation?

As far as i can see, it contains all the output one can ask for from a 
Lucene query. The nice thing is: it possible to scatter the result in 
different pages. The links to all pages are delivered with the output. 
It looks pretty comprehensive to me.

Again, thanks for the reply!

Regards, Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by Michael Wechner <mi...@wyona.com>.
Robert Goene wrote:

> Hi,
>
> Hereby my new proposal. I have done some more research and have 
> provided a bit more background information and a rudimentary timeline.
>
> Hope you like the looks of it, because the clock is ticking!


please see my commente below, whereas they are mostly comments on 
implementations

>
> Regards, Robert
>
>------------------------------------------------------------------------
>
>* Google Summer of Code proposal *
>
>Version: Third draft version
>Date: 12 june 2005
>Subject: Apache's lenya-search project
>Intended audience: Current maintainers and potential mentor(s)
>Author: Robert Goene, University of Amsterdam, The Netherlands
>
>* Project Overview
>
>  The Lenya-Search project is part of the Lenya Content Management System, as
>  hosted by the Apache Foundation. Heavily based on the XML publishing framework
>  Cocoon, Lenya combines an easy interface for the end-user with advanced
>  possibilities for the xml-aware developer. This makes Lenya both a good choice
>  for straight-forward and more complex websites.
>
>  The search facilities of Lenya are based on the Apache project Lucene. This
>  search engine takes care of the indexing of documents and processing of the
>  queries. 
>
>  The lenya-search project has found her objective in the integration of Lenya 
>  and Lucene. The current integration is not as easy and flexible as it should 
>  be for a complete CMS. The indexing process, for instance, depends on a number
>  of home-made indexers that take care of adding all documents to Lucene. This
>  process must be started manually trough an ant job. The indexers are not 
>  flexible enough and should be more focussed on the documents that Lenya is
>  dealing with: xhtml documents with Dublin Core metadata. Besides this, custom
>  documents should easily be added to the CMS. Lenya should be able to handle
>  xml documents of all kinds in a more straightforward way. This proposal is
>  part of this more general goal.
>
> In other words: the search facilities should be further integrated in Lenya.
> The search possibilities are not trivial to use in a Lenya publication, and
> they obviously should be.
>
> The development will be based on the current trunk of the project: version 1.4
> This major release contains a large number of architectual changes. A change like
> the one described here is appropriate to add to this new future release. The
> current stable version (1.2) will only be updated with crucial bugfixes. No 
> significant new features will be added.
>
> http://lenya.apache.org/1_4/index.html
>
>* Project description
>
>  The project will consist of a number of subprojects, which can be 
>  developed fairly isolated from each other. This section will give
>  a functional description and an overview of the techniques used for 
>  each individual subproject.
>
>* Integrate the indexing process with the Lenya publishing usecases
> 
>  * Index the document when published
>
>    When a document is published it should be added to the Lucene index immediately.
>    This can be accomplished by extending the publish process, which is implemented
>    as a Lenya 1.4 usecase.
>
>    http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Publish.html
>  
>  * Remove the document from the index when deactivated
>  
>    Documents that are no longer a part of the 'Live' section of the Lenya publication
>    (the public available website) should be immediately removed from the Lucene index.
>    In a similar fashion as the publishing of a document, the deactivate usecase of
>    Lenya 1.4 should be extended with a removal of the document of the Lucene index.
>  
>

I think this could be done more general, such that every time a document
is changing is being indexed, e.g. also after editing, whereas there could
be one index for the authoring and one index for the live area.

Even more general would to search the documents in association with the 
workflow,
but this would probably rather be Lenya 1.>4, but I am mentioning it to 
point
where I think it would make sense to head to


>    
>    http://lenya.apache.org/apidocs/1.4/org/apache/lenya/defaultpub/cms/usecases/Deactivate.html
>    
>   * Document parser    
>   
>     * lenya.index
>    
>       Lucene comes packed with a standard xml and html parser to add documents to the index. This 
>       parser fetches the data out of the document and stores this data in different fields of the
>       Lucene index. The documents that Lenya works with are extended XHTML documents that can be
>       parsed with the standard html parser, but they would lack the possibility of indexing the
>       metadata that comes with these Lenya documents.
>
>       As a replacement for the ConfigurableIndexer that creates indexes from a document based
>       on a collection of xpath statements, i would like to propose an alternative way of 
>       configuring the indexed data. This replacement would consist of tags in the internal xml
>       documents of Lenya. Every xml element that must be added to the index need a special
>       attribute, something like indexField="fieldName".
>  
>

I don't think the ConfigurableIndexer should be replaced, whereas I am not
saying the implementation of it is great. One wants to keep the 
definition centralized
and attached individually (just is the case for the workflow or 
validation schema of a document) Always the same problem ;-)

IIUC then every document would have to be tagged. What if a field is 
changing?!

I am not saying your suggestion doesn't make sense for certain cases, but
I wouldn't treat it as replacement, but rather as enhancement

>       One of the big advantages of this approach would be the availability of data that isn't
>       visible for the outside world, but could be helpful for the search mechanism to determine
>       the most relevant results. One could think of the metadata that isn't completely rendered 
>       to html, like the date of creation or the creator.
>
>       Besides this, it would be more easy to add a new document type to Lenya when the indexing of the 
>       document can be specified in the sample document and the Relax NG schema.
>  
>

adding the indexing to the Schema would probably make life easier but is 
basically
the same as the current solution (one transformation from one to the other).

>        
>       Every document in Lenya has an accompanying RelaxNG schema that validates every edited document when
>       it is saved.
>

I don't think a document should require a schema, but I guess we get 
into a religious war here. But you can definitely not assume that 
everything is validated by RelaxNG, because Lenya would close itself 
badly if it would neglect schemas like XSD and others ...

> This schema should allow a document to have the index attribute assigned to a number of
>       elements. These elements should be extended with the lenya.index attribute. This must be done for all
>       elements that are allowed to be added to the Lucene index. This may sound like a lot work, but it 
>       shouln't be that hard. An XHTML document, for instance, only needs several metadata and the body elements
>       to be specified.
>
>       The following Relax NG snippet should be added to all elements that can be indexed. The LenyaFieldName will
>       contain the name of the Lucene index field. 
>
>       <define name="lenya:index">
>        <ZeroOrMore>
>         <element name="lenya.index">
>          <text/>
>         </element>
>        </ZeroOrMore>
>       </define>
> 
>       Notice the possibility to add more than one LenyaIndexFieldName. This makes it possible to add the same data
>       to different fields, which can be useful when the user want a general or more specific search: the data will 
>       be added to the more general field that is also used for other fields and the specific field is queried when
>       one is aware of the exact field that must be addressed.
>    
>       The actual xml document must add the lenya.index elements to the elements that must be indexed. The actual field
>       name is specified in the xml document and not the specification. This makes filling the index more flexible, without
>       making it harder to have a common indexfield for all document. Since all documents are created from a sample xml file,
>       the default indexfields can be provided in this file. This way individual exceptions are still possible.
>
>       The LenyaIndex parser, as described above, must be applied to the most used document in Lenya:
>       the XHTML document that is extended with Dublin Core metadata.
>     
>       http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html   
>

I don't fully understand your example. Can you make one which shows the 
mapping to the Lucene document and a content example, e.g. a press release:

<pr>
<title>...</title>
<date>...</date>
<content>...</content>
</pr>

>         
>        
>   * Document boost
>   
>     By Adding an extra field to the metadata of the documents called 'Document
>     Boost' it will be possible to use the boosting feature of Lucene to control
>     the relevance of specific documents in the search results. A pulldown menu
>     with a choosable digit to specify the boostlevel should be sufficient.
>    
>     http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#setBoost(float)
>
>   * Extract external links
>
>     The publish process should also extract all the external links - html and pdf - from the document and 
>     add them to the nutch crawler, so they can be fetched and indexed in the next Nutch run.
>
>     In a similar fashion, the external links should be removed from the Nutch fetch list  and the Lucene
>     index when  deactivating a document.
>  
>

how do you want to treat these external links?

>* Nutch integration for external crawling
>
>  It should be possible to add external pages to the Lucene index. For instance pages that are part
>  of the website, but are not controlled by Lenya or external pages that contain related content. The
>  crawling of these sites will not be a problem. Linking to external pages on one of the pages controlled
>  should be enough to crawl these pages and add them to the lucene index.
>  
>  * Schedule the nutch indexing task
>  
>    The indexing of the external pages that have been extracted as links during the indexing of a document
>    are fetched and indexed by Nutch. These documents can be html or pdf ones, as Nutch is able to handle
>    these types.
>
>    The list of links to index will be crawled and indexed by Nutch and added to the Lucene index. This 
>    process will be a scheduled job that will run from time to time, which can be controlled from the 
>    Lenya Administrator interface.
>
>    http://lucene.apache.org/nutch/apidocs/net/nutch/fetcher/Fetcher.html
>    http://lenya.apache.org/apidocs/1.4/org/apache/lenya/cms/usecase/scheduling/UsecaseCronJob.html
>  
>* Create Usecase for searching the current publication
>
>  The current search pipeline is not a part of a specific publication, but is part of the 
>  general lenya configuration. By making it a usecase, it will be more convenient to address
>  the search facility from a html form and it will be easier to change the search needs
>  of a specific publication. Another reason to move to usecases is the fact that Lenya 1.4
>  makes standard use of these usecases.
>  
>  Solprovider already has implemented a feature like this. In my opinion, it looks pretty good,
>  but can be revised and simplified with the changes proposed in this document, especially the
>  replacement of the generator. 
>  
>  http://www.solprovider.com/lenya/search
>  http://lenya.apache.org/apidocs/1.4/org/apache/lenya/cms/search/usecases/Search.html
>  
>* Change the communiation of Lenya with Lucene
>  
>  The communication of Lenya with the Lucene index is pretty nasty at the moment. The current
>  approach is the use of a custom xsp page, that contains server processed java code that 
>  communicates with the Lucene API. This code is not very flexible nor extendable programmed. 
>  
>

it is flexible but I also don't like the XSP and you right it's horrible
to change things

>  Making small changes to the result set can take a very long time to implement.
>
>  Different approaches to change this are possible: using the Cocoon LuceneQueryBean, that 
>  makes all Lucene search features available to any Cocoon application, or the use of a
>  custom navigational component and the standard Cocoon search generator. 
>  The latter approach seems the most appropriate to me, because of the highly customizable nature
>  of Lenya that only needs knowledge of XSLT. The LuceneQueryBean offers possibilities for both common 
>  and advanced uses, but seems to lack the customization that a navigation component based on a xslt sheet only
>  can offer. 
>
>  http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/index/Index.html
>  
>  * Replace custom Lucene search generator with Cocoon Search generator
>
>   There is a very clean and easy alternative to this nasty xsp page the 
>   xslt sheets that process the result it: the Cocoon search generator
>   By using this generator instead of the clumpsy search pipeline currently
>   employed, it will be easier to debug or change the resultset for a 
>   specific publication. Besides this, it seems to me as a good practice
>   to take advantage of Cocoon's facilities as much as possible.
>
>   http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html
>  
>

how does the XML of the search-generator differ from the the current Lenya
implementation?

>  * Simplify the current search navigation component
>
>   Make the current search form more usable, visually attractive and easier to integrate to 
>   a publication. Change the current navigation component - search.xsl - to be compatible with
>   the new interface and change its apperance.
>  
>* Related navigation component
>
>  Besides the results of a explicit query of the user, it could be interesting to add a navigation
>  component that searches the Lucene index for related pages. This could be done on the subject or
>  the description fields of the document. The results can be integrated in the document as a flexible
>  way of navigation trough the publication.
>
>
>* Planning
>
>14 june 05:		Proposal deadline
>24 june 05: 		Acceptance or rejection of proposal
>06 july 05:		Index when publishing
>06 july 05		Remove when deactivated
>14 july 05:		Document parser
>			 indexfields
>			 boost
>			 external links
>21 july 05:		Nutch integration
>28 july 05: 		Search usecase
>28 july 05:		SearchGenerator
>28 july 05:		Search navigation component
>28 july 05:		Related navigation component
>01 sept 05:		Pencils down
>
>* Future consideration
>
>These considerations are no formal requirements of this proposal, but are sidetracks that could play
>a role in future developments. By writing them down, they become part of the considerations for the current
>proposal without being a direct goal of the project as described above itself.
>
> * Add Lucene indexviewer *
>
>  To have an overvieuw of the created index it should be fairly simple to integrate the 
>  indexviewer Limo (http://limo.sourceforge.net/) to the administration mode of the
>  Lenya interface. The viewer is an easy tool to dig into the created index when the
>  search results are different than you expected. This tool is indispensable when working
>  with the ConfigurableIndexer to have an overview of the created Lucene fields and their
>  content.
>
>  The tool is written as an Apache Licensed java servlet and the only information
>  it needs to function is the path to the Lucene index. The integration should therefor be 
>  fairly easy.
>  
>

yes, this could go into the admin area

>* Jackrabbit and Lucene
>
>  The role of Jackrabbit seems to apply to more structured queries as XQuery makes possible. The unstructured
>  fulltext searching, as non-computers will use most of the time, is the area of the Lucene engine. 
>
>  When the Lenya API will be changed to make use of all the features that Jackrabbit promisses us, the document
>  parser as proposed above will have to be moved to the Lucene interface of Jackrabbit. Jackrabbit will be 
>  responsible for a job that, for the time being, will be executed by Lenya.
>
>  At this point of time, the Jackrabbit integration is only a future consideration and should be given account
>  for when developing new features. The document parser will be developed with the Jackrabbit API in mind.
>
>  
>

yes, it makes sense to keep an eye on JCR/Jackrabbit

Michi

>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
>For additional commands, e-mail: dev-help@lenya.apache.org
>


-- 
Michael Wechner
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by Robert Goene <ro...@goene.nl>.
Gregor J. Rothfuss wrote:
> Robert Goene wrote:
> 
>> Nice to hear. I don't seem to be able to edit the page, as it is 
>> immutable. Or aren't you referring to the general apache wiki?
> 
> 
> you need to create an account first

Ofcourse. Sorry.

http://wiki.apache.org/general/CurrentProposal

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by "Gregor J. Rothfuss" <gr...@apache.org>.
Robert Goene wrote:

> Nice to hear. I don't seem to be able to edit the page, as it is 
> immutable. Or aren't you referring to the general apache wiki?

you need to create an account first

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by Robert Goene <ro...@goene.nl>.
Gregor J. Rothfuss wrote:
> Robert Goene wrote:
> 
>> Hereby my new proposal. I have done some more research and have 
>> provided a bit more background information and a rudimentary timeline.
>>
>> Hope you like the looks of it, because the clock is ticking!
> 
> 
> i like it. please attach a copy to the SummerOfCode wiki page, by 
> creating a new page and linking to it.

Nice to hear. I don't seem to be able to edit the page, as it is 
immutable. Or aren't you referring to the general apache wiki?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: lenya-search proposal

Posted by "Gregor J. Rothfuss" <gr...@apache.org>.
Robert Goene wrote:

> Hereby my new proposal. I have done some more research and have provided 
> a bit more background information and a rudimentary timeline.
> 
> Hope you like the looks of it, because the clock is ticking!

i like it. please attach a copy to the SummerOfCode wiki page, by 
creating a new page and linking to it.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org