You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Colin Britton <cb...@metatomix.com> on 2002/01/20 02:04:56 UTC

Patches for search indexing

Attached are three changes for the indexing capability provided by Cocoon
and Lucene. Details below:

1) current code makes any URL's with paramaters incorrect as it always adds
"?cocoon-view=" the the url, which invalidates any existing parameters.
Patch provided detects for a ? in the url and only adds it if one does not
exist. This applies to SimpleCocoonCrawlerImpl.java and in a similar way to
SimpleLuceneXMLIndexerImpl.java

2) Change a system.out to a logger entry in SimpleCocoonCrawlerImpl.java

3) added a logger entry when lucene optimizes to
SimpleLuceneCocoonIndexerImpl.java

rgds
CB



Index: crawler/SimpleCocoonCrawlerImpl.java
===================================================================
RCS file:
/home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/crawler/Si
mpleCocoonCrawlerImpl.java,v
retrieving revision 1.1
diff -u -r1.1 SimpleCocoonCrawlerImpl.java
--- crawler/SimpleCocoonCrawlerImpl.java 3 Jan 2002 12:31:09 -0000 1.1
+++ crawler/SimpleCocoonCrawlerImpl.java 20 Jan 2002 00:49:41 -0000
@@ -58,7 +58,7 @@
* @since
*/
public final String LINK_CONTENT_TYPE_DEFAULT =
"application/x-cocoon-links";
-
+
/**
* Config element name specifying query-string appendend for requesting links
* of an URL.
@@ -77,7 +77,7 @@
*
* @since
*/
- public final static String LINK_VIEW_QUERY_DEFAULT = "?cocoon-view=links";
+ public final static String LINK_VIEW_QUERY_DEFAULT = "&cocoon-view=links";
/**
* Config element name specifying excluding regular expression pattern.
@@ -199,7 +199,7 @@
this.includeCrawlingURL.add(new RE(tokenized_pattern));
}
} catch (RESyntaxException rese) {
- getLogger().error("Cannot create includeing regular-expression for " +
+ getLogger().error("Cannot create includeing regular-expression for " +
pattern, rese);
}
}
@@ -217,7 +217,7 @@
this.excludeCrawlingURL.add(new RE(tokenized_pattern));
}
} catch (RESyntaxException rese) {
- getLogger().error("Cannot create excluding regular-expression for " +
+ getLogger().error("Cannot create excluding regular-expression for " +
pattern, rese);
}
}
@@ -416,6 +416,9 @@
// get links of url
try {
+ if (url.toString().indexOf("?")==-1){
+ linkViewQuery = "?" + linkViewQuery;
+ }
URL links_url = new URL(url, url.getPath() + linkViewQuery);
URLConnection links_url_connection = links_url.openConnection();
InputStream is = links_url_connection.getInputStream();

Index: search/SimpleLuceneCocoonIndexerImpl.java
===================================================================
RCS file:
/home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/search/Sim
pleLuceneCocoonIndexerImpl.java,v
retrieving revision 1.1
diff -u -r1.1 SimpleLuceneCocoonIndexerImpl.java
--- search/SimpleLuceneCocoonIndexerImpl.java 3 Jan 2002 12:31:13 -0000 1.1
+++ search/SimpleLuceneCocoonIndexerImpl.java 20 Jan 2002 00:49:42 -0000
@@ -198,8 +198,10 @@
// skip urls using different host, or port than host,
// or port of base url
- System.out.println("Skipping carwling URL " + crawl_url.toString() +
+ if (getLogger().isDebugEnabled()) {
+ getLogger().debug("Skipping crawling URL " + crawl_url.toString() +
" as base_url is " + base_url.toString());
+ }
continue;
}
@@ -212,9 +214,15 @@
Document document = (Document) i.next();
writer.addDocument(document);
}
- }
- // optimize it
+
+ }
+ // optimize it
writer.optimize();
+ if (getLogger().isDebugEnabled()) {
+ getLogger().debug("Optimizing index" );
+ }
+
+
} catch (IOException ioe) {
throw new ProcessingException("IOException in index()", ioe);
} catch (ComponentException ce) {

Index: search/SimpleLuceneXMLIndexerImpl.java
===================================================================
RCS file:
/home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/search/Sim
pleLuceneXMLIndexerImpl.java,v
retrieving revision 1.1
diff -u -r1.1 SimpleLuceneXMLIndexerImpl.java
--- search/SimpleLuceneXMLIndexerImpl.java 3 Jan 2002 12:31:13 -0000 1.1
+++ search/SimpleLuceneXMLIndexerImpl.java 20 Jan 2002 00:49:42 -0000
@@ -29,7 +29,6 @@
import org.apache.avalon.framework.configuration.ConfigurationException;
import org.apache.avalon.framework.logger.AbstractLoggable;
-import org.apache.avalon.framework.logger.AbstractLoggable;
import org.apache.avalon.framework.parameters.Parameters;
import org.apache.avalon.framework.thread.ThreadSafe;
import org.apache.cocoon.ProcessingException;
@@ -80,7 +79,7 @@
*
* @since
*/
- final String CONTENT_QUERY = "?cocoon-view=content";
+ final String CONTENT_QUERY = "&cocoon-view=content";
/**
* set of allowed content types
@@ -163,7 +162,14 @@
throws ProcessingException {
try {
- URL contentURL = new URL(url, url.getPath() + CONTENT_QUERY);
+
+ String contentQuery = CONTENT_QUERY;
+
+ if (url.toString().indexOf("?")==-1){
+ contentQuery = "?" + contentQuery;
+ }
+
+ URL contentURL = new URL(url, url.getPath() + contentQuery);
URLConnection contentURLConnection = contentURL.openConnection();
String contentType = contentURLConnection.getContentType();
if (contentType != null &&


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion (CVS for XML?)

Posted by Nicola Ken Barozzi <ba...@nicolaken.com>.
----- Original Message -----
From: "Bertrand Delacretaz" <bd...@codeconsult.ch>
To: <co...@xml.apache.org>; "Robert Koberg" <ro...@koberg.com>
Sent: Monday, January 21, 2002 4:53 PM
Subject: Re: staging/promotion (CVS for XML?)


> On Monday 21 January 2002 15:51, Robert Koberg wrote:
> > Is there some Open Source versioning software that handles XML? That is,
> > when there is a conflict it keeps the content piece well-formed and
> > identifies the conflicts. I have only had some limited experince with
> > cvs...
>
> Would be great for sure, AFAIK there is no such thing yet but I'd like to
be
> proved wrong.

For comparing XML files, there's also Xdiff
(http://www.vmguys.com/vmtools/), available with Apache-like licence.

--
Nicola Ken Barozzi                 xml-cocoon@nicolaken.com
These are the days of miracle and wonder...
          ...so don't cry baby, don't cry...
                                                  Paul Simon




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion (CVS for XML?)

Posted by Matt Sergeant <ma...@sergeant.org>.
On Mon, 21 Jan 2002, Bertrand Delacretaz wrote:

> > Is there some Open Source versioning software that handles XML? That is,
> > when there is a conflict it keeps the content piece well-formed and
> > identifies the conflicts. I have only had some limited experince with
> > cvs...
>
> Would be great for sure, AFAIK there is no such thing yet but I'd like to be
> proved wrong.

Kip Hampton wrote an article for XML.com about his Perl module
XML::SemanticDiff. It's just a simple SAX thing, so should be really easy
to port to Java.

-- 
<!-- Matt -->
<:->Get a smart net</:->


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion (CVS for XML?)

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
On Monday 21 January 2002 15:51, Robert Koberg wrote:
> . . .
> Is cvs the best thing for text-based version control?

I think yes - these days CVS tends to be the "category killer" for version 
control of text and binary files.

> Is there some Open Source versioning software that handles XML? That is,
> when there is a conflict it keeps the content piece well-formed and
> identifies the conflicts. I have only had some limited experince with
> cvs...

Would be great for sure, AFAIK there is no such thing yet but I'd like to be 
proved wrong.

In the meantime, I'd go with CVS and prevent conflicts by requiring locking - 
I think CVS allows you to do that.

- Bertrand



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Bertrand Delacretaz <bd...@codeconsult.dsvr.co.uk>.
On 22.01.2002 08:01:53, Robert Koberg <ro...@koberg.com> wrote:

>...
>So the cms marks/promotes something (through the version control system)
>as
>ready for a stage and the publ sys reads which stage the request is
>coming from and serves the appropriate version.

Yes - actually CVS uses user-definable "tags" to mark specific versions. 
The following scenario would be interesting to test:

1. during content creation, versions are marked with CVS tags

2. based on request + user attributes, cocoon selects a CVS TAG value

3. cocoon retrieves appropriate version from CVS based on tag.

Would make it real easy to stage/release content.
Is anyone doing something along these lines?

- Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Robert Koberg <ro...@koberg.com>.
----- Original Message -----
From: "Robert Koberg" <ro...@koberg.com>

> Is there some
> Open Source java-based, versioning software that you can recommend? Is cvs
> the best thing for text-based version control?

Is there some Open Source versioning software that handles XML? That is,
when there is a conflict it keeps the content piece well-formed and
identifies the conflicts. I have only had some limited experince with cvs...
:( and it trashes the latest version if there is a conflict.


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Robert Koberg <ro...@koberg.com>.
Hey,
----- Original Message -----
From: "Bertrand Delacretaz" <bd...@codeconsult.ch>


> Hi Rob,,
>
> On Monday 21 January 2002 14:56, Robert Koberg wrote:
> > Is something that has been published content (for example, the resulting
> > HTML page)? Content remains unaffected when publishing in my view.
Content
> > should only be affected during development and bug fixing. During QA -->
> > Live the content should never change (but perhaps it's location does, or
> > not). I think we are talking about semantics here, I just want to get on
> > the same page.
>
> I was thinking about XML content that is dynamically published (as HTML or
> something else), but not pre-published HTML if that is what you mean. In
my
> view HTML content does not exist in files, only when it is dynamically
> generated (and then maybe you use wget to copy it to static files with no
> more processing of any kind).
>
> So, for me "content" is XML stored in a repository (database or
filesystem)
> or generated from backend databases, and it is this content that I'd like
to
> be able to filter securely based on request/user/site attributes.
>

Do you use some kind of versioning system to retrieve a
versioned-content-piece that is marked as ready for a certain stage? Or do
you move versions to different locations based on the stage? Is there some
Open Source java-based, versioning software that you can recommend? Is cvs
the best thing for text-based version control?

So the cms marks/promotes something (through the version control system) as
ready for a stage and the publ sys reads which stage the request is coming
from and serves the appropriate version.

this makes sense now, thanks,
-Rob


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Hi Rob,,

On Monday 21 January 2002 14:56, Robert Koberg wrote:
> Is something that has been published content (for example, the resulting
> HTML page)? Content remains unaffected when publishing in my view. Content
> should only be affected during development and bug fixing. During QA -->
> Live the content should never change (but perhaps it's location does, or
> not). I think we are talking about semantics here, I just want to get on
> the same page.

I was thinking about XML content that is dynamically published (as HTML or 
something else), but not pre-published HTML if that is what you mean. In my 
view HTML content does not exist in files, only when it is dynamically 
generated (and then maybe you use wget to copy it to static files with no 
more processing of any kind).

So, for me "content" is XML stored in a repository (database or filesystem) 
or generated from backend databases, and it is this content that I'd like to 
be able to filter securely based on request/user/site attributes.

Hope I'm being clear?
- Bertrand

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Robert Koberg <ro...@koberg.com>.
Hi Bertrand!

----- Original Message -----
From: "Bertrand Delacretaz" <bd...@codeconsult.ch>


> On Sunday 20 January 2002 18:12, Robert Koberg wrote:
> > Should the publishing system or the content management system handle
site
> > promotions (dev, qa, live)?
>
> I think the decision about what status to give to each piece of content
> belong to the cms.
>
> But the decision about publishing or not belongs IMHO to the publishing
> system (assuming cms and publishing system share the same content
database,
> which makes things much easier to manage).
>
> Good servlet-based transactional apps do that by having a single
"checkpoint"
> through which all requests go, where you can put all the rules about who
has
> access to what.
>
> In this case, we want to give access rights based on *content*, for
example
> filtering out certain information or navigation options before publishing.
>
> If I had to do this in Cocoon, my first idea would be to "manually" put
> an additional "access control Transformer" right after the "data acess"
(i.e.
> Generator) and before the actual "publishing" pipelines (other
Transformers).
>
> This is certainly doable by making sure each pipeline include this
> AccessControlTransformer (xslt or something) right after the Generator
stage,
> but maybe there is a better way? The idea would be to force all content
> through a filter based on the request AND the content being delivered.

Is something that has been published content (for example, the resulting
HTML page)? Content remains unaffected when publishing in my view. Content
should only be affected during development and bug fixing. During QA -->
Live the content should never change (but perhaps it's location does, or
not). I think we are talking about semantics here, I just want to get on the
same page.

best,
-Rob




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
On Sunday 20 January 2002 18:12, Robert Koberg wrote:
> Should the publishing system or the content management system handle site
> promotions (dev, qa, live)?

I think the decision about what status to give to each piece of content 
belong to the cms.

But the decision about publishing or not belongs IMHO to the publishing 
system (assuming cms and publishing system share the same content database, 
which makes things much easier to manage).

Good servlet-based transactional apps do that by having a single "checkpoint" 
through which all requests go, where you can put all the rules about who has 
access to what. 

In this case, we want to give access rights based on *content*, for example 
filtering out certain information or navigation options before publishing.

If I had to do this in Cocoon, my first idea would be to "manually" put 
an additional "access control Transformer" right after the "data acess" (i.e. 
Generator) and before the actual "publishing" pipelines (other Transformers).

This is certainly doable by making sure each pipeline include this 
AccessControlTransformer (xslt or something) right after the Generator stage, 
but maybe there is a better way? The idea would be to force all content 
through a filter based on the request AND the content being delivered.

-- 
 -- Bertrand Delacrétaz, www.codeconsult.ch
 -- web technologies consultant - OO, Java, XML, C++






---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Robert Koberg <ro...@koberg.com>.
----- Original Message -----
From: "Stefano Mazzocchi" <st...@apache.org>


> Robert Koberg wrote:
> >
> > Should the publishing system or the content management system handle
site
> > promotions (dev, qa, live)?
>
> what do you exactly mean by 'promotions'? Please, elaborate further.

By 'site promotions' I mean the process a site goes through to get from an
idea to being live on the internet, blessed and ready to be used.

A site lives in different 'stages' along the way.  One stage is development
were things needs to be fluid and changing. Next (in my setup) a site can be
promoted to quality assurance and after that has been completed (and bugs
have been fixed back on the deve iste, promoted to QA and qa'ed again) the
site is promoted to live status. Live and QA will only differ in the power
of the machine being used to serve.

We usually create a virtual host for each stage so you get something like:

dev.cocoon.com
qa.cocoon.com
www.cocoon.com

I have been going back and forth in my head with where this type of thing
belongs; in the publishing sys or the cms. Currently (simply because of the
words used) it sounds to me that it should be in the publ sys.  You are not
really managing content when you stage/promote a site.  You are publishing
for your intrenal/external needs. Or maybe it should be in the cms - I was
hoping to get an idea where others thought it should go.

best,
-Rob


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Michael Hartle <mh...@hartle-klug.com>.
Stefano Mazzocchi wrote:

>Robert Koberg wrote:
>
>>Should the publishing system or the content management system handle site
>>promotions (dev, qa, live)?
>>
>what do you exactly mean by 'promotions'? Please, elaborate further.
>
This resembles the web publishing concept Documentum is using with its 
Web Publisher.

Practically, this is like having a CVS, where content is published to 
certain URI spaces only when the content is appropriately tagged in 
order to ensure quality and guarantee that each content has taken some 
corporate-defined workflow before going public. Something that has just 
been written is promoted as "dev", available at a developer's location 
only; after QA people have checked it, lawyers agree and some manager 
thinks it is okay, after several stages, it is finally being promoted as 
"live", making it widely available.

Best regards,

Michael Hartle,
Hartle & Klug GbR


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: staging/promotion

Posted by Stefano Mazzocchi <st...@apache.org>.
Robert Koberg wrote:
> 
> Should the publishing system or the content management system handle site
> promotions (dev, qa, live)?

what do you exactly mean by 'promotions'? Please, elaborate further.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


staging/promotion

Posted by Robert Koberg <ro...@koberg.com>.
Should the publishing system or the content management system handle site
promotions (dev, qa, live)?


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


RE: Patches for search indexing

Posted by Vadim Gritsenko <va...@verizon.net>.
Colin,

Optimized version of your patch is in CVS. Please check how it works :)

Vadim

> From: Colin Britton [mailto:cbritton@metatomix.com]
> 
> Here it is as an attachment.
> 
> thanks
> CB


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: Patches for search indexing

Posted by Colin Britton <cb...@metatomix.com>.
Here it is as an attachment.

thanks
CB
----- Original Message ----- 
From: "Vadim Gritsenko" <va...@verizon.net>
To: <co...@xml.apache.org>
Cc: "'Colin Britton'" <cb...@metatomix.com>
Sent: Monday, January 21, 2002 8:59 PM
Subject: RE: Patches for search indexing


> It seems that patch is broken by mailer... 
> 
> Could you please re-send the patch as attachment?
> 
> Thanks,
> Vadim
> 
> > -----Original Message-----
> > From: Colin Britton [mailto:cbritton@metatomix.com]
> > Sent: Saturday, January 19, 2002 8:05 PM
> > To: cocoon-dev@xml.apache.org
> > Subject: Patches for search indexing
> > 
> > Attached are three changes for the indexing capability provided by
> Cocoon
> > and Lucene. Details below:
> > 
> > 1) current code makes any URL's with paramaters incorrect as it always
> adds
> > "?cocoon-view=" the the url, which invalidates any existing
> parameters.
> > Patch provided detects for a ? in the url and only adds it if one does
> not
> > exist. This applies to SimpleCocoonCrawlerImpl.java and in a similar
> way to
> > SimpleLuceneXMLIndexerImpl.java
> > 
> > 2) Change a system.out to a logger entry in
> SimpleCocoonCrawlerImpl.java
> > 
> > 3) added a logger entry when lucene optimizes to
> > SimpleLuceneCocoonIndexerImpl.java
> > 
> > rgds
> > CB
> > 
> > 
> > 
> > Index: crawler/SimpleCocoonCrawlerImpl.java
> > ===================================================================
> > RCS file:
> >
> /home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/crawle
> r/Si
> > mpleCocoonCrawlerImpl.java,v
> 
> <snip/>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
> 
> 

RE: Patches for search indexing

Posted by Vadim Gritsenko <va...@verizon.net>.
It seems that patch is broken by mailer... 

Could you please re-send the patch as attachment?

Thanks,
Vadim

> -----Original Message-----
> From: Colin Britton [mailto:cbritton@metatomix.com]
> Sent: Saturday, January 19, 2002 8:05 PM
> To: cocoon-dev@xml.apache.org
> Subject: Patches for search indexing
> 
> Attached are three changes for the indexing capability provided by
Cocoon
> and Lucene. Details below:
> 
> 1) current code makes any URL's with paramaters incorrect as it always
adds
> "?cocoon-view=" the the url, which invalidates any existing
parameters.
> Patch provided detects for a ? in the url and only adds it if one does
not
> exist. This applies to SimpleCocoonCrawlerImpl.java and in a similar
way to
> SimpleLuceneXMLIndexerImpl.java
> 
> 2) Change a system.out to a logger entry in
SimpleCocoonCrawlerImpl.java
> 
> 3) added a logger entry when lucene optimizes to
> SimpleLuceneCocoonIndexerImpl.java
> 
> rgds
> CB
> 
> 
> 
> Index: crawler/SimpleCocoonCrawlerImpl.java
> ===================================================================
> RCS file:
>
/home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/crawle
r/Si
> mpleCocoonCrawlerImpl.java,v

<snip/>


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org