You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doug Cutting <cu...@apache.org> on 2006/05/10 01:42:13 UTC

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Thanks for making this change!

A few comments:

jerome@apache.org wrote:
> ==============================================================================
> --- lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java (original)
> +++ lucene/nutch/trunk/src/java/org/apache/nutch/searcher/OpenSearchServlet.java Tue May  9 16:04:40 2006
[...]
> -        addNode(doc, item, "description", summaries[i]);
> +        addNode(doc, item, "description", summaries[i].toString());

This means there's no markup in the OpenSearch output?

Shouldn't there be?

> Modified: lucene/nutch/trunk/src/web/jsp/search.jsp
> URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/web/jsp/search.jsp?rev=405565&r1=405564&r2=405565&view=diff
> ==============================================================================
> +    
> +    // Build the summary
> +    StringBuffer sum = new StringBuffer();
> +    Fragment[] fragments = summaries[i].getFragments();
> +    for (int j=0; j<fragments.length; j++) {
> +      if (fragments[j].isHighlight()) {
> +        sum.append("<span class=\"highlight\">")
> +           .append(Entities.encode(fragments[j].getText()))
> +           .append("</span>");
> +      } else if (fragments[j].isEllipsis()) {
> +        sum.append("<span class=\"ellipsis\"> ... </span>");
> +      } else {
> +        sum.append(Entities.encode(fragments[j].getText()));
> +      }
> +    }
> +    String summary = sum.toString();

Perhaps this should be a method on Summary, to render it as html?

Doug

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> > (but if the nutch-site.xml overrides the plugin.include property and
> > doen't
> > include it it will not be activated, like any other plugin)
> yes, that's what I ment, I quess that's the default case for people
> hacking plugins.

Oh, yes Sami, I understand what you mean...
Sorry, I just forgot to mention this point on the list (so, plugins hackers,
you need to add one of the new summary plugin if you want to have some
summaries displayed).
Sorry, I forgot too to add summary plugins in the default webapp context
file (nutch.xml) ... I will add this once the svn write access will be
available.
And one more time sorry, because I forgot too to report summary APIs changes
to web2 module...

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Sami Siren <ss...@gmail.com>.

Jérôme Charron wrote:

> (but if the nutch-site.xml overrides the plugin.include property and 
> doen't
> include it it will not be activated, like any other plugin)

yes, that's what I ment, I quess that's the default case for people 
hacking plugins.

--
 Sami Siren

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> > Also a friendly hint to all plugin hackers, you need to enable
> > summary-basic in your existing nutch-site.xml to get things working.
> > Took me some time to realize this fact :)
> Sounds like we should enable it by default, no?

The summary-basic plugin is already enabled by default in nutch-default.xml
(but if the nutch-site.xml overrides the plugin.include property and doen't
include it it will not be activated, like any other plugin)

Jérôme

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Doug Cutting <cu...@apache.org>.

Sami Siren wrote:
> Also a friendly hint to all plugin hackers, you need to enable 
> summary-basic in your existing nutch-site.xml to get things working.
> Took me some time to realize this fact :)

Sounds like we should enable it by default, no?

Doug

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> > Also a friendly hint to all plugin hackers, you need to enable
> > summary-basic in your existing nutch-site.xml to get things working.
> > Took me some time to realize this fact :)
> I think we should add this to nutch-default.xml,

Does I missed something?
summary-basic is activated in the nutch-default.xml ... no?


> if omitting this
> results in a non-working installation ...

During my tests, it only results in no summary in the results pages...
Isn't it the case?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Andrzej Bialecki <ab...@getopt.org>.

Sami Siren wrote:
>
>>
>> Doesn't this break any existing application that uses OpenSearch and 
>> displays summaries in a web browser?  This is an incompatible change 
>> which we should avoid.
>>
> Also a friendly hint to all plugin hackers, you need to enable 
> summary-basic in your existing nutch-site.xml to get things working.
> Took me some time to realize this fact :)

I think we should add this to nutch-default.xml, if omitting this 
results in a non-working installation ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Sami Siren <ss...@gmail.com>.

>
> Doesn't this break any existing application that uses OpenSearch and 
> displays summaries in a web browser?  This is an incompatible change 
> which we should avoid.
>
Also a friendly hint to all plugin hackers, you need to enable 
summary-basic in your existing nutch-site.xml to get things working.
Took me some time to realize this fact :)

>
> That sounds fine, but in the meantime, let's not reproduce the 
> html-specific code in lots of places.  We need it in both search.jsp 
> and in OpenSearchServlet.java.  So we should have it in a common 
> place.  A method on Summary seems like a good place.  If we 
> subsequently add a more general API then we could re-implement the 
> toHtml() method using that API, but I think a generic toHtml() method 
> will be useful for quite a while yet.
>
+1

--
 Sami Siren

Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> Bob Carpenter of alias-i had this to say when I brought up this very
> idea:
> http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599

Thanks for you response Marvin.
But finally my question is : shouldn't the nutch clustering uses some
fixed size snippets instead of the configurable displayed size?

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Marvin Humphrey <ma...@rectangular.com>.

On May 11, 2006, at 3:36 AM, Jérôme Charron wrote:

> Actually, the clustering uses the summaries as input. I assumes it  
> would
> provides some better results if it takes the whole documents  
> content. no?
> I assumes that clustering uses the summaries instead of documents  
> content
> for some performances purpose.
> But there is a (bad) side effect : since the size of the summaries is
> configurable, the clustering "quality" will vary depending on the  
> summaries
> size configuration. I really found this very confusing : when folks  
> adjust
> this parameter it is only for front-end consideration (they want to  
> display
> a long or a short summary), but certainly not for clustering reasons.
>
> What you and others thinks about this?

Bob Carpenter of alias-i had this to say when I brought up this very  
idea:

http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Yes, this should be definitely mentioned somewhere (in the documentation 
:) At least we left a track on the mailing list so it'll be possible to 
refer to it.

D.

Jérôme Charron wrote:
>> You're right -- changing anything with the input (snippets length,
>> number of documents etc) will alter the clusters. This is basically how
>> it works. If you want clustering in your search engine then, depending
>> on the type of data you serve, you'll have to experiment with the
>> settings a bit and see which give you satisfactory results. I don't
>> think there is any particular reason to provide different data to the
>> clusterer. Moreover, it'd complicate things quite badly.
> 
> Thanks Dawid for your response.
> In fact, I don't really want to change this, but just to be sure that
> everybody is aware about it and to have some opinions.
> 
> Regards
> 
> Jérôme
>

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> You're right -- changing anything with the input (snippets length,
> number of documents etc) will alter the clusters. This is basically how
> it works. If you want clustering in your search engine then, depending
> on the type of data you serve, you'll have to experiment with the
> settings a bit and see which give you satisfactory results. I don't
> think there is any particular reason to provide different data to the
> clusterer. Moreover, it'd complicate things quite badly.

Thanks Dawid for your response.
In fact, I don't really want to change this, but just to be sure that
everybody is aware about it and to have some opinions.

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Hi Jerome,

> Yes Dawid, but it is already committed => the clustering now uses the plain
> text version returned by the toString() method.

Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be 
specific and that uses toString internally.

> Actually, the clustering uses the summaries as input. I assumes it would
> provides some better results if it takes the whole documents content. no?
> I assumes that clustering uses the summaries instead of documents content
> for some performances purpose.

Not always. Or rather: depends what your goals are. Full document 
clustering will take longer (word segmentation, feature extraction etc), 
but since you have more data to work with, document similarity should be 
more accurate and hence clusters more sensible. In practice, however, 
similarity between documents and "cluster quality" is just a 
mathematical concept which is never shown to the user -- what the user 
sees is the representation of a cluster, which in case of full-document 
clustering is usually quite inconvenient to build and has a weak 
relationship with the actual mathematical model of clusters.

Contextual (keyword-in-context) snippets have a great advantage: they 
are shorter and carry the neighborhood of your query's terms. This very 
neighborhood (or rather: repetitive sequences of terms) can be used to 
first determine "clusters" of documents and then to describe them to the 
user. This is how most Web clustering algorithms work (excuse me if I 
explained it in a very imprecise way).

> But there is a (bad) side effect : since the size of the summaries is
> configurable, the clustering "quality" will vary depending on the summaries
> size configuration. I really found this very confusing : when folks adjust
> this parameter it is only for front-end consideration (they want to display
> a long or a short summary), but certainly not for clustering reasons.

You're right -- changing anything with the input (snippets length, 
number of documents etc) will alter the clusters. This is basically how 
it works. If you want clustering in your search engine then, depending 
on the type of data you serve, you'll have to experiment with the 
settings a bit and see which give you satisfactory results. I don't 
think there is any particular reason to provide different data to the 
clusterer. Moreover, it'd complicate things quite badly.

D.

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> Add 3. Clustering would benefit from a plain text version.

Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.

Dawid, I have a question about clustering.
Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.
But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.

What you and others thinks about this?

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> The reason is that they should not use the same HTML code :
> 1. OpenSearch should only use <b> around highlights
> 2. search.jsp should use some more complicated HTML code (<span ... >)

Add 3. Clustering would benefit from a plain text version.

D.

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Doug Cutting <cu...@apache.org>.

Jérôme Charron wrote:
> Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
> a common place (Summary).
> And add one specific Formatter implementation for OpenSearch and another 
> one
> for search.jsp :
> The reason is that they should not use the same HTML code :
> 1. OpenSearch should only use <b> around highlights
> 2. search.jsp should use some more complicated HTML code (<span ... >)
> 
> In fact, I don't know if the "Formatter" solution is the good one, but the
> toString() or toHtml() must be parametrized
> since the two pieces of code that use this method should have distinct
> outputs.

This all sounds fine, I'm just remarking that, at present, the 
OpenSearch output has changed incompatibly, which is a bad thing, and 
that I wish, until this is fully worked out, OpenSearch returned what it 
did before (markup, although perhaps exceeding what's advised).

Doug

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> > String toString(Encoder, Formatter) like in the Lucene's Highlighter and
> > provide some basic implementations of Encoder and Formatter.
> That sounds fine, but in the meantime, let's not reproduce the
> html-specific code in lots of places.  We need it in both search.jsp and
> in OpenSearchServlet.java.  So we should have it in a common place.  A
> method on Summary seems like a good place.  If we subsequently add a
> more general API then we could re-implement the toHtml() method using
> that API, but I think a generic toHtml() method will be useful for quite
> a while yet.

Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
a common place (Summary).
And add one specific Formatter implementation for OpenSearch and another one
for search.jsp :
The reason is that they should not use the same HTML code :
1. OpenSearch should only use <b> around highlights
2. search.jsp should use some more complicated HTML code (<span ... >)

In fact, I don't know if the "Formatter" solution is the good one, but the
toString() or toHtml() must be parametrized
since the two pieces of code that use this method should have distinct
outputs.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Doug Cutting <cu...@apache.org>.

Jérôme Charron wrote:
>> This means there's no markup in the OpenSearch output?
> 
> 
> Yes, no markup for now.

Doesn't this break any existing application that uses OpenSearch and 
displays summaries in a web browser?  This is an incompatible change 
which we should avoid.

>> Shouldn't there be?
> 
> 
> The restriction on description field is : "Can contain simple escaped HTML
> markup, such as <b>, <i>, <a>, and <img> elements."
> So, ya, why not. We can add <b> around highlights.
> What you and others thinks?

+1

>> Perhaps this should be a method on Summary, to render it as html?
> 
> 
> I had some hesitations about this while coding ....
> In fact, as suggested in the issue's comments, I would like to add a 
> generic
> method on Summary :
> String toString(Encoder, Formatter) like in the Lucene's Highlighter and
> provide some basic implementations of Encoder and Formatter.

That sounds fine, but in the meantime, let's not reproduce the 
html-specific code in lots of places.  We need it in both search.jsp and 
in OpenSearchServlet.java.  So we should have it in a common place.  A 
method on Summary seems like a good place.  If we subsequently add a 
more general API then we could re-implement the toHtml() method using 
that API, but I think a generic toHtml() method will be useful for quite 
a while yet.

Doug

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Posted by Jérôme Charron <je...@gmail.com>.

> This means there's no markup in the OpenSearch output?

Yes, no markup for now.


> Shouldn't there be?

The restriction on description field is : "Can contain simple escaped HTML
markup, such as <b>, <i>, <a>, and <img> elements."
So, ya, why not. We can add <b> around highlights.
What you and others thinks?


> Perhaps this should be a method on Summary, to render it as html?

I had some hesitations about this while coding ....
In fact, as suggested in the issue's comments, I would like to add a generic
method on Summary :
String toString(Encoder, Formatter) like in the Lucene's Highlighter and
provide some basic implementations of Encoder and Formatter.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/