You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by si...@apache.org on 2006/12/09 23:27:07 UTC

svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Author: siren
Date: Sat Dec  9 14:27:07 2006
New Revision: 485076

URL: http://svn.apache.org/viewvc?view=rev&rev=485076
Log:
Optimize SpellCheckedMetadata further by taking into account the fact that it is used only for http-headers.

I am starting to believe that spellchecking should just be an utility method used by http protocol plugins.

Modified:
    lucene/nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
    lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java
URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java?view=diff&rev=485076&r1=485075&r2=485076
==============================================================================
--- lucene/nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java Sat Dec  9 14:27:07 2006
@@ -25,10 +25,9 @@
 
 /**
  * A decorator to Metadata that adds spellchecking capabilities to property
- * names.
- *
- * All the static String fields declared by this class are used as reference
- * names for syntax correction on meta-data naming.
+ * names. Currently used spelling vocabulary contains just the httpheaders from
+ * {@link HttpHeaders} class.
+ * 
  */
 public class SpellCheckedMetadata extends Metadata {
 
@@ -49,18 +48,23 @@
    */
   private static String[] normalized = null;
 
-  // Uses self introspection to fill the metanames index and the
-  // metanames list.
   static {
-    for (Field field : SpellCheckedMetadata.class.getFields()) {
-      int mods = field.getModifiers();
-      if (Modifier.isFinal(mods) && Modifier.isPublic(mods)
-          && Modifier.isStatic(mods) && field.getType().equals(String.class)) {
-        try {
-          String val = (String) field.get(null);
-          NAMES_IDX.put(normalize(val), val);
-        } catch (Exception e) {
-          // Simply ignore...
+
+    // Uses following array to fill the metanames index and the
+    // metanames list.
+    Class[] spellthese = {HttpHeaders.class};
+
+    for (Class spellCheckedNames : spellthese) {
+      for (Field field : spellCheckedNames.getFields()) {
+        int mods = field.getModifiers();
+        if (Modifier.isFinal(mods) && Modifier.isPublic(mods)
+            && Modifier.isStatic(mods) && field.getType().equals(String.class)) {
+          try {
+            String val = (String) field.get(null);
+            NAMES_IDX.put(normalize(val), val);
+          } catch (Exception e) {
+            // Simply ignore...
+          }
         }
       }
     }
@@ -125,8 +129,7 @@
 
   @Override
   public void add(final String name, final String value) {
-    String normalized = getNormalizedName(name);
-    super.add(normalized, value);
+    super.add(getNormalizedName(name), value);
   }
 
   @Override

Modified: lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java?view=diff&rev=485076&r1=485075&r2=485076
==============================================================================
--- lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java (original)
+++ lucene/nutch/trunk/src/test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java Sat Dec  9 14:27:07 2006
@@ -36,6 +36,8 @@
  */
 public class TestSpellCheckedMetadata extends TestCase {
 
+  private static final int NUM_ITERATIONS = 10000;
+
   public TestSpellCheckedMetadata(String testName) {
     super(testName);
   }
@@ -63,7 +65,7 @@
     assertEquals("Content-Type", SpellCheckedMetadata
         .getNormalizedName("contntype"));
   }
-
+  
   /** Test for the <code>add(String, String)</code> method. */
   public void testAdd() {
     String[] values = null;
@@ -237,18 +239,35 @@
     assertEquals(0, result.size());
     meta.add("name-one", "value-1.1");
     result = writeRead(meta);
+    meta.add("Contenttype", "text/html");
     assertEquals(1, result.size());
     assertEquals(1, result.getValues("name-one").length);
     assertEquals("value-1.1", result.get("name-one"));
     meta.add("name-two", "value-2.1");
     meta.add("name-two", "value-2.2");
     result = writeRead(meta);
-    assertEquals(2, result.size());
+    assertEquals(3, result.size());
     assertEquals(1, result.getValues("name-one").length);
     assertEquals("value-1.1", result.getValues("name-one")[0]);
     assertEquals(2, result.getValues("name-two").length);
     assertEquals("value-2.1", result.getValues("name-two")[0]);
     assertEquals("value-2.2", result.getValues("name-two")[1]);
+    assertEquals("text/html", result.get(Metadata.CONTENT_TYPE));
+  }
+
+  /**
+   * IO Test method, usable only when you plan to do changes in metadata
+   * to measure relative performance impact.
+   */
+  public final void testHandlingSpeed() {
+    SpellCheckedMetadata result;
+    long start = System.currentTimeMillis();
+    for (int i = 0; i < NUM_ITERATIONS; i++) {
+      SpellCheckedMetadata scmd = constructSpellCheckedMetadata();
+      result = writeRead(scmd);
+    }
+    System.out.println(NUM_ITERATIONS + " spellchecked metadata I/O time:"
+        + (System.currentTimeMillis() - start) + "ms.");
   }
 
   private SpellCheckedMetadata writeRead(SpellCheckedMetadata meta) {
@@ -262,6 +281,24 @@
       fail(ioe.toString());
     }
     return readed;
+  }
+
+  /**
+   * Assembles a Spellchecked metadata Object.
+   */
+  public static final SpellCheckedMetadata constructSpellCheckedMetadata() {
+    SpellCheckedMetadata scmd = new SpellCheckedMetadata();
+    scmd.add("Content-type", "foo/bar");
+    scmd.add("Connection", "close");
+    scmd.add("Last-Modified", "Sat, 09 Dec 2006 15:09:57 GMT");
+    scmd.add("Server", "Foobar");
+    scmd.add("Date", "Sat, 09 Dec 2006 18:07:20 GMT");
+    scmd.add("Accept-Ranges", "bytes");
+    scmd.add("ETag", "\"1234567-89-01234567\"");
+    scmd.add("Content-Length", "123");
+    scmd.add(Nutch.SEGMENT_NAME_KEY, "segmentzzz");
+    scmd.add(Nutch.SIGNATURE_KEY, "123");
+    return scmd;
   }
 
 }



Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Sami,


On 12/10/06 1:52 AM, "Sami Siren" <ss...@gmail.com> wrote:
> 
> Yes, I am ok with adding more words to SDMetadata if required. Do you
> have a concrete example of those FileHeaders you are planning to add?

Not yet. I was thinking about starting working on NUTCH-384, and I thought
that perhaps the fix for this might involve adding some more headers.
However, thinking about it more, those headers may be just the CONTENT_TYPE,
or something like that, so my point may be a moot one n-e ways.

> 
>>  What do you think about that? Alternatively we could just create a
>> ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all
>> the met key fields from HttpHeaders, and it would be the place that the met
>> key fields for FileHeaders, etc. could go into.
> 
> You don't actually need to hierarchically construct interfaces for
> constants as I changed the SCMetadata to initialize itself with array of
> classes.

Ah, yes, I see that now. You're right.

> 
> The optimization I made is not so significant from the big perspective
> so if there's really objections on it, it can also be reverted.

Nah, +1 let's keep it. It just looked weird to me at first, but I get the
whole point of it now.

> 
> However my original opinion haven't really changed: We probably should
> move the Spell checking feature to static utility method so it can be
> used when needed (probably also with customizable, context optimize able
> dictionary). This way it could also be used in non metadata context.

I agree with this. Want me to create a JIRA issue about it so we can
track/assign it?

Cheers,
  Chris

> 
> --
>  Sami Siren



Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Posted by Sami Siren <ss...@gmail.com>.
Chris Mattmann wrote:
>  Indeed, I see your point. I guess what I was advocating for was more of a
> ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we
> could update the code that you have below to use ProtocolHeaders.class
> rather than HttpHeaders.class. We would then make ProtocolHeaders extend
> HttpHeaders, so that it by default inherits all of the HttpHeaders, while
> still allowing more ProtocolHeader met keys (e.g., we could have an
> interface for FileHeaders, etc.).

Yes, I am ok with adding more words to SDMetadata if required. Do you
have a concrete example of those FileHeaders you are planning to add?

>  What do you think about that? Alternatively we could just create a
> ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all
> the met key fields from HttpHeaders, and it would be the place that the met
> key fields for FileHeaders, etc. could go into.

You don't actually need to hierarchically construct interfaces for
constants as I changed the SCMetadata to initialize itself with array of
classes.

The optimization I made is not so significant from the big perspective
so if there's really objections on it, it can also be reverted.

However my original opinion haven't really changed: We probably should
move the Spell checking feature to static utility method so it can be
used when needed (probably also with customizable, context optimize able
dictionary). This way it could also be used in non metadata context.

--
 Sami Siren

Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Sami,

 Indeed, I see your point. I guess what I was advocating for was more of a
ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we
could update the code that you have below to use ProtocolHeaders.class
rather than HttpHeaders.class. We would then make ProtocolHeaders extend
HttpHeaders, so that it by default inherits all of the HttpHeaders, while
still allowing more ProtocolHeader met keys (e.g., we could have an
interface for FileHeaders, etc.).

 What do you think about that? Alternatively we could just create a
ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all
the met key fields from HttpHeaders, and it would be the place that the met
key fields for FileHeaders, etc. could go into.

Let me know what you think, and thanks!

Cheers,
  Chris



On 12/9/06 3:53 PM, "Sami Siren" <ss...@gmail.com> wrote:

> Chris Mattmann wrote:
>> Hi Sami,
>> 
>> On 12/9/06 2:27 PM, "siren@apache.org" <si...@apache.org> wrote:
>> 
>>> Author: siren
>>> Date: Sat Dec  9 14:27:07 2006
>>> New Revision: 485076
>>> 
>>> URL: http://svn.apache.org/viewvc?view=rev&rev=485076
>>> Log:
>>> Optimize SpellCheckedMetadata further by taking into account the fact that
>>> it
>>> is used only for http-headers.
>>> 
>>> I am starting to believe that spellchecking should just be an utility method
>>> used by http protocol plugins.
>> 
>> I think that right now I'm -1 for this change. I would make note of all the
>> comments on NUTCH-139, from which this code was born. In the end, I think
>> what we all realized was that the spell checking capabilities is necessary,
>> but not everywhere, as you point out. However, I don't think it's limited
>> entirely to HTTP headers (what you've currently changed the code to). I
>> think it should be implemented as a protocol layer service, also providing
>> spell checking support to other protocol plugins, like protocol-file, etc.,
> 
> In protocol file all headers are artificial an generated in nutch code
> so if there's spelling mistake there then we should fix the code
> generating the headers and not rely on spellchecking in the first place.
> 
>> where field headers run the risk of being misspelled as well. What's to stop
>> someone from implementing protocol-file++ that returns different file header
>> keys than that of protocol-file? Just b/c HTTP is the most pervasively used
>> plugin right now, I think it's convenient to assume that only HTTP protocol
>> field keys may need spell checking services.
> 
> If there's a real need for spell checking on other keys one can just add
> more classes to the array no big deal.
> 
> --
>  Sami Siren
> 



Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Posted by Sami Siren <ss...@gmail.com>.
Chris Mattmann wrote:
> Hi Sami,
> 
> On 12/9/06 2:27 PM, "siren@apache.org" <si...@apache.org> wrote:
> 
>> Author: siren
>> Date: Sat Dec  9 14:27:07 2006
>> New Revision: 485076
>>
>> URL: http://svn.apache.org/viewvc?view=rev&rev=485076
>> Log:
>> Optimize SpellCheckedMetadata further by taking into account the fact that it
>> is used only for http-headers.
>>
>> I am starting to believe that spellchecking should just be an utility method
>> used by http protocol plugins.
> 
> I think that right now I'm -1 for this change. I would make note of all the
> comments on NUTCH-139, from which this code was born. In the end, I think
> what we all realized was that the spell checking capabilities is necessary,
> but not everywhere, as you point out. However, I don't think it's limited
> entirely to HTTP headers (what you've currently changed the code to). I
> think it should be implemented as a protocol layer service, also providing
> spell checking support to other protocol plugins, like protocol-file, etc.,

In protocol file all headers are artificial an generated in nutch code
so if there's spelling mistake there then we should fix the code
generating the headers and not rely on spellchecking in the first place.

> where field headers run the risk of being misspelled as well. What's to stop
> someone from implementing protocol-file++ that returns different file header
> keys than that of protocol-file? Just b/c HTTP is the most pervasively used
> plugin right now, I think it's convenient to assume that only HTTP protocol
> field keys may need spell checking services.

If there's a real need for spell checking on other keys one can just add
more classes to the array no big deal.

--
 Sami Siren


Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Sami,

On 12/9/06 2:27 PM, "siren@apache.org" <si...@apache.org> wrote:

> Author: siren
> Date: Sat Dec  9 14:27:07 2006
> New Revision: 485076
> 
> URL: http://svn.apache.org/viewvc?view=rev&rev=485076
> Log:
> Optimize SpellCheckedMetadata further by taking into account the fact that it
> is used only for http-headers.
> 
> I am starting to believe that spellchecking should just be an utility method
> used by http protocol plugins.

I think that right now I'm -1 for this change. I would make note of all the
comments on NUTCH-139, from which this code was born. In the end, I think
what we all realized was that the spell checking capabilities is necessary,
but not everywhere, as you point out. However, I don't think it's limited
entirely to HTTP headers (what you've currently changed the code to). I
think it should be implemented as a protocol layer service, also providing
spell checking support to other protocol plugins, like protocol-file, etc.,
where field headers run the risk of being misspelled as well. What's to stop
someone from implementing protocol-file++ that returns different file header
keys than that of protocol-file? Just b/c HTTP is the most pervasively used
plugin right now, I think it's convenient to assume that only HTTP protocol
field keys may need spell checking services.

Just my 2 cents...

Cheers,
  Chris