You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2005/12/14 05:02:45 UTC

[jira] Created: (NUTCH-139) Standard metadata property names in the ParseData metadata

Standard metadata property names in the ParseData metadata 
-----------------------------------------------------------

         Key: NUTCH-139
         URL: http://issues.apache.org/jira/browse/NUTCH-139
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev    
 Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
    Reporter: Chris A. Mattmann
 Assigned to: Chris A. Mattmann 
     Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6


Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
and "conTeNT_TyPE" and all the permutations are really the same). What about
if I named it "Content     Type", or "ContentType"?

 I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.

 The properties would be defined at the top of the ParseData class, something like:

 public class ParseData{

   .....

    public static final String CONTENT_TYPE = "content-type";
    public static final String CREATOR = "creator";

   ....

}


In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.

I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363996 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

I think this is all easily handled by naming, and that we don't need another map.

We keep using "title" and "content-type" as examples, when these are actually not problematic, since nutch already has dedicated fields for them.  Can someone please provide some examples of where multiple values are actually needed, besides the need to accurately represent multi-valued http and smtp headers?  For title and content-type, the header in the metadata is what the parser and protocol found, respectively, and the field in the Parse and Content are what will be used.  What are some cases where a value needs to be "overridden" where using an X-nutch value as the authoritative value will not suffice?


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361891 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

If we store protocol headers as metadata then we should store them as-is.  If they're incorrect, then we should store the correct value separately as an x-nutch metadata value.

We should never need to store content type or title in metadata, since these are fields of Content and Parse respectively.  The "Content-Type" in the metadata for an http request should thus be the raw http header, the Content.getContentType() should be the content type we actually think this is, and there should be no x-nutch-content-type value.  Similarly, x-nutch-title should never be set, as parsers should set the Parse title field instead.

Does this sound right?


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
---------------------------------

    Attachment: NUTCH-139.jc.review.patch.txt

Here is a new patch from Chris. I reviewed it, tested it.
>From my point of view, all seems to be ok.
So if no objections, I will commit it during the day.

Regards
Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364242 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

I was confused about which was the latest version.  (I deleted the older versions.  Is there a way to simply mark them obsolete?)

So, if Metadata and MetadataNames are moved from util into a metadata package (as suggested by Andrzej) then I am +1.

I don't see why we need separate subclasses of Metadata for content and parses.  Separate instances, yes, and we already have these, no?

Sorry for my confusion.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363942 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Yes, this should work ok ... but it strikes me as unnecessarily complicated. After all, in most cases we will have single values and no overrides, so this solution complicates the most common cases...

At this point it's probably easier just to keep the original <key, val[]> in one Map, and potential overrides <key, val1[]> in another Map, and then provide a container/facade with appropriate methods to add/get/set whichever value is necessary.

E.g.:

public class MetaData {
  private HashMap original = new HashMap();
  private HashMap actual = new HashMap();

  public void add(String key, String val) {
    // same as in ContentProperties now, uses the "original" map
    ...
  }

  public void set(String key, String val) {
    // same as in ContentProperties now, uses the "original" map
    ...
  }

  public void setFinal(String key, String val) {
   // as above, but uses the "actual" map
  }

  // return the final value, if it's missing then return the original value
  public Object getFinal(String key) {
    Object res = actual.get(key);
    if (res == null) res = original.get(key);
    return res;
  }
...
}

This seems to satisfy all the requirements, and with minimal overhead. If this is ok with you, please prepare a patch, and we should commit it - there are many other changes waiting in the queue that depend on this patch being applied ...

(BTW. I think it's conceptually the same as using the "X-nutch" to avoid name clashes, but from the point of view of correct OO programming it looks more "kosher" now... ;-) )

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361927 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug,

  While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it in the MetaData as well to save the computational time of recomputing it? Saving computation time is always a nice thing. Furthermore, even if the value is really trustworthy because of the possibility of truncation, the way that Jerome and I have implemented it allows for a certain level of "trustworthiness" depending upon where what value of the multi-valued list for the MetaData that you get when you requested a named MetaData property. The values at the front of the list are less trustworthy, while the values at the end should be more trustworthy. 

  I think that the issue you raise is an important one, however it's more of a policy issue (i.e., developers who are utilizing the MetaData classes, etc.) than a limitation of the patch, no?

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365066 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Sorry for this very late response...
The idea behind separate subclasses of Metadata for content and parses is to enforce the semantic separation between content metadata and parse metadata:
ContentProperties only defines constants for content related metadata.
ParseProperties only defines constants for parse related metadata.
Does it makes sense?

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365623 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I like this patch, the split of Metadata names into interfaces looks right. +1.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360931 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hmm, 

 Okay, I just finished reading the rest of the comments :-) Sorry, just woke up out here in Los  Angeles. Okay, I think I understand what you guys are getting at here. X-nutch should be the "reliable" metadata that we create, i.e., control the input and output dataflow from in Nutch right now. The other names, such as "Content-type", "Content-length" that are written at the protocol-layer, and that Nutch doesn't control how they are written into the properties objects, those names should be left alone then, no? Is that the jist of it. So, you guys propose we would have something like:

//some protocol layer plugin
...
String contentType = getHeader("Content-type");

//CONTENT_TYPE = "X-nutch-content-type";
propertiesObject.put(CONTENT_TYPE, contentType);

?

If this is the case, then I would still point out that there are still metadata names like "Content-type", that would be good to standardize on at the protocol level (shameless self plug of what I already did ;) ) on how they are read. You could call these other metadata names, since they aren't prefixed with X-nutch, like some other class, or something, but I think it's still important at the protocol level to just standarize the code if nothing less there. So the above example would become:

//some protocol layer plugin
...
String contentType = getHeader(PROTO_LYR_READ_ONLY_CONTENT_TYPE); //you're such a non-standard property, aren't you

//CONTENT_TYPE = "X-nutch-content-type";
propertiesObject.put(CONTENT_TYPE, contentType);


So, I think it's a good compromise to not only standardize on what we write/read, but also what we read only. Of course, I'm open to comments on this. :-)


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360933 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I like Jerome's proposal of using the new ContentProperties class; this could save a lot of work, especially this naming mess - and it seems to satisfy all requirements (having standard metadata names, preserving the original metadata, enabling Nutch to "shadow" the original values with the authoritative values).

A small comment, though: plugins that detect malformed or misnamed values should copy them to standard names. E.g. if a header name should be "Content-Type", and the real header name is "ContentType", then the plugin should copy this value to the proper name.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364116 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Just to add to Jerome's last comment, I think the key here is simplicity. As a software developer, and ultimately as an end user of Nutch, I identified the issue that there we several places where a developer has to remember the exact string used in a particular piece of code hidden under layers of OO abstraction, etc., just to get the value of a metadata property returned from the protocol layer. For example, did you know that in order to get  the content encoding at the protocol level, you have to use the EXACT string "Content-Encoding", not "ContentEncoding", or "ContenT-ENcodING", etc, but "Content-Encoding". There are numerous other examples at the protocol level, such "Content-Length", and "Content-Type" (even though Doug by now I'm sure hates that example :-) ). The whole point is that if you go look at the protocol level plugins, they all share the fact that they are reading these properties, and in some cases writing them to a metadata map. The whole issue is, why, as a writer of a protocol layer plugin, should I have to worry about the exact format of the String to get the "Content-Encoding" from the protocol layer? Wouldn't it be nice to standardize public static final Strings and then reference them instead of replicate them at the protocol plugin levels?

So, instead of having within protocol-http/HttpResponse.java:

    String contentLengthString = (String)headers.get("Content-Length");

and then in protocol-file/FileResponse.java

    hdrs.put("Content-Length", new Long(size).toString());

wouldn't it be nice to have a public static final String CONTENT_LENGTH = "Content-Length", and then replacing the hard coded strings in the protocol plugin code? So the above becomes:

protocol-http/HttpResponse.java:

    String contentLengthString = (String)headers.get(CONTENT_LENGTH);

protocol-file/FileResponse.java

    hdrs.put(CONTENT_LENGTH,  new Long(size).toString());

Of course, that's just one layer of the issue. As we've all identified these so-called "magic" strings exist at the parsing layers too. For example, in the rtf parser, there are * 17 * of these so called magic strings, ranging from "Security" to "Last-Save-Date" to "Last-Printed". Of course it would be naive to put every single metadata string that is written or read from a Map in the parsing and protocol layers of nutch into a single monolithic metadata class, but in the end, there are several standard metadata properties (* cough cough Dublin Core *) that deserve such first class status, along with certain other commonly used metadata properties at each respective layer, protocol and parsing. I believe that the purpose of this patch should not only to provide an extensible Metadata class, but also let's not forget the simple stuff too. And also, let's not turn this issue into 993939393 different things that need to be done. It should be phased into several capabilities,  and the first phase would be providing standard metadata names container at protocol and parsing layers which Jerome and I are working towards. I guess what I'm just trying to advocate is to not just forget about this issue by adding a million things to it, and making it difficult to complete that it never gets completed and accepted. Let's just keep it focused and simple, because in the end, as a user of Nutch, and as a software developer, I think it is very time-saving and helpful to have common Strings defined in one-place, or a few places rather than spread out across 20 or 30 classes, where you have to inspect each class to find out the exact way to read/write a String to make stuff work. That's all I'm saying.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365103 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

> except for the sake of purity of OO approach
Andrzej, as you noticed certainly, it is my defect...   ;-)
You know, I have still the temptation to split the metadata constants into several interfaces (DublinCore, HttpHeaders, ...)
;-)


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360902 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Thanks for taking time to take a look at the patch.
In fact, we have some discussion with Chris about this point
(that's why I don't commit the patch directly, I already have some doubts about this).

I will check right now how to handle things in this way.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365619 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

+1  This looks great.  Thanks for all the hard work on this one!

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363394 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Yes, I agree with the split into a generic MetaData container, and subclasses that define necessary constants for metadata names.

However, your proposal still doesn't address the key issue of having a set of "approved" or "final" values for metadata.

Example: in index-basic you need to index the title. It's not in the ContentProperties (wrong level) but in ParseProperties. A content parser may discover that the original title is empty or invalid. Still, this original value should be stored, under the standard key "dc:title". But then the parser knows the best what to do with an invalid value (is the ultimate authority), and it knows that the rest of Nutch really needs a meaningful value for the title, so it constructs it from the first line of the body text. However, with your porposed approach the parser cannot put it under the same key in ParseProperties (dc:title), because either it overwrites the original value, or it turns it into a simple multi-value property - as if the original metadata had multiple values.

That's why we insisted that the parser needs to use an "X-nutch-dc:title", so that it has a way to mark the final value that will remain in force for further processing. If you have a better way to achieve the same semantics, please explain.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362618 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Here is a new proposal for this issue.

org.apache.nutch.util.MetaData
  * becomes an utility class that is only a container of multi-valued, typo toletent String properties (using the same kind of API than JavaMail : the add / set methods mentionned by Doug - it is already implemented in the actual patch).
  * There is no more metadata names constants in this class, since it becomes a generic object for storing String/String[] mappings


org.apache.nutch.protocol.ContentProperties
  * This class simply extends the MetaData class
  * It defines the content related constants (Content-Type, and so on)

org.apache.nutch.parse.ParseProperties
  * This class simply extends the MetaData class
  * It defines the parse related constants (Dublin Core constans)

org.apache.nutch.parse.ParseData
  * The constructor becomes ParseData(ParseStatus, String, Outlink[], ContentProperties)
  * This class holds two metadata sets : 
     1. ContentProperties for the original metadata set which came from protocol
     2. MetaDataProperties for the parse metadata set.

  * This class provides 3 ways to retrieve a metadata value:
    1. public ContentProperties getContentMeta();
    2. public ParseProperties getParseMeta();
    3. public MetaData getMetaData(); // Returns a mix of the two previous one where values in parse properties override those in content properties.

In all parsers implementations:
* Remove the copying of content metadata to parse metadata.

>From my point of view the key benefits are:
  1. Provide a clear separation between content metadata and parse metadata.
  2. Metadata names are defined at the right places.
  3. Keeps the advantage of metadata names normalization and syntax correction
  4. An easy mapping beetween the content metadatas name and parse metadata names (both can use the real name of the metadata, without adding an artificial X-Nutch prefix for parse metadata name)


Comments are welcome.

Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
-------------------------------

    Attachment:     (was: NUTCH-139.Mattmann.patch.txt)

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

Also, since the primary use of multiple metadata values should be for protocols where multiple-values are required, the method to add a value should be different from the method to set a value.  I commented on this before when multiple values were added: there should be separate add(String,String) and set(String,String) methods.  The former should be used, e.g., by HTTP when storing headers, and the latter should be used, e.g., when setting x-nutch values.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363352 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Jerome,

  
>org.apache.nutch.parse.ParseData
>  * The constructor becomes ParseData(ParseStatus, String, Outlink[], ContentProperties)
>  * This class holds two metadata sets :
>     1. ContentProperties for the original metadata set which came from protocol
>     2. MetaDataProperties for the parse metadata set. 

Wouldn't the constructor be:

ParseData(ParseStatus, String, Outlink[], ContentProperties, ParseProperties)

?

Other than that, this sounds wonderful! I will contact you off list to get started on implementing this, if everyone else involved agrees. Doug, Andrzej?



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360901 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I have an objection, in fact I think the patches miss the main point of using of prefixed property names.

In this patch only some of the property names, specifically those corresponding to the Dublin Core, are prefixed with PREFIX. Why? The original reason for introducing the prefix was this: as Nutch processes the raw data, it extracts certain metadata, either directly or using heuristics (like with LANG or content type). In order to distinguish these values from the original raw values, the metadata processed by Nutch was to be prefixed by "X-nutch-", and all other metadata that we don't use was to be left alone as it was.

So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with e.g. the mime-type detection plugin, and it should put the final value of Content-Type in metadata - but under the name of "X-nutch-Content-Type", in order to avoid overwriting the original value (Chris's comment in MSWordParser.java reflects this doubt - that's the reason for prefixing).

Now, this convention is not followed in the patches. E.g. LANG is missing (should be PREFIX + "lang"). CharEncodingForConversion doesn't have a prefix either. Properties extracted in plugins (e.g. msword, zip, file, etc) are put under the standard, non-prefixed names, thus overwriting the original values.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360906 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Here are more comments about my doubts, and how to handle metadata names.

if for instance a protocol plugin doesn't have any Content-Length information (no header like in FTP), then is should compute the content-length and add it in X-nutch-content-length attribute.
But what do you suggest if a protocol have a Content-Length header (HTTP may provide one)?
My feeling is adding the two metadata:
1. One for the Content-Length header in the Content-Length attribute
2. One for the real Content-Length (computed) in the X-nutch-content-length attribute.

In other words and more generally:
* When adding a native protocol header, if an equivalent x-nutch attribute exists in MetadataNames, then it must be added too with the same value, or with a more precise value.
* If no header information is available, tries to fill the more x-nutch attribute the protocol level can.

Do you agree with that?



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360681 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug, Jerome, 

>I'm confused as to why all of the constant names have "X_nutch" in them. I'd expect to see something like that in their string values, but their 
> names are already qualified by org.apache.nutch.ParseData, no? 

Err, whoops. My fault. I misinterpreted what Andrzej was saying his comment:

"I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties that are defined there - just prepended them with "X-nutch-" in order to avoid name-clashes with other properties (e.g. blindly copied from the protocol headers). " 

I'll fix this right quick.

>Also, it would be easier if these were all defined in an interface, something like MetadataNames. That way a class can "implement" that interface 
>and then simply use the short names in code, e.g. CONTENT_TYPE, AUTHOR, etc.

Yuppers, I agree on this one too. In fact, while I was making the patch, I was thinking in my head ("hey this would probably be a good idea to have in its own interface class..."), but since no one objected to my initial proposition to the dev list to put in into ParseData, I just put them there. So, yeah I'll fix this right quick as well.

Updated patch...on its way!


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365095 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Ok Doug. Your point of view makes sense for me.
I hope, I can provide a (final) patch for the next week.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365089 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

Jerome: yes, it makes sense, but there's also metadata that's not tightly related to the protocol or the parser, e.g., the nutch segment that the page was fetched into and the score that's been assigned to the url.  I think we'd go crazy trying to divide the metadata up into categories, and that there's not much harm in stuffing it all in one bag.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by Chris Mattmann <ch...@baron.pagemewhen.com>.

Guys,

 My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
away.

 Sorry again.

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Doug Cutting (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, January 05, 2006 8:04 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in
> the ParseData metadata
> 
>     [ http://issues.apache.org/jira/browse/NUTCH-
> 139?page=comments#action_12361922 ]
> 
> Doug Cutting commented on NUTCH-139:
> ------------------------------------
> 
> One more thing.  Content length should also not need to be stored in the
> metadata as an x-nutch value.  The content length is simply the length of
> the Content's data.  The protocol may have truncated the content, in which
> case perhaps we need an x-nutch-truncated-content metadata property or
> something, but we should not be overwriting the HTTP "Content-Length"
> header, nor should we trust that it reflects the length of the data
> actually fetched.
> 
> 
> > Standard metadata property names in the ParseData metadata
> > ----------------------------------------------------------
> >
> >          Key: NUTCH-139
> >          URL: http://issues.apache.org/jira/browse/NUTCH-139
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> >  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
> RAM, although bug is independent of environment
> >     Reporter: Chris A. Mattmann
> >     Assignee: Chris A. Mattmann
> >     Priority: Minor
> >      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> >  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> NUTCH-139.jc.review.patch.txt
> >
> > Currently, people are free to name their string-based properties
> anything that they want, such as having names of "Content-type", "content-
> TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe
> proposed a solution in which all property names be converted to lower
> case, but in essence this really only fixes half the problem right (the
> case of identifying that "CONTENT_TYPE"
> > and "conTeNT_TyPE" and all the permutations are really the same). What
> about
> > if I named it "Content     Type", or "ContentType"?
> >  I propose that a way to correct this would be to create a standard set
> of named Strings in the ParseData class that the protocol framework and
> the parsing framework could use to identify common properties such as
> "Content-type", "Creator", "Language", etc.
> >  The properties would be defined at the top of the ParseData class,
> something like:
> >  public class ParseData{
> >    .....
> >     public static final String CONTENT_TYPE = "content-type";
> >     public static final String CREATOR = "creator";
> >    ....
> > }
> > In this fashion, users could at least know what the name of the standard
> properties that they can obtain from the ParseData are, for example by
> making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
> get the content type or a call to
> ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of
> course, this wouldn't preclude users from doing what they are currently
> doing, it would just provide a standard method of obtaining some of the
> more common, critical metadata without pouring over the code base to
> figure out what they are named.
> > I'll contribute a patch near the end of the this week, or beg. of next
> week that addresses this issue.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by Doug Cutting <cu...@nutch.org>.

Chris Mattmann wrote:
> I've tried removing the 5 copies of the comment, however I can't find a
> button on JIRA to remove comments. Maybe an administrator for Nutch can do
> it?

I removed the extra comments.  No problem.

Doug

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Folks,
 
  I've tried removing the 5 copies of the comment, however I can't find a
button on JIRA to remove comments. Maybe an administrator for Nutch can do
it? Anyways, the dang thing is running so slow right now, it may just have
to wait until the server stops returning the 503 service unavailable
messages. Sorry again...

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: chris.mattmann@jpl.nasa.gov [mailto:chris.mattmann@jpl.nasa.gov]
> Sent: Thursday, January 05, 2006 8:28 PM
> To: nutch-dev@lucene.apache.org
> Subject: RE: [jira] Commented: (NUTCH-139) Standard metadata property
> names in the ParseData metadata
> 
> Guys,
> 
>  My apologies for the spamming comments -- I tried to submit my comment
> through JIRA one time and it kept giving me service unavailable. So I
> resubmitted like 5 times, on the fifth time it finally went through -- but
> I
> guess the other comments went through too. I'll try and remove them right
> away.
> 
>  Sorry again.
> 
> Cheers,
>   Chris
> 
> 
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
> 
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> > -----Original Message-----
> > From: Doug Cutting (JIRA) [mailto:jira@apache.org]
> > Sent: Thursday, January 05, 2006 8:04 PM
> > To: nutch-dev@incubator.apache.org
> > Subject: [jira] Commented: (NUTCH-139) Standard metadata property names
> in
> > the ParseData metadata
> >
> >     [ http://issues.apache.org/jira/browse/NUTCH-
> > 139?page=comments#action_12361922 ]
> >
> > Doug Cutting commented on NUTCH-139:
> > ------------------------------------
> >
> > One more thing.  Content length should also not need to be stored in the
> > metadata as an x-nutch value.  The content length is simply the length
> of
> > the Content's data.  The protocol may have truncated the content, in
> which
> > case perhaps we need an x-nutch-truncated-content metadata property or
> > something, but we should not be overwriting the HTTP "Content-Length"
> > header, nor should we trust that it reflects the length of the data
> > actually fetched.
> >
> >
> > > Standard metadata property names in the ParseData metadata
> > > ----------------------------------------------------------
> > >
> > >          Key: NUTCH-139
> > >          URL: http://issues.apache.org/jira/browse/NUTCH-139
> > >      Project: Nutch
> > >         Type: Improvement
> > >   Components: fetcher
> > >     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> > >  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
> > RAM, although bug is independent of environment
> > >     Reporter: Chris A. Mattmann
> > >     Assignee: Chris A. Mattmann
> > >     Priority: Minor
> > >      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> > >  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> > NUTCH-139.jc.review.patch.txt
> > >
> > > Currently, people are free to name their string-based properties
> > anything that they want, such as having names of "Content-type",
> "content-
> > TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe
> > proposed a solution in which all property names be converted to lower
> > case, but in essence this really only fixes half the problem right (the
> > case of identifying that "CONTENT_TYPE"
> > > and "conTeNT_TyPE" and all the permutations are really the same). What
> > about
> > > if I named it "Content     Type", or "ContentType"?
> > >  I propose that a way to correct this would be to create a standard
> set
> > of named Strings in the ParseData class that the protocol framework and
> > the parsing framework could use to identify common properties such as
> > "Content-type", "Creator", "Language", etc.
> > >  The properties would be defined at the top of the ParseData class,
> > something like:
> > >  public class ParseData{
> > >    .....
> > >     public static final String CONTENT_TYPE = "content-type";
> > >     public static final String CREATOR = "creator";
> > >    ....
> > > }
> > > In this fashion, users could at least know what the name of the
> standard
> > properties that they can obtain from the ParseData are, for example by
> > making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
> > get the content type or a call to
> > ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of
> > course, this wouldn't preclude users from doing what they are currently
> > doing, it would just provide a standard method of obtaining some of the
> > more common, critical metadata without pouring over the code base to
> > figure out what they are named.
> > > I'll contribute a patch near the end of the this week, or beg. of next
> > week that addresses this issue.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > If you think it was sent incorrectly contact one of the administrators:
> >    http://issues.apache.org/jira/secure/Administrators.jspa
> > -
> > For more information on JIRA, see:
> >    http://www.atlassian.com/software/jira

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by ch...@jpl.nasa.gov.

Guys,

 My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
away.

 Sorry again.

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Doug Cutting (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, January 05, 2006 8:04 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in
> the ParseData metadata
> 
>     [ http://issues.apache.org/jira/browse/NUTCH-
> 139?page=comments#action_12361922 ]
> 
> Doug Cutting commented on NUTCH-139:
> ------------------------------------
> 
> One more thing.  Content length should also not need to be stored in the
> metadata as an x-nutch value.  The content length is simply the length of
> the Content's data.  The protocol may have truncated the content, in which
> case perhaps we need an x-nutch-truncated-content metadata property or
> something, but we should not be overwriting the HTTP "Content-Length"
> header, nor should we trust that it reflects the length of the data
> actually fetched.
> 
> 
> > Standard metadata property names in the ParseData metadata
> > ----------------------------------------------------------
> >
> >          Key: NUTCH-139
> >          URL: http://issues.apache.org/jira/browse/NUTCH-139
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> >  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
> RAM, although bug is independent of environment
> >     Reporter: Chris A. Mattmann
> >     Assignee: Chris A. Mattmann
> >     Priority: Minor
> >      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> >  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> NUTCH-139.jc.review.patch.txt
> >
> > Currently, people are free to name their string-based properties
> anything that they want, such as having names of "Content-type", "content-
> TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe
> proposed a solution in which all property names be converted to lower
> case, but in essence this really only fixes half the problem right (the
> case of identifying that "CONTENT_TYPE"
> > and "conTeNT_TyPE" and all the permutations are really the same). What
> about
> > if I named it "Content     Type", or "ContentType"?
> >  I propose that a way to correct this would be to create a standard set
> of named Strings in the ParseData class that the protocol framework and
> the parsing framework could use to identify common properties such as
> "Content-type", "Creator", "Language", etc.
> >  The properties would be defined at the top of the ParseData class,
> something like:
> >  public class ParseData{
> >    .....
> >     public static final String CONTENT_TYPE = "content-type";
> >     public static final String CREATOR = "creator";
> >    ....
> > }
> > In this fashion, users could at least know what the name of the standard
> properties that they can obtain from the ParseData are, for example by
> making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
> get the content type or a call to
> ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of
> course, this wouldn't preclude users from doing what they are currently
> doing, it would just provide a standard method of obtaining some of the
> more common, critical metadata without pouring over the code base to
> figure out what they are named.
> > I'll contribute a patch near the end of the this week, or beg. of next
> week that addresses this issue.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361922 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

One more thing.  Content length should also not need to be stored in the metadata as an x-nutch value.  The content length is simply the length of the Content's data.  The protocol may have truncated the content, in which case perhaps we need an x-nutch-truncated-content metadata property or something, but we should not be overwriting the HTTP "Content-Length" header, nor should we trust that it reflects the length of the data actually fetched.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
---------------------------------

    Attachment: NUTCH-139.060105.patch

Attached (NUTCH-139.060105.patch) is a new patch for this issue.
Thanks for reviewing it so that I can commit it as soon as possible.

Regards

Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361925 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug,

  While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it in the MetaData as well to save the computational time of recomputing it? Saving computation time is always a nice thing. Furthermore, even if the value is really trustworthy because of the possibility of truncation, the way that Jerome and I have implemented it allows for a certain level of "trustworthiness" depending upon where what value of the multi-valued list for the MetaData that you get when you requested a named MetaData property. The values at the front of the list are less trustworthy, while the values at the end should be more trustworthy. 

  I think that the issue you raise is an important one, however it's more of a policy issue (i.e., developers who are utilizing the MetaData classes, etc.) than a limitation of the patch, no?

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361924 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug,

  While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it in the MetaData as well to save the computational time of recomputing it? Saving computation time is always a nice thing. Furthermore, even if the value is really trustworthy because of the possibility of truncation, the way that Jerome and I have implemented it allows for a certain level of "trustworthiness" depending upon where what value of the multi-valued list for the MetaData that you get when you requested a named MetaData property. The values at the front of the list are less trustworthy, while the values at the end should be more trustworthy. 

  I think that the issue you raise is an important one, however it's more of a policy issue (i.e., developers who are utilizing the MetaData classes, etc.) than a limitation of the patch, no?

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
     
Jerome Charron resolved NUTCH-139:
----------------------------------

    Fix Version:     (was: 0.7.2-dev)
                     (was: 0.7.1)
                     (was: 0.7)
                     (was: 0.6)
     Resolution: Fixed

Tested and commited with some corrections in cached.jsp (missed ContentProperties usage) and build.xml (add commons-lang jar in war lib):
http://svn.apache.org/viewcvs.cgi?rev=376089&view=rev

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361889 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Looks good to me, +1

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364218 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

>  I think we're near agreement here.
I really hope ...   ;-)

> We should add an add() method to Metadata, and change set() to replace all values rather than add a new value.
I'm not sure we are looking at the same piece of code, since this how add() and set() methods works in the last attached patch (http://issues.apache.org/jira/secure/attachment/12321740/NUTCH-139.060105.patch)

> MetadataNames belongs in the protocol package, not util
+1 (but in my mind there is no more MetadatNames.... only MetaData, ContentProperties and ParseProperties, no?)

> We should rename ContentProperties to Metadata
What about having a generic Metadata container extended by ContentProperties and ParseProperties?
(as described in a previous comment : http://issues.apache.org/jira/browse/NUTCH-139#action_12362618)
By having two separate maps (one for Content and one for Parse in ParseData) we easily handle the problem of original value / final value and we avoid the copying af the Content metadata map to the Parse metadata map in all parsers:

ContentProperties metadata = new ContentProperties();
metadata.putAll(content.getMetadata()); // copy through


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362013 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Doug,

The purpose of this patch is to provide some standard metadata names and to be able to handle erroneous names, not to handle multi-valued headers (but your are right, that multi-valued headers should be handled in a next patch, since only the protocol layer knows how to extract the muli-values => after, it is too late).
Concerning the actual implementation of the multi-values (history) there is nothing new in this patch.
And since it already contains a correction of the patch providing by Andrzej about Class Cast Exception (http://www.nabble.com/Class-Cast-exception-t865816.html#a2245383), I suggest to commit it.



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362061 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

I agree with your analysis Andrzej.
I suggested to commit this patch because it is a response to this issue: standard metadata names + misspelled/erroneous names.
The "history" is not a new "feature" => ContentProperties is a kind of "history".
So after commiting this patch, I (and others) could focus on other sub-issues:
1. In fact, by taking a closer look to it, I agree that there is no real need of a metadata history in nutch.
2. What we need: 
2.1 MetaData must be used to store multi valued meta data and not the actual kind of history.
3.1 Only two "historical" values must be stored : the original one (protocol only) and some extra metadata (that could be or not some derivated values of the original ones).

What I suggest is that the MetaData deals with two collections instead of one:
* One for original protocol values : headers
* Another one for other metadata

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362049 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I see three issues here:

* using standard metadata names and handling misspelles/erroneous ones: this patch provides this function, and IMO does it well.

* providing authoritative values for certain metadata, if the original ones are absent of not trustworthy. Well, this patch provides this functionality, but with a twist, i.e. we say that the earliest values are the ones coming from the protocol level, later ones from parse or other plugins, and the latest value on top is the one that should be used for the further processing. That's what is meant by "history".

* supporting multi-values headers: I don't see how this patch could possibly support it, because it re-defined the semantics of multiple values per key ("history" - see above). So, it doesn't support multi-values per key, and I don't see any easy way how to add this support later, without creating a messed up mix of "historical" and "multi" values...

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Chris A. Mattmann updated NUTCH-139:
------------------------------------

    Attachment: NUTCH-139.Mattmann.patch.txt

Okay Folks,

 Here's the patch file. Phew. Spent the whole day working on this today. Finally got all the unit tests passing :-) It was pretty difficult to figure out all the different places that the metadata properties were being used, but I think I nailed it. I'd be happy to hear from anybody that finds differently of course. So, yeah, please test this in your environments, and let me know what you guys think. I didn't test extensively, just made sure the unit tests passed.

  The biggest changes were to the protocol and parsing layer, but overall I also had to update the index-more plugin. This patch also includes several removals of unused imports in Nutch classes (good ol' Eclipse, I love it how it tells you that ;) ). Also, I fixed a bug that I just reported in the parse-rtf plugin where the ParseImpl was not being constructed correctly (or maybe I had an old version?). Anyways, check it out and let me know what you think. 

Thanks Guys!

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360659 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

+1 with Doug comments:

* Remove X_nutch to constants names
* Add "X-nutch-" prefix to constants values
* Move constants definitions to a MetadataNames interface.

I take it.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360920 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

And why not using the fact that the ContentProperties object can now handles multi-valued properties.
Each piece of code that wants to add some more reliable content about a property simply add its own value to the property => the first value is the raw one (for instance from the protocol level) and the more you iterate over values of the property, the more youo have a reliable value (the last one should be the more reliable, and is generally the interesting one, or for other reasons, the original value may be needed, then it is simply the first value).

Yes, you loose one information with this solution: You cannot ensure that the first value of a multi-valued property is the one from the protocol level.
But it avoid searching the same kind of information (the content-type for instance) using many properties names ("Content-Type" for the protocol level and "X-Nutch-Content-Type" for other levels).

We can extend the Multi-Valued Properties by adding a provider attribute while adding a property:
public void addProperty(key, value, provider).
The provider can be one of PROTOCOL, CONTENT, OTHER, .... for instance (to be defined)



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

We can just use different names, rather than two metaData objects: X-nutch names for derived or other values that are usually protocol independent; and (possibly prefixed) names for protocol- or format-specific values.  The latter are sometimes multivalued, but the former are probably not.

The relevance to this patch is that this patch currently uses un-prefixed protocol-specific names to store derived, protocol-independent data, which is confusing.  This patch is meant to standardize property names.  Let's just standardize them once.  Protocol- and format-specific names should be defined in protocol- and format-specific files.  For example, if we want to define constants for http headers, they should probably go in the (new) lib-http plugin.

We also need to change ContentProperties to distinguish add(String,String) from set(String,String), and we may need to change some protocols to call add(String,String) instead of set(String,String).  I think that it makes sense to bundle that change in this patch too.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363834 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

I really don't like this "X-Nutch" naming convention. First it's really protocol level oriented, and it forces to map "X-Nutch" values with original ones (of course an utility method can easily provides this mapping). But I really think this solution is really clean (from my point of view).

We should perhaps define one more time what is a MetaData value.
I suggest to define a new class to represent a metadata value instead of using a simple String.
Thus, we can define a class that holds both original and final value.
The idea is that the only way to set the original value is to construct a new object (I will call this class MetaValue, but native english speakers are encourage to propose a better name), then when you set the value of this metadata value, it never override the original one, but the final one.
Here is a short piece of code:

public class MetaValue {
    private String[] original = null;
    private List actual = null;

    public MetaValue(String[] values) {
        // Constructor for multi value
        original = values;
    }
    public MetaValue(String value) {
        // Constructor for single value
       original = new String[] { value };
    }
   public void setValue(String[] values) {
       // copies the values in a new empty actual list
   }

   public void addValue(String value) {
       // append this value to the list of values
   }

   public String[] getOriginalValues() { }

   public String[] getFinalValues() { }

   public String[] getValues() {
       // Return the final values if the list of values is not null
      // otherwise return the final values
  }
}

With this approach we can keep the same value (MetaValue) with the same key.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360645 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

I'm confused as to why all of the constant names have "X_nutch" in them.  I'd expect to see something like that in their string values, but their names are already qualified by org.apache.nutch.ParseData, no?  Also, it would be easier if these were all defined in an interface, something like MetadataNames.  That way a class can "implement" that interface and then simply use the short names in code, e.g. CONTENT_TYPE, AUTHOR, etc.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364112 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

In fact, the more I look at this, the more I agreed with last Doug comment. There is no real needs (for now) for a so complicated meta-info container.

I would like to summarize the key goals related to this issue:

1. Defines some constants for protocol and content metadata names.
2. Provides some correction mechanisms for erroneous protocol headers names.
3. Handles multi-valued properties (such as SMTP recipients, or TAGS attached to a html page, ...)
4. Provides a easy way to keep tracks of protocol original values even if they are overridden by parsers (I don't think there a need for a concept of original value at the parsers level. If a parser override a previously existing value setted by another parser, then this new value must replace the existing one).

I really think that one of my comment (13/Jan/06 - http://issues.apache.org/jira/browse/NUTCH-139#action_12362618) covers all these cases.
In this proposal, the ParseData object keeps a reference on the protocol original metadata map (ContentProperties), instead of copying the map into a new one. 
The policy is then as follow :
* The ContentProperties is created at the protocol level and is then never modified after.
* The ParseProperties is created by the content parser and is the place to store any kind of metadata in all the next nutch processes.
* Any metadata stored in ParseProperties can be "overridded" (the last who speak has the last word).


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361923 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug,

  While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it in the MetaData as well to save the computational time of recomputing it? Saving computation time is always a nice thing. Furthermore, even if the value is really trustworthy because of the possibility of truncation, the way that Jerome and I have implemented it allows for a certain level of "trustworthiness" depending upon where what value of the multi-valued list for the MetaData that you get when you requested a named MetaData property. The values at the front of the list are less trustworthy, while the values at the end should be more trustworthy. 

  I think that the issue you raise is an important one, however it's more of a policy issue (i.e., developers who are utilizing the MetaData classes, etc.) than a limitation of the patch, no?

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
-------------------------------

    Attachment:     (was: NUTCH-139.jc.review.patch.txt)

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
-------------------------------

    Comment: was deleted

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360389 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

According to Andrzej:

"I agree, too. Perhaps we should use the names as they appear in the  Dublin Core for those properties that are defined there - just prepended  them with "X-nutch-" in order to avoid name-clashes with other  properties (e.g. blindly copied from the protocol headers). "


I will follow this notation when I devleop the code.

Thanks,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6

>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

Jerome,

Some HTTP headers have multiple values.  Correctly reflecting that was I thought the primary motivation for adding multiple values, not for recording historical values.

I still don't see a reason why the derived content type needs to be stored anywhere but in the contentType field of the Content.  And if a derived value ever needs to go into the metadata, it should always use an x-nutch key, so that it can be clearly distinguished from original values.

Chis,

The content length is not expensive to compute, it's simply the length of the content byte array.  Are there uses of content length where this is impractical?  If so, then perhaps we could, for performance, cache a protocol-independent, derived content length in an x-nutch header. 

Alternately, we could prefix all protocol headers with the protocol name, so that the HTTP "Content-Language" header could be stored as something like "http:Content-Language".  Then Nutch could avoid using the x-nutch prefix, and instead store the derived, protocol-independent value as simply "language".

Yes, these are issues of policy, but this patch violates my ideas about the correct policy.  We should not confuse protocol-specific HTTP headers with protocol-independent derived values.  And multiple-values should be the exception, used in cases where multiple values are really sensible (like email "Received" headers) not to store the historic values.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361041 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Ok, Chris and me will implement MetadataNames in this way.
Just some few comments:

I plan to move the MetadataNames to a class rather than an interface. Two reasons:

1.1 I don't like the design of implementing an interface in order to import some constants in a class: It gives some javadoc with a lot of class with many public constants defined without any really needs to show these constants in the javadoc.

1.2 I want to add an utility method in MetadataNames that tries to find the appropriate Nutch normalized metadata name from a string. It will be based on the Levenshtein Distance (available in commons-lang). More about Levenshtein Distance at http://www.merriampark.com/ld.htm




> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
-------------------------------

    Comment: was deleted

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361926 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Doug,

  While it's true that content-length can be computed from the Content's data, wouldn't it also be nice to have it in the MetaData as well to save the computational time of recomputing it? Saving computation time is always a nice thing. Furthermore, even if the value is really trustworthy because of the possibility of truncation, the way that Jerome and I have implemented it allows for a certain level of "trustworthiness" depending upon where what value of the multi-valued list for the MetaData that you get when you requested a named MetaData property. The values at the front of the list are less trustworthy, while the values at the end should be more trustworthy. 

  I think that the issue you raise is an important one, however it's more of a policy issue (i.e., developers who are utilizing the MetaData classes, etc.) than a limitation of the patch, no?

Cheers,
  Chris


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365098 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

FWIW, I agree with Doug on this - I don't see that subclasses would buy us much in terms of functionality, except for the sake of purity of OO approach. I think we can add subclasses later, when we really need specialization.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361900 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Doug,

This implementation is a multi-valued implementation:
1. The protocol headers are stored as-is.
2. Then correct values (guessed from parsers, mime-type or whatever) are added to the metadata.

Thus, finally, for instance for content-type we can have something like this in metadata:
key: Content-Type
values:
    * text/plain; charset= ....
    * text/plain
    * text/html

The values are ordered: the first one is the one at protocol level, then other values are the ones guessed by many piece of code: The idea behind is that: the first value is the raw value, and the more you iterate over values, the more the value is accurate (we hope).

For instance in the major parts of code the last(CONTENT_TYPE) value will be used, but in Cache.java, the first(CONTENT-TYPE) value will be used.


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
-------------------------------

    Comment: was deleted

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361043 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Regarding the move to a class with public static fields: I don't have any problem with that.

Regarding the Levenshtein distance: I think we can do even better, before we resort to such generic methods:

1) bring all property names to lowercase
2) remove any non-letters

Example: "Content-type" vs. "ContentType":

1) "content-type" vs. "contenttype".
2) "contenttype" vs. "contenttype" -> match

These two steps could be simply implemented in a custom Comparator for the ContentProperties.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362249 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

Let me try to be more concrete.  I'd prefer that the X-nutch properties be removed from MetadataNames before this is committed, and moved to protocol- and parse-specific files.  Response.java would be a good place for things that are part of HTTP that are mimic'd by other protocols.

If you prefer, this could be done subsequently.  So my vote for this patch is currently 0.

MetadataNames.java should probably not be in util, but rather in protocol, near ContentProperties.  In general, we should avoid putting things in util unless they're really generic.  Perhaps we need a new package for metadata?  And ContentProperties could be renamed MetadataProperties.  That change is out of the scope of this patch (there I go again!) but it's best to place new stuff like MetadataNames.java and the protocol-specific property names in the right place to start, rather than have to move them in a subsequent patch.  Then we don't have to change all the import statements again, etc.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> Erhm.. please bear with me. I'd rather see these two classes in a 
> separate package altogether, org.apache.nutch.metadata. The reason is 
> that most likely these two classes will be used elsewhere too, not just 
> in the protocol and parse/fetch related context. I'm specifically 
> referring to the CrawlData.

+1

Doug

Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting (JIRA) wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] 
>
>   

My apologies for commenting here - JIRA produces broken HTML for me, I 
can't use it...

> Doug Cutting commented on NUTCH-139:
> ------------------------------------
>
> I think we're near agreement here.
>
> Here are the changes I think this patch still needs:
>
> MetadataNames belongs in the protocol package, not util.
>   

Erhm.. please bear with me. I'd rather see these two classes in a 
separate package altogether, org.apache.nutch.metadata. The reason is 
that most likely these two classes will be used elsewhere too, not just 
in the protocol and parse/fetch related context. I'm specifically 
referring to the CrawlData.

> We should rename ContentProperties to Metadata.
>   

+1.

> We should add an add() method to Metadata, and change set() to replace all values rather than add a new value.  Protocol code which creates properties from headers should then use add().
>   

+1

> We could commit after simply moving MetadataNames to protocol, and leave the changes to ContentProperties for another commit, but I'd prefer it all be done together.
>   

Either way is fine with me. Perhaps splitting this into two commits 
would make it easier to fix potential breakage...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

I think we're near agreement here.

Here are the changes I think this patch still needs:

MetadataNames belongs in the protocol package, not util.

We should rename ContentProperties to Metadata.

We should add an add() method to Metadata, and change set() to replace all values rather than add a new value.  Protocol code which creates properties from headers should then use add().

We could commit after simply moving MetadataNames to protocol, and leave the changes to ContentProperties for another commit, but I'd prefer it all be done together.

Any objections to these changes?



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
-------------------------------

    Comment: was deleted

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360909 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Yes, that was again the reason for prefixing - we want to keep as much of the original metadata as we can, to facilitate various processing in plugins that know how to handle unreliable data. But at the same we need a place to keep the metadata that we believe in.

So, first of all the protocol plugins collect any metadata "as is" from the wire, and put the original values in properties. Then, those plugins that know what they are doing should put the reliable metadata under the "X-nutch-" namespace. If some metadata is required (e.g. content type), then the code that is responsible for handling this metadata should ensure that we have a reliable value under the "X-nutch-" namespace, even if the original value is missing.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360929 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Andrzej,

> I have an objection, in fact I think the patches miss the main point of using of prefixed property names.

D'oh!

> In this patch only some of the property names, specifically those corresponding to the Dublin Core, are prefixed with PREFIX. Why? 

Well the reason behind this was kind of like this. I wanted the metadata property names to be reusable, across the protocol level code, the parser code, pretty much anywhere that you used what I would call  "common" metadata properties in Nutch. Now, at the protocol level especially, there were bits and pieces of code like, "readHeaders("Content-type"), or String someValue = getHeader("Content-length"), blah blah blah", where the code was physically reading properties that were already written to an object, and that nutch has no control over. In these cases, in order to make all the calls synonomous, e.g., a call to readHeaders("Content-type") gets replaced by readHeaders(CONTENT_TYPE), I couldn't use the "_X_nutch" prefix on the names, because I didn't write the value into those objects originally.

On the other hand, anywhere that I was able to physically add metadata properties that were under our control, at the protocol level, or parsing level, etc., in particular, all of the DC properties, we had control as to how they were getting added into the properties object that was being passed around: both input control, and control over where it was being read, so we could use the X_nutch prefix.

So, in my mind I saw two distinct types of standard metadata properties: those which we can control both the input and output data flow from, and those which we really can only control the output  data flow from.

> The original reason for introducing the prefix was this: as Nutch processes the raw data, it extracts certain metadata, either directly or > using heuristics (like with LANG or content type). In order to distinguish these values from the original raw values, the metadata 
> processed by Nutch was to be prefixed by "X-nutch-", and all other metadata that we don't use was to be left alone as it was.

This was followed to the T, except for the case above, which I mention and which you pointed out. For example, what would have happened if I put CONTENT_TYPE="X_nutch_content_type", and then I had a call in getHeaders(CONTENT_TYPE) in the protocol level? Well, since we don't ever put CONTENT_TYPE into the headers properties object, that would really never help us, and then everywhere we read CONTENT_TYPE, the value would have nothing. 

> So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with e.g. the mime-type detection plugin, and it should 
> put the final value of Content-Type in metadata - but under the name of "X-nutch-Content-Type", in order to avoid overwriting the 
> original value (Chris's comment in MSWordParser.java reflects this doubt - that's the reason for prefixing).

Yup, exactly. Good job catching that comment!

> Now, this convention is not followed in the patches. E.g. LANG is missing (should be PREFIX + "lang"). 

Not sure I follow this one. In the patch, there's a line:

 public static final String LANGUAGE = NUTCH_PREFIX + "language";

?



> CharEncodingForConversion 
> doesn't have a prefix either. Properties extracted in plugins (e.g. msword, zip, file, etc) are put under the standard, non-prefixed 
> names, thus overwriting the original values.

This isn't really true at all. I didn't overwrite any of the original values. In fact, no values are really overwritten at all. There are only two cases really:

1. Places where I standardized on how the names are read: you see these at the bottom of MetadataNames.java. These are properties that we don't really have control over how they got written into properties object, or properties that I at least couldn't figure out how they got placed into the properties objects at their particular layers. In this case, I've omitted the NUTCH_PREFIX in order to make reading/(post-writing) of the properties work fine.

2. Places where I standardized on how the names are read/written. These are at the top of MetadataNames.java. I could find the entire data flow in and out of the properties objects at the respective layers for all of these properties, and what's why they have the X-nutch Prefix.  Make sense?





> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
---------------------------------

    Attachment: NUTCH-139.060208.patch

A new patch which I hope is compliant with all our requirements (not tested yet on a wide fetch/index/query cycle)

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Chris A. Mattmann updated NUTCH-139:
------------------------------------

    Priority: Minor  (was: Major)

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6

>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361045 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Do you read in my mind?
Yes of course, that's the way I want to do it: First checks for the most common cases (lower cases + keeps only letters), then use the Levenshtein distance if needed (last chance).
Regards

Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira