You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/10/24 18:15:52 UTC

[jira] Closed: (NUTCH-139) Standard metadata property names in the ParseData metadata

     [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Sami Siren closed NUTCH-139.
----------------------------


> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>                 Key: NUTCH-139
>                 URL: http://issues.apache.org/jira/browse/NUTCH-139
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8
>         Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug is independent of environment
>            Reporter: Chris A. Mattmann
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch
>
>
> Currently, people are free to name their string-based properties anything that they want, such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira