You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/01/31 01:17:32 UTC

[jira] Created: (NUTCH-192) meta data support for CrawlDatum

meta data support for CrawlDatum
--------------------------------

         Key: NUTCH-192
         URL: http://issues.apache.org/jira/browse/NUTCH-192
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
     Fix For: 0.8-dev


Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata010206.patch

As discussed...

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365648 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

Looks good to me, too. If there are no further objections, I can commit this latest patch, modulo some minor whitespace changes.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365413 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

I have a different opinion on this (I think MapWritable is a sufficiently general-purpose data structure that would be useful in Hadoop), but we can always move it later. I'd like to finalize this issue, though - are we happy with the MapWritable as it is in the last patch, so that we apply this patch and move on?  I vote +1.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365364 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

If this looks exactable for you, I can port mapWritable to hadoop...

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364795 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

A perfect plan, I will do that so and commit a new patch. :) 
THANKS!

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

* plus whatever it takes to put the class name->id mapping in the MapWritable header (the mapping table): let's assume 40 bytes. 

I do not write the mapping table in any kind to the out stream, by now the the id is caculated by a hash from the class name. 
I will change this so it will be a part of the class where I will manually assign LongWritable id = (byte)1, UTF8 id = (byte)2, etc.

For example writing a long ( e.g. a timestamp) as UTF8 require me 15 byte, writing it as LongWritable took me 8 byte.
8 byte plus 1 byte for the class type, is 60 % required space than using a String. 

I guess the main missunderstanding is that I do not write the clazz - id map into the stream at any time.
Makes that sense?
 


> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Closed: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]
     
Andrzej Bialecki  closed NUTCH-192:
-----------------------------------

    Resolution: Fixed

Applied. Thank you!

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata060206.patch

Doug, did you mean something like this?
Writing 1 mio map's (with one tuple [int key, long value]) into a sequence file that use a int key takes around 5400 ms on my box.
Writing 1 mio int key, utf8 values into a sequence files took pretty much the same time. 
However reading utf8 is requre 60 % of the time i need to read the map. This is may depends that utf8 just reads a byte array and convert the string first if toString is called. If I call toString in my test than reading utf8 is slower that reading the map. 
So another possible improvement could be to read just a byte array into the map and 'parsing' this byte array first and only when the first get method is called. 
This can save some time in processing crawlDatum in situation where we do not need to access the meta data at all.  
However reading and writing of a 10 mio map's with one key value tuple can be done in less than a minute on my desktop box.  

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata300106.patch

Attached a first suggestion for a patch to adding meta data support into crawlDatum. 
In general I created a MapWritable and add this to the CrawlDatum. If no meta data are added to CrawlDatum there will be only one more int written to the output stream. The MapWritable works like a HashMap but requre Writables as key and value. Beside the key and the value size it writes two addition int's into the stream to identify the classes of  key and value. If we may be more change the WritableName we can minimize that to two addidtional bytes for storing classes (this would limit us but i guess we will neve so mache writable object types. :-o). However I started with a patch that changes as less as possible and I'm sure there is space for improvements. So feedback and improvement suggestions are welcome.



> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364674 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

I agree that Writable is probably overkill, that strings should be sufficient.

A mapping dictionary would save a lot of space, even with strings.  This could be a useful optimization, but should be left until after the initial (less optimized) addition of metadata to CrawlDatum.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365618 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

Since these mappings are not something that users should alter, I'm not sure they should be in the config file.  I added related mappings to static code in NutchConfigurable.  Every Nutch invocation should reference that class, so adding registrations like there ensures they'll always be exectuted.  So, in any case, if they're loaded from a resource (config file or otherwise) the loading should probably happen in NutchConfigurable.  Putting it in a static block there means it isn't reloaded for each configuration, but, e.g., if plugins need to register new mappings, then perhaps we'll need to reload these resources each time a configuration is constructed.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365643 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

+1 This looks good to me. Thanks for your persistence.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata08_02_06_FULL.patch

Please remove the last patch, I had attached the wrong file, this file is the patch containig all new classes and changes. Sorry.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata08_02_06.patch, metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364782 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

There is a very real hazard in the fact that we don't store the dictionary. Let's consider this example: two plugins invoke WritableName.setName() with different classes, ClassA and ClassB. We get the mapping ClassA -> 23, ClassB -> 24. The files written by these plugins use just the byte IDs, 23 and 24. The someone changes the config file, and plugins are initialized in a reversed order, so consequently we get ClassB -> 23, ClassA ->24. And now the plugins cannot read the files they created because of the wrong class returned from MapWritable ...

So, I'm still convinced that we need to save the dictionary. Unfortunately, for small amounts of metadata (typical use case) it blows up the on-disk size of MapWritable, which is why I thought using Strings would be cheaper ...

Other things: In the javadoc for MapWritable it should be mentioned that any Writable type that one is going to use needs to be first registered with the WritableName.setName(). Or perhaps the method could do it automatically, but then the IDs will be unpredictable, depending on the order of iteration (which leads to the problem described above).

Also, there is a bug in setName(): if you try adding the same mapping twice (which could happen in different places), the method should allocate just one ID for the class. As it is now, it will allocate new ID each time you call the method, even if the class name is the same. Just add this:

   public static synchronized void setName(Class writableClass, String name) {
     Object o = CLASS_TO_NAME.put(writableClass, name);
     NAME_TO_CLASS.put(name, writableClass);
     if (o != null) return; // already has an ID
    CLASS_TO_ID.put(writableClass, new Byte((byte)CLASS_TO_ID.size()));
    ID_TO_CLASS.put(new Byte((byte)ID_TO_CLASS.size()), writableClass);
   }

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365536 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

Yes, that's an issue - due to the way WritableName is initialized it's difficult to add more mappings later, in a predictable fashion. Unless we read the mappings from an external resource, like a Configuration property.

We can proceed in two ways now - either we commit to Nutch all parts that we have (and move the Nutch-specific mappings to be initalized in MapWritable), or we expend additional effort to clean it up, load the mappings from a resource, and then commit MapWritable-related parts to Hadoop.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364542 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

I have two comments:

* it's not obvious to me what are the strong arguments in favor of storing Writables. I'd think that for vast majority of applications Strings are sufficient, which would simplify the code and save a lot of space (at the cost of possible serialization from non-string values, in rare cases).

* if we really, really need Writables, then perhaps it would be better to store the mapping dictionary <class names, ids>, and then use a single byte as an id. I don't think one would need more than 256 different classes in a MapWritable, and this way we could avoid that static mapping table (which I'm afraid would cause its own problems with changes and versioning).

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364923 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

I'm worried that this will substantially slow things.

I'd like to see some effort made to ensure that:

1. If no metadata is used, then no MapWritable's should be allocated.

2. If readFields() is called repeatedly on a single CrawlDatum instance, as few new objects should be alloacated as possible.  If MapWritable were to extend HashMap rather than wrap it, and MapWritable.readFields() first called clear(), then the HashMap's entry table could be reused.  Better yet would be to try to reuse the entries in the table.  If an entry exists with the same classes, then it and its key and value instances could be reused.  This optimization would require the use of a more extensible HashMap, perhaps like that in Jakarta Commons Collections.  Alternately, one could use a linked list instead of a HashMap, which should be plenty fast for things this size.

If an entry were defined as:

class Entry {
  Writable key;
  Writable value;
  Entry next;
}

Then MapWritable could have fields:
  Entry first;
  Entry last;
  Entry old;

clear() would set old=first; and first=last=null.
allocateEntry(Class keyClass, Class valueClass) would scan old, splicing out and returning the first entry whose classes match these.  If none is found then a new entry would be allocated.
readFields() would first identify each key and value class, call allocateEntry(), then call entry.key.readFields() and entry.value.readFields() and finally set last.next=entry and last=entry.

Also, why does MapWritable.write() create a DataOutputBuffer?  It should just write to out.




> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata310106.patch

Now 1 byte for the class type and the size of the type itself, this means we can have only 2 byte keys and 2 byte values in the map. 

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365368 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

Make sense, than only the package need to be changed since 'io' was moved to hadoop. I just was guessing that a writable map can be a useful type for other map reduce users than nutch, e.g. in the erea of bio informatics.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364788 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

That's true. In any case I don't wan't to store the class id map. Since if we do that, you are right we can use strings. 
What you think about having a map in the MapWritable itself where we manually assign id's. This was may plan in very beginning but I was thinking that using WritableName would be better, but of cource I overseen problemes you mentioned.
Do you think haveing a static block in the MapwWritable like this, will solve our problems?
CACHE.put(LongWritable.class, new Byte(1));

Thanks for taking time to discuss this.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364694 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

What I meant was that both keys and values should be Strings (or rather UTF8), for the sake of simplicity. Let's take your example: if we use Writables, then to store 1 ByteWritable you need:

* 1 byte - type id
* 1 byte - value
* plus whatever it takes to put the class name->id mapping in the MapWritable header (the mapping table): let's assume 40 bytes.

For storing one value it's a substantial overhead. For storing hundreds of values the overhead is going down asymptotically to 1 byte per value.

So, the question really is what is the typical use scenario that we want to optimize: whether you intend to store hundreds of metadata values of different types, or just a couple. If the former, then using MapWritable makes sense, if the latter - using Strings is simpler.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365366 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

No, the stuff you're doing with MapWritable has nothing to do with Hadoop, but is all to support features you're adding to Nutch.  So, if anywhere, it belongs somewhere in Nutch.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365450 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

Sorry, I misspoke and overstated things too.  There are problems, but not with MapWritable, rather with WritableName: this refers to some Nutch classes that are not in Hadoop.  Aside from that, I agree that MapWritable could be generally useful.  Sorry I wasn't thinking clearly when I made my previous comment.


> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364683 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

Andrzej, Doug. I'm not sure if I understand you correct, do you suggest to have string keys and values, or just string keys?
It confuse me a bit but I'm afraid to misunderstand things because of my english, since I remember that one reason to have no meta data until today was  performance and the size of data. 
In one of my personal use-cases I have a set of meta data that is definitely smaller than 255 and I only need to store some long values.
So I would love to use key:ByteWritable and value:LongWritable. 

Storing new LongWritable(23) or new UTF8("23") should be  a significant different in size. Also parsing byte int or long from a string takes some time.
At least there is a nice side effect, since this map also is a writable we can store a Map in a Map, what allows heretically meta data.

I fully agree with having a manual created mapping table stored in the MapWritable class and I will change this and commit a new patch.
Thanks for your comments!

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Doug Cutting updated NUTCH-192:
-------------------------------

    Attachment:     (was: metadata08_02_06.patch)

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364791 ] 

Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

We could take a middle ground - write out only the non-standard parts of the dictionary. In vast majority of cases this is equivalent to not writing the dictionary, and in rare cases we still have this flexibility.

First we would need to encode the standard dictionary inside WritableName (I think it's better place than in MapWritable), but using a separate API so that it's clear you cannot extend it accidentally just by calling setName. I.e. something like this, in the WritableName.<clinit>:

    WritableName.setName(NullWritable.class, "null");
    WritableName.setID(NullWritable.class, 0);
    WritableName.setName(LongWritable.class, "long");
    WritableName.setID(LongWritable.class, 1);
   ...

* in WritableName.setID complain loudly if you overwrite an already existing ID.

* then in MapWritable use these 1-byte "standard IDs" as before. However:

* inside write(), first check that all types that the MapWritable uses for keys and values are registered in WritableName. For any non-registered types create a private additional dictionary, with IDs starting in the range above the latest "standard ID". This dictionary we will have to write to the output, if not empty.

* then inside write() write out all values and keys as before, using both the standard IDs and non-standard ones from the dictionary.

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata08_02_06.patch

Doug, I'm afraid there is a missunderstanding or may be I just do not understand  your comments.
A plugin never need to add a class - id mapping anymore. The later patches (after Andrzej suggestions) can handle any kind of writables. In case the class  is not known in a mapping we create a internal id - class tuple and write  it to or read it from the 'header' of each mapWritable.  So users can use any kind of custom  writable's this just takes some more space in the file. (one byte for the id and a UTF8 for the classname). In case there is a frequently used new writable we can add it to the mapping. 

So as suggested I moved the mapping from WritableName into a static block of MapWritable and in case unknown writables are used we read write a header containg this id class tuple. From my point of view this is the best solution for now and I don't think we will have that often new and frquently used writables. 


> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, metadata08_02_06.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira