You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gora.apache.org by "Alfonso Nishikawa (JIRA)" <ji...@apache.org> on 2015/03/03 23:09:06 UTC

[jira] [Comment Edited] (GORA-401) Serialization and deserialization of Persistent does not hold the entity dirty state from Map to Reduce

    [ https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345854#comment-14345854 ] 

Alfonso Nishikawa edited comment on GORA-401 at 3/3/15 10:08 PM:
-----------------------------------------------------------------

I uploaded [GORA-401v5.patch|https://issues.apache.org/jira/secure/attachment/12702265/GORA-401v5.patch], which has the missing ASF header in `MockPersistent` and is a bit cleaner.
About the testing stores:

- MemStore just use a Map inside, so it does not discriminate the fields that are dirty from those that are not when "writing" (actually a `#put()` in a Map).
- AvroStore does not write values and it uses DatumWriter, so in this case it does not take into account the dirty bytes.

Just as insight, HBaseStore iterates over the fields (actually the top-most fields of the avro schema) and persist those that are dirty.

In conclusion: I am not satisfied with this patch: dirty, MR tests in HBase, etc. People seems to not have problems about what has been reported here, do they? What should we do? :P

Thanks!


was (Author: alfonso.nishikawa):
I uploaded [GORA-401v5.patch|https://issues.apache.org/jira/secure/attachment/12702265/GORA-401v5.patch], which has the missing ASF header in `MockPersistent` and is a bit cleaner.
About the testing stores:

- MemStore just use a Map inside, so it does not discriminate the fields that are dirty from those that are not when "writing" (actually a `#put()` in a Map).
- AvroStore does not write values and it uses DatumWriter, so in this case it does not take into account the dirty bytes.

Just as insight, HBaseStore iterates over the fields (actually the top-most fields of the avro schema) and persist those that are dirty.

In conclusion: I am not satisfied with this patch: dirty, MR tests in HBase, etc. People seems to not have problems about what has been reported here, do they? What do we do? :P

Thanks!

> Serialization and deserialization of Persistent does not hold the entity dirty state from Map to Reduce
> -------------------------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on gora-0.5. HBase backend.
>            Reporter: Alfonso Nishikawa
>            Assignee: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>             Fix For: 0.7
>
>         Attachments: GORA-401-tests.patch, GORA-401v1.patch, GORA-401v2.patch, GORA-401v3.patch, GORA-401v4.patch, GORA-401v5.patch
>
>   Original Estimate: 35h
>          Time Spent: 21h
>  Remaining Estimate: 14h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. In GORA-321 {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}} went from using {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}} to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty field to Avro (but really not desirable to have that field as a main field in the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's phases, serializes entities (from Map to Reduce), and when deserializes finds all fields as "dirty", independently of what fields were modified in the Map, and overwrite all data in datastore (deleting much things: downloaded content, parsed content, etc).
> This effect can be seen in {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections shows that, entities are "equal" when it's fields are equal. This is fine as "equal" definition, but another test must be added to check that serialization an deserialization keeps the dirty state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)