You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Jeremy Hinegardner (JIRA)" <ji...@apache.org> on 2011/01/03 06:09:45 UTC

[jira] Created: (AVRO-724) C implementation does not write datum values that are larger than the memory write buffer (currently 16K)

C implementation does not write datum values that are larger than the memory write buffer (currently 16K)
---------------------------------------------------------------------------------------------------------

                 Key: AVRO-724
                 URL: https://issues.apache.org/jira/browse/AVRO-724
             Project: Avro
          Issue Type: Bug
          Components: c
    Affects Versions: 1.4.1
            Reporter: Jeremy Hinegardner


The current C implementation does not allow for datum values greater than 16K.

The {{avro_file_writer_append}} flushes blocks to disk over time, but does not deal with the single case of a single datum being larger than {{avro_file_writer_t.datum_buffer}}.  This is noted in the source code:

{code:title=datafile.c:294-313}
int avro_file_writer_append(avro_file_writer_t w, avro_datum_t datum)
{
    int rval;                  
    if (!w || !datum) {        
        return EINVAL;         
    }
    rval = avro_write_data(w->datum_writer, w->writers_schema, datum);
    if (rval) {                
        check(rval, file_write_block(w));
        rval =
            avro_write_data(w->datum_writer, w->writers_schema, datum);
        if (rval) {            
            /* TODO: if the datum encoder larger than our buffer,
               just write a single large datum */
            return rval;       
        }
    }
    w->block_count++;          
    return 0;                  
}
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (AVRO-724) C implementation does not write datum values that are larger than the memory write buffer (currently 16K)

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168536#comment-13168536 ] 

Doug Cutting commented on AVRO-724:
-----------------------------------

> The ideal solution would be to have fixed length block header fields but that would require a change to the spec.

This makes sense.  To do this we'd probably want to increment the file format's magic number, i.e., from {'O','b','j',1} to {'O','b','j',2}.  And it would be best to update all implementations to read the new format before making it the default for any implementation.
                
> C implementation does not write datum values that are larger than the memory write buffer (currently 16K)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: AVRO-724
>                 URL: https://issues.apache.org/jira/browse/AVRO-724
>             Project: Avro
>          Issue Type: Bug
>          Components: c
>    Affects Versions: 1.4.1
>            Reporter: Jeremy Hinegardner
>
> The current C implementation does not allow for datum values greater than 16K.
> The {{avro_file_writer_append}} flushes blocks to disk over time, but does not deal with the single case of a single datum being larger than {{avro_file_writer_t.datum_buffer}}.  This is noted in the source code:
> {code:title=datafile.c:294-313}
> int avro_file_writer_append(avro_file_writer_t w, avro_datum_t datum)
> {
>     int rval;                  
>     if (!w || !datum) {        
>         return EINVAL;         
>     }
>     rval = avro_write_data(w->datum_writer, w->writers_schema, datum);
>     if (rval) {                
>         check(rval, file_write_block(w));
>         rval =
>             avro_write_data(w->datum_writer, w->writers_schema, datum);
>         if (rval) {            
>             /* TODO: if the datum encoder larger than our buffer,
>                just write a single large datum */
>             return rval;       
>         }
>     }
>     w->block_count++;          
>     return 0;                  
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (AVRO-724) C implementation does not write datum values that are larger than the memory write buffer (currently 16K)

Posted by "Matt Massie (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976904#action_12976904 ] 

Matt Massie commented on AVRO-724:
----------------------------------

Jeremy-

I'm familiar with the code for the C implementation since I wrote most of it.

If you look at around line 277 of datafile.c you'll find the file_write_block() function.  This function writes a "block" of datum to a datafile.  From the spec:

{quote}
A file data block consists of:

A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.
{quote}

Unfortunately, the longs in the block header are zig-zag encoded and variable length (sigh), so it makes it impossible simply write the objects to disk immediately and then seek back to the block count/size offset and update it later.  To work around the file format, I just kept everything in memory and then flushed the block to disk all at once.

The fastest workaround to the problem (assuming your objects are small enough to fit in memory), would be to simply increase the size of the memory buffer.  The avro_file_writer_t in datafile.c looks like the following:
{code}
struct avro_file_writer_t_ {
        avro_schema_t writers_schema;
        avro_writer_t writer;
        char sync[16];
        int block_count;
        avro_writer_t datum_writer;
        char datum_buffer[16 * 1024];
};
{code}

This is where the 16K limitation comes from.  Instead of having this static buffer, you could redefine it to be a 'char *' and allocate the memory from heap using malloc() in avro_file_writer_create().

If your objects are too big for memory, an alternative (and better) solution would be to use a temporary file-backed datum_writer in the avro_file_writer_t instead of a memory-backed one.  When you were later ready to incorporate the objects in the datafile, you would (1) write the block count/size, (2) do a byte copy of the temporary file to the datafile, (3) write the sync marker and (4) truncate the temporary file for use in the next block.

The ideal solution would be to have fixed length block header fields but that would require a change to the spec.  I may open a discussion on the developers mailing list soon regarding this limitation but please feel free to do so yourself if you like.

Good luck with writing a fix.  I look forward to seeing it.  It will be a nice contribution to the implementation.




> C implementation does not write datum values that are larger than the memory write buffer (currently 16K)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: AVRO-724
>                 URL: https://issues.apache.org/jira/browse/AVRO-724
>             Project: Avro
>          Issue Type: Bug
>          Components: c
>    Affects Versions: 1.4.1
>            Reporter: Jeremy Hinegardner
>
> The current C implementation does not allow for datum values greater than 16K.
> The {{avro_file_writer_append}} flushes blocks to disk over time, but does not deal with the single case of a single datum being larger than {{avro_file_writer_t.datum_buffer}}.  This is noted in the source code:
> {code:title=datafile.c:294-313}
> int avro_file_writer_append(avro_file_writer_t w, avro_datum_t datum)
> {
>     int rval;                  
>     if (!w || !datum) {        
>         return EINVAL;         
>     }
>     rval = avro_write_data(w->datum_writer, w->writers_schema, datum);
>     if (rval) {                
>         check(rval, file_write_block(w));
>         rval =
>             avro_write_data(w->datum_writer, w->writers_schema, datum);
>         if (rval) {            
>             /* TODO: if the datum encoder larger than our buffer,
>                just write a single large datum */
>             return rval;       
>         }
>     }
>     w->block_count++;          
>     return 0;                  
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] [Commented] (AVRO-724) C implementation does not write datum values that are larger than the memory write buffer (currently 16K)

Posted by "Michael Cooper (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168131#comment-13168131 ] 

Michael Cooper commented on AVRO-724:
-------------------------------------

I have added a way to set the upper limit of the datum_buffer as part of AVRO-957.

It was implemented in memory with avro_malloc instead of it being a stack array.

The relevant commit: https://github.com/hitwise/avro/commit/859f83e09354aeb13ac394777adc449586c6d770
                
> C implementation does not write datum values that are larger than the memory write buffer (currently 16K)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: AVRO-724
>                 URL: https://issues.apache.org/jira/browse/AVRO-724
>             Project: Avro
>          Issue Type: Bug
>          Components: c
>    Affects Versions: 1.4.1
>            Reporter: Jeremy Hinegardner
>
> The current C implementation does not allow for datum values greater than 16K.
> The {{avro_file_writer_append}} flushes blocks to disk over time, but does not deal with the single case of a single datum being larger than {{avro_file_writer_t.datum_buffer}}.  This is noted in the source code:
> {code:title=datafile.c:294-313}
> int avro_file_writer_append(avro_file_writer_t w, avro_datum_t datum)
> {
>     int rval;                  
>     if (!w || !datum) {        
>         return EINVAL;         
>     }
>     rval = avro_write_data(w->datum_writer, w->writers_schema, datum);
>     if (rval) {                
>         check(rval, file_write_block(w));
>         rval =
>             avro_write_data(w->datum_writer, w->writers_schema, datum);
>         if (rval) {            
>             /* TODO: if the datum encoder larger than our buffer,
>                just write a single large datum */
>             return rval;       
>         }
>     }
>     w->block_count++;          
>     return 0;                  
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (AVRO-724) C implementation does not write datum values that are larger than the memory write buffer (currently 16K)

Posted by "Jeremy Hinegardner (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976614#action_12976614 ] 

Jeremy Hinegardner commented on AVRO-724:
-----------------------------------------

I am going to attempt to work on this issue.  If anyone has some guidance on the approach for this, I could use it.  Otherwise I'll figure something out.

> C implementation does not write datum values that are larger than the memory write buffer (currently 16K)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: AVRO-724
>                 URL: https://issues.apache.org/jira/browse/AVRO-724
>             Project: Avro
>          Issue Type: Bug
>          Components: c
>    Affects Versions: 1.4.1
>            Reporter: Jeremy Hinegardner
>
> The current C implementation does not allow for datum values greater than 16K.
> The {{avro_file_writer_append}} flushes blocks to disk over time, but does not deal with the single case of a single datum being larger than {{avro_file_writer_t.datum_buffer}}.  This is noted in the source code:
> {code:title=datafile.c:294-313}
> int avro_file_writer_append(avro_file_writer_t w, avro_datum_t datum)
> {
>     int rval;                  
>     if (!w || !datum) {        
>         return EINVAL;         
>     }
>     rval = avro_write_data(w->datum_writer, w->writers_schema, datum);
>     if (rval) {                
>         check(rval, file_write_block(w));
>         rval =
>             avro_write_data(w->datum_writer, w->writers_schema, datum);
>         if (rval) {            
>             /* TODO: if the datum encoder larger than our buffer,
>                just write a single large datum */
>             return rval;       
>         }
>     }
>     w->block_count++;          
>     return 0;                  
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.