You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Russell Jurney (Created) (JIRA)" <ji...@apache.org> on 2012/03/26 01:02:27 UTC

[jira] [Created] (PIG-2614) AvroStorage crashes on LOADING a single bad error

AvroStorage crashes on LOADING a single bad error
-------------------------------------------------

                 Key: PIG-2614
                 URL: https://issues.apache.org/jira/browse/PIG-2614
             Project: Pig
          Issue Type: Bug
          Components: piggybank
    Affects Versions: 0.10
            Reporter: Russell Jurney
            Priority: Blocker
             Fix For: 0.10


AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239596#comment-13239596 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Also, how are you setting these values? The proper way is, in a script:

set pig.piggybank.storage.avro.bad.record.threshold 0.01
set pig.piggybank.storage.avro.bad.record.min 100
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244752#comment-13244752 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Daniel,

The files you are referencing are from https://issues.apache.org/jira/browse/PIG-2551. I used PigCounterHelper. Those files will also be checked in with the JRuby patch (or we can just commit that one already, your call). I included the whole thing to be unambiguous, but can remove the other files.

Russell,

Perhaps an e2e test is better here? Is it that you can't get the function to work, or that you can't get data to work it on?
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239339#comment-13239339 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Followup^2 comment: I wish I could undo comments.  Retesting with the actual jar I get this...


Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	... 7 more

Adding org.slf4j.LoggerFactory tells me this jar is missing.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264659#comment-13264659 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Yeah, some bad data would be key... if the case with avro is that one bad row throws off the offset, I can deal with that, but it'll help immensely to have something to test against.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243891#comment-13243891 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Thank you sir, testing now!
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243921#comment-13243921 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

I will write unit tests, but I'm having trouble getting this to work in my actual error case.  In my case, when one record is bad, all subsequent records are bad... the offset screws up or something in the read.

I'll make unit tests, though!
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy V. Ryaboy reassigned PIG-2614:
--------------------------------------

    Assignee: Jonathan Coveney

Looks like Jon's all over this.. assigning to him.

Would be nice to make this a global setting (and punt per-loader settings to ONERROR implementation)

But that's outside the scope of this ticket, let's at least confirm this works.

Russell, if you can post input that knocks you to the wrong offset, that could help with reproducing the error you are getting with the current patch.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239311#comment-13239311 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Thanks a lot for helping out.  This feels like ONERROR :)  Ok, I asked the LinkedIn guys to try 0.10/JRuby, fwiw.

pig.piggybank.storage.avro.bad.record.threshold=0.99
pig.piggybank.storage.avro.bad.record.min=100

So I read these as 'don't die unless max(more than 1% of the records fail, 100 records fail)

Correct?
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507564#comment-13507564 ] 

Cheolsoo Park commented on PIG-2614:
------------------------------------

In addition to applying the patch, the following commands should be also executed to run the unit test cases:
{code}
wget https://issues.apache.org/jira/secure/attachment/12555546/test_avro_files.tar.gz
tar -xf test_avro_files.tar.gz
svn rm contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_corrupted_file.avro
svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_corrupted_file
svn rm contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile.avro
svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile2.avro
svn add contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/expected_testCorruptedFile3.avro
{code}

Thanks!
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239331#comment-13239331 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Currently, it still dies after I apply the patch and set thresholds. 


java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64
	at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
	at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
	at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
	at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
	at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273)
	... 7 more

                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Coveney updated PIG-2614:
----------------------------------

    Attachment: PIG-2614_1.patch

Russell,

This patch gets rid of the logging dependency. Let me know if you have any issues with it!
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239344#comment-13239344 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

I added that class to the PigAvroRecordReader class and register all these jars and still it don't go:

register /me/pig/build/ivy/lib/Pig/slf4j-api-1.6.1.jar
register /me/pig/build/ivy/lib/Pig/slf4j-log4j12-1.6.1.jar
register /me/pig/build/ivy/lib/Pig/log4j-1.2.16.jar
register /me/pig/build/ivy/lib/Pig/commons-logging-1.1.1.jar
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Coveney updated PIG-2614:
----------------------------------

             Priority: Major  (was: Blocker)
    Affects Version/s: 0.11
        Fix Version/s: 0.11
    
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506101#comment-13506101 ] 

Joseph Adler commented on PIG-2614:
-----------------------------------

Repeating an old question: is there any reason that this patch is only for Avro? I think this could work for all storage types.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239368#comment-13239368 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

I've tried adding slf4j to PigRecordReader and what not, but I'm stumped and am gonna kick this back at you, having tried and failed to use it :D
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239336#comment-13239336 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Followup comment... it looks like you also need to do the check in reader.nextKeyValue()
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510636#comment-13510636 ] 

Joseph Adler commented on PIG-2614:
-----------------------------------

I'd argue that it's not just true for different RecordReaders, but for different business problems. There are cases where it's OK to run a job even if 50% of your records are bad, and cases where an error on 1 in 1,000,000,000,000 records is unacceptable...

How about this: I'll take the existing patch and modify it to apply to all record readers (not just AvroStorage). I'll see if I can get this to work more generally. I think it will be about the same amount of work to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader as it would be to modify the rewritten AvroStorage, so why not just tackle the general problem?
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509325#comment-13509325 ] 

Joseph Adler commented on PIG-2614:
-----------------------------------

Could I propose an alternative? 

I like this functionality, but I don't think that this should be specific to Avro records. I think that is should be straightforward to modify org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader to implement this functionality for ALL LoadFunc types. Specifically, it should be possible to count the number of Exceptions thrown by the getNext method in the underlying load function (inside PigRecordReader.nextKeyValue).


                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239270#comment-13239270 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Right now there is no plan to make it any more prominent than the fact that it is a Hadoop counter. I'm not sure how hard it is to print various stuff at the end... it's an interesting thought (to make it so functions can register information they want printed at the end). Either way, the counters + the fact that any threshholds are set by the user feel sufficient to me.

There's a thread going around about the RC. There are a handful open...if you need this patch to be in the RC, just post in the thread (with any others). My guess would be sometime next week, to give you a chance to write tests, Bill to finish up some of his fixes, Julien a chance to look over JRuby stuff, etc. Right now am trying to get a final list of what will go into the RC so we stop stuffing the ballot box, as it were :)
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Coveney updated PIG-2614:
----------------------------------

    Attachment: PIG-2614_0.patch

works in both 0.10 and 0.11
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>            Priority: Blocker
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2614:
-------------------------------

    Status: Patch Available  (was: Open)
    
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2614:
-------------------------------

    Attachment: test_avro_files.tar.gz
                PIG-2614_2.patch

Hi all,

I rebased the patch to trunk. Hopefully, this will make things more clear:
- Removed PIG-2551 code since it's already committed to trunk.
- Replaced the {{ignore_bad_file}} option that was committed in PIG-2909 with the {{bad.record.threshold}} and {{bad.record.min}} properties.
- Added unit test cases
{{testCorruptedFile1,2,3}}.

@Joe,
I am not sure if I fully understand your question. Please correct me if I am wrong.

You're right that {{InputErrorTracker}} can be used by any LoadFunc. What storages need to do is to create a {{InputErrorTracker}} and increase counters. Do you have a better suggestion?

Thanks!
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-2614:
----------------------------

    Fix Version/s:     (was: 0.10.0)
                   0.10.1
    
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509342#comment-13509342 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

I imagine we could make this work for any RecordReader, yes. I think the difference is just that different RecordReaders fail differently...for some, you can recover, for others, you cannot. I have no problem making this more generic, though I do not have the bandwidth for it atm. I think we should have these tests, decide whether to commit this to Avro, then have another JIRA which will make it more general.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch, PIG-2614_2.patch, test_avro_files.tar.gz
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239335#comment-13239335 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Ah, I see.  The code is like this:

    @Override
    public Tuple getNext() throws IOException {
        try {
            if (!reader.nextKeyValue()) {
                return null;
            }
            return (Tuple) this.reader.getCurrentValue();
        } catch (Exception e) {
            throw new IOException(e);
        }
    }

So I need to remove these conditions.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239193#comment-13239193 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Russell,

In Elephant-bird, there is a key elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like doing this, so find attached a patch that adds the functionality you want (note that it includes PIG-2551, which is more or less good to go... only because that patch brings in a Counter helper).

The default functionality does not change. On an error, it will die. However, there are not two keys that can be set:
pig.piggybank.storage.avro.bad.record.threshold
pig.piggybank.storage.avro.bad.record.min

The former sets the acceptable ratio threshhold. The latter sets the minimum number of errors before it can error out.

Here is where you come in:

Currently, the only error I log is on "reader.next()." Are there any other cases where errors (at least, errors indicating a bad row) can be thrown? And on an error, what do you want to happen? Skip the row, or return null? It seems to make sense to me to skip the record (also, the number of records processed and the number of errors thrown is logged in a Hadoop counter now).

Secondly, someone needs to make tests. It currently passes the tests, but that's because the default threshold and min are 0. I don't know what is and isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work implementing will motivate someone to add tests ;)
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>            Priority: Blocker
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Daniel Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244461#comment-13244461 ] 

Daniel Dai commented on PIG-2614:
---------------------------------

If unit test is too hard, you can elaborate your manual test step, attach the dataset you are using.

BTW, I also see there is some code not belonging to this issue in patch, can you cleanup?
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239267#comment-13239267 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

reader.next() is where I've had the problem, and I'm not aware of any other cases where this gets thrown.

It makes sense to skip the record and adjust counters.  Assuming the skipped/err'd/dropped record count will output prominently at the end?

I can add tests.  When are we releasing an RC?
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Joseph Adler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501339#comment-13501339 ] 

Joseph Adler commented on PIG-2614:
-----------------------------------

Just taking a look at this patch now

If I'm reading the code correctly, it should not be necesarry for the author of a UDF (specifically a LoadFunc) to do anything special to take advantage of this functionality. Is that correct?
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Russell Jurney
>            Assignee: Jonathan Coveney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Russell Jurney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244413#comment-13244413 ] 

Russell Jurney commented on PIG-2614:
-------------------------------------

Hmmmm I'm not able to test this on real data. I don't know how to write bad avros that don't screw up all reads thereafter :(
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch, PIG-2614_1.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PIG-2614) AvroStorage crashes on LOADING a single bad error

Posted by "Jonathan Coveney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239586#comment-13239586 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Thanks for taking a look. I'll probably just remove the dependence on the logger and rely on the counters. As far as the bounds, it's should be:

pig.piggybank.storage.avro.bad.record.threshold=0.01
pig.piggybank.storage.avro.bad.record.min=100

the threshhold is the error/record ratio you're willing to tolerate.

Do you think there is any chance that you could submit a dataset with some known bad rows and some known good rows? That would let me troubleshoot this. I'll fix the logging issue though.
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing fields.  This is very bad on 'big data,' where bad records are inevitable.  See discussion at http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira