You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafodion.apache.org by "Liu, Ming (Ming)" <mi...@esgyn.cn> on 2016/03/29 05:13:29 UTC

new requirments for Trafodion to be more tolerate during bulkloading

Hi, all,

Trafodion can bulk load data from HDFS into Trafodion tables. Currently, it has some strict requirements about the source data in order to load successfully.
Typically, data source should be clean and contains relatively few 'dirty' data. However, there will be some special cases where source data contains some special value and we hope Trafodion can handle automatically:


Automatically remove '\r' when it is used as '\r\n' the DOS format line delimiter.

Donot raise SQL error, but convert bad data into null automatically, and still be able to log this into error log files when required, don't make this change silent, and make this action traceable.

Allow '\n' in data field even '\n' is the line terminator

Auto truncate overflowed string, log it into the error log file, in order to make it traceable.

When src data have above 'issues', now, we have to do a special 'data clean' process before load the data: convert DOS format into Unix format, find bad data and remove them. However, products like Hive can handle these 'bad' data as mentioned above. So it will be helpful, if Trafodion can introduce a special mode to simulate the same 'tolerance' when doing bulkload, if user can make sure these are desired conversion, and no need to do the extra 'data clean' process. Especially, data will be shared by Trafodion and other products like Hive.

I will file a JIRA if no objections here, and any suggestions ideas are welcome!

Thanks,
Ming

Re: 答复: new requirments for Trafodion to be more tolerate during bulkloading

Posted by Hans Zeller <ha...@esgyn.com>.

Hi Ming,

Yes, I think we could make this the default behavior, insert the error row
with NULL values and also log it in the error file. That may take some
effort, though, since we probably evaluate all the columns in a single
expression, and if we encounter an error we may abort processing at that
point.

About the line terminators in the middle: In my humble opinion this is a
user error and we should not go through a lot of trouble to guess what the
user may have meant when they fed us junk. The solution above should be
sufficient. We could have a mode similar to the CSV SerDe in Hive that
could understand quotes and escape symbols, if users need that ability.
But, if Hive does something different and if what Hive does is reasonable,
maybe we should go with the Hive behavior. In the longer term I would think
that supporting MapReduce SerDes instead of adding more complicated logic
to our standard processing might be the way to go.

Thanks,

Hans

On Thu, Mar 31, 2016 at 8:27 AM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, Hans,
>
> I need to spend more time to study Hive. At first glance, Hive will simply
> treat invalid value as NULL, and that is what I originally think about.
> Trafodion bulkloader already have an option to log the error row into a
> 'error log' file on HDFS, so that is already implemented. What I now think
> is: when read an Integer column but get 'abc' from src, instead of raise an
> error and dump that row into 'error log' file, in a special mode, Trafodion
> still load that row with invalid field as NULL and at same time dump it
> into the 'error log' file for later investigation.
> But I do need to spend more time to study Hive behavior, what I said is
> only for Hive external table, there is no 'load' phase, user simply 'hdfs
> put' a file into the HDFS. I need to try if try to load into a Hive table
> with invalid data how Hive will handle it.
>
> For '\n' in the middle, you pointed out a real difficult corner case, i.e.
> the extra '\n' in the last field. It is really hard to do this correct in
> various cases. What I original simple considering is: each row has fixed
> number of columns, so before HIVE_SCAN can get all columns, it will treat
> the line terminator as normal character, but when it is the last field,
> this will fail. We do need to think more about this. And maybe we need to
> abandon this requirement. It is always simpler to fix in the source, choose
> another delimiter, 0x01 is always a good one. I need to further think about
> this requirement. The fact is, some commercial RDBMS can tolerate this kind
> of issue. For this specific issue, \n in the last field, there will be
> solution, like to look further ahead, but it is really difficult if the
> last field contains multiple \n. So this may not be a valid requirement.
> And Hive cannot handle this as well. I need to study how that commercial
> RDBMS do this. Will update later, for now, we can ignore this requirement.
>
> Thanks,
> Ming
>
> -----邮件原件-----'hhh
> 发件人: Hans Zeller [mailto:hans.zeller@esgyn.com]
> 发送时间: 2016年3月31日 1:49
> 收件人: dev <de...@trafodion.incubator.apache.org>
> 主题: Re: new requirments for Trafodion to be more tolerate during
> bulkloading
>
> Hi Ming,
>
> Yes, it would be good if Trafodion would behave in ways similar to Hive.
> It would also be good if errors in the data wouldn't make the entire bulk
> load fail. Instead, those error rows should ideally go into a separate file
> or table (with additional fields for source file name and line number if
> possible). That's not an easy thing to do, though. Could you say how much
> of this error logging functionality you would want and how Hive solves that?
>
> I don't quite understand how we would recognize record delimiters in a
> data field? Do you want to ignore record delimiters that appear within
> field delimiters? So, a record delimiter in the last field would cause the
> record to be truncated, but in fields other than the last it would be ok? I
> think it would be better if we would treat a row with a record delimiter in
> it as an error row (two error rows, actually). Possibly we could allow
> quoted strings, like
>
> "this is a single field with a field delimiter | and a record delimiter \n
> and a quote "" in it"
>
>
> Thanks,
>
> Hans
>
> On Mon, Mar 28, 2016 at 8:13 PM, Liu, Ming (Ming) <mi...@esgyn.cn>
> wrote:
>
> > Hi, all,
> >
> > Trafodion can bulk load data from HDFS into Trafodion tables.
> > Currently, it has some strict requirements about the source data in
> > order to load successfully.
> > Typically, data source should be clean and contains relatively few
> 'dirty'
> > data. However, there will be some special cases where source data
> > contains some special value and we hope Trafodion can handle
> automatically:
> >
> >
> > Automatically remove '\r' when it is used as '\r\n' the DOS format
> > line delimiter.
> >
> > Donot raise SQL error, but convert bad data into null automatically,
> > and still be able to log this into error log files when required,
> > don't make this change silent, and make this action traceable.
> >
> > Allow '\n' in data field even '\n' is the line terminator
> >
> > Auto truncate overflowed string, log it into the error log file, in
> > order to make it traceable.
> >
> > When src data have above 'issues', now, we have to do a special 'data
> > clean' process before load the data: convert DOS format into Unix
> > format, find bad data and remove them. However, products like Hive can
> > handle these 'bad' data as mentioned above. So it will be helpful, if
> > Trafodion can introduce a special mode to simulate the same
> > 'tolerance' when doing bulkload, if user can make sure these are
> > desired conversion, and no need to do the extra 'data clean' process.
> > Especially, data will be shared by Trafodion and other products like
> Hive.
> >
> > I will file a JIRA if no objections here, and any suggestions ideas
> > are welcome!
> >
> > Thanks,
> > Ming
> >
> >
>

答复: new requirments for Trafodion to be more tolerate during bulkloading

Posted by "Liu, Ming (Ming)" <mi...@esgyn.cn>.

Hi, Hans,

I need to spend more time to study Hive. At first glance, Hive will simply treat invalid value as NULL, and that is what I originally think about. Trafodion bulkloader already have an option to log the error row into a 'error log' file on HDFS, so that is already implemented. What I now think is: when read an Integer column but get 'abc' from src, instead of raise an error and dump that row into 'error log' file, in a special mode, Trafodion still load that row with invalid field as NULL and at same time dump it into the 'error log' file for later investigation.
But I do need to spend more time to study Hive behavior, what I said is only for Hive external table, there is no 'load' phase, user simply 'hdfs put' a file into the HDFS. I need to try if try to load into a Hive table with invalid data how Hive will handle it.

For '\n' in the middle, you pointed out a real difficult corner case, i.e. the extra '\n' in the last field. It is really hard to do this correct in various cases. What I original simple considering is: each row has fixed number of columns, so before HIVE_SCAN can get all columns, it will treat the line terminator as normal character, but when it is the last field, this will fail. We do need to think more about this. And maybe we need to abandon this requirement. It is always simpler to fix in the source, choose another delimiter, 0x01 is always a good one. I need to further think about this requirement. The fact is, some commercial RDBMS can tolerate this kind of issue. For this specific issue, \n in the last field, there will be solution, like to look further ahead, but it is really difficult if the last field contains multiple \n. So this may not be a valid requirement. And Hive cannot handle this as well. I need to study how that commercial RDBMS do this. Will update later, for now, we can ignore this requirement.

Thanks,
Ming

-----邮件原件-----'hhh
发件人: Hans Zeller [mailto:hans.zeller@esgyn.com] 
发送时间: 2016年3月31日 1:49
收件人: dev <de...@trafodion.incubator.apache.org>
主题: Re: new requirments for Trafodion to be more tolerate during bulkloading

Hi Ming,

Yes, it would be good if Trafodion would behave in ways similar to Hive. It would also be good if errors in the data wouldn't make the entire bulk load fail. Instead, those error rows should ideally go into a separate file or table (with additional fields for source file name and line number if possible). That's not an easy thing to do, though. Could you say how much of this error logging functionality you would want and how Hive solves that?

I don't quite understand how we would recognize record delimiters in a data field? Do you want to ignore record delimiters that appear within field delimiters? So, a record delimiter in the last field would cause the record to be truncated, but in fields other than the last it would be ok? I think it would be better if we would treat a row with a record delimiter in it as an error row (two error rows, actually). Possibly we could allow quoted strings, like

"this is a single field with a field delimiter | and a record delimiter \n and a quote "" in it"

Thanks,

Hans

On Mon, Mar 28, 2016 at 8:13 PM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> Trafodion can bulk load data from HDFS into Trafodion tables. 
> Currently, it has some strict requirements about the source data in 
> order to load successfully.
> Typically, data source should be clean and contains relatively few 'dirty'
> data. However, there will be some special cases where source data 
> contains some special value and we hope Trafodion can handle automatically:
>
>
> Automatically remove '\r' when it is used as '\r\n' the DOS format 
> line delimiter.
>
> Donot raise SQL error, but convert bad data into null automatically, 
> and still be able to log this into error log files when required, 
> don't make this change silent, and make this action traceable.
>
> Allow '\n' in data field even '\n' is the line terminator
>
> Auto truncate overflowed string, log it into the error log file, in 
> order to make it traceable.
>
> When src data have above 'issues', now, we have to do a special 'data 
> clean' process before load the data: convert DOS format into Unix 
> format, find bad data and remove them. However, products like Hive can 
> handle these 'bad' data as mentioned above. So it will be helpful, if 
> Trafodion can introduce a special mode to simulate the same 
> 'tolerance' when doing bulkload, if user can make sure these are 
> desired conversion, and no need to do the extra 'data clean' process. 
> Especially, data will be shared by Trafodion and other products like Hive.
>
> I will file a JIRA if no objections here, and any suggestions ideas 
> are welcome!
>
> Thanks,
> Ming
>
>

Re: new requirments for Trafodion to be more tolerate during bulkloading

Posted by Hans Zeller <ha...@esgyn.com>.

Hi Ming,

Yes, it would be good if Trafodion would behave in ways similar to Hive. It
would also be good if errors in the data wouldn't make the entire bulk load
fail. Instead, those error rows should ideally go into a separate file or
table (with additional fields for source file name and line number if
possible). That's not an easy thing to do, though. Could you say how much
of this error logging functionality you would want and how Hive solves that?

I don't quite understand how we would recognize record delimiters in a data
field? Do you want to ignore record delimiters that appear within field
delimiters? So, a record delimiter in the last field would cause the record
to be truncated, but in fields other than the last it would be ok? I think
it would be better if we would treat a row with a record delimiter in it as
an error row (two error rows, actually). Possibly we could allow quoted
strings, like

"this is a single field with a field delimiter | and a record delimiter \n
and a quote "" in it"

Thanks,

Hans

On Mon, Mar 28, 2016 at 8:13 PM, Liu, Ming (Ming) <mi...@esgyn.cn> wrote:

> Hi, all,
>
> Trafodion can bulk load data from HDFS into Trafodion tables. Currently,
> it has some strict requirements about the source data in order to load
> successfully.
> Typically, data source should be clean and contains relatively few 'dirty'
> data. However, there will be some special cases where source data contains
> some special value and we hope Trafodion can handle automatically:
>
>
> Automatically remove '\r' when it is used as '\r\n' the DOS format line
> delimiter.
>
> Donot raise SQL error, but convert bad data into null automatically, and
> still be able to log this into error log files when required, don't make
> this change silent, and make this action traceable.
>
> Allow '\n' in data field even '\n' is the line terminator
>
> Auto truncate overflowed string, log it into the error log file, in order
> to make it traceable.
>
> When src data have above 'issues', now, we have to do a special 'data
> clean' process before load the data: convert DOS format into Unix format,
> find bad data and remove them. However, products like Hive can handle these
> 'bad' data as mentioned above. So it will be helpful, if Trafodion can
> introduce a special mode to simulate the same 'tolerance' when doing
> bulkload, if user can make sure these are desired conversion, and no need
> to do the extra 'data clean' process. Especially, data will be shared by
> Trafodion and other products like Hive.
>
> I will file a JIRA if no objections here, and any suggestions ideas are
> welcome!
>
> Thanks,
> Ming
>
>