You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Gerber, Bryan W" <Br...@pnnl.gov> on 2016/01/12 18:40:57 UTC

Loading data containing newlines

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

Thanks,
Bryan G.

RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks Ryan, Very useful to know indeed 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ryan Harris [mailto:Ryan.Harris@zionsbancorp.com] 
Sent: 15 January 2016 23:31
To: user@hive.apache.org
Subject: RE: Loading data containing newlines

 

Mich, if you have a toolpath that you can use to pipeline the required edits to the source file, you can use a chain similar to this:

 

hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII | sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put - /etl/${table_name}/load/${orig_filename}.gz

 

to clean up your source input data as you drop it into the initial external table location that hive will use in a hive based ELT chain.

 

It really depends on your upstream data path....if data were being collected by flume, you might be able to clean it up there.  It is also possible to handle this with custom hive serdes, but it depends on where you want to write the code and how much existing data you already have to deal with.

 

Spark is also a very flexible and useful tool for this sort of problem, as well as numerous advantages when used as an execution engine, but setting up spark strictly to resolve this issue seems like overkill to me.

 

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Friday, January 15, 2016 4:04 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Ok but I believe there are other similar approaches.

 

I can take a raw csv file and customize it using existing shell commands like sed, awk, cut, grep etc among them getting rid of blank lines or replacing silly characters.

 

Bottom line I want to “eventually” store that csv file in a hive table in a format that I can use sql queries on it. 

 

Is that a viable alternative?

 

Thanks

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com] 
Sent: 15 January 2016 21:51
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Loading data containing newlines

 

You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size and use map_partitions to map the standard python csv parser over the partitions.

 

In general, the advantage of spark is that you can do anything you like rather than being limited to a specific set of primitives. 

 

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Hi Marcin,

 

Can you be specific in what way Spark is better suited for this operation compared to Hive?

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com <ma...@handybook.com> ] 
Sent: 15 January 2016 21:39
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Loading data containing newlines

 

I second this. I've generally found anything else to be disappointing when working with data which is at all funky. 

 

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivovarov@gmail.com <ma...@gmail.com> > wrote:

Time to use Spark and Spark-Sql in addition to Hive?

It's probably going to happen sooner or later anyway.

 

I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function using BZip2CompressorInputStream and Super CSV API)

you can download spark,  open spark-shell and run/debug the program on a single computer

 

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark cluster in 7 min)

 

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov <ma...@pnnl.gov> > wrote:

1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION ‘/data/stg/table/’

3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties ("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

 

This has been working quite well, until our newest data contains fields with embedded newlines.

 

We are now looking into options further up the pipeline to see if we can condition the data earlier in the process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ] 
Sent: Wednesday, January 13, 2016 10:34 AM


To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Thanks Brian.

 

Just to clarify do you use something like below?

 

1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp <http://rhes564.hedat.net:9000/misc/t.bcp> 

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] <mailto:[mailto:Bryan.Gerber@pnnl.gov]>  
Sent: 13 January 2016 18:12
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Hi Bryan,

 

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

 

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

 

Thanks,

Bryan G.

 

 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 



 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 



  _____  

THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.


RE: Loading data containing newlines

Posted by Ryan Harris <Ry...@zionsbancorp.com>.
Mich, if you have a toolpath that you can use to pipeline the required edits to the source file, you can use a chain similar to this:

hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII | sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put - /etl/${table_name}/load/${orig_filename}.gz

to clean up your source input data as you drop it into the initial external table location that hive will use in a hive based ELT chain.

It really depends on your upstream data path....if data were being collected by flume, you might be able to clean it up there.  It is also possible to handle this with custom hive serdes, but it depends on where you want to write the code and how much existing data you already have to deal with.

Spark is also a very flexible and useful tool for this sort of problem, as well as numerous advantages when used as an execution engine, but setting up spark strictly to resolve this issue seems like overkill to me.


From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Friday, January 15, 2016 4:04 PM
To: user@hive.apache.org
Subject: RE: Loading data containing newlines

Ok but I believe there are other similar approaches.

I can take a raw csv file and customize it using existing shell commands like sed, awk, cut, grep etc among them getting rid of blank lines or replacing silly characters.

Bottom line I want to “eventually” store that csv file in a hive table in a format that I can use sql queries on it.

Is that a viable alternative?

Thanks



Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 15 January 2016 21:51
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size and use map_partitions to map the standard python csv parser over the partitions.

In general, the advantage of spark is that you can do anything you like rather than being limited to a specific set of primitives.

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi Marcin,

Can you be specific in what way Spark is better suited for this operation compared to Hive?

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

From: Marcin Tustin [mailto:mtustin@handybook.com<ma...@handybook.com>]
Sent: 15 January 2016 21:39
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Loading data containing newlines

I second this. I've generally found anything else to be disappointing when working with data which is at all funky.

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <ap...@gmail.com>> wrote:
Time to use Spark and Spark-Sql in addition to Hive?
It's probably going to happen sooner or later anyway.

I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function using BZip2CompressorInputStream and Super CSV API)
you can download spark,  open spark-shell and run/debug the program on a single computer

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark cluster in 7 min)

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Br...@pnnl.gov>> wrote:

1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/<http://host.name/data/stg/table/>

2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION ‘/data/stg/table/’

3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties ("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/<http://host.name/data/stg/table/>

This has been working quite well, until our newest data contains fields with embedded newlines.

We are now looking into options further up the pipeline to see if we can condition the data earlier in the process.

From: Mich Talebzadeh [mailto:mich@peridale.co.uk<ma...@peridale.co.uk>]
Sent: Wednesday, January 13, 2016 10:34 AM

To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Loading data containing newlines

Thanks Brian.

Just to clarify do you use something like below?


1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp<http://rhes564.hedat.net:9000/misc/t.bcp>

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC

Cheers,


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]<mailto:[mailto:Bryan.Gerber@pnnl.gov]>
Sent: 13 January 2016 18:12
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Loading data containing newlines

We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Loading data containing newlines

Hi Bryan,

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

Thanks


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
Sent: 12 January 2016 17:41
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Loading data containing newlines

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

Thanks,
Bryan G.



Want to work at Handy? Check out our culture deck and open roles<http://www.handy.com/careers>
Latest news<http://www.handy.com/press> at Handy
Handy just raised $50m<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity

[Image removed by sender.]


Want to work at Handy? Check out our culture deck and open roles<http://www.handy.com/careers>
Latest news<http://www.handy.com/press> at Handy
Handy just raised $50m<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity

[Image removed by sender.]

======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.

RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Interesting points.

 

I think we are now moving to another abstraction layer. 

 

Recall that all these extra features (spark, scala etc) require learning curve and potentially additional skill sets that in practice may not be a viable option for many companies who have a lot of investment in tools using SQL already and generally have available skill set in Hive. The exit barriers are considerable

 

I use Hive on Spark engine and it does a good job performance wise in addition to Hive on MR. Also it seems that Hive SQL seems to be of fuller set compared to spark-sql. 

 

None of these issues are really show stoppers in Hive and I believe they can be resolved without moving to another tool set.

 

Having said I would be more than happy to explore spark features that you mentioned.

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Alexander Pivovarov [mailto:apivovarov@gmail.com] 
Sent: 15 January 2016 23:07
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

 

Probably Bryan can try both Hive and Spark and decide which one better works for him.

 

The fact is - lots of companies migrate from Hadoop/Hive to Spark

 

if you like writing ETL using Spark API the you can use map, reduceByKey, groupByKeym, join, distinct, etc API

if you like using SQL then you can do it by running sqlContext.sq("select ....")

In addition to SQL you can also use DataFrame API.

 

Hive only allows you to use SQL

 

BTW, most of Hive UDFs are available in Spark

Plus Spark allows you to create UDF on fly right in your script, e.g.

 

sqlContext.udf.register("cube", (in: java.lang.Long) => in * in * in)

 

sqlContext.sql("select cube(4) c").show()

+---+

| c |

+---+

| 64|

+---+

 

On Fri, Jan 15, 2016 at 3:03 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Ok but I believe there are other similar approaches.

 

I can take a raw csv file and customize it using existing shell commands like sed, awk, cut, grep etc among them getting rid of blank lines or replacing silly characters.

 

Bottom line I want to “eventually” store that csv file in a hive table in a format that I can use sql queries on it. 

 

Is that a viable alternative?

 

Thanks

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com <ma...@handybook.com> ] 
Sent: 15 January 2016 21:51


To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Loading data containing newlines

 

You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size and use map_partitions to map the standard python csv parser over the partitions.

 

In general, the advantage of spark is that you can do anything you like rather than being limited to a specific set of primitives. 

 

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Hi Marcin,

 

Can you be specific in what way Spark is better suited for this operation compared to Hive?

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com <ma...@handybook.com> ] 
Sent: 15 January 2016 21:39
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Loading data containing newlines

 

I second this. I've generally found anything else to be disappointing when working with data which is at all funky. 

 

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivovarov@gmail.com <ma...@gmail.com> > wrote:

Time to use Spark and Spark-Sql in addition to Hive?

It's probably going to happen sooner or later anyway.

 

I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function using BZip2CompressorInputStream and Super CSV API)

you can download spark,  open spark-shell and run/debug the program on a single computer

 

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark cluster in 7 min)

 

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov <ma...@pnnl.gov> > wrote:

1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION ‘/data/stg/table/’

3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties ("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

 

This has been working quite well, until our newest data contains fields with embedded newlines.

 

We are now looking into options further up the pipeline to see if we can condition the data earlier in the process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ] 
Sent: Wednesday, January 13, 2016 10:34 AM


To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Thanks Brian.

 

Just to clarify do you use something like below?

 

1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp <http://rhes564.hedat.net:9000/misc/t.bcp> 

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] <mailto:[mailto:Bryan.Gerber@pnnl.gov]>  
Sent: 13 January 2016 18:12
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Hi Bryan,

 

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

 

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

 

Thanks,

Bryan G.

 

 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 



 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 



 


Re: Loading data containing newlines

Posted by Alexander Pivovarov <ap...@gmail.com>.
Hive requires you to provide table schema even if you create table based on
folder having ORC files (ORC file already has schema internally). It's a
shame because ORC is Hive internal project originally

Spark can create table based on ORC or Parquet files automatically without
asking you to provide schema.

val myTableDf =
getSqlContext.read.format("orc").load("s3n://mycompany/hive/warehouse/my_table_orc")

myTableDf.printSchema()

myTableDf.show()


You do not even need to register table to do SQL based ETL against
files (added in Spark 1.6)

SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
queries over files of any supported format without registering a
table.

https://issues.apache.org/jira/browse/SPARK-11197


I think now it's more clear why all companies move to Spark to do ETL.


On Fri, Jan 15, 2016 at 3:06 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> Probably Bryan can try both Hive and Spark and decide which one better
> works for him.
>
> The fact is - lots of companies migrate from Hadoop/Hive to Spark
>
> if you like writing ETL using Spark API the you can use map, reduceByKey,
> groupByKeym, join, distinct, etc API
> if you like using SQL then you can do it by running sqlContext.sq("select
> ....")
> In addition to SQL you can also use DataFrame API.
>
> Hive only allows you to use SQL
>
> BTW, most of Hive UDFs are available in Spark
> Plus Spark allows you to create UDF on fly right in your script, e.g.
>
> sqlContext.udf.register("cube", (in: java.lang.Long) => in * in * in)
>
> sqlContext.sql("select cube(4) c").show()
>
> +---+
> | c |
> +---+
> | 64|
> +---+
>
> On Fri, Jan 15, 2016 at 3:03 PM, Mich Talebzadeh <mi...@peridale.co.uk>
> wrote:
>
>> Ok but I believe there are other similar approaches.
>>
>>
>>
>> I can take a raw csv file and customize it using existing shell commands
>> like sed, awk, cut, grep etc among them getting rid of blank lines or
>> replacing silly characters.
>>
>>
>>
>> Bottom line I want to “eventually” store that csv file in a hive table in
>> a format that I can use sql queries on it.
>>
>>
>>
>> Is that a viable alternative?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
>> *Sent:* 15 January 2016 21:51
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Loading data containing newlines
>>
>>
>>
>> You can open a file as an RDD of lines, and map whatever custom
>> tokenisation function you want over it; alternatively you can partition
>> down to a reasonable size and use map_partitions to map the standard python
>> csv parser over the partitions.
>>
>>
>>
>> In general, the advantage of spark is that you can do anything you like
>> rather than being limited to a specific set of primitives.
>>
>>
>>
>> On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mi...@peridale.co.uk>
>> wrote:
>>
>> Hi Marcin,
>>
>>
>>
>> Can you be specific in what way Spark is better suited for this operation
>> compared to Hive?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
>> *Sent:* 15 January 2016 21:39
>> *To:* user@hive.apache.org
>> *Subject:* Re: Loading data containing newlines
>>
>>
>>
>> I second this. I've generally found anything else to be disappointing
>> when working with data which is at all funky.
>>
>>
>>
>> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivovarov@gmail.com> wrote:
>>
>> Time to use Spark and Spark-Sql in addition to Hive?
>>
>> It's probably going to happen sooner or later anyway.
>>
>>
>>
>> I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file:
>> String): List[Array[String]]  function using BZip2CompressorInputStream
>> and Super CSV API)
>>
>> you can download spark,  open spark-shell and run/debug the program on a
>> single computer
>>
>>
>>
>> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
>> cluster in 7 min)
>>
>>
>>
>> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Br...@pnnl.gov>
>> wrote:
>>
>> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
>> host.name/data/stg/table/
>>
>> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
>> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
>> ‘/data/stg/table/’
>>
>> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
>> ("orc.compress"="ZLIB");
>>
>> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
>> udf2(colb),functions(),etc. FROM ext_<table>
>>
>> 5.       Delete files from hdfs://host.name/data/stg/table/
>>
>>
>>
>> This has been working quite well, until our newest data contains fields
>> with embedded newlines.
>>
>>
>>
>> We are now looking into options further up the pipeline to see if we can
>> condition the data earlier in the process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
>> *Sent:* Wednesday, January 13, 2016 10:34 AM
>>
>>
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Thanks Brian.
>>
>>
>>
>> Just to clarify do you use something like below?
>>
>>
>>
>> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
>> rhes564.hedat.net:9000/misc/t.bcp
>>
>> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …)
>> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY
>> ',' STORED AS ORC
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
>> *Sent:* 13 January 2016 18:12
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> We are pushing the compressed text files into HDFS directory for Hive
>> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
>> letting Hive handle the ORC file creation process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk <mi...@peridale.co.uk>]
>>
>> *Sent:* Tuesday, January 12, 2016 4:41 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Hi Bryan,
>>
>>
>>
>> As a matter of interest are you loading text files into local directories
>> in encrypted format at all and then push it into HDFS/Hive as ORC?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov
>> <Br...@pnnl.gov>]
>> *Sent:* 12 January 2016 17:41
>> *To:* user@hive.apache.org
>> *Subject:* Loading data containing newlines
>>
>>
>>
>> We are attempting to load CSV text files (compressed to bz2) containing
>> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
>> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
>> them to condition the data.
>>
>>
>>
>> A few days of research has us ready to implement custom  input/output
>> formats to handle the ingest.  Any other suggestions that may be less
>> effort with low impact to load times?
>>
>>
>>
>> Thanks,
>>
>> Bryan G.
>>
>>
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>>
>> Latest news <http://www.handy.com/press> at Handy
>>
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>>
>> [image: Image removed by sender.]
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>>
>> Latest news <http://www.handy.com/press> at Handy
>>
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>>
>> [image: Image removed by sender.]
>>
>
>

Re: Loading data containing newlines

Posted by Alexander Pivovarov <ap...@gmail.com>.
Probably Bryan can try both Hive and Spark and decide which one better
works for him.

The fact is - lots of companies migrate from Hadoop/Hive to Spark

if you like writing ETL using Spark API the you can use map, reduceByKey,
groupByKeym, join, distinct, etc API
if you like using SQL then you can do it by running sqlContext.sq("select
....")
In addition to SQL you can also use DataFrame API.

Hive only allows you to use SQL

BTW, most of Hive UDFs are available in Spark
Plus Spark allows you to create UDF on fly right in your script, e.g.

sqlContext.udf.register("cube", (in: java.lang.Long) => in * in * in)

sqlContext.sql("select cube(4) c").show()

+---+
| c |
+---+
| 64|
+---+

On Fri, Jan 15, 2016 at 3:03 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Ok but I believe there are other similar approaches.
>
>
>
> I can take a raw csv file and customize it using existing shell commands
> like sed, awk, cut, grep etc among them getting rid of blank lines or
> replacing silly characters.
>
>
>
> Bottom line I want to “eventually” store that csv file in a hive table in
> a format that I can use sql queries on it.
>
>
>
> Is that a viable alternative?
>
>
>
> Thanks
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
> *Sent:* 15 January 2016 21:51
>
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> You can open a file as an RDD of lines, and map whatever custom
> tokenisation function you want over it; alternatively you can partition
> down to a reasonable size and use map_partitions to map the standard python
> csv parser over the partitions.
>
>
>
> In general, the advantage of spark is that you can do anything you like
> rather than being limited to a specific set of primitives.
>
>
>
> On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mi...@peridale.co.uk>
> wrote:
>
> Hi Marcin,
>
>
>
> Can you be specific in what way Spark is better suited for this operation
> compared to Hive?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
> *Sent:* 15 January 2016 21:39
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> I second this. I've generally found anything else to be disappointing when
> working with data which is at all funky.
>
>
>
> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
> Time to use Spark and Spark-Sql in addition to Hive?
>
> It's probably going to happen sooner or later anyway.
>
>
>
> I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file:
> String): List[Array[String]]  function using BZip2CompressorInputStream
> and Super CSV API)
>
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
>
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
>
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Br...@pnnl.gov>
> wrote:
>
> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_<table>
>
> 5.       Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
> *Sent:* 13 January 2016 18:12
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> We are pushing the compressed text files into HDFS directory for Hive
> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
> letting Hive handle the ORC file creation process.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk <mi...@peridale.co.uk>]
>
> *Sent:* Tuesday, January 12, 2016 4:41 PM
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Hi Bryan,
>
>
>
> As a matter of interest are you loading text files into local directories
> in encrypted format at all and then push it into HDFS/Hive as ORC?
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov
> <Br...@pnnl.gov>]
> *Sent:* 12 January 2016 17:41
> *To:* user@hive.apache.org
> *Subject:* Loading data containing newlines
>
>
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>
>
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>

RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Ok but I believe there are other similar approaches.

 

I can take a raw csv file and customize it using existing shell commands like sed, awk, cut, grep etc among them getting rid of blank lines or replacing silly characters.

 

Bottom line I want to “eventually” store that csv file in a hive table in a format that I can use sql queries on it. 

 

Is that a viable alternative?

 

Thanks

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com] 
Sent: 15 January 2016 21:51
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

 

You can open a file as an RDD of lines, and map whatever custom tokenisation function you want over it; alternatively you can partition down to a reasonable size and use map_partitions to map the standard python csv parser over the partitions.

 

In general, the advantage of spark is that you can do anything you like rather than being limited to a specific set of primitives. 

 

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Hi Marcin,

 

Can you be specific in what way Spark is better suited for this operation compared to Hive?

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com <ma...@handybook.com> ] 
Sent: 15 January 2016 21:39
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Loading data containing newlines

 

I second this. I've generally found anything else to be disappointing when working with data which is at all funky. 

 

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivovarov@gmail.com <ma...@gmail.com> > wrote:

Time to use Spark and Spark-Sql in addition to Hive?

It's probably going to happen sooner or later anyway.

 

I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function using BZip2CompressorInputStream and Super CSV API)

you can download spark,  open spark-shell and run/debug the program on a single computer

 

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark cluster in 7 min)

 

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov <ma...@pnnl.gov> > wrote:

1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION ‘/data/stg/table/’

3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties ("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

 

This has been working quite well, until our newest data contains fields with embedded newlines.

 

We are now looking into options further up the pipeline to see if we can condition the data earlier in the process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ] 
Sent: Wednesday, January 13, 2016 10:34 AM


To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Thanks Brian.

 

Just to clarify do you use something like below?

 

1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp <http://rhes564.hedat.net:9000/misc/t.bcp> 

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] <mailto:[mailto:Bryan.Gerber@pnnl.gov]>  
Sent: 13 January 2016 18:12
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Hi Bryan,

 

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

 

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

 

Thanks,

Bryan G.

 

 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 



 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 




Re: Loading data containing newlines

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> You can open a file as an RDD of lines, and map whatever custom
>tokenisation function you want over it;

That's what a SerDe does in Hive (like OpenCSVSerDe).

Once your record gets split into multiple lines, then the problem becomes
more complex since Spark's functional nature demands side-effect free
map() operations.

You cannot depend on the previous row for any map(), particularly because
the natural operations are unordered lazy.

The only way to get proper contractual ordering is to get something via
OrderedRDDFunctions [1].

> alternatively you can partition down to a reasonable size and use
>map_partitions to map the standard python csv parser over the partitions.


Yet again, this does not work once you get to delimiter interspersed data
like the case discussed.

Partitioning the RDD, when you have a mixed newlines will give you half a
row as the first item in a partition.

> In general, the advantage of spark is that you can do anything you like
>rather than being limited to a specific set of primitives.


That is true once you've cut up the record boundaries into an RDD, but as
long as you're using .textFile() you have the exact same annoyances that
hadoop TextInputFormat has.

HadoopRDD has the same issues with unescaped interspersed delimiters,
where a "\n" cannot be identified as a delimiter purely by itself or its
previous byte.

Once you write an InputFormat which has stateful readers across lines for
Hadoop, then that can be used by Spark too - but Spark by itself can't fix
delimiter interspersing.

I've had to implement this before and it wasn't simple when you scale it
up past 1 HDFS block of input without losing data.


However, when you're doing data cleansing Spark makes it really easy to
drop partial rows and move ahead unlike something like Hive.


Cheers,
Gopal
[1] - 
http://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method/29281
548#29268210



Re: Loading data containing newlines

Posted by Marcin Tustin <mt...@handybook.com>.
You can open a file as an RDD of lines, and map whatever custom
tokenisation function you want over it; alternatively you can partition
down to a reasonable size and use map_partitions to map the standard python
csv parser over the partitions.

In general, the advantage of spark is that you can do anything you like
rather than being limited to a specific set of primitives.

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:

> Hi Marcin,
>
>
>
> Can you be specific in what way Spark is better suited for this operation
> compared to Hive?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
> *Sent:* 15 January 2016 21:39
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> I second this. I've generally found anything else to be disappointing when
> working with data which is at all funky.
>
>
>
> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
> Time to use Spark and Spark-Sql in addition to Hive?
>
> It's probably going to happen sooner or later anyway.
>
>
>
> I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file:
> String): List[Array[String]]  function using BZip2CompressorInputStream
> and Super CSV API)
>
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
>
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
>
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Br...@pnnl.gov>
> wrote:
>
> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_<table>
>
> 5.       Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
> *Sent:* 13 January 2016 18:12
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> We are pushing the compressed text files into HDFS directory for Hive
> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
> letting Hive handle the ORC file creation process.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk <mi...@peridale.co.uk>]
>
> *Sent:* Tuesday, January 12, 2016 4:41 PM
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Hi Bryan,
>
>
>
> As a matter of interest are you loading text files into local directories
> in encrypted format at all and then push it into HDFS/Hive as ORC?
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov
> <Br...@pnnl.gov>]
> *Sent:* 12 January 2016 17:41
> *To:* user@hive.apache.org
> *Subject:* Loading data containing newlines
>
>
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>
>
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity


RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Marcin,

 

Can you be specific in what way Spark is better suited for this operation compared to Hive?

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com] 
Sent: 15 January 2016 21:39
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

 

I second this. I've generally found anything else to be disappointing when working with data which is at all funky. 

 

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <apivovarov@gmail.com <ma...@gmail.com> > wrote:

Time to use Spark and Spark-Sql in addition to Hive?

It's probably going to happen sooner or later anyway.

 

I sent you Spark solution yesterday.  (you just need to write unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function using BZip2CompressorInputStream and Super CSV API)

you can download spark,  open spark-shell and run/debug the program on a single computer

 

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark cluster in 7 min)

 

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Bryan.Gerber@pnnl.gov <ma...@pnnl.gov> > wrote:

1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION ‘/data/stg/table/’

3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties ("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/ <http://host.name/data/stg/table/> 

 

This has been working quite well, until our newest data contains fields with embedded newlines.

 

We are now looking into options further up the pipeline to see if we can condition the data earlier in the process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ] 
Sent: Wednesday, January 13, 2016 10:34 AM


To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Thanks Brian.

 

Just to clarify do you use something like below?

 

1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp <http://rhes564.hedat.net:9000/misc/t.bcp> 

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] <mailto:[mailto:Bryan.Gerber@pnnl.gov]>  
Sent: 13 January 2016 18:12
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Hi Bryan,

 

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

 

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

 

Thanks,

Bryan G.

 

 

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity

 




Re: Loading data containing newlines

Posted by Marcin Tustin <mt...@handybook.com>.
I second this. I've generally found anything else to be disappointing when
working with data which is at all funky.

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> Time to use Spark and Spark-Sql in addition to Hive?
> It's probably going to happen sooner or later anyway.
>
> I sent you Spark solution yesterday.  (you just need to write
> unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function
> using BZip2CompressorInputStream and Super CSV API)
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Br...@pnnl.gov>
> wrote:
>
>> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
>> host.name/data/stg/table/
>>
>> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
>> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
>> ‘/data/stg/table/’
>>
>> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
>> ("orc.compress"="ZLIB");
>>
>> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
>> udf2(colb),functions(),etc. FROM ext_<table>
>>
>> 5.       Delete files from hdfs://host.name/data/stg/table/
>>
>>
>>
>> This has been working quite well, until our newest data contains fields
>> with embedded newlines.
>>
>>
>>
>> We are now looking into options further up the pipeline to see if we can
>> condition the data earlier in the process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
>> *Sent:* Wednesday, January 13, 2016 10:34 AM
>>
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Thanks Brian.
>>
>>
>>
>> Just to clarify do you use something like below?
>>
>>
>>
>> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
>> rhes564.hedat.net:9000/misc/t.bcp
>>
>> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …)
>> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY
>> ',' STORED AS ORC
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
>> *Sent:* 13 January 2016 18:12
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> We are pushing the compressed text files into HDFS directory for Hive
>> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
>> letting Hive handle the ORC file creation process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk <mi...@peridale.co.uk>]
>>
>> *Sent:* Tuesday, January 12, 2016 4:41 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Hi Bryan,
>>
>>
>>
>> As a matter of interest are you loading text files into local directories
>> in encrypted format at all and then push it into HDFS/Hive as ORC?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov
>> <Br...@pnnl.gov>]
>> *Sent:* 12 January 2016 17:41
>> *To:* user@hive.apache.org
>> *Subject:* Loading data containing newlines
>>
>>
>>
>> We are attempting to load CSV text files (compressed to bz2) containing
>> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
>> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
>> them to condition the data.
>>
>>
>>
>> A few days of research has us ready to implement custom  input/output
>> formats to handle the ingest.  Any other suggestions that may be less
>> effort with low impact to load times?
>>
>>
>>
>> Thanks,
>>
>> Bryan G.
>>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity


Re: Loading data containing newlines

Posted by Alexander Pivovarov <ap...@gmail.com>.
Time to use Spark and Spark-Sql in addition to Hive?
It's probably going to happen sooner or later anyway.

I sent you Spark solution yesterday.  (you just need to write
unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function
using BZip2CompressorInputStream and Super CSV API)
you can download spark,  open spark-shell and run/debug the program on a
single computer

and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
cluster in 7 min)

On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <Br...@pnnl.gov>
wrote:

> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_<table>
>
> 5.       Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
> *Sent:* 13 January 2016 18:12
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> We are pushing the compressed text files into HDFS directory for Hive
> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
> letting Hive handle the ORC file creation process.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk <mi...@peridale.co.uk>]
>
> *Sent:* Tuesday, January 12, 2016 4:41 PM
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Hi Bryan,
>
>
>
> As a matter of interest are you loading text files into local directories
> in encrypted format at all and then push it into HDFS/Hive as ORC?
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov
> <Br...@pnnl.gov>]
> *Sent:* 12 January 2016 17:41
> *To:* user@hive.apache.org
> *Subject:* Loading data containing newlines
>
>
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>

RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Bryan,

 

Thanks for this detailed explanation.

 

We have also experimented with importing bzip2 files and Hive is pretty good
at handling them.

 

We also need to negotiate empty lines and columns defined as currencies. I
still think that mapping external table to raw files (not filtered) is the
correct way of doing it (ELT as opposed to ETL)

 

Our problem  is the best way of creating the target table in Hive to have
the filtered data and use the data as opposed to storing it only. Do you
have any suggestion in this area?

 

 

Cheers,

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 14 January 2016 00:13
To: user@hive.apache.org
Subject: RE: Loading data containing newlines

 

1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2
hdfs://host.name/data/stg/table/

2.       CREATE EXTERNAL TABLE stg_<table> (cols.) ROW FORMAT serde
'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
'/data/stg/table/'

3.       CREATE TABLE <table> (cols.) STORE AS ORC  tblproperties
("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/

 

This has been working quite well, until our newest data contains fields with
embedded newlines.

 

We are now looking into options further up the pipeline to see if we can
condition the data earlier in the process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Wednesday, January 13, 2016 10:34 AM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Thanks Brian.

 

Just to clarify do you use something like below?

 

1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp
hdfs://rhes564.hedat.net:9000/misc/t.bcp

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, .) COMMENT
'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
ORC

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
<mailto:[mailto:Bryan.Gerber@pnnl.gov]>  
Sent: 13 January 2016 18:12
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

We are pushing the compressed text files into HDFS directory for Hive
EXTERNAL table, then using an INSERT on the table using ORC storage. We are
letting Hive handle the ORC file creation process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Hi Bryan,

 

As a matter of interest are you loading text files into local directories in
encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing
newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
them to condition the data.

 

A few days of research has us ready to implement custom  input/output
formats to handle the ingest.  Any other suggestions that may be less effort
with low impact to load times?

 

Thanks,

Bryan G.


RE: Loading data containing newlines

Posted by "Gerber, Bryan W" <Br...@pnnl.gov>.
1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://host.name/data/stg/table/

2.       CREATE EXTERNAL TABLE stg_<table> (cols...) ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/data/stg/table/'

3.       CREATE TABLE <table> (cols...) STORE AS ORC  tblproperties ("orc.compress"="ZLIB");

4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola), udf2(colb),functions(),etc. FROM ext_<table>

5.       Delete files from hdfs://host.name/data/stg/table/

This has been working quite well, until our newest data contains fields with embedded newlines.

We are now looking into options further up the pipeline to see if we can condition the data earlier in the process.

From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Wednesday, January 13, 2016 10:34 AM
To: user@hive.apache.org
Subject: RE: Loading data containing newlines

Thanks Brian.

Just to clarify do you use something like below?


1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://rhes564.hedat.net:9000/misc/t.bcp

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, ...) COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC

Cheers,


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]<mailto:[mailto:Bryan.Gerber@pnnl.gov]>
Sent: 13 January 2016 18:12
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Loading data containing newlines

We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Loading data containing newlines

Hi Bryan,

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

Thanks


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
Sent: 12 January 2016 17:41
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Loading data containing newlines

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

Thanks,
Bryan G.

RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks Brian.

 

Just to clarify do you use something like below?

 

1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp
hdfs://rhes564.hedat.net:9000/misc/t.bcp

2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, .) COMMENT
'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
ORC

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 13 January 2016 18:12
To: user@hive.apache.org
Subject: RE: Loading data containing newlines

 

We are pushing the compressed text files into HDFS directory for Hive
EXTERNAL table, then using an INSERT on the table using ORC storage. We are
letting Hive handle the ORC file creation process.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Loading data containing newlines

 

Hi Bryan,

 

As a matter of interest are you loading text files into local directories in
encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing
newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
them to condition the data.

 

A few days of research has us ready to implement custom  input/output
formats to handle the ingest.  Any other suggestions that may be less effort
with low impact to load times?

 

Thanks,

Bryan G.


Re: Loading data containing newlines

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> We are pushing the compressed text files into HDFS directory for Hive
>EXTERNAL table, then using an INSERT on the table using ORC storage. We
>are letting Hive handle the ORC file creation process.

Are the compressed text files small enough to process one by one?

I did write something similar last year for an EBCIDIC case.

The only thing it can't do is split a file half-way through, so each file
is processed as a single stream with a simple state machine.

Cheers,
Gopal



RE: Loading data containing newlines

Posted by "Gerber, Bryan W" <Br...@pnnl.gov>.
We are pushing the compressed text files into HDFS directory for Hive EXTERNAL table, then using an INSERT on the table using ORC storage. We are letting Hive handle the ORC file creation process.

From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Tuesday, January 12, 2016 4:41 PM
To: user@hive.apache.org
Subject: RE: Loading data containing newlines

Hi Bryan,

As a matter of interest are you loading text files into local directories in encrypted format at all and then push it into HDFS/Hive as ORC?

Thanks


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov]
Sent: 12 January 2016 17:41
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Loading data containing newlines

We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

Thanks,
Bryan G.

RE: Loading data containing newlines

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Bryan,

 

As a matter of interest are you loading text files into local directories in
encrypted format at all and then push it into HDFS/Hive as ORC?

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Gerber, Bryan W [mailto:Bryan.Gerber@pnnl.gov] 
Sent: 12 January 2016 17:41
To: user@hive.apache.org
Subject: Loading data containing newlines

 

We are attempting to load CSV text files (compressed to bz2) containing
newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
them to condition the data.

 

A few days of research has us ready to implement custom  input/output
formats to handle the ingest.  Any other suggestions that may be less effort
with low impact to load times?

 

Thanks,

Bryan G.


RE: Loading data containing newlines

Posted by Alexander Pivovarov <ap...@gmail.com>.
I tried spark-csv with file having new line inside field value - does not
work as well

$ cat /tmp/cars.csv
1,"Hello1
world"
2,"Hello2"
3,"Hello3"

scala> val df = sqlContext.read.
     |   format("com.databricks.spark.csv").
     |   load("/tmp/cars.csv")
java.io.IOException: (startline 1) EOF reached before encapsulated token
finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:223)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:72)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:157)
at
com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44)

what you can do is
1. read list of files to Array[String]
val files = getListOfFiles(dir)

2. create RDD out of it and repartition by files.length (So, each task get
one file)
val filesRdd = sc.parallelize(files, files.length)

3. unbzip2 and parse. 1 file = 1 task

val lines: RDD[Array[String]] = filesRdd.flatMap(file =>
unbzip2AndCsvToListOfArrays(file))

unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] can use csv
parser which understands new line inside field value, e.g. Super CSV

4. create RDD of Rows
val rows = lines.map(line => Row.fromSeq(line.toSeq))

5. create dataframe
val df = getSqlContext.createDataFrame(rows, schema)

schema describes column name and types.

6. save df as ORC
df.repartition(outputFilesCount).write.format("orc").save(outputPath)


On Jan 12, 2016 9:58 AM, "Gerber, Bryan W" <Br...@pnnl.gov> wrote:

> From that wiki:
>
> "This SerDe works for most CSV data, but does not handle embedded
> newlines."
>
>
>
> The Hive SerDe interface is all downstream of the TextInputFormat, which
> has already split records by newlines.  In theory you can give it a
> different line delimiter, but Hive 1.2.1 does not support it: "FAILED:
> SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right
> now."
>
>
>
> *From:* Alexander Pivovarov [mailto:apivovarov@gmail.com]
> *Sent:* Tuesday, January 12, 2016 9:52 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> Try CSV serde. It should correctly parse quoted field value having newline
> inside
>
> https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
>
>
>
> Hadoop should automatically read bz2 files
>
>
>
>
>
> On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <Br...@pnnl.gov>
> wrote:
>
> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>
>
>

RE: Loading data containing newlines

Posted by "Gerber, Bryan W" <Br...@pnnl.gov>.
From that wiki:
"This SerDe works for most CSV data, but does not handle embedded newlines."

The Hive SerDe interface is all downstream of the TextInputFormat, which has already split records by newlines.  In theory you can give it a different line delimiter, but Hive 1.2.1 does not support it: "FAILED: SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right now."

From: Alexander Pivovarov [mailto:apivovarov@gmail.com]
Sent: Tuesday, January 12, 2016 9:52 AM
To: user@hive.apache.org
Subject: Re: Loading data containing newlines

Try CSV serde. It should correctly parse quoted field value having newline inside
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Hadoop should automatically read bz2 files


On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <Br...@pnnl.gov>> wrote:
We are attempting to load CSV text files (compressed to bz2) containing newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking them to condition the data.

A few days of research has us ready to implement custom  input/output formats to handle the ingest.  Any other suggestions that may be less effort with low impact to load times?

Thanks,
Bryan G.


Re: Loading data containing newlines

Posted by Alexander Pivovarov <ap...@gmail.com>.
Try CSV serde. It should correctly parse quoted field value having newline
inside
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Hadoop should automatically read bz2 files


On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W <Br...@pnnl.gov>
wrote:

> We are attempting to load CSV text files (compressed to bz2) containing
> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
> them to condition the data.
>
>
>
> A few days of research has us ready to implement custom  input/output
> formats to handle the ingest.  Any other suggestions that may be less
> effort with low impact to load times?
>
>
>
> Thanks,
>
> Bryan G.
>