You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@apex.apache.org by bhidevivek <bh...@gmail.com> on 2017/05/16 19:53:29 UTC

HiveOutputModule creating extra directories, than specified, while saving data into HDFS

H All, I am trying to use HiveOutput Module to insert the ingested data into
hive external table. The table is already created with the same location as
/dt.application.<app_name>.operator.hiveOutput.prop.filePath/ property and
partition column is accessdate. With below configurations in property file,
the hdfs file structure I am expecting is 

/common/data/test/accessCounts
						|
						----- accessdate=2017-05-15
									|
									------- <fil1>
									------- <fil2>
						----- accessdate=2017-05-16
									|
									------- <fil1>
									------- <fil2>

but the actual structure look like

/common/data/test/accessCounts/<yarn_application_id_for_apex_ingest_appl>/10
															   |
															   ----- 2017-05-15
														  		     |
														  		     ------- <fil1>
																     ------- <fil2>
															  |
															   ----- 2017-05-16
															    	     |
																     ------- <fil1>
																     ------- <fil2>

Questions
1. Why the yarn_application_id and some other extra directories are created
when it is no where specified in config
2. If I want to achieve the structure I want, what other configurations I
will need to set?

HiveOutputModule Configs
==================

<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.filePath
		</name>
		<value>/common/data/test/accessCounts</value>
	</property>
	<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.databaseUrl
		</name>
		<value><jdbc_url></value>
	</property>
	<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.databaseDriver
		</name>
		<value>org.apache.hive.jdbc.HiveDriver</value>
	</property>
	<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.tablename
		</name>
		<value><hive table name where records needs to be inserted></value>
	</property>
	<property>
	
<name>dt.application.<app_name>.operator.hiveOutput.prop.hivePartitionColumns
		</name>
		<value>{accessdate}</value>
	</property>
	<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.password
		</name>
		<value><hive connection password></value>
	</property>
	<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.userName
		</name>
		<value><hive connection user></value>
	</property>
	<property>
		<name>dt.application.<app_name>.operator.hiveOutput.prop.hiveColumns
		</name>
		<value>{col1,col2,col3,col4}</value>
	</property>
	<property>
	
<name>dt.application.<app_name>.operator.hiveOutput.prop.hiveColumnDataTypes
		</name>
		<value>{STRING,STRING,STRING,STRING}</value>
	</property>
	<property>
	
<name>dt.application.<app_name>.operator.hiveOutput.prop.hivePartitionColumns
		</name>
		<value>{accessdate}</value>
	</property>
	<property>
	
<name>dt.application.<app_name>.operator.hiveOutput.prop.hivePartitionColumnDataTypes
		</name>
		<value>{STRING}</value>
	</property>
	<property>
	
<name>dt.application.<app_name>.operator.hiveOutput.prop.expressionsForHiveColumns
                </name>
		<value>{"getCol1()","getCol2()","getCol3()","getCol4()"}</value>
	</property>
	<property>
	
<name>dt.application.<app_name>.operator.hiveOutput.prop.expressionsForHivePartitionColumns
                </name>
		<value>{"getAccessdate()"}</value>
	</property>



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/HiveOutputModule-creating-extra-directories-than-specified-while-saving-data-into-HDFS-tp1620.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Re: HiveOutputModule creating extra directories, than specified, while saving data into HDFS

Posted by Sanjay Pujare <sa...@datatorrent.com>.
Tracing the path this is similar to the previous case where this path is
triggered when the number of empty windows exceeds a certain number. Try
setting the property I mentioned in the last email to a high value so this
case is not triggered.

On Wed, May 17, 2017 at 6:22 PM, Vivek Bhide <bh...@gmail.com> wrote:

> Hi Sanjay
>
> I waited for the application to rollover to nest file but as soon as the
> file reaches the size I defined, the operator started failing with below
> error
>
> Any suggestion on this error? FYI. I override the file size into my
> application properties to 50 MB from defalut 128MB
>
>
> 2017-05-17 20:18:46,608 INFO  stram.StreamingContainerParent
> (StreamingContainerParent.java:log(170)) - child msg: Stopped running due
> to
> an exception. java.lang.NullPointerException
>         at
> com.datatorrent.lib.io.fs.AbstractFileOutputOperator.requestFinalize(
> AbstractFileOutputOperator.java:742)
>         at
> com.datatorrent.lib.io.fs.AbstractFileOutputOperator.rotate(
> AbstractFileOutputOperator.java:883)
>         at
> com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.rotateCall(
> AbstractFSRollingOutputOperator.java:186)
>         at
> com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.endWindow(
> AbstractFSRollingOutputOperator.java:227)
>         at
> com.datatorrent.stram.engine.GenericNode.processEndWindow(
> GenericNode.java:153)
>         at com.datatorrent.stram.engine.GenericNode.run(GenericNode.
> java:397)
>         at
> com.datatorrent.stram.engine.StreamingContainer$2.run(
> StreamingContainer.java:1428)
>  context:
> PTContainer[id=9(container_e3092_1491920474239_123256_01_
> 000022),state=ACTIVE,operators=[PTOperator[id=10,
> name=hiveOutput$fsRolling,state=PENDING_DEPLOY]]]
> 2017-05-17 20:18:47,819 WARN  stram.StreamingContainerManager
> (StreamingContainerManager.java:processOperatorFailure(1439)) - Operator
> failure: PTOperator[id=10,name=hiveOutput$fsRolling,state=INACTIVE] count:
> 10
> 2017-05-17 20:18:47,819 ERROR stram.StreamingContainerManager
> (StreamingContainerManager.java:processOperatorFailure(1446)) - Initiating
> container restart after operator failure
> PTOperator[id=10,name=hiveOutput$fsRolling,state=INACTIVE]
> 2017-05-17 20:18:47,837 INFO  stram.StreamingAppMasterService
> (StreamingAppMasterService.java:sendContainerAskToRM(1174)) - Requested
> stop
> container container_e3092_1491920474239_123256_01_000022
> 2017-05-17 20:18:47,837 INFO  impl.NMClientAsyncImpl
> (NMClientAsyncImpl.java:run(536)) - Processing Event EventType:
> STOP_CONTAINER for Container container_e3092_1491920474239_
> 123256_01_000022
> 2017-05-17 20:18:47,838 INFO  impl.NMClientImpl
> (NMClientImpl.java:stopContainer(242)) - ok, stopContainerInternal..
> container_e3092_1491920474239_123256_01_000022
> 2017-05-17 20:18:47,838 INFO  impl.ContainerManagementProtocolProxy
> (ContainerManagementProtocolProxy.java:newProxy(260)) - Opening proxy :
> brdn1089.target.com:45454
> 2017-05-17 20:18:48,840 INFO  stram.StreamingAppMasterService
> (StreamingAppMasterService.java:execute(954)) - Completed
> containerId=container_e3092_1491920474239_123256_01_000022,
> state=COMPLETE,
> exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
>         at
> com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.endWindow(
> AbstractFSRollingOutputOperator.java:227)
>         at
> com.datatorrent.stram.engine.GenericNode.processEndWindow(
> GenericNode.java:153)
>         at com.datatorrent.stram.engine.GenericNode.run(GenericNode.
> java:397)
>         at
> com.datatorrent.stram.engine.StreamingContainer$2.run(
> StreamingContainer.java:1428)
>  context:
> PTContainer[id=9(container_e3092_1491920474239_123256_01_
> 000022),state=ACTIVE,operators=[PTOperator[id=10,
> name=hiveOutput$fsRolling,state=PENDING_DEPLOY]]]
> 2017-05-17 20:18:47,819 WARN  stram.StreamingContainerManager
> (StreamingContainerManager.java:processOperatorFailure(1439)) - Operator
> failure: PTOperator[id=10,name=hiveOutput$fsRolling,state=INACTIVE] count:
> 10
>
>
>
> --
> View this message in context: http://apache-apex-users-list.
> 78494.x6.nabble.com/HiveOutputModule-creating-extra-directories-than-
> specified-while-saving-data-into-HDFS-tp1620p1626.html
> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>

Re: HiveOutputModule creating extra directories, than specified, while saving data into HDFS

Posted by Vivek Bhide <bh...@gmail.com>.
Hi Sanjay

I waited for the application to rollover to nest file but as soon as the
file reaches the size I defined, the operator started failing with below
error

Any suggestion on this error? FYI. I override the file size into my
application properties to 50 MB from defalut 128MB


2017-05-17 20:18:46,608 INFO  stram.StreamingContainerParent
(StreamingContainerParent.java:log(170)) - child msg: Stopped running due to
an exception. java.lang.NullPointerException
	at
com.datatorrent.lib.io.fs.AbstractFileOutputOperator.requestFinalize(AbstractFileOutputOperator.java:742)
	at
com.datatorrent.lib.io.fs.AbstractFileOutputOperator.rotate(AbstractFileOutputOperator.java:883)
	at
com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.rotateCall(AbstractFSRollingOutputOperator.java:186)
	at
com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.endWindow(AbstractFSRollingOutputOperator.java:227)
	at
com.datatorrent.stram.engine.GenericNode.processEndWindow(GenericNode.java:153)
	at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:397)
	at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1428)
 context:
PTContainer[id=9(container_e3092_1491920474239_123256_01_000022),state=ACTIVE,operators=[PTOperator[id=10,name=hiveOutput$fsRolling,state=PENDING_DEPLOY]]]
2017-05-17 20:18:47,819 WARN  stram.StreamingContainerManager
(StreamingContainerManager.java:processOperatorFailure(1439)) - Operator
failure: PTOperator[id=10,name=hiveOutput$fsRolling,state=INACTIVE] count:
10
2017-05-17 20:18:47,819 ERROR stram.StreamingContainerManager
(StreamingContainerManager.java:processOperatorFailure(1446)) - Initiating
container restart after operator failure
PTOperator[id=10,name=hiveOutput$fsRolling,state=INACTIVE]
2017-05-17 20:18:47,837 INFO  stram.StreamingAppMasterService
(StreamingAppMasterService.java:sendContainerAskToRM(1174)) - Requested stop
container container_e3092_1491920474239_123256_01_000022
2017-05-17 20:18:47,837 INFO  impl.NMClientAsyncImpl
(NMClientAsyncImpl.java:run(536)) - Processing Event EventType:
STOP_CONTAINER for Container container_e3092_1491920474239_123256_01_000022
2017-05-17 20:18:47,838 INFO  impl.NMClientImpl
(NMClientImpl.java:stopContainer(242)) - ok, stopContainerInternal..
container_e3092_1491920474239_123256_01_000022
2017-05-17 20:18:47,838 INFO  impl.ContainerManagementProtocolProxy
(ContainerManagementProtocolProxy.java:newProxy(260)) - Opening proxy :
brdn1089.target.com:45454
2017-05-17 20:18:48,840 INFO  stram.StreamingAppMasterService
(StreamingAppMasterService.java:execute(954)) - Completed
containerId=container_e3092_1491920474239_123256_01_000022, state=COMPLETE,
exitStatus=-105, diagnostics=Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
	at
com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.endWindow(AbstractFSRollingOutputOperator.java:227)
	at
com.datatorrent.stram.engine.GenericNode.processEndWindow(GenericNode.java:153)
	at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:397)
	at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1428)
 context:
PTContainer[id=9(container_e3092_1491920474239_123256_01_000022),state=ACTIVE,operators=[PTOperator[id=10,name=hiveOutput$fsRolling,state=PENDING_DEPLOY]]]
2017-05-17 20:18:47,819 WARN  stram.StreamingContainerManager
(StreamingContainerManager.java:processOperatorFailure(1439)) - Operator
failure: PTOperator[id=10,name=hiveOutput$fsRolling,state=INACTIVE] count:
10



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/HiveOutputModule-creating-extra-directories-than-specified-while-saving-data-into-HDFS-tp1620p1626.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Re: HiveOutputModule creating extra directories, than specified, while saving data into HDFS

Posted by bhidevivek <bh...@gmail.com>.
Thank you Sanjay.. I will check and will get back in case if I still see a
problem



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/HiveOutputModule-creating-extra-directories-than-specified-while-saving-data-into-HDFS-tp1620p1624.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Re: HiveOutputModule creating extra directories, than specified, while saving data into HDFS

Posted by Sanjay Pujare <sa...@datatorrent.com>.
Vivek,

Take a look at HiveOutputModule.populateDAG() (
https://github.com/apache/apex-malhar/blob/master/hive/src/main/java/org/apache/apex/malhar/hive/HiveOutputModule.java
)

This is a sub-DAG with fsRolling (FSPojoToHiveOperator) and hiveStore (
FSPojoToHiveOperator) using the file-path you have supplied (
/common/data/test/accessCounts).

If you look at the code
in com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.setup(OperatorContext)
  (superclass of  FSPojoToHiveOperator) it does construct a path for
rolling temporary files along the lines you have observed. But the final
output should be in the output path you specified if you wait long enough
for the creation of those files.



On Tue, May 16, 2017 at 12:53 PM, bhidevivek <bh...@gmail.com> wrote:

> H All, I am trying to use HiveOutput Module to insert the ingested data
> into
> hive external table. The table is already created with the same location as
> /dt.application.<app_name>.operator.hiveOutput.prop.filePath/ property and
> partition column is accessdate. With below configurations in property file,
> the hdfs file structure I am expecting is
>
> /common/data/test/accessCounts
>                                                 |
>                                                 ----- accessdate=2017-05-15
>                                                                         |
>
> ------- <fil1>
>
> ------- <fil2>
>                                                 ----- accessdate=2017-05-16
>                                                                         |
>
> ------- <fil1>
>
> ------- <fil2>
>
> but the actual structure look like
>
> /common/data/test/accessCounts/<yarn_application_id_for_apex_
> ingest_appl>/10
>
>                                                  |
>
>                                                  ----- 2017-05-15
>
>                                                            |
>
>                                                            ------- <fil1>
>
>                                                            ------- <fil2>
>
>                                                 |
>
>                                                  ----- 2017-05-16
>
>                                                            |
>
>                                                            ------- <fil1>
>
>                                                            ------- <fil2>
>
> Questions
> 1. Why the yarn_application_id and some other extra directories are created
> when it is no where specified in config
> 2. If I want to achieve the structure I want, what other configurations I
> will need to set?
>
> HiveOutputModule Configs
> ==================
>
> <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.filePath
>                 </name>
>                 <value>/common/data/test/accessCounts</value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.databaseUrl
>                 </name>
>                 <value><jdbc_url></value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.databaseDriver
>                 </name>
>                 <value>org.apache.hive.jdbc.HiveDriver</value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.tablename
>                 </name>
>                 <value><hive table name where records needs to be
> inserted></value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hivePartitionColumns
>                 </name>
>                 <value>{accessdate}</value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.password
>                 </name>
>                 <value><hive connection password></value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.userName
>                 </name>
>                 <value><hive connection user></value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hiveColumns
>                 </name>
>                 <value>{col1,col2,col3,col4}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hiveColumnDataTypes
>                 </name>
>                 <value>{STRING,STRING,STRING,STRING}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hivePartitionColumns
>                 </name>
>                 <value>{accessdate}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.prop.
> hivePartitionColumnDataTypes
>                 </name>
>                 <value>{STRING}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.expressionsForHiveColumns
>                 </name>
>                 <value>{"getCol1()","getCol2()","getCol3()","getCol4()"}</
> value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.prop.
> expressionsForHivePartitionColumns
>                 </name>
>                 <value>{"getAccessdate()"}</value>
>         </property>
>
>
>
> --
> View this message in context: http://apache-apex-users-list.
> 78494.x6.nabble.com/HiveOutputModule-creating-extra-directories-than-
> specified-while-saving-data-into-HDFS-tp1620.html
> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>