You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by rpereira <rp...@xs4all.nl> on 2016/04/23 12:35:22 UTC

Hadoop Streaming icm HDFS

Hi

I have a textfile that I'm processing through hadoop streaming.
I placed the file on de HDFS.
My data transform process is a set of awk and sed commands that creates 
a table structure.
I can choose the count of mappers. When I use one mapper the data is 
correct.
When choosing more than one mapper then the data will be split up.
The splitting up is done on eol.
I would like to have it split up Before the text markers.
I need to have the text blocks not be splitted up as it will mean loss 
of information.
And I like to be able to use more than one mapper.

Example:

============================================
Current situation :
	text mark 1
	   some data
		  ...
			 some data
	text mark 2
		  some data
	----------------split-----------------
		  ...
	   some data
	text mark 3
		  some data
		  ...
	   some data

============================================
Correct situation :
	text mark 1
	   some data
		  ...
			 some data
	----------------split here -------------
	text mark 2
		  some data
		  ...
	   some data
	----------------or split here ----------
	text mark 3
		  some data
		  ...
	   some data


I wouldn't like to do preprocessing before placing it on the HDFS to 
solve this issue. I want to go ahead from the HDFS filesystem being 
flexible with the count of mapper processes applied.

Are there any possibilities to have the splitting be done outside the 
textblocks keeping the text blocks complete ?

Kind Regards
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org