You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by rpereira <rp...@xs4all.nl> on 2016/04/23 12:35:22 UTC
Hadoop Streaming icm HDFS
Hi
I have a textfile that I'm processing through hadoop streaming.
I placed the file on de HDFS.
My data transform process is a set of awk and sed commands that creates
a table structure.
I can choose the count of mappers. When I use one mapper the data is
correct.
When choosing more than one mapper then the data will be split up.
The splitting up is done on eol.
I would like to have it split up Before the text markers.
I need to have the text blocks not be splitted up as it will mean loss
of information.
And I like to be able to use more than one mapper.
Example:
============================================
Current situation :
text mark 1
some data
...
some data
text mark 2
some data
----------------split-----------------
...
some data
text mark 3
some data
...
some data
============================================
Correct situation :
text mark 1
some data
...
some data
----------------split here -------------
text mark 2
some data
...
some data
----------------or split here ----------
text mark 3
some data
...
some data
I wouldn't like to do preprocessing before placing it on the HDFS to
solve this issue. I want to go ahead from the HDFS filesystem being
flexible with the count of mapper processes applied.
Are there any possibilities to have the splitting be done outside the
textblocks keeping the text blocks complete ?
Kind Regards
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org