You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Mapred Learn <ma...@gmail.com> on 2011/06/14 02:18:55 UTC

Delimiter selection for Sequence Files

Hi,
I was thinking of using CTRL A as delimiter but data that I am loading to
Hadoop already has CTRL A in it. What are other good choices of delimiters
that anybody might have used in this kind of scenario, considering that I
also want to query this data using Hive.

Thanks in advance
-JJ

Re: Delimiter selection for Sequence Files

Posted by Mapred Learn <ma...@gmail.com>.

Hi Harsh,
I am also trying to something like:
 hadoop fs -text /user/cloudera/staging/test_file_0.seq|cut -f 3 -d '\x01'

And file contents are as:
0       340\x01234\x010067\x010.00\x01
1       230\x01454\x010045\x010.00\x01

But I get :
cut: the delimiter must be a single character
Try `cut --help' for more information.




On Wed, Jun 15, 2011 at 11:16 AM, Harsh J <ha...@cloudera.com> wrote:

> For Hive, its best if the delimiter character's value is also < 127 in
> decimal. I think Hive uses a signed byte to represent the delimiter
> and that may lead to issues if greater is chosen.
>
> I've seen Hive take ascii and octal representations in its statements
> for delimiters. You can use a hex value in your shell simply by
> passing it as a literal.
>
> For ex., on Bash/ZSH I do:
> $ echo $'\x1B' # For the 'escape' character.
>
> On Wed, Jun 15, 2011 at 11:37 PM, Mapred Learn <ma...@gmail.com>
> wrote:
> > If I use hex value of a delimiter as delimiter for eg. \x01 for ctrl A.
> Can
> > I use it as a delimiter in hive/unix cut commands ?
> >
> >
> > On Tue, Jun 14, 2011 at 7:10 AM, Mapred Learn <ma...@gmail.com>
> > wrote:
> >>
> >> Thanks Joe fit the reply !
> >> "@@##@@" looks like a big value for a delimiter.
> >> I will also choose something like a hex number so that it does not
> appear
> >> in the data.
> >> Sent from my iPhone
> >> On Jun 13, 2011, at 5:33 PM, Joe Stein <jo...@medialets.com> wrote:
> >>
> >> I have had quite a few data sets that I have had no idea if my delimiter
> >> was in there so what I did was replaced my delimiter with a string I
> knew
> >> would not be in there during map and then in the reducer replaced it
> back
> >> again.
> >>
> >> e.g.
> >>
> >> replace("^","@@##@@") for each line
> >>
> >> then use ^ as your delimiter
> >>
> >> and in the reducer replace("@@##@@","^") for each line
> >>
> >> and in your reducer output qualify things appropriately for how you
> >> want/need to deal with the output
> >>
> >> now if your problem is splitting each line during your map and not
> knowing
> >> what to split on... well that is very related to your context
> >>
> >> you could JOIN map side a list of all possible characters with your data
> >> set and then reduce output only characters not found and use that as
> your
> >> delimiter.... who knows maybe you will find out that ~ is not in your
> >> data...
> >>
> >> On Mon, Jun 13, 2011 at 8:18 PM, Mapred Learn <ma...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>> I was thinking of using CTRL A as delimiter but data that I am loading
> to
> >>> Hadoop already has CTRL A in it. What are other good choices of
> delimiters
> >>> that anybody might have used in this kind of scenario, considering
> that I
> >>> also want to query this data using Hive.
> >>>
> >>> Thanks in advance
> >>> -JJ
> >>
> >>
> >> --
> >> /*
> >> Joe Stein, 973-944-0094
> >> http://www.medialets.com
> >> Twitter: @allthingshadoop
> >> */
> >
> >
>
>
>
> --
> Harsh J
>

Re: Delimiter selection for Sequence Files

Posted by Mapred Learn <ma...@gmail.com>.

If I use hex value of a delimiter as delimiter for eg. \x01 for ctrl A. Can
I use it as a delimiter in hive/unix cut commands ?



On Tue, Jun 14, 2011 at 7:10 AM, Mapred Learn <ma...@gmail.com>wrote:

>  Thanks Joe fit the reply !
> "@@##@@" looks like a big value for a delimiter.
> I will also choose something like a hex number so that it does not appear
> in the data.
>
> Sent from my iPhone
>
> On Jun 13, 2011, at 5:33 PM, Joe Stein <jo...@medialets.com> wrote:
>
>  I have had quite a few data sets that I have had no idea if my delimiter
> was in there so what I did was replaced my delimiter with a string I knew
> would not be in there during map and then in the reducer replaced it back
> again.
>
> e.g.
>
> replace("^","@@##@@") for each line
>
> then use ^ as your delimiter
>
> and in the reducer replace("@@##@@","^") for each line
>
> and in your reducer output qualify things appropriately for how you
> want/need to deal with the output
>
> now if your problem is splitting each line during your map and not knowing
> what to split on... well that is very related to your context
>
> you could JOIN map side a list of all possible characters with your data
> set and then reduce output only characters not found and use that as your
> delimiter.... who knows maybe you will find out that ~ is not in your
> data...
>
> On Mon, Jun 13, 2011 at 8:18 PM, Mapred Learn <ma...@gmail.com>wrote:
>
>> Hi,
>> I was thinking of using CTRL A as delimiter but data that I am loading to
>> Hadoop already has CTRL A in it. What are other good choices of delimiters
>> that anybody might have used in this kind of scenario, considering that I
>> also want to query this data using Hive.
>>
>> Thanks in advance
>> -JJ
>>
>
>
>
> --
> /*
> Joe Stein, 973-944-0094
> http://www.medialets.com
> Twitter: @allthingshadoop
> */
>
>

Re: Delimiter selection for Sequence Files

Posted by Mapred Learn <ma...@gmail.com>.

Thanks Joe fit the reply !
"@@##@@" looks like a big value for a delimiter.
I will also choose something like a hex number so that it does not appear in the data.

Sent from my iPhone

On Jun 13, 2011, at 5:33 PM, Joe Stein <jo...@medialets.com> wrote:

> I have had quite a few data sets that I have had no idea if my delimiter was in there so what I did was replaced my delimiter with a string I knew would not be in there during map and then in the reducer replaced it back again.
> 
> e.g.
> 
> replace("^","@@##@@") for each line
> 
> then use ^ as your delimiter
> 
> and in the reducer replace("@@##@@","^") for each line
> 
> and in your reducer output qualify things appropriately for how you want/need to deal with the output
> 
> now if your problem is splitting each line during your map and not knowing what to split on... well that is very related to your context
> 
> you could JOIN map side a list of all possible characters with your data set and then reduce output only characters not found and use that as your delimiter.... who knows maybe you will find out that ~ is not in your data... 
> 
> On Mon, Jun 13, 2011 at 8:18 PM, Mapred Learn <ma...@gmail.com> wrote:
> Hi,
> I was thinking of using CTRL A as delimiter but data that I am loading to Hadoop already has CTRL A in it. What are other good choices of delimiters that anybody might have used in this kind of scenario, considering that I also want to query this data using Hive.
>  
> Thanks in advance
> -JJ
> 
> 
> 
> -- 
> /*
> Joe Stein, 973-944-0094
> http://www.medialets.com
> Twitter: @allthingshadoop
> */