You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "HAJIHASHEMI, ZAHRA (AG/1000)" <za...@monsanto.com> on 2012/09/23 20:41:54 UTC

RE: Formatting file in pig

Hi all,

I'm new to pig and need to format my file. I have fasta file with this fomat:

>1 abundance=7626 length=72 cross=0
CGACACGACTCTCGGCAACGGATA
CGACACGACTCTCGGCAACGGATAC
GACACGACTCTCGGCAACGGATA
>3 abundance=4639 length=22 cross=1
CGACACGACTCTCGGCAACGGA
CGACACGACTCTCGGCAACGGATA
CGACACGACTCTCGGCAACGGATA
>4 abundance=4302 length=24 cross=0
ACTTGTGCTGATTGGATGACTTGA
>5 abundance=3785 length=23 cross=0
GACACGACTCTCGGCAACGGATA

Each line which starts with '>' corresponds to one sequence, but the actual sequence might be stored in multiple lines like record 1. In each line, the first number is id.
In the formatted file, I do not need id and cross.
I need to format this file such that all records will be in just one line and without the keywords "abundance", "length", and "cross". So the ideal formatted file should be like that:
7626 72 CGACACGACTCTCGGCAACGGATACGACACGACTCTCGGCAACGGATACGACACGACTCTCGGCAACGGATA
4639 22 CGACACGACTCTCGGCAACGGACGACACGACTCTCGGCAACGGATACGACACGACTCTCGGCAACGGATA
4302 24 ACTTGTGCTGATTGGATGACTTGA
3785 23 GACACGACTCTCGGCAACGGATA

Can I do this formatting in pig?
Any help is highly appreciated.


Zara
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.

Re: Formatting file in pig

Posted by TianYi Zhu <ti...@facilitatedigital.com>.
Hi Hajihashemi,

you can do this by a python(easy to learn)/perl(fast and powerful on
processing string) script. pig may not be appropriate to your needs.

Thanks.
TianYi

On Mon, Sep 24, 2012 at 4:41 AM, HAJIHASHEMI, ZAHRA (AG/1000) <
zahra.hajihashemi@monsanto.com> wrote:

> Hi all,
>
> I'm new to pig and need to format my file. I have fasta file with this
> fomat:
>
> >1 abundance=7626 length=72 cross=0
> CGACACGACTCTCGGCAACGGATA
> CGACACGACTCTCGGCAACGGATAC
> GACACGACTCTCGGCAACGGATA
> >3 abundance=4639 length=22 cross=1
> CGACACGACTCTCGGCAACGGA
> CGACACGACTCTCGGCAACGGATA
> CGACACGACTCTCGGCAACGGATA
> >4 abundance=4302 length=24 cross=0
> ACTTGTGCTGATTGGATGACTTGA
> >5 abundance=3785 length=23 cross=0
> GACACGACTCTCGGCAACGGATA
>
> Each line which starts with '>' corresponds to one sequence, but the
> actual sequence might be stored in multiple lines like record 1. In each
> line, the first number is id.
> In the formatted file, I do not need id and cross.
> I need to format this file such that all records will be in just one line
> and without the keywords "abundance", "length", and "cross". So the ideal
> formatted file should be like that:
> 7626 72
> CGACACGACTCTCGGCAACGGATACGACACGACTCTCGGCAACGGATACGACACGACTCTCGGCAACGGATA
> 4639 22
> CGACACGACTCTCGGCAACGGACGACACGACTCTCGGCAACGGATACGACACGACTCTCGGCAACGGATA
> 4302 24 ACTTGTGCTGATTGGATGACTTGA
> 3785 23 GACACGACTCTCGGCAACGGATA
>
> Can I do this formatting in pig?
> Any help is highly appreciated.
>
>
> Zara
> This e-mail message may contain privileged and/or confidential
> information, and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other
> use of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>