You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by John Dison <jd...@yahoo.com> on 2014/10/28 22:08:46 UTC

Need some help with RecordReader

Hello!
I have a file in the following format:+++++ InvoiceNo=1sometext1+++++ InvoiceNo=2somemoretext2<...>
Each record starts with a line beginning with five "+", then number of invoice.Then several lines of text.I want the invoice number to become a key for Map operation, and the text to become a value.
As far as I understand, I need to implement some kind of custom RecordReader class to parse that format.  But all examples I found on the Internet deal with formats where there is some mark at the end of the record, but in my case I only can see that records ended after reading the first line of the next record.
I would be very thankful for any help with implementing such a RecordReader.
Thanks in advance,John.

Re: Need some help with RecordReader

Posted by Steve Lewis <lo...@gmail.com>.

This InputFormat reads a Fasta file (See below)
Format is a line starting >
plus N lines of Data

The projects in
https://code.google.com/p/distributed-tools/

Have other samples of more complex input formats


>YDR356W SPC110 SGDID:S000002764, Chr IV from 1186099-1188933, Verified
ORF, "Inner plaque spindle pole body (SPB) component, ortholog of human
kendrin; involved in connecting nuclear microtubules to SPB; interacts with
Tub4p-complex and calmodulin; phosphorylated by Mps1p in cell
cycle-dependent manner"
MDEASHLPNGSLKNMEFTPVGFIKSKRNTTQTQVVSPTKVPNANNGDENEGPVKKRQRRS
IDDTIDSTRLFSEASQFDDSFPEIKANIPPSPRSGNVDKSRKRNLIDDLKKDVPMSQPLK
EQEVREHQMKKERFDRALESKLLGKRHITYANSDISNKELYINEIKSLKHEIKELRKEKN
DTLNNYDTLEEETDDLKNRLQALEKELDAKNKIVNSRKVDDHSGCIEEREQMERKLAELE
RRLRLDTRKGEHSLNISLPDDDELDRDYYNSHVYTRYHDYEYPLRFNLNRRGPYFERRLS
FKTVALLVLACVRMKRIAFYRRSDDNRLRILRDRIESSSGRISW
>YLR244C MAP1 SGDID:S000004234, Chr XII from 626333-625170, reverse
complement, Verified ORF, "Methionine aminopeptidase, catalyzes the
cotranslational removal of N-terminal methionine from nascent polypeptides;
function is partially redundant with that of Map2p"
MSTATTTVTTSDQASHPTKIYCSGLQCGRETSSQMKCPVCLKQGIVSIFCDTSCYENNYK
AHKALHNAKDGLEGAYDPFPKFKYSGKVKASYPLTPRRYVPEDIPKPDWAANGLPVSEQR
NDRLNNIPIYKKDQIKKIRKACMLGREVLDIAAAHVRPGITTDELDEIVHNETIKRGAYP
SPLNYYNFPKSLCTSVNEVICHGVPDKTVLKEGDIVNLDVSLYYQGYHADLNETYYVGEN
ISKEALNTTETSRECLKLAIKMCKPGTTFQELGDHIEKHATENKCSVVRTYCGHGVGEFF
HCSPNIPHYAKNRTPGVMKPGMVFTIEPMINEGTWKDMTWPDDWTSTTQDGKLSAQFEHT
LLVTEHGVEILTARNKKSPGGPRQRIK
>REV1_YJL076W NET1 SGDID:S000003612, Chr X from 295162-298731, Verified
ORF, "Core subunit of the RENT complex, which is a complex involved in
nucleolar silencing and telophase exit; stimulates transcription by RNA
polymerase I and regulates nucleolar structure"
MYKNPLLQSSEAITPGYGFQIPMTAQLSPPVLVVQLRLNAYQLSADGASQAMNTRSQNFYSPTFSVNASRFRKTFLLFKPDIIEDSLNLLTNTKECKVLFDPDLDCGSNDQLSLIEIDEQLSPYMKVINNVNFVDRLIVKYLSVPASDDLDIENKVSKRSKLVGSSSPIQQQPQVSQPSGNNLRAIKKRPITTTTTTGTPRMSGNTASRALPTSVRSSPPPYIQKEGIDEDEDDSNNSVIRIPPSQPQTPPPLFSRGADIGSSIKKIKSVIDEEVISSRDPDVTASKTKQQRNPTMTSMIPTGSLLRQGTLTVRHAHESVVKNIDQATVAATGGNAFSSSSASASFVLENRKPVPTVPRLMGSTIKIPIPREIESIKL
SSDSVSDSSSNSDSDSSSEDDSSSPAKGDDSSDGSDDSDSESKASIFSKGLAASASKKKKPILSAFGGSKFDKKK
>YJL077W-A YJL077W-A SGDID:S000028661, Chr X from 294716-294802, Dubious
ORF, "Identified by gene-trapping, microarray-based expression analysis,
and genome-wide homology searching"
MPGIAFKGKDMVKAIQFLEIVVPCHCTT




> Some Comment


On Tue, Oct 28, 2014 at 2:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Need some help with RecordReader

Posted by Steve Lewis <lo...@gmail.com>.

This InputFormat reads a Fasta file (See below)
Format is a line starting >
plus N lines of Data

The projects in
https://code.google.com/p/distributed-tools/

Have other samples of more complex input formats


>YDR356W SPC110 SGDID:S000002764, Chr IV from 1186099-1188933, Verified
ORF, "Inner plaque spindle pole body (SPB) component, ortholog of human
kendrin; involved in connecting nuclear microtubules to SPB; interacts with
Tub4p-complex and calmodulin; phosphorylated by Mps1p in cell
cycle-dependent manner"
MDEASHLPNGSLKNMEFTPVGFIKSKRNTTQTQVVSPTKVPNANNGDENEGPVKKRQRRS
IDDTIDSTRLFSEASQFDDSFPEIKANIPPSPRSGNVDKSRKRNLIDDLKKDVPMSQPLK
EQEVREHQMKKERFDRALESKLLGKRHITYANSDISNKELYINEIKSLKHEIKELRKEKN
DTLNNYDTLEEETDDLKNRLQALEKELDAKNKIVNSRKVDDHSGCIEEREQMERKLAELE
RRLRLDTRKGEHSLNISLPDDDELDRDYYNSHVYTRYHDYEYPLRFNLNRRGPYFERRLS
FKTVALLVLACVRMKRIAFYRRSDDNRLRILRDRIESSSGRISW
>YLR244C MAP1 SGDID:S000004234, Chr XII from 626333-625170, reverse
complement, Verified ORF, "Methionine aminopeptidase, catalyzes the
cotranslational removal of N-terminal methionine from nascent polypeptides;
function is partially redundant with that of Map2p"
MSTATTTVTTSDQASHPTKIYCSGLQCGRETSSQMKCPVCLKQGIVSIFCDTSCYENNYK
AHKALHNAKDGLEGAYDPFPKFKYSGKVKASYPLTPRRYVPEDIPKPDWAANGLPVSEQR
NDRLNNIPIYKKDQIKKIRKACMLGREVLDIAAAHVRPGITTDELDEIVHNETIKRGAYP
SPLNYYNFPKSLCTSVNEVICHGVPDKTVLKEGDIVNLDVSLYYQGYHADLNETYYVGEN
ISKEALNTTETSRECLKLAIKMCKPGTTFQELGDHIEKHATENKCSVVRTYCGHGVGEFF
HCSPNIPHYAKNRTPGVMKPGMVFTIEPMINEGTWKDMTWPDDWTSTTQDGKLSAQFEHT
LLVTEHGVEILTARNKKSPGGPRQRIK
>REV1_YJL076W NET1 SGDID:S000003612, Chr X from 295162-298731, Verified
ORF, "Core subunit of the RENT complex, which is a complex involved in
nucleolar silencing and telophase exit; stimulates transcription by RNA
polymerase I and regulates nucleolar structure"
MYKNPLLQSSEAITPGYGFQIPMTAQLSPPVLVVQLRLNAYQLSADGASQAMNTRSQNFYSPTFSVNASRFRKTFLLFKPDIIEDSLNLLTNTKECKVLFDPDLDCGSNDQLSLIEIDEQLSPYMKVINNVNFVDRLIVKYLSVPASDDLDIENKVSKRSKLVGSSSPIQQQPQVSQPSGNNLRAIKKRPITTTTTTGTPRMSGNTASRALPTSVRSSPPPYIQKEGIDEDEDDSNNSVIRIPPSQPQTPPPLFSRGADIGSSIKKIKSVIDEEVISSRDPDVTASKTKQQRNPTMTSMIPTGSLLRQGTLTVRHAHESVVKNIDQATVAATGGNAFSSSSASASFVLENRKPVPTVPRLMGSTIKIPIPREIESIKL
SSDSVSDSSSNSDSDSSSEDDSSSPAKGDDSSDGSDDSDSESKASIFSKGLAASASKKKKPILSAFGGSKFDKKK
>YJL077W-A YJL077W-A SGDID:S000028661, Chr X from 294716-294802, Dubious
ORF, "Identified by gene-trapping, microarray-based expression analysis,
and genome-wide homology searching"
MPGIAFKGKDMVKAIQFLEIVVPCHCTT




> Some Comment


On Tue, Oct 28, 2014 at 2:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Need some help with RecordReader

Posted by Steve Lewis <lo...@gmail.com>.

This InputFormat reads a Fasta file (See below)
Format is a line starting >
plus N lines of Data

The projects in
https://code.google.com/p/distributed-tools/

Have other samples of more complex input formats


>YDR356W SPC110 SGDID:S000002764, Chr IV from 1186099-1188933, Verified
ORF, "Inner plaque spindle pole body (SPB) component, ortholog of human
kendrin; involved in connecting nuclear microtubules to SPB; interacts with
Tub4p-complex and calmodulin; phosphorylated by Mps1p in cell
cycle-dependent manner"
MDEASHLPNGSLKNMEFTPVGFIKSKRNTTQTQVVSPTKVPNANNGDENEGPVKKRQRRS
IDDTIDSTRLFSEASQFDDSFPEIKANIPPSPRSGNVDKSRKRNLIDDLKKDVPMSQPLK
EQEVREHQMKKERFDRALESKLLGKRHITYANSDISNKELYINEIKSLKHEIKELRKEKN
DTLNNYDTLEEETDDLKNRLQALEKELDAKNKIVNSRKVDDHSGCIEEREQMERKLAELE
RRLRLDTRKGEHSLNISLPDDDELDRDYYNSHVYTRYHDYEYPLRFNLNRRGPYFERRLS
FKTVALLVLACVRMKRIAFYRRSDDNRLRILRDRIESSSGRISW
>YLR244C MAP1 SGDID:S000004234, Chr XII from 626333-625170, reverse
complement, Verified ORF, "Methionine aminopeptidase, catalyzes the
cotranslational removal of N-terminal methionine from nascent polypeptides;
function is partially redundant with that of Map2p"
MSTATTTVTTSDQASHPTKIYCSGLQCGRETSSQMKCPVCLKQGIVSIFCDTSCYENNYK
AHKALHNAKDGLEGAYDPFPKFKYSGKVKASYPLTPRRYVPEDIPKPDWAANGLPVSEQR
NDRLNNIPIYKKDQIKKIRKACMLGREVLDIAAAHVRPGITTDELDEIVHNETIKRGAYP
SPLNYYNFPKSLCTSVNEVICHGVPDKTVLKEGDIVNLDVSLYYQGYHADLNETYYVGEN
ISKEALNTTETSRECLKLAIKMCKPGTTFQELGDHIEKHATENKCSVVRTYCGHGVGEFF
HCSPNIPHYAKNRTPGVMKPGMVFTIEPMINEGTWKDMTWPDDWTSTTQDGKLSAQFEHT
LLVTEHGVEILTARNKKSPGGPRQRIK
>REV1_YJL076W NET1 SGDID:S000003612, Chr X from 295162-298731, Verified
ORF, "Core subunit of the RENT complex, which is a complex involved in
nucleolar silencing and telophase exit; stimulates transcription by RNA
polymerase I and regulates nucleolar structure"
MYKNPLLQSSEAITPGYGFQIPMTAQLSPPVLVVQLRLNAYQLSADGASQAMNTRSQNFYSPTFSVNASRFRKTFLLFKPDIIEDSLNLLTNTKECKVLFDPDLDCGSNDQLSLIEIDEQLSPYMKVINNVNFVDRLIVKYLSVPASDDLDIENKVSKRSKLVGSSSPIQQQPQVSQPSGNNLRAIKKRPITTTTTTGTPRMSGNTASRALPTSVRSSPPPYIQKEGIDEDEDDSNNSVIRIPPSQPQTPPPLFSRGADIGSSIKKIKSVIDEEVISSRDPDVTASKTKQQRNPTMTSMIPTGSLLRQGTLTVRHAHESVVKNIDQATVAATGGNAFSSSSASASFVLENRKPVPTVPRLMGSTIKIPIPREIESIKL
SSDSVSDSSSNSDSDSSSEDDSSSPAKGDDSSDGSDDSDSESKASIFSKGLAASASKKKKPILSAFGGSKFDKKK
>YJL077W-A YJL077W-A SGDID:S000028661, Chr X from 294716-294802, Dubious
ORF, "Identified by gene-trapping, microarray-based expression analysis,
and genome-wide homology searching"
MPGIAFKGKDMVKAIQFLEIVVPCHCTT




> Some Comment


On Tue, Oct 28, 2014 at 2:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Need some help with RecordReader

Posted by jay vyas <ja...@gmail.com>.

great question .  i like the idea of using the existing FASTA Record Reader
if it works for you.
In general, you should know that this isnt too hard: If you want to
implement your own - here is how:

Yes, your right that a file typically has delimiters at the end of records,
and so it makes sense that FASTA is problematic for this.

The signature for a record reader is something like this:

* public RecordReader<Text, Text> createRecordReader(InputSplit
arg0,TaskAttemptContext arg1) throws IOException,
InterruptedException{*

Thus ,  a record reader has the WHOLE split as its input.

So,  the record reader can easily start reading the file, when it sees the
+++++ demarkation, it can break off a new record, remembering where it is,
and then begin reading again.

You unfortunately wont be able to extend KeyValueLineRecordReader, but
instead, youll have to write a record reader which is somewhat similar to
LineRecordReader, but only you'll have to replace the "readLine" call with
something a little more intelligent (i.e. youll have to keep reading till
you see the next record, return the finished sequence, and then start
assembling the next sequence , until the file is extinguished).

So as a start you will want to copy LineRecordReader and compile it to
ensure that its working in your java setup, and then get it working with
the FASTA files,.

On Tue, Oct 28, 2014 at 5:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>

-- 
jay vyas

Re: Need some help with RecordReader

Posted by Steve Lewis <lo...@gmail.com>.

This InputFormat reads a Fasta file (See below)
Format is a line starting >
plus N lines of Data

The projects in
https://code.google.com/p/distributed-tools/

Have other samples of more complex input formats


>YDR356W SPC110 SGDID:S000002764, Chr IV from 1186099-1188933, Verified
ORF, "Inner plaque spindle pole body (SPB) component, ortholog of human
kendrin; involved in connecting nuclear microtubules to SPB; interacts with
Tub4p-complex and calmodulin; phosphorylated by Mps1p in cell
cycle-dependent manner"
MDEASHLPNGSLKNMEFTPVGFIKSKRNTTQTQVVSPTKVPNANNGDENEGPVKKRQRRS
IDDTIDSTRLFSEASQFDDSFPEIKANIPPSPRSGNVDKSRKRNLIDDLKKDVPMSQPLK
EQEVREHQMKKERFDRALESKLLGKRHITYANSDISNKELYINEIKSLKHEIKELRKEKN
DTLNNYDTLEEETDDLKNRLQALEKELDAKNKIVNSRKVDDHSGCIEEREQMERKLAELE
RRLRLDTRKGEHSLNISLPDDDELDRDYYNSHVYTRYHDYEYPLRFNLNRRGPYFERRLS
FKTVALLVLACVRMKRIAFYRRSDDNRLRILRDRIESSSGRISW
>YLR244C MAP1 SGDID:S000004234, Chr XII from 626333-625170, reverse
complement, Verified ORF, "Methionine aminopeptidase, catalyzes the
cotranslational removal of N-terminal methionine from nascent polypeptides;
function is partially redundant with that of Map2p"
MSTATTTVTTSDQASHPTKIYCSGLQCGRETSSQMKCPVCLKQGIVSIFCDTSCYENNYK
AHKALHNAKDGLEGAYDPFPKFKYSGKVKASYPLTPRRYVPEDIPKPDWAANGLPVSEQR
NDRLNNIPIYKKDQIKKIRKACMLGREVLDIAAAHVRPGITTDELDEIVHNETIKRGAYP
SPLNYYNFPKSLCTSVNEVICHGVPDKTVLKEGDIVNLDVSLYYQGYHADLNETYYVGEN
ISKEALNTTETSRECLKLAIKMCKPGTTFQELGDHIEKHATENKCSVVRTYCGHGVGEFF
HCSPNIPHYAKNRTPGVMKPGMVFTIEPMINEGTWKDMTWPDDWTSTTQDGKLSAQFEHT
LLVTEHGVEILTARNKKSPGGPRQRIK
>REV1_YJL076W NET1 SGDID:S000003612, Chr X from 295162-298731, Verified
ORF, "Core subunit of the RENT complex, which is a complex involved in
nucleolar silencing and telophase exit; stimulates transcription by RNA
polymerase I and regulates nucleolar structure"
MYKNPLLQSSEAITPGYGFQIPMTAQLSPPVLVVQLRLNAYQLSADGASQAMNTRSQNFYSPTFSVNASRFRKTFLLFKPDIIEDSLNLLTNTKECKVLFDPDLDCGSNDQLSLIEIDEQLSPYMKVINNVNFVDRLIVKYLSVPASDDLDIENKVSKRSKLVGSSSPIQQQPQVSQPSGNNLRAIKKRPITTTTTTGTPRMSGNTASRALPTSVRSSPPPYIQKEGIDEDEDDSNNSVIRIPPSQPQTPPPLFSRGADIGSSIKKIKSVIDEEVISSRDPDVTASKTKQQRNPTMTSMIPTGSLLRQGTLTVRHAHESVVKNIDQATVAATGGNAFSSSSASASFVLENRKPVPTVPRLMGSTIKIPIPREIESIKL
SSDSVSDSSSNSDSDSSSEDDSSSPAKGDDSSDGSDDSDSESKASIFSKGLAASASKKKKPILSAFGGSKFDKKK
>YJL077W-A YJL077W-A SGDID:S000028661, Chr X from 294716-294802, Dubious
ORF, "Identified by gene-trapping, microarray-based expression analysis,
and genome-wide homology searching"
MPGIAFKGKDMVKAIQFLEIVVPCHCTT




> Some Comment


On Tue, Oct 28, 2014 at 2:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Need some help with RecordReader

Posted by jay vyas <ja...@gmail.com>.

great question .  i like the idea of using the existing FASTA Record Reader
if it works for you.
In general, you should know that this isnt too hard: If you want to
implement your own - here is how:

Yes, your right that a file typically has delimiters at the end of records,
and so it makes sense that FASTA is problematic for this.

The signature for a record reader is something like this:

* public RecordReader<Text, Text> createRecordReader(InputSplit
arg0,TaskAttemptContext arg1) throws IOException,
InterruptedException{*

Thus ,  a record reader has the WHOLE split as its input.

So,  the record reader can easily start reading the file, when it sees the
+++++ demarkation, it can break off a new record, remembering where it is,
and then begin reading again.

You unfortunately wont be able to extend KeyValueLineRecordReader, but
instead, youll have to write a record reader which is somewhat similar to
LineRecordReader, but only you'll have to replace the "readLine" call with
something a little more intelligent (i.e. youll have to keep reading till
you see the next record, return the finished sequence, and then start
assembling the next sequence , until the file is extinguished).

So as a start you will want to copy LineRecordReader and compile it to
ensure that its working in your java setup, and then get it working with
the FASTA files,.

On Tue, Oct 28, 2014 at 5:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>

-- 
jay vyas

Re: Need some help with RecordReader

Posted by jay vyas <ja...@gmail.com>.

great question .  i like the idea of using the existing FASTA Record Reader
if it works for you.
In general, you should know that this isnt too hard: If you want to
implement your own - here is how:

Yes, your right that a file typically has delimiters at the end of records,
and so it makes sense that FASTA is problematic for this.

The signature for a record reader is something like this:

* public RecordReader<Text, Text> createRecordReader(InputSplit
arg0,TaskAttemptContext arg1) throws IOException,
InterruptedException{*

Thus ,  a record reader has the WHOLE split as its input.

So,  the record reader can easily start reading the file, when it sees the
+++++ demarkation, it can break off a new record, remembering where it is,
and then begin reading again.

You unfortunately wont be able to extend KeyValueLineRecordReader, but
instead, youll have to write a record reader which is somewhat similar to
LineRecordReader, but only you'll have to replace the "readLine" call with
something a little more intelligent (i.e. youll have to keep reading till
you see the next record, return the finished sequence, and then start
assembling the next sequence , until the file is extinguished).

So as a start you will want to copy LineRecordReader and compile it to
ensure that its working in your java setup, and then get it working with
the FASTA files,.

On Tue, Oct 28, 2014 at 5:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>

-- 
jay vyas

Re: Need some help with RecordReader

Posted by jay vyas <ja...@gmail.com>.

great question .  i like the idea of using the existing FASTA Record Reader
if it works for you.
In general, you should know that this isnt too hard: If you want to
implement your own - here is how:

Yes, your right that a file typically has delimiters at the end of records,
and so it makes sense that FASTA is problematic for this.

The signature for a record reader is something like this:

* public RecordReader<Text, Text> createRecordReader(InputSplit
arg0,TaskAttemptContext arg1) throws IOException,
InterruptedException{*

Thus ,  a record reader has the WHOLE split as its input.

So,  the record reader can easily start reading the file, when it sees the
+++++ demarkation, it can break off a new record, remembering where it is,
and then begin reading again.

You unfortunately wont be able to extend KeyValueLineRecordReader, but
instead, youll have to write a record reader which is somewhat similar to
LineRecordReader, but only you'll have to replace the "readLine" call with
something a little more intelligent (i.e. youll have to keep reading till
you see the next record, return the finished sequence, and then start
assembling the next sequence , until the file is extinguished).

So as a start you will want to copy LineRecordReader and compile it to
ensure that its working in your java setup, and then get it working with
the FASTA files,.

On Tue, Oct 28, 2014 at 5:08 PM, John Dison <jd...@yahoo.com> wrote:

> Hello!
>
> I have a file in the following format:
> +++++ InvoiceNo=1
> some
> text1
> +++++ InvoiceNo=2
> some
> more
> text2
> <...>
>
> Each record starts with a line beginning with five "+", then number of
> invoice.
> Then several lines of text.
> I want the invoice number to become a key for Map operation, and the text
> to become a value.
>
> As far as I understand, I need to implement some kind of custom
> RecordReader class to parse that format.  But all examples I found on the
> Internet deal with formats where there is some mark at the end of the
> record, but in my case I only can see that records ended after reading the
> first line of the next record.
>
> I would be very thankful for any help with implementing such a
> RecordReader.
>
> Thanks in advance,
> John.
>

-- 
jay vyas