You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Udaya Lakshmi <ud...@gmail.com> on 2010/01/28 10:59:38 UTC

Input file format doubt

Hi all..
  I have searched the documentation but could not find a input file
format which will give line number as the key and line as the value.
Did I miss something? Can someone give me a clue of how to implement
one such input file format.

Thanks,
Udaya.

Re: Input file format doubt

Posted by Udaya Lakshmi <ud...@gmail.com>.

Thank you Amogh. I will go through the link.

Udaya.

On 1/28/10, Ravi <ra...@gmail.com> wrote:
> Thank you Amogh
>
> Ravi.
>
> On 1/28/10, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>> Hi,
>> Here's the relevant thread with Gordon, the author of the solution:
>> I am in the process of learning Hadoop (and I think I've made a lot of
>> progress).  I have described the specific problem and solution on my blog
>> http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html.
>>
>> You particular solution won't work, because I need to do additional
>> processing between the two passes.
>>
>> --gordon
>>
>> On Wed, Nov 25, 2009 at 1:50 AM, Amogh Vasekar <am...@yahoo-inc.com>
>> wrote:
>>
>> Amogh
>>
>>
>> On 1/28/10 4:03 PM, "Ravi" <ra...@gmail.com> wrote:
>>
>> Thank you Amogh.
>>
>> On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar <am...@yahoo-inc.com>
>> wrote:
>>
>>> Hi,
>>> For global line numbers, you would need to know the ordering within each
>>> split generated from the input file. The standard input formats provide
>>> offsets in splits, so if the records are of equal length you can compute
>>> some kind of numbering.
>>> I remember someone had implemented sequential numbering using the
>>> partition
>>> id for each map task (mapred.task.partition) and posted this on his blog.
>>> I
>>> don't have it handy with me right now, but will send you off the list if
>>> I
>>> find it.
>>>
>>> Amogh
>>>
>>>
>>> On 1/28/10 3:29 PM, "Udaya Lakshmi" <ud...@gmail.com> wrote:
>>>
>>> Hi all..
>>>  I have searched the documentation but could not find a input file
>>> format which will give line number as the key and line as the value.
>>> Did I miss something? Can someone give me a clue of how to implement
>>> one such input file format.
>>>
>>> Thanks,
>>> Udaya.
>>>
>>>
>>
>>
>

Re: Input file format doubt

Posted by Ravi <ra...@gmail.com>.

Thank you Amogh

Ravi.

On 1/28/10, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> Here's the relevant thread with Gordon, the author of the solution:
> I am in the process of learning Hadoop (and I think I've made a lot of
> progress).  I have described the specific problem and solution on my blog
> http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html.
>
> You particular solution won't work, because I need to do additional
> processing between the two passes.
>
> --gordon
>
> On Wed, Nov 25, 2009 at 1:50 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>
> Amogh
>
>
> On 1/28/10 4:03 PM, "Ravi" <ra...@gmail.com> wrote:
>
> Thank you Amogh.
>
> On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>
>> Hi,
>> For global line numbers, you would need to know the ordering within each
>> split generated from the input file. The standard input formats provide
>> offsets in splits, so if the records are of equal length you can compute
>> some kind of numbering.
>> I remember someone had implemented sequential numbering using the
>> partition
>> id for each map task (mapred.task.partition) and posted this on his blog.
>> I
>> don't have it handy with me right now, but will send you off the list if I
>> find it.
>>
>> Amogh
>>
>>
>> On 1/28/10 3:29 PM, "Udaya Lakshmi" <ud...@gmail.com> wrote:
>>
>> Hi all..
>>  I have searched the documentation but could not find a input file
>> format which will give line number as the key and line as the value.
>> Did I miss something? Can someone give me a clue of how to implement
>> one such input file format.
>>
>> Thanks,
>> Udaya.
>>
>>
>
>

Re: Input file format doubt

Posted by Ravi <ra...@gmail.com>.

I too had the doubt but could not find the clue. However Please post the
code if u can find it.

On Thu, Jan 28, 2010 at 4:03 PM, Ravi <ra...@gmail.com>wrote:

> Thank you Amogh.
>
>
> On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar <am...@yahoo-inc.com>wrote:
>
>> Hi,
>> For global line numbers, you would need to know the ordering within each
>> split generated from the input file. The standard input formats provide
>> offsets in splits, so if the records are of equal length you can compute
>> some kind of numbering.
>> I remember someone had implemented sequential numbering using the
>> partition id for each map task (mapred.task.partition) and posted this on
>> his blog. I don't have it handy with me right now, but will send you off the
>> list if I find it.
>>
>> Amogh
>>
>>
>> On 1/28/10 3:29 PM, "Udaya Lakshmi" <ud...@gmail.com> wrote:
>>
>> Hi all..
>>  I have searched the documentation but could not find a input file
>> format which will give line number as the key and line as the value.
>> Did I miss something? Can someone give me a clue of how to implement
>> one such input file format.
>>
>> Thanks,
>> Udaya.
>>
>>
>

Re: Input file format doubt

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
Here's the relevant thread with Gordon, the author of the solution:
I am in the process of learning Hadoop (and I think I've made a lot of progress).  I have described the specific problem and solution on my blog http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html.

You particular solution won't work, because I need to do additional processing between the two passes.

--gordon

On Wed, Nov 25, 2009 at 1:50 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

Amogh

On 1/28/10 4:03 PM, "Ravi" <ra...@gmail.com> wrote:

Thank you Amogh.

On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

> Hi,
> For global line numbers, you would need to know the ordering within each
> split generated from the input file. The standard input formats provide
> offsets in splits, so if the records are of equal length you can compute
> some kind of numbering.
> I remember someone had implemented sequential numbering using the partition
> id for each map task (mapred.task.partition) and posted this on his blog. I
> don't have it handy with me right now, but will send you off the list if I
> find it.
>
> Amogh
>
>
> On 1/28/10 3:29 PM, "Udaya Lakshmi" <ud...@gmail.com> wrote:
>
> Hi all..
>  I have searched the documentation but could not find a input file
> format which will give line number as the key and line as the value.
> Did I miss something? Can someone give me a clue of how to implement
> one such input file format.
>
> Thanks,
> Udaya.
>
>

Re: Input file format doubt

Posted by Ravi <ra...@gmail.com>.

Thank you Amogh.

On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

> Hi,
> For global line numbers, you would need to know the ordering within each
> split generated from the input file. The standard input formats provide
> offsets in splits, so if the records are of equal length you can compute
> some kind of numbering.
> I remember someone had implemented sequential numbering using the partition
> id for each map task (mapred.task.partition) and posted this on his blog. I
> don't have it handy with me right now, but will send you off the list if I
> find it.
>
> Amogh
>
>
> On 1/28/10 3:29 PM, "Udaya Lakshmi" <ud...@gmail.com> wrote:
>
> Hi all..
>  I have searched the documentation but could not find a input file
> format which will give line number as the key and line as the value.
> Did I miss something? Can someone give me a clue of how to implement
> one such input file format.
>
> Thanks,
> Udaya.
>
>

Re: Input file format doubt

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
For global line numbers, you would need to know the ordering within each split generated from the input file. The standard input formats provide offsets in splits, so if the records are of equal length you can compute some kind of numbering.
I remember someone had implemented sequential numbering using the partition id for each map task (mapred.task.partition) and posted this on his blog. I don't have it handy with me right now, but will send you off the list if I find it.

Amogh


On 1/28/10 3:29 PM, "Udaya Lakshmi" <ud...@gmail.com> wrote:

Hi all..
  I have searched the documentation but could not find a input file
format which will give line number as the key and line as the value.
Did I miss something? Can someone give me a clue of how to implement
one such input file format.

Thanks,
Udaya.