You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Pat Ferrel <pa...@gmail.com> on 2012/12/12 01:49:45 UTC

Hadoop 101

Stupid question for the day…

I have a file created by a mahout job of the form:

0	[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8	[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25	[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28	[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
…

If this were a SequenceFile I could read it and be merrily on my way but it's a text file. The classes written are key, value pairs <LongWritable, VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable();
VectorWritable recommendations = new VectorWritable();
while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text files outside of a map or reduce?

Re: Hadoop 101

Posted by Chris Embree <ce...@gmail.com>.

Just to be a picker of nits... this topic is more concisely Hadoop
Development 101.  I only mention this because I am a newbie hadoop admin
and this was over my head. ;)  Admins don't worry as much about Key Value
Pairs and parsing as we do about where is the script that starts the
NameNode. ;)


On Wed, Dec 12, 2012 at 11:16 PM, David Parks <da...@yahoo.com>wrote:

> Nothing that I'm aware of for text files, I'd just use standard unix utils
> to process it outside of Hadoop.
>
> As to getting a reader from any of the Input Formats, here's the typical
> example you'd follow to get the reader for a sequence file, you could
> extrapolate the example to access whichever reader you're interested in.
>
>
> http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
> ed-data-structures/id3555432
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 11:37 PM
> To: user@hadoop.apache.org
> Subject: Re: Hadoop 101
>
> Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
> to parse text--I'm just too lazy. I was hoping there was a Text equivalent
> of a SequenceFile that was hidden somewhere. As I said there is no mapper,
> this is running outside of hadoop M/R. So I at least need a line reader and
> not sure how the InputFormat works outside a mapper. But who cares, parsing
> is simple enough from scratch. All the TextKeyValueInputFormat gives me is
> splitting at the tab afaict.
>
> Actually this convinces me to look further into getting the values from
> method calls. They aren't quite what I want to begin with.
>
> Thanks for saving me more fruitless searches.
>
> On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:
>
> You use TextInputFormat, you'll get the following key<LongWritable>,
> value<Text> pairs in your mapper:
>
> file_position, your_input
>
> Example:
> 0,
>
> "0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
> 100,
>
> "8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
> 037]"
> 200,
>
> "25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
> 1576]"
>
> Then just parse it out in your mapper.
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 7:50 AM
> To: user@hadoop.apache.org
> Subject: Hadoop 101
>
> Stupid question for the day.
>
> I have a file created by a mahout job of the form:
>
> 0
> [356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
> 8
>
> [356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
> 25
>
> [284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
> 28
>
> [452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
> .
>
> If this were a SequenceFile I could read it and be merrily on my way but
> it's a text file. The classes written are key, value pairs <LongWritable,
> VectorWritable> but the file is tab delimited text.
>
> I was hoping to do something like:
>
> SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
> Writable userId = new LongWritable(); VectorWritable recommendations = new
> VectorWritable(); while (reader.next(userId, recommendations)) {
>         //do something with each pair
> }
>
> But alas Google fails me. How do you read in key, values pairs from text
> files outside of a map or reduce?
>
>
>

Re: Hadoop 101

Posted by Chris Embree <ce...@gmail.com>.

Just to be a picker of nits... this topic is more concisely Hadoop
Development 101.  I only mention this because I am a newbie hadoop admin
and this was over my head. ;)  Admins don't worry as much about Key Value
Pairs and parsing as we do about where is the script that starts the
NameNode. ;)


On Wed, Dec 12, 2012 at 11:16 PM, David Parks <da...@yahoo.com>wrote:

> Nothing that I'm aware of for text files, I'd just use standard unix utils
> to process it outside of Hadoop.
>
> As to getting a reader from any of the Input Formats, here's the typical
> example you'd follow to get the reader for a sequence file, you could
> extrapolate the example to access whichever reader you're interested in.
>
>
> http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
> ed-data-structures/id3555432
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 11:37 PM
> To: user@hadoop.apache.org
> Subject: Re: Hadoop 101
>
> Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
> to parse text--I'm just too lazy. I was hoping there was a Text equivalent
> of a SequenceFile that was hidden somewhere. As I said there is no mapper,
> this is running outside of hadoop M/R. So I at least need a line reader and
> not sure how the InputFormat works outside a mapper. But who cares, parsing
> is simple enough from scratch. All the TextKeyValueInputFormat gives me is
> splitting at the tab afaict.
>
> Actually this convinces me to look further into getting the values from
> method calls. They aren't quite what I want to begin with.
>
> Thanks for saving me more fruitless searches.
>
> On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:
>
> You use TextInputFormat, you'll get the following key<LongWritable>,
> value<Text> pairs in your mapper:
>
> file_position, your_input
>
> Example:
> 0,
>
> "0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
> 100,
>
> "8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
> 037]"
> 200,
>
> "25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
> 1576]"
>
> Then just parse it out in your mapper.
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 7:50 AM
> To: user@hadoop.apache.org
> Subject: Hadoop 101
>
> Stupid question for the day.
>
> I have a file created by a mahout job of the form:
>
> 0
> [356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
> 8
>
> [356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
> 25
>
> [284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
> 28
>
> [452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
> .
>
> If this were a SequenceFile I could read it and be merrily on my way but
> it's a text file. The classes written are key, value pairs <LongWritable,
> VectorWritable> but the file is tab delimited text.
>
> I was hoping to do something like:
>
> SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
> Writable userId = new LongWritable(); VectorWritable recommendations = new
> VectorWritable(); while (reader.next(userId, recommendations)) {
>         //do something with each pair
> }
>
> But alas Google fails me. How do you read in key, values pairs from text
> files outside of a map or reduce?
>
>
>

Re: Hadoop 101

Posted by Chris Embree <ce...@gmail.com>.

Just to be a picker of nits... this topic is more concisely Hadoop
Development 101.  I only mention this because I am a newbie hadoop admin
and this was over my head. ;)  Admins don't worry as much about Key Value
Pairs and parsing as we do about where is the script that starts the
NameNode. ;)


On Wed, Dec 12, 2012 at 11:16 PM, David Parks <da...@yahoo.com>wrote:

> Nothing that I'm aware of for text files, I'd just use standard unix utils
> to process it outside of Hadoop.
>
> As to getting a reader from any of the Input Formats, here's the typical
> example you'd follow to get the reader for a sequence file, you could
> extrapolate the example to access whichever reader you're interested in.
>
>
> http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
> ed-data-structures/id3555432
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 11:37 PM
> To: user@hadoop.apache.org
> Subject: Re: Hadoop 101
>
> Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
> to parse text--I'm just too lazy. I was hoping there was a Text equivalent
> of a SequenceFile that was hidden somewhere. As I said there is no mapper,
> this is running outside of hadoop M/R. So I at least need a line reader and
> not sure how the InputFormat works outside a mapper. But who cares, parsing
> is simple enough from scratch. All the TextKeyValueInputFormat gives me is
> splitting at the tab afaict.
>
> Actually this convinces me to look further into getting the values from
> method calls. They aren't quite what I want to begin with.
>
> Thanks for saving me more fruitless searches.
>
> On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:
>
> You use TextInputFormat, you'll get the following key<LongWritable>,
> value<Text> pairs in your mapper:
>
> file_position, your_input
>
> Example:
> 0,
>
> "0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
> 100,
>
> "8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
> 037]"
> 200,
>
> "25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
> 1576]"
>
> Then just parse it out in your mapper.
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 7:50 AM
> To: user@hadoop.apache.org
> Subject: Hadoop 101
>
> Stupid question for the day.
>
> I have a file created by a mahout job of the form:
>
> 0
> [356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
> 8
>
> [356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
> 25
>
> [284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
> 28
>
> [452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
> .
>
> If this were a SequenceFile I could read it and be merrily on my way but
> it's a text file. The classes written are key, value pairs <LongWritable,
> VectorWritable> but the file is tab delimited text.
>
> I was hoping to do something like:
>
> SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
> Writable userId = new LongWritable(); VectorWritable recommendations = new
> VectorWritable(); while (reader.next(userId, recommendations)) {
>         //do something with each pair
> }
>
> But alas Google fails me. How do you read in key, values pairs from text
> files outside of a map or reduce?
>
>
>

Re: Hadoop 101

Posted by Chris Embree <ce...@gmail.com>.

Just to be a picker of nits... this topic is more concisely Hadoop
Development 101.  I only mention this because I am a newbie hadoop admin
and this was over my head. ;)  Admins don't worry as much about Key Value
Pairs and parsing as we do about where is the script that starts the
NameNode. ;)


On Wed, Dec 12, 2012 at 11:16 PM, David Parks <da...@yahoo.com>wrote:

> Nothing that I'm aware of for text files, I'd just use standard unix utils
> to process it outside of Hadoop.
>
> As to getting a reader from any of the Input Formats, here's the typical
> example you'd follow to get the reader for a sequence file, you could
> extrapolate the example to access whichever reader you're interested in.
>
>
> http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
> ed-data-structures/id3555432
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 11:37 PM
> To: user@hadoop.apache.org
> Subject: Re: Hadoop 101
>
> Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
> to parse text--I'm just too lazy. I was hoping there was a Text equivalent
> of a SequenceFile that was hidden somewhere. As I said there is no mapper,
> this is running outside of hadoop M/R. So I at least need a line reader and
> not sure how the InputFormat works outside a mapper. But who cares, parsing
> is simple enough from scratch. All the TextKeyValueInputFormat gives me is
> splitting at the tab afaict.
>
> Actually this convinces me to look further into getting the values from
> method calls. They aren't quite what I want to begin with.
>
> Thanks for saving me more fruitless searches.
>
> On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:
>
> You use TextInputFormat, you'll get the following key<LongWritable>,
> value<Text> pairs in your mapper:
>
> file_position, your_input
>
> Example:
> 0,
>
> "0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
> 100,
>
> "8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
> 037]"
> 200,
>
> "25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
> 1576]"
>
> Then just parse it out in your mapper.
>
>
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
> Sent: Wednesday, December 12, 2012 7:50 AM
> To: user@hadoop.apache.org
> Subject: Hadoop 101
>
> Stupid question for the day.
>
> I have a file created by a mahout job of the form:
>
> 0
> [356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
> 8
>
> [356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
> 25
>
> [284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
> 28
>
> [452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
> .
>
> If this were a SequenceFile I could read it and be merrily on my way but
> it's a text file. The classes written are key, value pairs <LongWritable,
> VectorWritable> but the file is tab delimited text.
>
> I was hoping to do something like:
>
> SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
> Writable userId = new LongWritable(); VectorWritable recommendations = new
> VectorWritable(); while (reader.next(userId, recommendations)) {
>         //do something with each pair
> }
>
> But alas Google fails me. How do you read in key, values pairs from text
> files outside of a map or reduce?
>
>
>

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

Nothing that I'm aware of for text files, I'd just use standard unix utils
to process it outside of Hadoop.

As to getting a reader from any of the Input Formats, here's the typical
example you'd follow to get the reader for a sequence file, you could
extrapolate the example to access whichever reader you're interested in.

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
ed-data-structures/id3555432


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 11:37 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop 101

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
to parse text--I'm just too lazy. I was hoping there was a Text equivalent
of a SequenceFile that was hidden somewhere. As I said there is no mapper,
this is running outside of hadoop M/R. So I at least need a line reader and
not sure how the InputFormat works outside a mapper. But who cares, parsing
is simple enough from scratch. All the TextKeyValueInputFormat gives me is
splitting at the tab afaict.

Actually this convinces me to look further into getting the values from
method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

Nothing that I'm aware of for text files, I'd just use standard unix utils
to process it outside of Hadoop.

As to getting a reader from any of the Input Formats, here's the typical
example you'd follow to get the reader for a sequence file, you could
extrapolate the example to access whichever reader you're interested in.

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
ed-data-structures/id3555432


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 11:37 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop 101

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
to parse text--I'm just too lazy. I was hoping there was a Text equivalent
of a SequenceFile that was hidden somewhere. As I said there is no mapper,
this is running outside of hadoop M/R. So I at least need a line reader and
not sure how the InputFormat works outside a mapper. But who cares, parsing
is simple enough from scratch. All the TextKeyValueInputFormat gives me is
splitting at the tab afaict.

Actually this convinces me to look further into getting the values from
method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

Nothing that I'm aware of for text files, I'd just use standard unix utils
to process it outside of Hadoop.

As to getting a reader from any of the Input Formats, here's the typical
example you'd follow to get the reader for a sequence file, you could
extrapolate the example to access whichever reader you're interested in.

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
ed-data-structures/id3555432


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 11:37 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop 101

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
to parse text--I'm just too lazy. I was hoping there was a Text equivalent
of a SequenceFile that was hidden somewhere. As I said there is no mapper,
this is running outside of hadoop M/R. So I at least need a line reader and
not sure how the InputFormat works outside a mapper. But who cares, parsing
is simple enough from scratch. All the TextKeyValueInputFormat gives me is
splitting at the tab afaict.

Actually this convinces me to look further into getting the values from
method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

Nothing that I'm aware of for text files, I'd just use standard unix utils
to process it outside of Hadoop.

As to getting a reader from any of the Input Formats, here's the typical
example you'd follow to get the reader for a sequence file, you could
extrapolate the example to access whichever reader you're interested in.

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas
ed-data-structures/id3555432


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 11:37 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop 101

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how
to parse text--I'm just too lazy. I was hoping there was a Text equivalent
of a SequenceFile that was hidden somewhere. As I said there is no mapper,
this is running outside of hadoop M/R. So I at least need a line reader and
not sure how the InputFormat works outside a mapper. But who cares, parsing
is simple enough from scratch. All the TextKeyValueInputFormat gives me is
splitting at the tab afaict.

Actually this convinces me to look further into getting the values from
method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com]
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

Re: Hadoop 101

Posted by Pat Ferrel <pa...@gmail.com>.

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how to parse text--I'm just too lazy. I was hoping there was a Text equivalent of a SequenceFile that was hidden somewhere. As I said there is no mapper, this is running outside of hadoop M/R. So I at least need a line reader and not sure how the InputFormat works outside a mapper. But who cares, parsing is simple enough from scratch. All the TextKeyValueInputFormat gives me is splitting at the tab afaict.

Actually this convinces me to look further into getting the values from method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

Re: Hadoop 101

Posted by Pat Ferrel <pa...@gmail.com>.

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how to parse text--I'm just too lazy. I was hoping there was a Text equivalent of a SequenceFile that was hidden somewhere. As I said there is no mapper, this is running outside of hadoop M/R. So I at least need a line reader and not sure how the InputFormat works outside a mapper. But who cares, parsing is simple enough from scratch. All the TextKeyValueInputFormat gives me is splitting at the tab afaict.

Actually this convinces me to look further into getting the values from method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

Re: Hadoop 101

Posted by Pat Ferrel <pa...@gmail.com>.

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how to parse text--I'm just too lazy. I was hoping there was a Text equivalent of a SequenceFile that was hidden somewhere. As I said there is no mapper, this is running outside of hadoop M/R. So I at least need a line reader and not sure how the InputFormat works outside a mapper. But who cares, parsing is simple enough from scratch. All the TextKeyValueInputFormat gives me is splitting at the tab afaict.

Actually this convinces me to look further into getting the values from method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

Re: Hadoop 101

Posted by Pat Ferrel <pa...@gmail.com>.

Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how to parse text--I'm just too lazy. I was hoping there was a Text equivalent of a SequenceFile that was hidden somewhere. As I said there is no mapper, this is running outside of hadoop M/R. So I at least need a line reader and not sure how the InputFormat works outside a mapper. But who cares, parsing is simple enough from scratch. All the TextKeyValueInputFormat gives me is splitting at the tab afaict.

Actually this convinces me to look further into getting the values from method calls. They aren't quite what I want to begin with. 

Thanks for saving me more fruitless searches.

On Dec 11, 2012, at 10:04 PM, David Parks <da...@yahoo.com> wrote:

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?

RE: Hadoop 101

Posted by David Parks <da...@yahoo.com>.

You use TextInputFormat, you'll get the following key<LongWritable>,
value<Text> pairs in your mapper:

file_position, your_input

Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482
1576]"

Then just parse it out in your mapper.


-----Original Message-----
From: Pat Ferrel [mailto:pat.ferrel@gmail.com] 
Sent: Wednesday, December 12, 2012 7:50 AM
To: user@hadoop.apache.org
Subject: Hadoop 101

Stupid question for the day.

I have a file created by a mahout job of the form:

0
[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
8
[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037]
25
[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576]
28
[452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154]
.

If this were a SequenceFile I could read it and be merrily on my way but
it's a text file. The classes written are key, value pairs <LongWritable,
VectorWritable> but the file is tab delimited text. 

I was hoping to do something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf);
Writable userId = new LongWritable(); VectorWritable recommendations = new
VectorWritable(); while (reader.next(userId, recommendations)) {
	//do something with each pair
}

But alas Google fails me. How do you read in key, values pairs from text
files outside of a map or reduce?