You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "James R. Leek" <le...@llnl.gov> on 2009/11/30 02:00:40 UTC
Identifying lines in map()
I want to use hadoop to discover if there is any token that appears in
every line of a file. I thought that this should be pretty
straightforward, but I'm having a heck of a time with it. (I'm pretty
new to hadoop. I've been using it for about two weeks.)
My original idea was to have the mapper produce every token as the key,
with the line number as the value. But I couldn't find any InputFormat
that would give me line numbers.
However, it seemed that FileInputFormat would give me the position in
the file as the key, and the line as the value. I assume that the key
would be the position in the file of the beginning of the line. With
that I could have the token be the key, and the line position as the
value, and use a hash table in the reducer to determine if the token
appeared in every line. However, I found that it actually seems to give
the position of the input split. I figured this out because, rather
than getting 50,000 unique keys to the mapper (the number of lines in
the file), I was getting 220 unique keys. (The number of mappers/input
splits.)
So, what should I do?
Thanks,
Jim
Re: Identifying lines in map()
Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Todd, James,
If you want to tweak this to reduce memory usage (please excuse my pseudo-python),
configure:
common_tokens = set()
processed_first_line = False
line_count = 0
map:
line_count += 1
if processed_first_line == False:
for each word in line:
common_tokens.add( word )
processed_first_line = True
else:
temp = set()
for each word in line:
temp.add(word)
common_tokens.intersect_update(temp)
close:
emit (null, line_count)
for each word in common_tokens:
emit (word, line_count)
combiner and reducer:
Mostly as below. Couldn't you avoid the post-processing step by selecting a "null" key that is sorted as the first element and using only one reducer? This way, you know the true # of lines prior to processing "real" keys.
(I might have to look up in my trusty ol' data structure book to see if the computational complexity of the map might be improved).
Basically, for each mapper, you're keeping track of all tokens that are in each line of the given input split and emitting only those, as opposed to emitting at least one record for each unique token in the file. Saves you some RAM and reduces the size of the intermediate files, which can be a real killer for a CPU-light problem such as this one.
Hope this helps,
Brian
On Nov 29, 2009, at 7:23 PM, Todd Lipcon wrote:
> Ah, I misunderstood.
>
> How about this?
>
> mapper:
> for each word in line:
> add word to a set()
> for each word in set:
> emit (word, 1)
> emit (null, 1)
>
> combiner:
> sum up input values (just like word count)
>
> reducer:
> same as combiner, except you can avoid emitting any words which have a
> count less than any other count you previously emitted
>
> then post process - you have the "null" key which is the total line count.
> Any keys which have that same value appeared on every line.
>
> If you want to do away with the post process step, you might be able to do
> something clever involving using a counter in the map, and then reading that
> counter's value from the Reduce task. I've never tried it, but I *think* you
> should be able to get at your job's map counter values from the reduce
> execution (though not the combiner, necessarily). If you can do this, you'd
> know the total line count in the reducer and could avoid emitting any keys
> that aren't in every line.
>
> -Todd
>
> On Sun, Nov 29, 2009 at 5:18 PM, James R. Leek <le...@llnl.gov> wrote:
>
>> Thanks Todd, but your code seems check if a given token exists on every
>> line. (Like a the regex example.) I want to find any tokens that exist on
>> every line. So, give the input:
>>
>> Amy Sue Fred John
>> Jack Joe Sue John
>> Alice Bob Fred Sue John
>>
>> The output should be:
>> Sue
>> John
>>
>> because Sue and John appear on every line. I don't know Sue and John in
>> advance.
>>
>> Thanks,
>> Jim
>>
>>
>> Todd Lipcon wrote:
>>
>>> Hi James,
>>>
>>> Something like the following pseudocode:
>>>
>>> Mapper:
>>> configure:
>>> set instance variable "seenOnlyMatches" = true
>>> map:
>>> save OutputCollector
>>> if current line doesn't match, set seenOnlyMatches to false
>>> close:
>>> output a single record containing the value of seenOnlyMatches (and a
>>> null key)
>>> super.close()
>>>
>>> Reducer:
>>> if any input records are false, output false. otherwise output true
>>>
>>> Make sense?
>>>
>>> -Todd
>>> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>>>
>>>
>>>
>>>> I want to use hadoop to discover if there is any token that appears in
>>>> every line of a file. I thought that this should be pretty
>>>> straightforward,
>>>> but I'm having a heck of a time with it. (I'm pretty new to hadoop.
>>>> I've
>>>> been using it for about two weeks.)
>>>>
>>>> My original idea was to have the mapper produce every token as the key,
>>>> with the line number as the value. But I couldn't find any InputFormat
>>>> that
>>>> would give me line numbers.
>>>>
>>>> However, it seemed that FileInputFormat would give me the position in the
>>>> file as the key, and the line as the value. I assume that the key would
>>>> be
>>>> the position in the file of the beginning of the line. With that I could
>>>> have the token be the key, and the line position as the value, and use a
>>>> hash table in the reducer to determine if the token appeared in every
>>>> line.
>>>> However, I found that it actually seems to give the position of the
>>>> input
>>>> split. I figured this out because, rather than getting 50,000 unique
>>>> keys
>>>> to the mapper (the number of lines in the file), I was getting 220 unique
>>>> keys. (The number of mappers/input splits.)
>>>>
>>>> So, what should I do?
>>>>
>>>> Thanks,
>>>> Jim
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
Re: Identifying lines in map()
Posted by Patrick Angeles <pa...@gmail.com>.
Interesting... you have more tokens per line than total lines?
LineRecordReader conveys the line number as the key in the mapper. If I
understand correctly, though, that line number is relative to the input
split, so you could probably use a combination of line number and task ID.
However, based on what you said, if you don't have enough memory to keep a
set of tokens, you might be running into other problems (like having lines
that cannot fit in memory).
If memory is an issue, you could split up Todd's original algorithm in 2 M/R
phases, one to filter out duplicates per line and another that does the word
count.
On Sun, Nov 29, 2009 at 9:07 PM, James R. Leek <le...@llnl.gov> wrote:
>
> Oh, that's a good idea. Put the hashset in the mapper rather than the
>> reducer. Thanks.
>>
> Actually, thinking a little deeper about this, while this will work for my
> smaller examples, the lines get really long for the larger ones. I'm not
> sure I'll have enough memory to keep all the tokens in a line in one hash
> table. Although this will do for now, I'd appreciate any additional ideas.
>
> My original idea was to have an array in the reducer that kept line number.
> So:
>
> Mapper:
> for each(token):
> emit(token, linenumber)
>
> Reducer:
> for linenumber in values:
> lines[linenumber] = 1;
>
> count = 0;
> for existsInThisLine in lines:
> count += existsInThisLine
>
> if(count = totalNumberOfLines):
> result(token, count)
>
> When I found I couldn't get the line number, I tried set of line positions,
> but the idea is the same. Of course, it turned out I was getting input
> split positions rather than line positions, that didn't help. Anyway, this
> way the size of the array or set is limited by the number of lines rather
> than the number of tokens.
> Jim
>
>
Re: Identifying lines in map()
Posted by "James R. Leek" <le...@llnl.gov>.
> Oh, that's a good idea. Put the hashset in the mapper rather than the
> reducer. Thanks.
Actually, thinking a little deeper about this, while this will work for
my smaller examples, the lines get really long for the larger ones. I'm
not sure I'll have enough memory to keep all the tokens in a line in one
hash table. Although this will do for now, I'd appreciate any
additional ideas.
My original idea was to have an array in the reducer that kept line
number. So:
Mapper:
for each(token):
emit(token, linenumber)
Reducer:
for linenumber in values:
lines[linenumber] = 1;
count = 0;
for existsInThisLine in lines:
count += existsInThisLine
if(count = totalNumberOfLines):
result(token, count)
When I found I couldn't get the line number, I tried set of line
positions, but the idea is the same. Of course, it turned out I was
getting input split positions rather than line positions, that didn't
help. Anyway, this way the size of the array or set is limited by the
number of lines rather than the number of tokens.
Jim
Re: Identifying lines in map()
Posted by "James R. Leek" <le...@llnl.gov>.
Todd Lipcon wrote:
> Ah, I misunderstood.
>
> How about this?
>
> mapper:
> for each word in line:
> add word to a set()
> for each word in set:
> emit (word, 1)
> emit (null, 1)
>
Oh, that's a good idea. Put the hashset in the mapper rather than the
reducer. Thanks.
I don't mind post-processing in this case, I have to do that anyway.
Jim
Re: Identifying lines in map()
Posted by Todd Lipcon <to...@cloudera.com>.
Ah, I misunderstood.
How about this?
mapper:
for each word in line:
add word to a set()
for each word in set:
emit (word, 1)
emit (null, 1)
combiner:
sum up input values (just like word count)
reducer:
same as combiner, except you can avoid emitting any words which have a
count less than any other count you previously emitted
then post process - you have the "null" key which is the total line count.
Any keys which have that same value appeared on every line.
If you want to do away with the post process step, you might be able to do
something clever involving using a counter in the map, and then reading that
counter's value from the Reduce task. I've never tried it, but I *think* you
should be able to get at your job's map counter values from the reduce
execution (though not the combiner, necessarily). If you can do this, you'd
know the total line count in the reducer and could avoid emitting any keys
that aren't in every line.
-Todd
On Sun, Nov 29, 2009 at 5:18 PM, James R. Leek <le...@llnl.gov> wrote:
> Thanks Todd, but your code seems check if a given token exists on every
> line. (Like a the regex example.) I want to find any tokens that exist on
> every line. So, give the input:
>
> Amy Sue Fred John
> Jack Joe Sue John
> Alice Bob Fred Sue John
>
> The output should be:
> Sue
> John
>
> because Sue and John appear on every line. I don't know Sue and John in
> advance.
>
> Thanks,
> Jim
>
>
> Todd Lipcon wrote:
>
>> Hi James,
>>
>> Something like the following pseudocode:
>>
>> Mapper:
>> configure:
>> set instance variable "seenOnlyMatches" = true
>> map:
>> save OutputCollector
>> if current line doesn't match, set seenOnlyMatches to false
>> close:
>> output a single record containing the value of seenOnlyMatches (and a
>> null key)
>> super.close()
>>
>> Reducer:
>> if any input records are false, output false. otherwise output true
>>
>> Make sense?
>>
>> -Todd
>> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>>
>>
>>
>>> I want to use hadoop to discover if there is any token that appears in
>>> every line of a file. I thought that this should be pretty
>>> straightforward,
>>> but I'm having a heck of a time with it. (I'm pretty new to hadoop.
>>> I've
>>> been using it for about two weeks.)
>>>
>>> My original idea was to have the mapper produce every token as the key,
>>> with the line number as the value. But I couldn't find any InputFormat
>>> that
>>> would give me line numbers.
>>>
>>> However, it seemed that FileInputFormat would give me the position in the
>>> file as the key, and the line as the value. I assume that the key would
>>> be
>>> the position in the file of the beginning of the line. With that I could
>>> have the token be the key, and the line position as the value, and use a
>>> hash table in the reducer to determine if the token appeared in every
>>> line.
>>> However, I found that it actually seems to give the position of the
>>> input
>>> split. I figured this out because, rather than getting 50,000 unique
>>> keys
>>> to the mapper (the number of lines in the file), I was getting 220 unique
>>> keys. (The number of mappers/input splits.)
>>>
>>> So, what should I do?
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>
>>
>>
>
>
Re: Identifying lines in map()
Posted by "James R. Leek" <le...@llnl.gov>.
Thanks Todd, but your code seems check if a given token exists on every
line. (Like a the regex example.) I want to find any tokens that exist
on every line. So, give the input:
Amy Sue Fred John
Jack Joe Sue John
Alice Bob Fred Sue John
The output should be:
Sue
John
because Sue and John appear on every line. I don't know Sue and John in
advance.
Thanks,
Jim
Todd Lipcon wrote:
> Hi James,
>
> Something like the following pseudocode:
>
> Mapper:
> configure:
> set instance variable "seenOnlyMatches" = true
> map:
> save OutputCollector
> if current line doesn't match, set seenOnlyMatches to false
> close:
> output a single record containing the value of seenOnlyMatches (and a
> null key)
> super.close()
>
> Reducer:
> if any input records are false, output false. otherwise output true
>
> Make sense?
>
> -Todd
> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>
>
>> I want to use hadoop to discover if there is any token that appears in
>> every line of a file. I thought that this should be pretty straightforward,
>> but I'm having a heck of a time with it. (I'm pretty new to hadoop. I've
>> been using it for about two weeks.)
>>
>> My original idea was to have the mapper produce every token as the key,
>> with the line number as the value. But I couldn't find any InputFormat that
>> would give me line numbers.
>>
>> However, it seemed that FileInputFormat would give me the position in the
>> file as the key, and the line as the value. I assume that the key would be
>> the position in the file of the beginning of the line. With that I could
>> have the token be the key, and the line position as the value, and use a
>> hash table in the reducer to determine if the token appeared in every line.
>> However, I found that it actually seems to give the position of the input
>> split. I figured this out because, rather than getting 50,000 unique keys
>> to the mapper (the number of lines in the file), I was getting 220 unique
>> keys. (The number of mappers/input splits.)
>>
>> So, what should I do?
>>
>> Thanks,
>> Jim
>>
>>
>
>
Re: Identifying lines in map()
Posted by Todd Lipcon <to...@cloudera.com>.
Hi James,
Something like the following pseudocode:
Mapper:
configure:
set instance variable "seenOnlyMatches" = true
map:
save OutputCollector
if current line doesn't match, set seenOnlyMatches to false
close:
output a single record containing the value of seenOnlyMatches (and a
null key)
super.close()
Reducer:
if any input records are false, output false. otherwise output true
Make sense?
-Todd
On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
> I want to use hadoop to discover if there is any token that appears in
> every line of a file. I thought that this should be pretty straightforward,
> but I'm having a heck of a time with it. (I'm pretty new to hadoop. I've
> been using it for about two weeks.)
>
> My original idea was to have the mapper produce every token as the key,
> with the line number as the value. But I couldn't find any InputFormat that
> would give me line numbers.
>
> However, it seemed that FileInputFormat would give me the position in the
> file as the key, and the line as the value. I assume that the key would be
> the position in the file of the beginning of the line. With that I could
> have the token be the key, and the line position as the value, and use a
> hash table in the reducer to determine if the token appeared in every line.
> However, I found that it actually seems to give the position of the input
> split. I figured this out because, rather than getting 50,000 unique keys
> to the mapper (the number of lines in the file), I was getting 220 unique
> keys. (The number of mappers/input splits.)
>
> So, what should I do?
>
> Thanks,
> Jim
>
Re: Identifying lines in map()
Posted by Mike Kendall <mk...@justin.tv>.
this really seems like the kind of query that doesn't lend itself to
mapreduce very well... i'd probably do some kind of mapreduce
abuse... (using mapreduce for distributed computation but for nothing
else)
map:
tokens = set of tokens in first line
for each line:
make set of tokens in this line
tokens = intersection(tokens, tokens_this_line)
print tokens
combiner: same as map
red: cat
this way each mapper will reduce all of its input to one line of
tokens found in all lines. then you can re-run this on the output
until you get a small enough set that you can run the last job on one
box.
-mike
On Sun, Nov 29, 2009 at 8:37 PM, Owen O'Malley <om...@apache.org> wrote:
>
> On Nov 29, 2009, at 5:00 PM, James R. Leek wrote:
>
>> I want to use hadoop to discover if there is any token that appears in every line of a file.
>
> What I would do:
>
> map:
> generate sorted list of tokens (dropping duplicates) for current line
> if this is the first record:
> previous = current token list
> else:
> iterate through both lists deleting any tokens from previous that aren't in the current
>
> in map close:
> for each token:
> emit token, 1
>
> in reduce:
> if there are M, where M is the number of maps, values for the key:
> emit token, null
>
> memory in map is limited to roughly double the size of each line, which in most non-insane data sets is totally fine. Processing for each line is N lg N in the number of tokens in that line. Everything else is linear in the size of the answer.
>
> -- Owen
>
>
>
Re: Identifying lines in map()
Posted by Owen O'Malley <om...@apache.org>.
On Nov 29, 2009, at 5:00 PM, James R. Leek wrote:
> I want to use hadoop to discover if there is any token that appears in every line of a file.
What I would do:
map:
generate sorted list of tokens (dropping duplicates) for current line
if this is the first record:
previous = current token list
else:
iterate through both lists deleting any tokens from previous that aren't in the current
in map close:
for each token:
emit token, 1
in reduce:
if there are M, where M is the number of maps, values for the key:
emit token, null
memory in map is limited to roughly double the size of each line, which in most non-insane data sets is totally fine. Processing for each line is N lg N in the number of tokens in that line. Everything else is linear in the size of the answer.
-- Owen