You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "James R. Leek" <le...@llnl.gov> on 2009/11/30 02:00:40 UTC

Identifying lines in map()

I want to use hadoop to discover if there is any token that appears in 
every line of a file.  I thought that this should be pretty 
straightforward, but I'm having a heck of a time with it.  (I'm pretty 
new to hadoop.  I've been using it for about two weeks.)

My original idea was to have the mapper produce every token as the key, 
with the line number as the value.  But I couldn't find any InputFormat 
that would give me line numbers.

However, it seemed that FileInputFormat would give me the position in 
the file as the key, and the line as the value.  I assume that the key 
would be the position in the file of the beginning of the line.  With 
that I could have the token be the key, and the line position as the 
value, and use a hash table in the reducer to determine if the token 
appeared in every line.  However, I found that it actually seems to give 
the position of the input split.  I figured this out because, rather 
than getting 50,000 unique keys to the mapper (the number of lines in 
the file), I was getting 220 unique keys.  (The number of mappers/input 
splits.)

So, what should I do?

Thanks,
Jim

Re: Identifying lines in map()

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Todd, James,

If you want to tweak this to reduce memory usage (please excuse my pseudo-python),

configure:
  common_tokens = set()
  processed_first_line = False
  line_count = 0
map:
  line_count += 1
  if processed_first_line == False:
    for each word in line:
      common_tokens.add( word )
    processed_first_line = True
  else:
    temp = set()
    for each word in line:
      temp.add(word)
    common_tokens.intersect_update(temp)
close:
  emit (null, line_count)
  for each word in common_tokens:
    emit (word, line_count)
combiner and reducer:
  Mostly as below.  Couldn't you avoid the post-processing step by selecting a "null" key that is sorted as the first element and using only one reducer?  This way, you know the true # of lines prior to processing "real" keys.

(I might have to look up in my trusty ol' data structure book to see if the computational complexity of the map might be improved).

Basically, for each mapper, you're keeping track of all tokens that are in each line of the given input split and emitting only those, as opposed to emitting at least one record for each unique token in the file.  Saves you some RAM and reduces the size of the intermediate files, which can be a real killer for a CPU-light problem such as this one.

Hope this helps,

Brian

On Nov 29, 2009, at 7:23 PM, Todd Lipcon wrote:

> Ah, I misunderstood.
> 
> How about this?
> 
> mapper:
>  for each word in line:
>   add word to a set()
>  for each word in set:
>    emit (word, 1)
>  emit (null, 1)
> 
> combiner:
>  sum up input values (just like word count)
> 
> reducer:
>  same as combiner, except you can avoid emitting any words which have a
> count less than any other count you previously emitted
> 
> then post process - you have the "null" key which is the total line count.
> Any keys which have that same value appeared on every line.
> 
> If you want to do away with the post process step, you might be able to do
> something clever involving using a counter in the map, and then reading that
> counter's value from the Reduce task. I've never tried it, but I *think* you
> should be able to get at your job's map counter values from the reduce
> execution (though not the combiner, necessarily). If you can do this, you'd
> know the total line count in the reducer and could avoid emitting any keys
> that aren't in every line.
> 
> -Todd
> 
> On Sun, Nov 29, 2009 at 5:18 PM, James R. Leek <le...@llnl.gov> wrote:
> 
>> Thanks Todd, but your code seems check if a given token exists on every
>> line.  (Like a the regex example.)  I want to find any tokens that exist on
>> every line.  So, give the input:
>> 
>> Amy Sue Fred John
>> Jack Joe Sue John
>> Alice Bob Fred Sue John
>> 
>> The output should be:
>> Sue
>> John
>> 
>> because Sue and John appear on every line.  I don't know Sue and John in
>> advance.
>> 
>> Thanks,
>> Jim
>> 
>> 
>> Todd Lipcon wrote:
>> 
>>> Hi James,
>>> 
>>> Something like the following pseudocode:
>>> 
>>> Mapper:
>>> configure:
>>>   set instance variable "seenOnlyMatches" = true
>>> map:
>>>   save OutputCollector
>>>   if current line doesn't match, set seenOnlyMatches to false
>>> close:
>>>   output a single record containing the value of seenOnlyMatches (and a
>>> null key)
>>>   super.close()
>>> 
>>> Reducer:
>>> if any input records are false, output false. otherwise output true
>>> 
>>> Make sense?
>>> 
>>> -Todd
>>> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>>> 
>>> 
>>> 
>>>> I want to use hadoop to discover if there is any token that appears in
>>>> every line of a file.  I thought that this should be pretty
>>>> straightforward,
>>>> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.
>>>> I've
>>>> been using it for about two weeks.)
>>>> 
>>>> My original idea was to have the mapper produce every token as the key,
>>>> with the line number as the value.  But I couldn't find any InputFormat
>>>> that
>>>> would give me line numbers.
>>>> 
>>>> However, it seemed that FileInputFormat would give me the position in the
>>>> file as the key, and the line as the value.  I assume that the key would
>>>> be
>>>> the position in the file of the beginning of the line.  With that I could
>>>> have the token be the key, and the line position as the value, and use a
>>>> hash table in the reducer to determine if the token appeared in every
>>>> line.
>>>> However, I found that it actually seems to give the position of the
>>>> input
>>>> split.  I figured this out because, rather than getting 50,000 unique
>>>> keys
>>>> to the mapper (the number of lines in the file), I was getting 220 unique
>>>> keys.  (The number of mappers/input splits.)
>>>> 
>>>> So, what should I do?
>>>> 
>>>> Thanks,
>>>> Jim
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 


Re: Identifying lines in map()

Posted by Patrick Angeles <pa...@gmail.com>.
Interesting... you have more tokens per line than total lines?

LineRecordReader conveys the line number as the key in the mapper. If I
understand correctly, though, that line number is relative to the input
split, so you could probably use a combination of line number and task ID.

However, based on what you said, if you don't have enough memory to keep a
set of tokens, you might be running into other problems (like having lines
that cannot fit in memory).

If memory is an issue, you could split up Todd's original algorithm in 2 M/R
phases, one to filter out duplicates per line and another that does the word
count.


On Sun, Nov 29, 2009 at 9:07 PM, James R. Leek <le...@llnl.gov> wrote:

>
>  Oh, that's a good idea.  Put the hashset in the mapper rather than the
>> reducer.  Thanks.
>>
> Actually, thinking a little deeper about this, while this will work for my
> smaller examples, the lines get really long for the larger ones.  I'm not
> sure I'll have enough memory to keep all the tokens in a line in one hash
> table.  Although this will do for now, I'd appreciate any additional ideas.
>
> My original idea was to have an array in the reducer that kept line number.
>  So:
>
> Mapper:
>  for each(token):
>   emit(token, linenumber)
>
> Reducer:
>  for linenumber in values:
>   lines[linenumber] = 1;
>
>  count = 0;
>  for existsInThisLine in lines:
>    count += existsInThisLine
>
> if(count = totalNumberOfLines):
>  result(token, count)
>
> When I found I couldn't get the line number, I tried set of line positions,
> but the idea is the same.  Of course, it turned out I was getting input
> split positions rather than line positions, that didn't help.  Anyway, this
> way the size of the array or set is limited by the number of lines rather
> than the number of tokens.
> Jim
>
>

Re: Identifying lines in map()

Posted by "James R. Leek" <le...@llnl.gov>.
 
> Oh, that's a good idea.  Put the hashset in the mapper rather than the 
> reducer.  Thanks.
Actually, thinking a little deeper about this, while this will work for 
my smaller examples, the lines get really long for the larger ones.  I'm 
not sure I'll have enough memory to keep all the tokens in a line in one 
hash table.  Although this will do for now, I'd appreciate any 
additional ideas.

My original idea was to have an array in the reducer that kept line 
number.  So:

Mapper:
  for each(token):
    emit(token, linenumber)

Reducer:
  for linenumber in values:
    lines[linenumber] = 1;
 
  count = 0;
  for existsInThisLine in lines:
     count += existsInThisLine

if(count = totalNumberOfLines):
  result(token, count)

When I found I couldn't get the line number, I tried set of line 
positions, but the idea is the same.  Of course, it turned out I was 
getting input split positions rather than line positions, that didn't 
help.  Anyway, this way the size of the array or set is limited by the 
number of lines rather than the number of tokens. 

Jim


Re: Identifying lines in map()

Posted by "James R. Leek" <le...@llnl.gov>.
Todd Lipcon wrote:
> Ah, I misunderstood.
>
> How about this?
>
> mapper:
>   for each word in line:
>    add word to a set()
>   for each word in set:
>     emit (word, 1)
>   emit (null, 1)
>   
Oh, that's a good idea.  Put the hashset in the mapper rather than the 
reducer.  Thanks.

I don't mind post-processing in this case, I have to do that anyway.

Jim

Re: Identifying lines in map()

Posted by Todd Lipcon <to...@cloudera.com>.
Ah, I misunderstood.

How about this?

mapper:
  for each word in line:
   add word to a set()
  for each word in set:
    emit (word, 1)
  emit (null, 1)

combiner:
  sum up input values (just like word count)

reducer:
  same as combiner, except you can avoid emitting any words which have a
count less than any other count you previously emitted

then post process - you have the "null" key which is the total line count.
Any keys which have that same value appeared on every line.

If you want to do away with the post process step, you might be able to do
something clever involving using a counter in the map, and then reading that
counter's value from the Reduce task. I've never tried it, but I *think* you
should be able to get at your job's map counter values from the reduce
execution (though not the combiner, necessarily). If you can do this, you'd
know the total line count in the reducer and could avoid emitting any keys
that aren't in every line.

-Todd

On Sun, Nov 29, 2009 at 5:18 PM, James R. Leek <le...@llnl.gov> wrote:

> Thanks Todd, but your code seems check if a given token exists on every
> line.  (Like a the regex example.)  I want to find any tokens that exist on
> every line.  So, give the input:
>
> Amy Sue Fred John
> Jack Joe Sue John
> Alice Bob Fred Sue John
>
> The output should be:
> Sue
> John
>
> because Sue and John appear on every line.  I don't know Sue and John in
> advance.
>
> Thanks,
> Jim
>
>
> Todd Lipcon wrote:
>
>> Hi James,
>>
>> Something like the following pseudocode:
>>
>> Mapper:
>>  configure:
>>    set instance variable "seenOnlyMatches" = true
>>  map:
>>    save OutputCollector
>>    if current line doesn't match, set seenOnlyMatches to false
>>  close:
>>    output a single record containing the value of seenOnlyMatches (and a
>> null key)
>>    super.close()
>>
>> Reducer:
>>  if any input records are false, output false. otherwise output true
>>
>> Make sense?
>>
>> -Todd
>> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>>
>>
>>
>>> I want to use hadoop to discover if there is any token that appears in
>>> every line of a file.  I thought that this should be pretty
>>> straightforward,
>>> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.
>>>  I've
>>> been using it for about two weeks.)
>>>
>>> My original idea was to have the mapper produce every token as the key,
>>> with the line number as the value.  But I couldn't find any InputFormat
>>> that
>>> would give me line numbers.
>>>
>>> However, it seemed that FileInputFormat would give me the position in the
>>> file as the key, and the line as the value.  I assume that the key would
>>> be
>>> the position in the file of the beginning of the line.  With that I could
>>> have the token be the key, and the line position as the value, and use a
>>> hash table in the reducer to determine if the token appeared in every
>>> line.
>>>  However, I found that it actually seems to give the position of the
>>> input
>>> split.  I figured this out because, rather than getting 50,000 unique
>>> keys
>>> to the mapper (the number of lines in the file), I was getting 220 unique
>>> keys.  (The number of mappers/input splits.)
>>>
>>> So, what should I do?
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Identifying lines in map()

Posted by "James R. Leek" <le...@llnl.gov>.
Thanks Todd, but your code seems check if a given token exists on every 
line.  (Like a the regex example.)  I want to find any tokens that exist 
on every line.  So, give the input:

Amy Sue Fred John
Jack Joe Sue John
Alice Bob Fred Sue John

The output should be:
Sue
John

because Sue and John appear on every line.  I don't know Sue and John in 
advance.

Thanks,
Jim

Todd Lipcon wrote:
> Hi James,
>
> Something like the following pseudocode:
>
> Mapper:
>   configure:
>     set instance variable "seenOnlyMatches" = true
>   map:
>     save OutputCollector
>     if current line doesn't match, set seenOnlyMatches to false
>   close:
>     output a single record containing the value of seenOnlyMatches (and a
> null key)
>     super.close()
>
> Reducer:
>   if any input records are false, output false. otherwise output true
>
> Make sense?
>
> -Todd
> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>
>   
>> I want to use hadoop to discover if there is any token that appears in
>> every line of a file.  I thought that this should be pretty straightforward,
>> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.  I've
>> been using it for about two weeks.)
>>
>> My original idea was to have the mapper produce every token as the key,
>> with the line number as the value.  But I couldn't find any InputFormat that
>> would give me line numbers.
>>
>> However, it seemed that FileInputFormat would give me the position in the
>> file as the key, and the line as the value.  I assume that the key would be
>> the position in the file of the beginning of the line.  With that I could
>> have the token be the key, and the line position as the value, and use a
>> hash table in the reducer to determine if the token appeared in every line.
>>  However, I found that it actually seems to give the position of the input
>> split.  I figured this out because, rather than getting 50,000 unique keys
>> to the mapper (the number of lines in the file), I was getting 220 unique
>> keys.  (The number of mappers/input splits.)
>>
>> So, what should I do?
>>
>> Thanks,
>> Jim
>>
>>     
>
>   


Re: Identifying lines in map()

Posted by Todd Lipcon <to...@cloudera.com>.
Hi James,

Something like the following pseudocode:

Mapper:
  configure:
    set instance variable "seenOnlyMatches" = true
  map:
    save OutputCollector
    if current line doesn't match, set seenOnlyMatches to false
  close:
    output a single record containing the value of seenOnlyMatches (and a
null key)
    super.close()

Reducer:
  if any input records are false, output false. otherwise output true

Make sense?

-Todd
On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:

> I want to use hadoop to discover if there is any token that appears in
> every line of a file.  I thought that this should be pretty straightforward,
> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.  I've
> been using it for about two weeks.)
>
> My original idea was to have the mapper produce every token as the key,
> with the line number as the value.  But I couldn't find any InputFormat that
> would give me line numbers.
>
> However, it seemed that FileInputFormat would give me the position in the
> file as the key, and the line as the value.  I assume that the key would be
> the position in the file of the beginning of the line.  With that I could
> have the token be the key, and the line position as the value, and use a
> hash table in the reducer to determine if the token appeared in every line.
>  However, I found that it actually seems to give the position of the input
> split.  I figured this out because, rather than getting 50,000 unique keys
> to the mapper (the number of lines in the file), I was getting 220 unique
> keys.  (The number of mappers/input splits.)
>
> So, what should I do?
>
> Thanks,
> Jim
>

Re: Identifying lines in map()

Posted by Mike Kendall <mk...@justin.tv>.
this really seems like the kind of query that doesn't lend itself to
mapreduce very well...  i'd probably do some kind of mapreduce
abuse... (using mapreduce for distributed computation but for nothing
else)

map:
tokens = set of tokens in first line
for each line:
    make set of tokens in this line
    tokens = intersection(tokens, tokens_this_line)
print tokens

combiner: same as map

red: cat

this way each mapper will reduce all of its input to one line of
tokens found in all lines.  then you can re-run this on the output
until you get a small enough set that you can run the last job on one
box.

-mike

On Sun, Nov 29, 2009 at 8:37 PM, Owen O'Malley <om...@apache.org> wrote:
>
> On Nov 29, 2009, at 5:00 PM, James R. Leek wrote:
>
>> I want to use hadoop to discover if there is any token that appears in every line of a file.
>
> What I would do:
>
> map:
>   generate sorted list of tokens (dropping duplicates) for current line
>   if this is the first record:
>      previous = current token list
>   else:
>      iterate through both lists deleting any tokens from previous that aren't in the current
>
> in map close:
>  for each token:
>     emit token, 1
>
> in reduce:
>   if there are M, where M is the number of maps, values for the key:
>     emit token, null
>
> memory in map is limited to roughly double the size of each line, which in most non-insane data sets is totally fine. Processing for each line is N lg N in the number of tokens in that line. Everything else is linear in the size of the answer.
>
> -- Owen
>
>
>

Re: Identifying lines in map()

Posted by Owen O'Malley <om...@apache.org>.
On Nov 29, 2009, at 5:00 PM, James R. Leek wrote:

> I want to use hadoop to discover if there is any token that appears in every line of a file.

What I would do:

map:
   generate sorted list of tokens (dropping duplicates) for current line
   if this is the first record:
      previous = current token list
   else:
      iterate through both lists deleting any tokens from previous that aren't in the current

in map close:
  for each token:
     emit token, 1

in reduce:
   if there are M, where M is the number of maps, values for the key:
     emit token, null

memory in map is limited to roughly double the size of each line, which in most non-insane data sets is totally fine. Processing for each line is N lg N in the number of tokens in that line. Everything else is linear in the size of the answer.

-- Owen