You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Björn-Elmar Macek <ma...@cs.uni-kassel.de> on 2012/08/09 16:47:59 UTC
OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Hi again,
this is an direct response to my previous posting with the title "Logs
cannot be created", where logs could not be created (Spill failed). I
got the hint, that i gotta check privileges, but that was not the
problem, because i own the folders that were used for this.
I finally found an important hint in a log saying:
12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
attempt_201208091516_0001_m_000055_0, Status : FAILED
java.io.IOException: Spill failed
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
at uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
at uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NumberFormatException: For input string: ""
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:468)
at java.lang.Integer.parseInt(Integer.java:497)
at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
at
uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
corresponding to the following lines of code within the class
TwitterValueGroupingComparator:
public class TwitterValueGroupingComparator implements RawComparator<Text> {
...
public int compare(byte[] text1, int start1, int length1, byte[] text2,
int start2, int length2) {
byte[] tweet1 = new byte[length1];// length1-1 (???)
byte[] tweet2 = new byte[length2];// length1-1 (???)
System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
Tweet atweet1 = new Tweet(new String(tweet1));
Tweet atweet2 = new Tweet(new String(tweet2));
String key1 = atweet1.getAuthor();
String key2 = atweet2.getAuthor();
////////////////////////////////////////////////////////////////
//THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
/////////////////////////////////////////////////////////////////
if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
key1 = atweet1.getMention();
if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
key2 = atweet2.getMention();
int realKeyCompare = key1.compareTo(key2);
return realKeyCompare;
}
}
As i am taking the incoming bytes and interpret them as Tweets by
recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
sure, that the problem somehow are the leading bytes, that Hadoop puts
in front of the data being compared. Since i never really understood
what hadoop is doing to the strings when they are sent to the
KeyComparator i simply appended all strings to a file in order to see
myself.
You can see the results here:
??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
Watching food network, null
b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
NOTHING LIKE THE HOOD, null
As you can see there are different leading characters: sometimes its
"??", other times its "b" or "^", etc.
My question is now:
How many bits do i have to cut off, so i get the original Text as a
String that i put into the key-position of my mapper output? What are
the concepts behind this?
Thanks for your help in advance!
Best regards,
Elmar Macek
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ok, i found a tutorial for this myself. For everybody who ran into the
problem: here is a tutorial explaining WriteableComparable types.
http://developer.yahoo.com/hadoop/tutorial/module5.html
Am 09.08.2012 17:14, schrieb Björn-Elmar Macek:
> Ah ok, i got the idea: i can use the abstract class instead of the low
> level interface, though i am not sure, how to use it. It would just be
> nice, if complexer mechanics like the sorting would have an up-to-date
> tutorial with some example code. If i find the time, i will make one,
> since i want to make a presentation for Hadoop anyways.
>
> Thanks for your help! I will try to use the abstract class.
>
>
> Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
>> Hi Bertrand,
>>
>> i am using RawComperator because this one was used in the tutorial of
>> some famous (hadoop) guy describing how to sort the input for the
>> reducer. Is there an easier alternative?
>>
>>
>> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>>> I am just curious but are you using Writable? If so there is a
>>> WritableComparator...
>>> If you are going to interpret every bytes (you create a String, so
>>> you do), there no clear reason for choosing such a low level API.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>>
>>> Hi again,
>>>
>>> this is an direct response to my previous posting with the title
>>> "Logs cannot be created", where logs could not be created (Spill
>>> failed). I got the hint, that i gotta check privileges, but that
>>> was not the problem, because i own the folders that were used
>>> for this.
>>>
>>> I finally found an important hint in a log saying:
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>>> java.io.IOException: Spill failed
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>>> at
>>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>> Caused by: java.lang.NumberFormatException: For input string: ""
>>> at
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>> at java.lang.Integer.parseInt(Integer.java:468)
>>> at java.lang.Integer.parseInt(Integer.java:497)
>>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>>> at
>>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>> at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>>
>>>
>>>
>>> corresponding to the following lines of code within the class
>>> TwitterValueGroupingComparator:
>>>
>>> public class TwitterValueGroupingComparator implements
>>> RawComparator<Text> {
>>> ...
>>> public int compare(byte[] text1, int start1, int length1,
>>> byte[] text2,
>>> int start2, int length2) {
>>>
>>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>>
>>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>>> start1+1 (???)
>>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>>> start2+1 (???)
>>>
>>> Tweet atweet1 = new Tweet(new String(tweet1));
>>> Tweet atweet2 = new Tweet(new String(tweet2));
>>>
>>>
>>> String key1 = atweet1.getAuthor();
>>> String key2 = atweet2.getAuthor();
>>> ////////////////////////////////////////////////////////////////
>>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>>> /////////////////////////////////////////////////////////////////
>>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>>> key1 = atweet1.getMention();
>>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>>> key2 = atweet2.getMention();
>>>
>>> int realKeyCompare = key1.compareTo(key2);
>>> return realKeyCompare;
>>> }
>>>
>>> }
>>>
>>> As i am taking the incoming bytes and interpret them as Tweets
>>> by recreating the appropriate CSV-Strings and Tokenizing it, i
>>> was kind of sure, that the problem somehow are the leading
>>> bytes, that Hadoop puts in front of the data being compared.
>>> Since i never really understood what hadoop is doing to the
>>> strings when they are sent to the KeyComparator i simply
>>> appended all strings to a file in order to see myself.
>>>
>>> You can see the results here:
>>>
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>>> ??????????????????, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>>> mind Watching food network, null
>>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>>
>>>
>>> As you can see there are different leading characters: sometimes
>>> its "??", other times its "b" or "^", etc.
>>>
>>> My question is now:
>>> How many bits do i have to cut off, so i get the original Text
>>> as a String that i put into the key-position of my mapper
>>> output? What are the concepts behind this?
>>>
>>> Thanks for your help in advance!
>>>
>>> Best regards,
>>> Elmar Macek
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ok, i found a tutorial for this myself. For everybody who ran into the
problem: here is a tutorial explaining WriteableComparable types.
http://developer.yahoo.com/hadoop/tutorial/module5.html
Am 09.08.2012 17:14, schrieb Björn-Elmar Macek:
> Ah ok, i got the idea: i can use the abstract class instead of the low
> level interface, though i am not sure, how to use it. It would just be
> nice, if complexer mechanics like the sorting would have an up-to-date
> tutorial with some example code. If i find the time, i will make one,
> since i want to make a presentation for Hadoop anyways.
>
> Thanks for your help! I will try to use the abstract class.
>
>
> Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
>> Hi Bertrand,
>>
>> i am using RawComperator because this one was used in the tutorial of
>> some famous (hadoop) guy describing how to sort the input for the
>> reducer. Is there an easier alternative?
>>
>>
>> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>>> I am just curious but are you using Writable? If so there is a
>>> WritableComparator...
>>> If you are going to interpret every bytes (you create a String, so
>>> you do), there no clear reason for choosing such a low level API.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>>
>>> Hi again,
>>>
>>> this is an direct response to my previous posting with the title
>>> "Logs cannot be created", where logs could not be created (Spill
>>> failed). I got the hint, that i gotta check privileges, but that
>>> was not the problem, because i own the folders that were used
>>> for this.
>>>
>>> I finally found an important hint in a log saying:
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>>> java.io.IOException: Spill failed
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>>> at
>>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>> Caused by: java.lang.NumberFormatException: For input string: ""
>>> at
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>> at java.lang.Integer.parseInt(Integer.java:468)
>>> at java.lang.Integer.parseInt(Integer.java:497)
>>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>>> at
>>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>> at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>>
>>>
>>>
>>> corresponding to the following lines of code within the class
>>> TwitterValueGroupingComparator:
>>>
>>> public class TwitterValueGroupingComparator implements
>>> RawComparator<Text> {
>>> ...
>>> public int compare(byte[] text1, int start1, int length1,
>>> byte[] text2,
>>> int start2, int length2) {
>>>
>>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>>
>>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>>> start1+1 (???)
>>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>>> start2+1 (???)
>>>
>>> Tweet atweet1 = new Tweet(new String(tweet1));
>>> Tweet atweet2 = new Tweet(new String(tweet2));
>>>
>>>
>>> String key1 = atweet1.getAuthor();
>>> String key2 = atweet2.getAuthor();
>>> ////////////////////////////////////////////////////////////////
>>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>>> /////////////////////////////////////////////////////////////////
>>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>>> key1 = atweet1.getMention();
>>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>>> key2 = atweet2.getMention();
>>>
>>> int realKeyCompare = key1.compareTo(key2);
>>> return realKeyCompare;
>>> }
>>>
>>> }
>>>
>>> As i am taking the incoming bytes and interpret them as Tweets
>>> by recreating the appropriate CSV-Strings and Tokenizing it, i
>>> was kind of sure, that the problem somehow are the leading
>>> bytes, that Hadoop puts in front of the data being compared.
>>> Since i never really understood what hadoop is doing to the
>>> strings when they are sent to the KeyComparator i simply
>>> appended all strings to a file in order to see myself.
>>>
>>> You can see the results here:
>>>
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>>> ??????????????????, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>>> mind Watching food network, null
>>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>>
>>>
>>> As you can see there are different leading characters: sometimes
>>> its "??", other times its "b" or "^", etc.
>>>
>>> My question is now:
>>> How many bits do i have to cut off, so i get the original Text
>>> as a String that i put into the key-position of my mapper
>>> output? What are the concepts behind this?
>>>
>>> Thanks for your help in advance!
>>>
>>> Best regards,
>>> Elmar Macek
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ok, i found a tutorial for this myself. For everybody who ran into the
problem: here is a tutorial explaining WriteableComparable types.
http://developer.yahoo.com/hadoop/tutorial/module5.html
Am 09.08.2012 17:14, schrieb Björn-Elmar Macek:
> Ah ok, i got the idea: i can use the abstract class instead of the low
> level interface, though i am not sure, how to use it. It would just be
> nice, if complexer mechanics like the sorting would have an up-to-date
> tutorial with some example code. If i find the time, i will make one,
> since i want to make a presentation for Hadoop anyways.
>
> Thanks for your help! I will try to use the abstract class.
>
>
> Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
>> Hi Bertrand,
>>
>> i am using RawComperator because this one was used in the tutorial of
>> some famous (hadoop) guy describing how to sort the input for the
>> reducer. Is there an easier alternative?
>>
>>
>> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>>> I am just curious but are you using Writable? If so there is a
>>> WritableComparator...
>>> If you are going to interpret every bytes (you create a String, so
>>> you do), there no clear reason for choosing such a low level API.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>>
>>> Hi again,
>>>
>>> this is an direct response to my previous posting with the title
>>> "Logs cannot be created", where logs could not be created (Spill
>>> failed). I got the hint, that i gotta check privileges, but that
>>> was not the problem, because i own the folders that were used
>>> for this.
>>>
>>> I finally found an important hint in a log saying:
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>>> java.io.IOException: Spill failed
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>>> at
>>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>> Caused by: java.lang.NumberFormatException: For input string: ""
>>> at
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>> at java.lang.Integer.parseInt(Integer.java:468)
>>> at java.lang.Integer.parseInt(Integer.java:497)
>>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>>> at
>>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>> at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>>
>>>
>>>
>>> corresponding to the following lines of code within the class
>>> TwitterValueGroupingComparator:
>>>
>>> public class TwitterValueGroupingComparator implements
>>> RawComparator<Text> {
>>> ...
>>> public int compare(byte[] text1, int start1, int length1,
>>> byte[] text2,
>>> int start2, int length2) {
>>>
>>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>>
>>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>>> start1+1 (???)
>>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>>> start2+1 (???)
>>>
>>> Tweet atweet1 = new Tweet(new String(tweet1));
>>> Tweet atweet2 = new Tweet(new String(tweet2));
>>>
>>>
>>> String key1 = atweet1.getAuthor();
>>> String key2 = atweet2.getAuthor();
>>> ////////////////////////////////////////////////////////////////
>>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>>> /////////////////////////////////////////////////////////////////
>>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>>> key1 = atweet1.getMention();
>>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>>> key2 = atweet2.getMention();
>>>
>>> int realKeyCompare = key1.compareTo(key2);
>>> return realKeyCompare;
>>> }
>>>
>>> }
>>>
>>> As i am taking the incoming bytes and interpret them as Tweets
>>> by recreating the appropriate CSV-Strings and Tokenizing it, i
>>> was kind of sure, that the problem somehow are the leading
>>> bytes, that Hadoop puts in front of the data being compared.
>>> Since i never really understood what hadoop is doing to the
>>> strings when they are sent to the KeyComparator i simply
>>> appended all strings to a file in order to see myself.
>>>
>>> You can see the results here:
>>>
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>>> ??????????????????, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>>> mind Watching food network, null
>>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>>
>>>
>>> As you can see there are different leading characters: sometimes
>>> its "??", other times its "b" or "^", etc.
>>>
>>> My question is now:
>>> How many bits do i have to cut off, so i get the original Text
>>> as a String that i put into the key-position of my mapper
>>> output? What are the concepts behind this?
>>>
>>> Thanks for your help in advance!
>>>
>>> Best regards,
>>> Elmar Macek
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ok, i found a tutorial for this myself. For everybody who ran into the
problem: here is a tutorial explaining WriteableComparable types.
http://developer.yahoo.com/hadoop/tutorial/module5.html
Am 09.08.2012 17:14, schrieb Björn-Elmar Macek:
> Ah ok, i got the idea: i can use the abstract class instead of the low
> level interface, though i am not sure, how to use it. It would just be
> nice, if complexer mechanics like the sorting would have an up-to-date
> tutorial with some example code. If i find the time, i will make one,
> since i want to make a presentation for Hadoop anyways.
>
> Thanks for your help! I will try to use the abstract class.
>
>
> Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
>> Hi Bertrand,
>>
>> i am using RawComperator because this one was used in the tutorial of
>> some famous (hadoop) guy describing how to sort the input for the
>> reducer. Is there an easier alternative?
>>
>>
>> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>>> I am just curious but are you using Writable? If so there is a
>>> WritableComparator...
>>> If you are going to interpret every bytes (you create a String, so
>>> you do), there no clear reason for choosing such a low level API.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>>
>>> Hi again,
>>>
>>> this is an direct response to my previous posting with the title
>>> "Logs cannot be created", where logs could not be created (Spill
>>> failed). I got the hint, that i gotta check privileges, but that
>>> was not the problem, because i own the folders that were used
>>> for this.
>>>
>>> I finally found an important hint in a log saying:
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>>> java.io.IOException: Spill failed
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>>> at
>>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>>> at
>>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>> at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>> Caused by: java.lang.NumberFormatException: For input string: ""
>>> at
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>> at java.lang.Integer.parseInt(Integer.java:468)
>>> at java.lang.Integer.parseInt(Integer.java:497)
>>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>>> at
>>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>> at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>>> at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>>
>>>
>>>
>>> corresponding to the following lines of code within the class
>>> TwitterValueGroupingComparator:
>>>
>>> public class TwitterValueGroupingComparator implements
>>> RawComparator<Text> {
>>> ...
>>> public int compare(byte[] text1, int start1, int length1,
>>> byte[] text2,
>>> int start2, int length2) {
>>>
>>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>>
>>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>>> start1+1 (???)
>>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>>> start2+1 (???)
>>>
>>> Tweet atweet1 = new Tweet(new String(tweet1));
>>> Tweet atweet2 = new Tweet(new String(tweet2));
>>>
>>>
>>> String key1 = atweet1.getAuthor();
>>> String key2 = atweet2.getAuthor();
>>> ////////////////////////////////////////////////////////////////
>>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>>> /////////////////////////////////////////////////////////////////
>>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>>> key1 = atweet1.getMention();
>>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>>> key2 = atweet2.getMention();
>>>
>>> int realKeyCompare = key1.compareTo(key2);
>>> return realKeyCompare;
>>> }
>>>
>>> }
>>>
>>> As i am taking the incoming bytes and interpret them as Tweets
>>> by recreating the appropriate CSV-Strings and Tokenizing it, i
>>> was kind of sure, that the problem somehow are the leading
>>> bytes, that Hadoop puts in front of the data being compared.
>>> Since i never really understood what hadoop is doing to the
>>> strings when they are sent to the KeyComparator i simply
>>> appended all strings to a file in order to see myself.
>>>
>>> You can see the results here:
>>>
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>>> ??????????????????, null
>>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>>> mind Watching food network, null
>>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>>
>>>
>>> As you can see there are different leading characters: sometimes
>>> its "??", other times its "b" or "^", etc.
>>>
>>> My question is now:
>>> How many bits do i have to cut off, so i get the original Text
>>> as a String that i put into the key-position of my mapper
>>> output? What are the concepts behind this?
>>>
>>> Thanks for your help in advance!
>>>
>>> Best regards,
>>> Elmar Macek
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ah ok, i got the idea: i can use the abstract class instead of the low
level interface, though i am not sure, how to use it. It would just be
nice, if complexer mechanics like the sorting would have an up-to-date
tutorial with some example code. If i find the time, i will make one,
since i want to make a presentation for Hadoop anyways.
Thanks for your help! I will try to use the abstract class.
Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of
> some famous (hadoop) guy describing how to sort the input for the
> reducer. Is there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>> I am just curious but are you using Writable? If so there is a
>> WritableComparator...
>> If you are going to interpret every bytes (you create a String, so
>> you do), there no clear reason for choosing such a low level API.
>>
>> Regards
>>
>> Bertrand
>>
>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title
>> "Logs cannot be created", where logs could not be created (Spill
>> failed). I got the hint, that i gotta check privileges, but that
>> was not the problem, because i own the folders that were used for
>> this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1,
>> byte[] text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>> start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>> start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was
>> kind of sure, that the problem somehow are the leading bytes,
>> that Hadoop puts in front of the data being compared. Since i
>> never really understood what hadoop is doing to the strings when
>> they are sent to the KeyComparator i simply appended all strings
>> to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>> ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>> mind Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes
>> its "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as
>> a String that i put into the key-position of my mapper output?
>> What are the concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ah ok, i got the idea: i can use the abstract class instead of the low
level interface, though i am not sure, how to use it. It would just be
nice, if complexer mechanics like the sorting would have an up-to-date
tutorial with some example code. If i find the time, i will make one,
since i want to make a presentation for Hadoop anyways.
Thanks for your help! I will try to use the abstract class.
Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of
> some famous (hadoop) guy describing how to sort the input for the
> reducer. Is there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>> I am just curious but are you using Writable? If so there is a
>> WritableComparator...
>> If you are going to interpret every bytes (you create a String, so
>> you do), there no clear reason for choosing such a low level API.
>>
>> Regards
>>
>> Bertrand
>>
>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title
>> "Logs cannot be created", where logs could not be created (Spill
>> failed). I got the hint, that i gotta check privileges, but that
>> was not the problem, because i own the folders that were used for
>> this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1,
>> byte[] text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>> start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>> start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was
>> kind of sure, that the problem somehow are the leading bytes,
>> that Hadoop puts in front of the data being compared. Since i
>> never really understood what hadoop is doing to the strings when
>> they are sent to the KeyComparator i simply appended all strings
>> to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>> ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>> mind Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes
>> its "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as
>> a String that i put into the key-position of my mapper output?
>> What are the concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I would recommend you to look at the yahoo tutorial for more information.
Here is the part we are discussing about :
http://developer.yahoo.com/hadoop/tutorial/module5.html#writable-comparator
Regards
Bertrand
On Thu, Aug 9, 2012 at 5:03 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of some
> famous (hadoop) guy describing how to sort the input for the reducer. Is
> there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title "Logs
>> cannot be created", where logs could not be created (Spill failed). I got
>> the hint, that i gotta check privileges, but that was not the problem,
>> because i own the folders that were used for this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1, byte[]
>> text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
>> sure, that the problem somehow are the leading bytes, that Hadoop puts in
>> front of the data being compared. Since i never really understood what
>> hadoop is doing to the strings when they are sent to the KeyComparator i
>> simply appended all strings to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
>> Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
>> NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes its
>> "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as a
>> String that i put into the key-position of my mapper output? What are the
>> concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ah ok, i got the idea: i can use the abstract class instead of the low
level interface, though i am not sure, how to use it. It would just be
nice, if complexer mechanics like the sorting would have an up-to-date
tutorial with some example code. If i find the time, i will make one,
since i want to make a presentation for Hadoop anyways.
Thanks for your help! I will try to use the abstract class.
Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of
> some famous (hadoop) guy describing how to sort the input for the
> reducer. Is there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>> I am just curious but are you using Writable? If so there is a
>> WritableComparator...
>> If you are going to interpret every bytes (you create a String, so
>> you do), there no clear reason for choosing such a low level API.
>>
>> Regards
>>
>> Bertrand
>>
>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title
>> "Logs cannot be created", where logs could not be created (Spill
>> failed). I got the hint, that i gotta check privileges, but that
>> was not the problem, because i own the folders that were used for
>> this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1,
>> byte[] text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>> start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>> start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was
>> kind of sure, that the problem somehow are the leading bytes,
>> that Hadoop puts in front of the data being compared. Since i
>> never really understood what hadoop is doing to the strings when
>> they are sent to the KeyComparator i simply appended all strings
>> to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>> ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>> mind Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes
>> its "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as
>> a String that i put into the key-position of my mapper output?
>> What are the concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I would recommend you to look at the yahoo tutorial for more information.
Here is the part we are discussing about :
http://developer.yahoo.com/hadoop/tutorial/module5.html#writable-comparator
Regards
Bertrand
On Thu, Aug 9, 2012 at 5:03 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of some
> famous (hadoop) guy describing how to sort the input for the reducer. Is
> there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title "Logs
>> cannot be created", where logs could not be created (Spill failed). I got
>> the hint, that i gotta check privileges, but that was not the problem,
>> because i own the folders that were used for this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1, byte[]
>> text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
>> sure, that the problem somehow are the leading bytes, that Hadoop puts in
>> front of the data being compared. Since i never really understood what
>> hadoop is doing to the strings when they are sent to the KeyComparator i
>> simply appended all strings to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
>> Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
>> NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes its
>> "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as a
>> String that i put into the key-position of my mapper output? What are the
>> concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I would recommend you to look at the yahoo tutorial for more information.
Here is the part we are discussing about :
http://developer.yahoo.com/hadoop/tutorial/module5.html#writable-comparator
Regards
Bertrand
On Thu, Aug 9, 2012 at 5:03 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of some
> famous (hadoop) guy describing how to sort the input for the reducer. Is
> there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title "Logs
>> cannot be created", where logs could not be created (Spill failed). I got
>> the hint, that i gotta check privileges, but that was not the problem,
>> because i own the folders that were used for this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1, byte[]
>> text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
>> sure, that the problem somehow are the leading bytes, that Hadoop puts in
>> front of the data being compared. Since i never really understood what
>> hadoop is doing to the strings when they are sent to the KeyComparator i
>> simply appended all strings to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
>> Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
>> NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes its
>> "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as a
>> String that i put into the key-position of my mapper output? What are the
>> concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Ah ok, i got the idea: i can use the abstract class instead of the low
level interface, though i am not sure, how to use it. It would just be
nice, if complexer mechanics like the sorting would have an up-to-date
tutorial with some example code. If i find the time, i will make one,
since i want to make a presentation for Hadoop anyways.
Thanks for your help! I will try to use the abstract class.
Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of
> some famous (hadoop) guy describing how to sort the input for the
> reducer. Is there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>> I am just curious but are you using Writable? If so there is a
>> WritableComparator...
>> If you are going to interpret every bytes (you create a String, so
>> you do), there no clear reason for choosing such a low level API.
>>
>> Regards
>>
>> Bertrand
>>
>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
>> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title
>> "Logs cannot be created", where logs could not be created (Spill
>> failed). I got the hint, that i gotta check privileges, but that
>> was not the problem, because i own the folders that were used for
>> this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1,
>> byte[] text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);//
>> start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);//
>> start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was
>> kind of sure, that the problem somehow are the leading bytes,
>> that Hadoop puts in front of the data being compared. Since i
>> never really understood what hadoop is doing to the strings when
>> they are sent to the KeyComparator i simply appended all strings
>> to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>> ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>> mind Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>> WORDUP ANT NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes
>> its "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as
>> a String that i put into the key-position of my mapper output?
>> What are the concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I would recommend you to look at the yahoo tutorial for more information.
Here is the part we are discussing about :
http://developer.yahoo.com/hadoop/tutorial/module5.html#writable-comparator
Regards
Bertrand
On Thu, Aug 9, 2012 at 5:03 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi Bertrand,
>
> i am using RawComperator because this one was used in the tutorial of some
> famous (hadoop) guy describing how to sort the input for the reducer. Is
> there an easier alternative?
>
>
> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
>
>> Hi again,
>>
>> this is an direct response to my previous posting with the title "Logs
>> cannot be created", where logs could not be created (Spill failed). I got
>> the hint, that i gotta check privileges, but that was not the problem,
>> because i own the folders that were used for this.
>>
>> I finally found an important hint in a log saying:
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
>> its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>> attempt_201208091516_0001_m_000055_0, Status : FAILED
>> java.io.IOException: Spill failed
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>> at
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>> at
>> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> Caused by: java.lang.NumberFormatException: For input string: ""
>> at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>> at java.lang.Integer.parseInt(Integer.java:468)
>> at java.lang.Integer.parseInt(Integer.java:497)
>> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>> at
>> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>> at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>> at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>
>>
>>
>> corresponding to the following lines of code within the class
>> TwitterValueGroupingComparator:
>>
>> public class TwitterValueGroupingComparator implements
>> RawComparator<Text> {
>> ...
>> public int compare(byte[] text1, int start1, int length1, byte[]
>> text2,
>> int start2, int length2) {
>>
>> byte[] tweet1 = new byte[length1];// length1-1 (???)
>> byte[] tweet2 = new byte[length2];// length1-1 (???)
>>
>> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
>> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>>
>> Tweet atweet1 = new Tweet(new String(tweet1));
>> Tweet atweet2 = new Tweet(new String(tweet2));
>>
>>
>> String key1 = atweet1.getAuthor();
>> String key2 = atweet2.getAuthor();
>> ////////////////////////////////////////////////////////////////
>> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>> /////////////////////////////////////////////////////////////////
>> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>> key1 = atweet1.getMention();
>> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>> key2 = atweet2.getMention();
>>
>> int realKeyCompare = key1.compareTo(key2);
>> return realKeyCompare;
>> }
>>
>> }
>>
>> As i am taking the incoming bytes and interpret them as Tweets by
>> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
>> sure, that the problem somehow are the leading bytes, that Hadoop puts in
>> front of the data being compared. Since i never really understood what
>> hadoop is doing to the strings when they are sent to the KeyComparator i
>> simply appended all strings to a file in order to see myself.
>>
>> You can see the results here:
>>
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
>> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
>> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
>> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
>> Watching food network, null
>> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
>> NOTHING LIKE THE HOOD, null
>>
>>
>> As you can see there are different leading characters: sometimes its
>> "??", other times its "b" or "^", etc.
>>
>> My question is now:
>> How many bits do i have to cut off, so i get the original Text as a
>> String that i put into the key-position of my mapper output? What are the
>> concepts behind this?
>>
>> Thanks for your help in advance!
>>
>> Best regards,
>> Elmar Macek
>>
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi Bertrand,
i am using RawComperator because this one was used in the tutorial of
some famous (hadoop) guy describing how to sort the input for the
reducer. Is there an easier alternative?
Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>
> Hi again,
>
> this is an direct response to my previous posting with the title
> "Logs cannot be created", where logs could not be created (Spill
> failed). I got the hint, that i gotta check privileges, but that
> was not the problem, because i own the folders that were used for
> this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
> at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.NumberFormatException: For input string: ""
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Integer.parseInt(Integer.java:468)
> at java.lang.Integer.parseInt(Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
> at
> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
> at
> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator:
>
> public class TwitterValueGroupingComparator implements
> RawComparator<Text> {
> ...
> public int compare(byte[] text1, int start1, int length1,
> byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);//
> start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);//
> start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> ////////////////////////////////////////////////////////////////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> /////////////////////////////////////////////////////////////////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was
> kind of sure, that the problem somehow are the leading bytes, that
> Hadoop puts in front of the data being compared. Since i never
> really understood what hadoop is doing to the strings when they
> are sent to the KeyComparator i simply appended all strings to a
> file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
> ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
> mind Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
> WORDUP ANT NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes
> its "??", other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as
> a String that i put into the key-position of my mapper output?
> What are the concepts behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
>
>
> --
> Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi Bertrand,
i am using RawComperator because this one was used in the tutorial of
some famous (hadoop) guy describing how to sort the input for the
reducer. Is there an easier alternative?
Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>
> Hi again,
>
> this is an direct response to my previous posting with the title
> "Logs cannot be created", where logs could not be created (Spill
> failed). I got the hint, that i gotta check privileges, but that
> was not the problem, because i own the folders that were used for
> this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
> at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.NumberFormatException: For input string: ""
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Integer.parseInt(Integer.java:468)
> at java.lang.Integer.parseInt(Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
> at
> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
> at
> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator:
>
> public class TwitterValueGroupingComparator implements
> RawComparator<Text> {
> ...
> public int compare(byte[] text1, int start1, int length1,
> byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);//
> start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);//
> start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> ////////////////////////////////////////////////////////////////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> /////////////////////////////////////////////////////////////////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was
> kind of sure, that the problem somehow are the leading bytes, that
> Hadoop puts in front of the data being compared. Since i never
> really understood what hadoop is doing to the strings when they
> are sent to the KeyComparator i simply appended all strings to a
> file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
> ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
> mind Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
> WORDUP ANT NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes
> its "??", other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as
> a String that i put into the key-position of my mapper output?
> What are the concepts behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
>
>
> --
> Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi Bertrand,
i am using RawComperator because this one was used in the tutorial of
some famous (hadoop) guy describing how to sort the input for the
reducer. Is there an easier alternative?
Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>
> Hi again,
>
> this is an direct response to my previous posting with the title
> "Logs cannot be created", where logs could not be created (Spill
> failed). I got the hint, that i gotta check privileges, but that
> was not the problem, because i own the folders that were used for
> this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
> at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.NumberFormatException: For input string: ""
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Integer.parseInt(Integer.java:468)
> at java.lang.Integer.parseInt(Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
> at
> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
> at
> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator:
>
> public class TwitterValueGroupingComparator implements
> RawComparator<Text> {
> ...
> public int compare(byte[] text1, int start1, int length1,
> byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);//
> start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);//
> start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> ////////////////////////////////////////////////////////////////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> /////////////////////////////////////////////////////////////////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was
> kind of sure, that the problem somehow are the leading bytes, that
> Hadoop puts in front of the data being compared. Since i never
> really understood what hadoop is doing to the strings when they
> are sent to the KeyComparator i simply appended all strings to a
> file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
> ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
> mind Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
> WORDUP ANT NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes
> its "??", other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as
> a String that i put into the key-position of my mapper output?
> What are the concepts behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
>
>
> --
> Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Björn-Elmar Macek <ma...@cs.uni-kassel.de>.
Hi Bertrand,
i am using RawComperator because this one was used in the tutorial of
some famous (hadoop) guy describing how to sort the input for the
reducer. Is there an easier alternative?
Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
> I am just curious but are you using Writable? If so there is a
> WritableComparator...
> If you are going to interpret every bytes (you create a String, so you
> do), there no clear reason for choosing such a low level API.
>
> Regards
>
> Bertrand
>
> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek
> <macek@cs.uni-kassel.de <ma...@cs.uni-kassel.de>> wrote:
>
> Hi again,
>
> this is an direct response to my previous posting with the title
> "Logs cannot be created", where logs could not be created (Spill
> failed). I got the hint, that i gotta check privileges, but that
> was not the problem, because i own the folders that were used for
> this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
> outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
> <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
> at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
> at
> uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.NumberFormatException: For input string: ""
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Integer.parseInt(Integer.java:468)
> at java.lang.Integer.parseInt(Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
> at
> uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
> at
> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
> at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator:
>
> public class TwitterValueGroupingComparator implements
> RawComparator<Text> {
> ...
> public int compare(byte[] text1, int start1, int length1,
> byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);//
> start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);//
> start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> ////////////////////////////////////////////////////////////////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> /////////////////////////////////////////////////////////////////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was
> kind of sure, that the problem somehow are the leading bytes, that
> Hadoop puts in front of the data being compared. Since i never
> really understood what hadoop is doing to the strings when they
> are sent to the KeyComparator i simply appended all strings to a
> file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
> ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
> it's mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
> mind Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
> WORDUP ANT NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes
> its "??", other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as
> a String that i put into the key-position of my mapper output?
> What are the concepts behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
>
>
> --
> Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I am just curious but are you using Writable? If so there is a
WritableComparator...
If you are going to interpret every bytes (you create a String, so you do),
there no clear reason for choosing such a low level API.
Regards
Bertrand
On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi again,
>
> this is an direct response to my previous posting with the title "Logs
> cannot be created", where logs could not be created (Spill failed). I got
> the hint, that i gotta check privileges, but that was not the problem,
> because i own the folders that were used for this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stdout<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stderr<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_**000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> collect(MapTask.java:1029)
> at org.apache.hadoop.mapred.**MapTask$OldOutputCollector.**
> collect(MapTask.java:592)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:26)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:12)
> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**
> java:436)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
> at java.security.**AccessController.doPrivileged(**Native Method)
> at javax.security.auth.Subject.**doAs(Subject.java:396)
> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1093)
> at org.apache.hadoop.mapred.**Child.main(Child.java:249)
> Caused by: java.lang.**NumberFormatException: For input string: ""
> at java.lang.**NumberFormatException.**forInputString(**
> NumberFormatException.java:48)
> at java.lang.Integer.parseInt(**Integer.java:468)
> at java.lang.Integer.parseInt(**Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.**getRT(Tweet.java:126)
> at uni.kassel.macek.rtprep.**TwitterValueGroupingComparator**
> .compare(**TwitterValueGroupingComparator**.java:47)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> compare(MapTask.java:1111)
> at org.apache.hadoop.util.**QuickSort.sortInternal(**
> QuickSort.java:95)
> at org.apache.hadoop.util.**QuickSort.sort(QuickSort.java:**59)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> sortAndSpill(MapTask.java:**1399)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> access$1800(MapTask.java:853)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$**
> SpillThread.run(MapTask.java:**1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator**:
>
> public class TwitterValueGroupingComparator implements RawComparator<Text>
> {
> ...
> public int compare(byte[] text1, int start1, int length1, byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> //////////////////////////////**//////////////////////////////**////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> //////////////////////////////**//////////////////////////////**/////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals("**"))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals("**"))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
> sure, that the problem somehow are the leading bytes, that Hadoop puts in
> front of the data being compared. Since i never really understood what
> hadoop is doing to the strings when they are sent to the KeyComparator i
> simply appended all strings to a file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
> Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
> NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes its "??",
> other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as a String
> that i put into the key-position of my mapper output? What are the concepts
> behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I am just curious but are you using Writable? If so there is a
WritableComparator...
If you are going to interpret every bytes (you create a String, so you do),
there no clear reason for choosing such a low level API.
Regards
Bertrand
On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi again,
>
> this is an direct response to my previous posting with the title "Logs
> cannot be created", where logs could not be created (Spill failed). I got
> the hint, that i gotta check privileges, but that was not the problem,
> because i own the folders that were used for this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stdout<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stderr<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_**000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> collect(MapTask.java:1029)
> at org.apache.hadoop.mapred.**MapTask$OldOutputCollector.**
> collect(MapTask.java:592)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:26)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:12)
> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**
> java:436)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
> at java.security.**AccessController.doPrivileged(**Native Method)
> at javax.security.auth.Subject.**doAs(Subject.java:396)
> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1093)
> at org.apache.hadoop.mapred.**Child.main(Child.java:249)
> Caused by: java.lang.**NumberFormatException: For input string: ""
> at java.lang.**NumberFormatException.**forInputString(**
> NumberFormatException.java:48)
> at java.lang.Integer.parseInt(**Integer.java:468)
> at java.lang.Integer.parseInt(**Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.**getRT(Tweet.java:126)
> at uni.kassel.macek.rtprep.**TwitterValueGroupingComparator**
> .compare(**TwitterValueGroupingComparator**.java:47)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> compare(MapTask.java:1111)
> at org.apache.hadoop.util.**QuickSort.sortInternal(**
> QuickSort.java:95)
> at org.apache.hadoop.util.**QuickSort.sort(QuickSort.java:**59)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> sortAndSpill(MapTask.java:**1399)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> access$1800(MapTask.java:853)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$**
> SpillThread.run(MapTask.java:**1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator**:
>
> public class TwitterValueGroupingComparator implements RawComparator<Text>
> {
> ...
> public int compare(byte[] text1, int start1, int length1, byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> //////////////////////////////**//////////////////////////////**////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> //////////////////////////////**//////////////////////////////**/////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals("**"))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals("**"))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
> sure, that the problem somehow are the leading bytes, that Hadoop puts in
> front of the data being compared. Since i never really understood what
> hadoop is doing to the strings when they are sent to the KeyComparator i
> simply appended all strings to a file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
> Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
> NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes its "??",
> other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as a String
> that i put into the key-position of my mapper output? What are the concepts
> behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I am just curious but are you using Writable? If so there is a
WritableComparator...
If you are going to interpret every bytes (you create a String, so you do),
there no clear reason for choosing such a low level API.
Regards
Bertrand
On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi again,
>
> this is an direct response to my previous posting with the title "Logs
> cannot be created", where logs could not be created (Spill failed). I got
> the hint, that i gotta check privileges, but that was not the problem,
> because i own the folders that were used for this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stdout<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stderr<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_**000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> collect(MapTask.java:1029)
> at org.apache.hadoop.mapred.**MapTask$OldOutputCollector.**
> collect(MapTask.java:592)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:26)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:12)
> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**
> java:436)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
> at java.security.**AccessController.doPrivileged(**Native Method)
> at javax.security.auth.Subject.**doAs(Subject.java:396)
> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1093)
> at org.apache.hadoop.mapred.**Child.main(Child.java:249)
> Caused by: java.lang.**NumberFormatException: For input string: ""
> at java.lang.**NumberFormatException.**forInputString(**
> NumberFormatException.java:48)
> at java.lang.Integer.parseInt(**Integer.java:468)
> at java.lang.Integer.parseInt(**Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.**getRT(Tweet.java:126)
> at uni.kassel.macek.rtprep.**TwitterValueGroupingComparator**
> .compare(**TwitterValueGroupingComparator**.java:47)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> compare(MapTask.java:1111)
> at org.apache.hadoop.util.**QuickSort.sortInternal(**
> QuickSort.java:95)
> at org.apache.hadoop.util.**QuickSort.sort(QuickSort.java:**59)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> sortAndSpill(MapTask.java:**1399)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> access$1800(MapTask.java:853)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$**
> SpillThread.run(MapTask.java:**1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator**:
>
> public class TwitterValueGroupingComparator implements RawComparator<Text>
> {
> ...
> public int compare(byte[] text1, int start1, int length1, byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> //////////////////////////////**//////////////////////////////**////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> //////////////////////////////**//////////////////////////////**/////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals("**"))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals("**"))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
> sure, that the problem somehow are the leading bytes, that Hadoop puts in
> front of the data being compared. Since i never really understood what
> hadoop is doing to the strings when they are sent to the KeyComparator i
> simply appended all strings to a file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
> Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
> NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes its "??",
> other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as a String
> that i put into the key-position of my mapper output? What are the concepts
> behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
--
Bertrand Dechoux
Re: OutputValueGroupingComparator gets strange inputs (topic changed
from "Logs cannot be created")
Posted by Bertrand Dechoux <de...@gmail.com>.
I am just curious but are you using Writable? If so there is a
WritableComparator...
If you are going to interpret every bytes (you create a String, so you do),
there no clear reason for choosing such a low level API.
Regards
Bertrand
On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <ma...@cs.uni-kassel.de>wrote:
> Hi again,
>
> this is an direct response to my previous posting with the title "Logs
> cannot be created", where logs could not be created (Spill failed). I got
> the hint, that i gotta check privileges, but that was not the problem,
> because i own the folders that were used for this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stdout<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stderr<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_**000055_0, Status : FAILED
> java.io.IOException: Spill failed
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> collect(MapTask.java:1029)
> at org.apache.hadoop.mapred.**MapTask$OldOutputCollector.**
> collect(MapTask.java:592)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:26)
> at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:12)
> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**
> java:436)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
> at java.security.**AccessController.doPrivileged(**Native Method)
> at javax.security.auth.Subject.**doAs(Subject.java:396)
> at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1093)
> at org.apache.hadoop.mapred.**Child.main(Child.java:249)
> Caused by: java.lang.**NumberFormatException: For input string: ""
> at java.lang.**NumberFormatException.**forInputString(**
> NumberFormatException.java:48)
> at java.lang.Integer.parseInt(**Integer.java:468)
> at java.lang.Integer.parseInt(**Integer.java:497)
> at uni.kassel.macek.rtprep.Tweet.**getRT(Tweet.java:126)
> at uni.kassel.macek.rtprep.**TwitterValueGroupingComparator**
> .compare(**TwitterValueGroupingComparator**.java:47)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> compare(MapTask.java:1111)
> at org.apache.hadoop.util.**QuickSort.sortInternal(**
> QuickSort.java:95)
> at org.apache.hadoop.util.**QuickSort.sort(QuickSort.java:**59)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> sortAndSpill(MapTask.java:**1399)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> access$1800(MapTask.java:853)
> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$**
> SpillThread.run(MapTask.java:**1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator**:
>
> public class TwitterValueGroupingComparator implements RawComparator<Text>
> {
> ...
> public int compare(byte[] text1, int start1, int length1, byte[] text2,
> int start2, int length2) {
>
> byte[] tweet1 = new byte[length1];// length1-1 (???)
> byte[] tweet2 = new byte[length2];// length1-1 (???)
>
> System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
> System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>
> Tweet atweet1 = new Tweet(new String(tweet1));
> Tweet atweet2 = new Tweet(new String(tweet2));
>
>
> String key1 = atweet1.getAuthor();
> String key2 = atweet2.getAuthor();
> //////////////////////////////**//////////////////////////////**////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> //////////////////////////////**//////////////////////////////**/////
> if (atweet1.getRT() > 0 && !atweet1.getMention().equals("**"))
> key1 = atweet1.getMention();
> if (atweet2.getRT() > 0 && !atweet2.getMention().equals("**"))
> key2 = atweet2.getMention();
>
> int realKeyCompare = key1.compareTo(key2);
> return realKeyCompare;
> }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
> sure, that the problem somehow are the leading bytes, that Hadoop puts in
> front of the data being compared. Since i never really understood what
> hadoop is doing to the strings when they are sent to the KeyComparator i
> simply appended all strings to a file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
> Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
> NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes its "??",
> other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as a String
> that i put into the key-position of my mapper output? What are the concepts
> behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>
--
Bertrand Dechoux