You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by jamal sasha <ja...@gmail.com> on 2013/05/29 23:54:18 UTC

Reading json format input

Hi,
   I am stuck again. :(
My input data is in hdfs. I am again trying to do wordcount but there is
slight difference.
The data is in json format.
So each line of data is:

{"author":"foo", "text": "hello"}
{"author":"foo123", "text": "hello world"}
{"author":"foo234", "text": "hello this world"}

So I want to do wordcount for text part.
I understand that in mapper, I just have to pass this data as json and
extract "text" and rest of the code is just the same but I am trying to
switch from python to java hadoop.
How do I do this.
Thanks

Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
You have the entire string. 
If you tokenize on commas ... 

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with 
{"author":"foo",    and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes. 

HTH


On May 29, 2013, at 6:44 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
> 
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
> 
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack as well :)
> 
> 
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, 
> I have to agree w Russell. Pig is definitely the way to go on this. 
> 
> If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
> How formal do you want to go? 
> Do you want to strip it down or just find the quote after the text part? 
> 
> 
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:
> 
>> Seriously consider Pig (free answer, 4 LOC):
>> 
>> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>> 
>> It will be faster than the Java you'll likely write.
>> 
>> 
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
>> The data is in json format.
>> So each line of data is:
>> 
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> 
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
>> How do I do this.
>> Thanks
>> 
>> 
>> 
>> -- 
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
> 
> 


Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Ok got this thing working..
Turns out that -libjars should be mentioned before specifying hdfs input
and output.. rather than after it..
:-/
Thanks everyone.


On Thu, May 30, 2013 at 1:35 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   I did that but still same exception error.
> I did:
> export HADOOP_CLASSPATH=/path/to/external.jar
> And then had a -libjars /path/to/external.jar added in my command but
> still same error
>
>
> On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> For starters, you can specify them through the -libjars parameter when
>> you kick off your M/R job. This way the jars will be copied to all TTs.
>>
>> Regards,
>> Shahab
>>
>>
>> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi Thanks guys.
>>>  I figured out the issue. Hence i have another question.
>>> I am using a third party library and I thought that once I have created
>>> the jar file I dont need to specify the dependancies but aparently thats
>>> not the case. (error below)
>>> Very very naive question...probably stupid. How do i specify third party
>>> libraries (jar) in hadoop.
>>>
>>> Error:
>>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>  at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:247)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>>  at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>>> at
>>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>>
>>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>>
>>>> Whatever you are trying to do should work,
>>>> Here is the modified WordCount Map
>>>>
>>>>
>>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>>
>>>>         JSONObject line_as_json = new JSONObject(line);
>>>>         String text = line_as_json.getString("text");
>>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Pramod N <http://atmachinelearner.blogspot.in>
>>>> Bruce Wayne of web
>>>> @machinelearner <https://twitter.com/machinelearner>
>>>>
>>>> --
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>   For some reason, this have to be in java :(
>>>>>> I am trying to use org.json library, something like (in mapper)
>>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>>
>>>>>> String text = (String) jsn.get("text");
>>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>>
>>>>>> But its not working :(
>>>>>> It would be better to get this thing properly but I wouldnt mind
>>>>>> using a hack as well :)
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>>> michael_segel@hotmail.com> wrote:
>>>>>>
>>>>>>> Yeah,
>>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>>
>>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>>> on the input string but it too should be trivial.
>>>>>>> How formal do you want to go?
>>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>>> part?
>>>>>>>
>>>>>>>
>>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <
>>>>>>> russell.jurney@gmail.com> wrote:
>>>>>>>
>>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>>
>>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>>> COUNT_STAR(words) AS word_count;
>>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>>
>>>>>>> It will be faster than the Java you'll likely write.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>    I am stuck again. :(
>>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>>> there is slight difference.
>>>>>>>> The data is in json format.
>>>>>>>> So each line of data is:
>>>>>>>>
>>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>>
>>>>>>>> So I want to do wordcount for text part.
>>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>>> switch from python to java hadoop.
>>>>>>>> How do I do this.
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>>> datasyndrome.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Ok got this thing working..
Turns out that -libjars should be mentioned before specifying hdfs input
and output.. rather than after it..
:-/
Thanks everyone.


On Thu, May 30, 2013 at 1:35 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   I did that but still same exception error.
> I did:
> export HADOOP_CLASSPATH=/path/to/external.jar
> And then had a -libjars /path/to/external.jar added in my command but
> still same error
>
>
> On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> For starters, you can specify them through the -libjars parameter when
>> you kick off your M/R job. This way the jars will be copied to all TTs.
>>
>> Regards,
>> Shahab
>>
>>
>> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi Thanks guys.
>>>  I figured out the issue. Hence i have another question.
>>> I am using a third party library and I thought that once I have created
>>> the jar file I dont need to specify the dependancies but aparently thats
>>> not the case. (error below)
>>> Very very naive question...probably stupid. How do i specify third party
>>> libraries (jar) in hadoop.
>>>
>>> Error:
>>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>  at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:247)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>>  at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>>> at
>>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>>
>>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>>
>>>> Whatever you are trying to do should work,
>>>> Here is the modified WordCount Map
>>>>
>>>>
>>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>>
>>>>         JSONObject line_as_json = new JSONObject(line);
>>>>         String text = line_as_json.getString("text");
>>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Pramod N <http://atmachinelearner.blogspot.in>
>>>> Bruce Wayne of web
>>>> @machinelearner <https://twitter.com/machinelearner>
>>>>
>>>> --
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>   For some reason, this have to be in java :(
>>>>>> I am trying to use org.json library, something like (in mapper)
>>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>>
>>>>>> String text = (String) jsn.get("text");
>>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>>
>>>>>> But its not working :(
>>>>>> It would be better to get this thing properly but I wouldnt mind
>>>>>> using a hack as well :)
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>>> michael_segel@hotmail.com> wrote:
>>>>>>
>>>>>>> Yeah,
>>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>>
>>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>>> on the input string but it too should be trivial.
>>>>>>> How formal do you want to go?
>>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>>> part?
>>>>>>>
>>>>>>>
>>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <
>>>>>>> russell.jurney@gmail.com> wrote:
>>>>>>>
>>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>>
>>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>>> COUNT_STAR(words) AS word_count;
>>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>>
>>>>>>> It will be faster than the Java you'll likely write.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>    I am stuck again. :(
>>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>>> there is slight difference.
>>>>>>>> The data is in json format.
>>>>>>>> So each line of data is:
>>>>>>>>
>>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>>
>>>>>>>> So I want to do wordcount for text part.
>>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>>> switch from python to java hadoop.
>>>>>>>> How do I do this.
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>>> datasyndrome.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Ok got this thing working..
Turns out that -libjars should be mentioned before specifying hdfs input
and output.. rather than after it..
:-/
Thanks everyone.


On Thu, May 30, 2013 at 1:35 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   I did that but still same exception error.
> I did:
> export HADOOP_CLASSPATH=/path/to/external.jar
> And then had a -libjars /path/to/external.jar added in my command but
> still same error
>
>
> On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> For starters, you can specify them through the -libjars parameter when
>> you kick off your M/R job. This way the jars will be copied to all TTs.
>>
>> Regards,
>> Shahab
>>
>>
>> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi Thanks guys.
>>>  I figured out the issue. Hence i have another question.
>>> I am using a third party library and I thought that once I have created
>>> the jar file I dont need to specify the dependancies but aparently thats
>>> not the case. (error below)
>>> Very very naive question...probably stupid. How do i specify third party
>>> libraries (jar) in hadoop.
>>>
>>> Error:
>>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>  at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:247)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>>  at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>>> at
>>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>>
>>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>>
>>>> Whatever you are trying to do should work,
>>>> Here is the modified WordCount Map
>>>>
>>>>
>>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>>
>>>>         JSONObject line_as_json = new JSONObject(line);
>>>>         String text = line_as_json.getString("text");
>>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Pramod N <http://atmachinelearner.blogspot.in>
>>>> Bruce Wayne of web
>>>> @machinelearner <https://twitter.com/machinelearner>
>>>>
>>>> --
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>   For some reason, this have to be in java :(
>>>>>> I am trying to use org.json library, something like (in mapper)
>>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>>
>>>>>> String text = (String) jsn.get("text");
>>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>>
>>>>>> But its not working :(
>>>>>> It would be better to get this thing properly but I wouldnt mind
>>>>>> using a hack as well :)
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>>> michael_segel@hotmail.com> wrote:
>>>>>>
>>>>>>> Yeah,
>>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>>
>>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>>> on the input string but it too should be trivial.
>>>>>>> How formal do you want to go?
>>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>>> part?
>>>>>>>
>>>>>>>
>>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <
>>>>>>> russell.jurney@gmail.com> wrote:
>>>>>>>
>>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>>
>>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>>> COUNT_STAR(words) AS word_count;
>>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>>
>>>>>>> It will be faster than the Java you'll likely write.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>    I am stuck again. :(
>>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>>> there is slight difference.
>>>>>>>> The data is in json format.
>>>>>>>> So each line of data is:
>>>>>>>>
>>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>>
>>>>>>>> So I want to do wordcount for text part.
>>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>>> switch from python to java hadoop.
>>>>>>>> How do I do this.
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>>> datasyndrome.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Ok got this thing working..
Turns out that -libjars should be mentioned before specifying hdfs input
and output.. rather than after it..
:-/
Thanks everyone.


On Thu, May 30, 2013 at 1:35 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   I did that but still same exception error.
> I did:
> export HADOOP_CLASSPATH=/path/to/external.jar
> And then had a -libjars /path/to/external.jar added in my command but
> still same error
>
>
> On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> For starters, you can specify them through the -libjars parameter when
>> you kick off your M/R job. This way the jars will be copied to all TTs.
>>
>> Regards,
>> Shahab
>>
>>
>> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi Thanks guys.
>>>  I figured out the issue. Hence i have another question.
>>> I am using a third party library and I thought that once I have created
>>> the jar file I dont need to specify the dependancies but aparently thats
>>> not the case. (error below)
>>> Very very naive question...probably stupid. How do i specify third party
>>> libraries (jar) in hadoop.
>>>
>>> Error:
>>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>  at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:247)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>>  at
>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>>> at
>>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>>
>>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>>
>>>> Whatever you are trying to do should work,
>>>> Here is the modified WordCount Map
>>>>
>>>>
>>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>>
>>>>         JSONObject line_as_json = new JSONObject(line);
>>>>         String text = line_as_json.getString("text");
>>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Pramod N <http://atmachinelearner.blogspot.in>
>>>> Bruce Wayne of web
>>>> @machinelearner <https://twitter.com/machinelearner>
>>>>
>>>> --
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>   For some reason, this have to be in java :(
>>>>>> I am trying to use org.json library, something like (in mapper)
>>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>>
>>>>>> String text = (String) jsn.get("text");
>>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>>
>>>>>> But its not working :(
>>>>>> It would be better to get this thing properly but I wouldnt mind
>>>>>> using a hack as well :)
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>>> michael_segel@hotmail.com> wrote:
>>>>>>
>>>>>>> Yeah,
>>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>>
>>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>>> on the input string but it too should be trivial.
>>>>>>> How formal do you want to go?
>>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>>> part?
>>>>>>>
>>>>>>>
>>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <
>>>>>>> russell.jurney@gmail.com> wrote:
>>>>>>>
>>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>>
>>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>>> COUNT_STAR(words) AS word_count;
>>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>>
>>>>>>> It will be faster than the Java you'll likely write.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>    I am stuck again. :(
>>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>>> there is slight difference.
>>>>>>>> The data is in json format.
>>>>>>>> So each line of data is:
>>>>>>>>
>>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>>
>>>>>>>> So I want to do wordcount for text part.
>>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>>> switch from python to java hadoop.
>>>>>>>> How do I do this.
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>>> datasyndrome.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  I did that but still same exception error.
I did:
export HADOOP_CLASSPATH=/path/to/external.jar
And then had a -libjars /path/to/external.jar added in my command but still
same error


On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:

> For starters, you can specify them through the -libjars parameter when you
> kick off your M/R job. This way the jars will be copied to all TTs.
>
> Regards,
> Shahab
>
>
> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi Thanks guys.
>>  I figured out the issue. Hence i have another question.
>> I am using a third party library and I thought that once I have created
>> the jar file I dont need to specify the dependancies but aparently thats
>> not the case. (error below)
>> Very very naive question...probably stupid. How do i specify third party
>> libraries (jar) in hadoop.
>>
>> Error:
>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>  at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:247)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>> at
>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>>
>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>
>>> Whatever you are trying to do should work,
>>> Here is the modified WordCount Map
>>>
>>>
>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>
>>>         JSONObject line_as_json = new JSONObject(line);
>>>         String text = line_as_json.getString("text");
>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>
>>>
>>>
>>>
>>>
>>> Pramod N <http://atmachinelearner.blogspot.in>
>>> Bruce Wayne of web
>>> @machinelearner <https://twitter.com/machinelearner>
>>>
>>> --
>>>
>>>
>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>   For some reason, this have to be in java :(
>>>>> I am trying to use org.json library, something like (in mapper)
>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>
>>>>> String text = (String) jsn.get("text");
>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>
>>>>> But its not working :(
>>>>> It would be better to get this thing properly but I wouldnt mind using
>>>>> a hack as well :)
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Yeah,
>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>
>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>> on the input string but it too should be trivial.
>>>>>> How formal do you want to go?
>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>> part?
>>>>>>
>>>>>>
>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>
>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>> COUNT_STAR(words) AS word_count;
>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>
>>>>>> It will be faster than the Java you'll likely write.
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>    I am stuck again. :(
>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>> there is slight difference.
>>>>>>> The data is in json format.
>>>>>>> So each line of data is:
>>>>>>>
>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>
>>>>>>> So I want to do wordcount for text part.
>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>> switch from python to java hadoop.
>>>>>>> How do I do this.
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>> datasyndrome.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  I did that but still same exception error.
I did:
export HADOOP_CLASSPATH=/path/to/external.jar
And then had a -libjars /path/to/external.jar added in my command but still
same error


On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:

> For starters, you can specify them through the -libjars parameter when you
> kick off your M/R job. This way the jars will be copied to all TTs.
>
> Regards,
> Shahab
>
>
> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi Thanks guys.
>>  I figured out the issue. Hence i have another question.
>> I am using a third party library and I thought that once I have created
>> the jar file I dont need to specify the dependancies but aparently thats
>> not the case. (error below)
>> Very very naive question...probably stupid. How do i specify third party
>> libraries (jar) in hadoop.
>>
>> Error:
>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>  at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:247)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>> at
>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>>
>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>
>>> Whatever you are trying to do should work,
>>> Here is the modified WordCount Map
>>>
>>>
>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>
>>>         JSONObject line_as_json = new JSONObject(line);
>>>         String text = line_as_json.getString("text");
>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>
>>>
>>>
>>>
>>>
>>> Pramod N <http://atmachinelearner.blogspot.in>
>>> Bruce Wayne of web
>>> @machinelearner <https://twitter.com/machinelearner>
>>>
>>> --
>>>
>>>
>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>   For some reason, this have to be in java :(
>>>>> I am trying to use org.json library, something like (in mapper)
>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>
>>>>> String text = (String) jsn.get("text");
>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>
>>>>> But its not working :(
>>>>> It would be better to get this thing properly but I wouldnt mind using
>>>>> a hack as well :)
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Yeah,
>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>
>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>> on the input string but it too should be trivial.
>>>>>> How formal do you want to go?
>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>> part?
>>>>>>
>>>>>>
>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>
>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>> COUNT_STAR(words) AS word_count;
>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>
>>>>>> It will be faster than the Java you'll likely write.
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>    I am stuck again. :(
>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>> there is slight difference.
>>>>>>> The data is in json format.
>>>>>>> So each line of data is:
>>>>>>>
>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>
>>>>>>> So I want to do wordcount for text part.
>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>> switch from python to java hadoop.
>>>>>>> How do I do this.
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>> datasyndrome.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  I did that but still same exception error.
I did:
export HADOOP_CLASSPATH=/path/to/external.jar
And then had a -libjars /path/to/external.jar added in my command but still
same error


On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:

> For starters, you can specify them through the -libjars parameter when you
> kick off your M/R job. This way the jars will be copied to all TTs.
>
> Regards,
> Shahab
>
>
> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi Thanks guys.
>>  I figured out the issue. Hence i have another question.
>> I am using a third party library and I thought that once I have created
>> the jar file I dont need to specify the dependancies but aparently thats
>> not the case. (error below)
>> Very very naive question...probably stupid. How do i specify third party
>> libraries (jar) in hadoop.
>>
>> Error:
>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>  at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:247)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>> at
>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>>
>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>
>>> Whatever you are trying to do should work,
>>> Here is the modified WordCount Map
>>>
>>>
>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>
>>>         JSONObject line_as_json = new JSONObject(line);
>>>         String text = line_as_json.getString("text");
>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>
>>>
>>>
>>>
>>>
>>> Pramod N <http://atmachinelearner.blogspot.in>
>>> Bruce Wayne of web
>>> @machinelearner <https://twitter.com/machinelearner>
>>>
>>> --
>>>
>>>
>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>   For some reason, this have to be in java :(
>>>>> I am trying to use org.json library, something like (in mapper)
>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>
>>>>> String text = (String) jsn.get("text");
>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>
>>>>> But its not working :(
>>>>> It would be better to get this thing properly but I wouldnt mind using
>>>>> a hack as well :)
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Yeah,
>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>
>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>> on the input string but it too should be trivial.
>>>>>> How formal do you want to go?
>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>> part?
>>>>>>
>>>>>>
>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>
>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>> COUNT_STAR(words) AS word_count;
>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>
>>>>>> It will be faster than the Java you'll likely write.
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>    I am stuck again. :(
>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>> there is slight difference.
>>>>>>> The data is in json format.
>>>>>>> So each line of data is:
>>>>>>>
>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>
>>>>>>> So I want to do wordcount for text part.
>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>> switch from python to java hadoop.
>>>>>>> How do I do this.
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>> datasyndrome.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  I did that but still same exception error.
I did:
export HADOOP_CLASSPATH=/path/to/external.jar
And then had a -libjars /path/to/external.jar added in my command but still
same error


On Thu, May 30, 2013 at 11:46 AM, Shahab Yunus <sh...@gmail.com>wrote:

> For starters, you can specify them through the -libjars parameter when you
> kick off your M/R job. This way the jars will be copied to all TTs.
>
> Regards,
> Shahab
>
>
> On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi Thanks guys.
>>  I figured out the issue. Hence i have another question.
>> I am using a third party library and I thought that once I have created
>> the jar file I dont need to specify the dependancies but aparently thats
>> not the case. (error below)
>> Very very naive question...probably stupid. How do i specify third party
>> libraries (jar) in hadoop.
>>
>> Error:
>> Error: java.lang.ClassNotFoundException: org.json.JSONException
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>  at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:247)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>> at
>> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>>
>> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>>
>>> Whatever you are trying to do should work,
>>> Here is the modified WordCount Map
>>>
>>>
>>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>>
>>>         JSONObject line_as_json = new JSONObject(line);
>>>         String text = line_as_json.getString("text");
>>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>>
>>>
>>>
>>>
>>>
>>> Pramod N <http://atmachinelearner.blogspot.in>
>>> Bruce Wayne of web
>>> @machinelearner <https://twitter.com/machinelearner>
>>>
>>> --
>>>
>>>
>>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>   For some reason, this have to be in java :(
>>>>> I am trying to use org.json library, something like (in mapper)
>>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>>
>>>>> String text = (String) jsn.get("text");
>>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>>
>>>>> But its not working :(
>>>>> It would be better to get this thing properly but I wouldnt mind using
>>>>> a hack as well :)
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Yeah,
>>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>>
>>>>>> If you want to do it as a Java program you will have to do some work
>>>>>> on the input string but it too should be trivial.
>>>>>> How formal do you want to go?
>>>>>> Do you want to strip it down or just find the quote after the text
>>>>>> part?
>>>>>>
>>>>>>
>>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>>
>>>>>> my_data = LOAD 'my_data.json' USING
>>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>>> COUNT_STAR(words) AS word_count;
>>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>>
>>>>>> It will be faster than the Java you'll likely write.
>>>>>>
>>>>>>
>>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>    I am stuck again. :(
>>>>>>> My input data is in hdfs. I am again trying to do wordcount but
>>>>>>> there is slight difference.
>>>>>>> The data is in json format.
>>>>>>> So each line of data is:
>>>>>>>
>>>>>>> {"author":"foo", "text": "hello"}
>>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>>
>>>>>>> So I want to do wordcount for text part.
>>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>>> switch from python to java hadoop.
>>>>>>> How do I do this.
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>>> datasyndrome.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by Shahab Yunus <sh...@gmail.com>.
For starters, you can specify them through the -libjars parameter when you
kick off your M/R job. This way the jars will be copied to all TTs.

Regards,
Shahab


On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Thanks guys.
>  I figured out the issue. Hence i have another question.
> I am using a third party library and I thought that once I have created
> the jar file I dont need to specify the dependancies but aparently thats
> not the case. (error below)
> Very very naive question...probably stupid. How do i specify third party
> libraries (jar) in hadoop.
>
> Error:
> Error: java.lang.ClassNotFoundException: org.json.JSONException
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>  at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
> at
> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
>
> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>
>> Whatever you are trying to do should work,
>> Here is the modified WordCount Map
>>
>>
>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>
>>         JSONObject line_as_json = new JSONObject(line);
>>         String text = line_as_json.getString("text");
>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>
>>
>>
>>
>>
>> Pramod N <http://atmachinelearner.blogspot.in>
>> Bruce Wayne of web
>> @machinelearner <https://twitter.com/machinelearner>
>>
>> --
>>
>>
>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>   For some reason, this have to be in java :(
>>>> I am trying to use org.json library, something like (in mapper)
>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>
>>>> String text = (String) jsn.get("text");
>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>
>>>> But its not working :(
>>>> It would be better to get this thing properly but I wouldnt mind using
>>>> a hack as well :)
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Yeah,
>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>
>>>>> If you want to do it as a Java program you will have to do some work
>>>>> on the input string but it too should be trivial.
>>>>> How formal do you want to go?
>>>>> Do you want to strip it down or just find the quote after the text
>>>>> part?
>>>>>
>>>>>
>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>
>>>>> my_data = LOAD 'my_data.json' USING
>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>> COUNT_STAR(words) AS word_count;
>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>
>>>>> It will be faster than the Java you'll likely write.
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>    I am stuck again. :(
>>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>>> is slight difference.
>>>>>> The data is in json format.
>>>>>> So each line of data is:
>>>>>>
>>>>>> {"author":"foo", "text": "hello"}
>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>
>>>>>> So I want to do wordcount for text part.
>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>> switch from python to java hadoop.
>>>>>> How do I do this.
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>> datasyndrome.com
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by Shahab Yunus <sh...@gmail.com>.
For starters, you can specify them through the -libjars parameter when you
kick off your M/R job. This way the jars will be copied to all TTs.

Regards,
Shahab


On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Thanks guys.
>  I figured out the issue. Hence i have another question.
> I am using a third party library and I thought that once I have created
> the jar file I dont need to specify the dependancies but aparently thats
> not the case. (error below)
> Very very naive question...probably stupid. How do i specify third party
> libraries (jar) in hadoop.
>
> Error:
> Error: java.lang.ClassNotFoundException: org.json.JSONException
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>  at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
> at
> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
>
> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>
>> Whatever you are trying to do should work,
>> Here is the modified WordCount Map
>>
>>
>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>
>>         JSONObject line_as_json = new JSONObject(line);
>>         String text = line_as_json.getString("text");
>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>
>>
>>
>>
>>
>> Pramod N <http://atmachinelearner.blogspot.in>
>> Bruce Wayne of web
>> @machinelearner <https://twitter.com/machinelearner>
>>
>> --
>>
>>
>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>   For some reason, this have to be in java :(
>>>> I am trying to use org.json library, something like (in mapper)
>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>
>>>> String text = (String) jsn.get("text");
>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>
>>>> But its not working :(
>>>> It would be better to get this thing properly but I wouldnt mind using
>>>> a hack as well :)
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Yeah,
>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>
>>>>> If you want to do it as a Java program you will have to do some work
>>>>> on the input string but it too should be trivial.
>>>>> How formal do you want to go?
>>>>> Do you want to strip it down or just find the quote after the text
>>>>> part?
>>>>>
>>>>>
>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>
>>>>> my_data = LOAD 'my_data.json' USING
>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>> COUNT_STAR(words) AS word_count;
>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>
>>>>> It will be faster than the Java you'll likely write.
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>    I am stuck again. :(
>>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>>> is slight difference.
>>>>>> The data is in json format.
>>>>>> So each line of data is:
>>>>>>
>>>>>> {"author":"foo", "text": "hello"}
>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>
>>>>>> So I want to do wordcount for text part.
>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>> switch from python to java hadoop.
>>>>>> How do I do this.
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>> datasyndrome.com
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by Shahab Yunus <sh...@gmail.com>.
For starters, you can specify them through the -libjars parameter when you
kick off your M/R job. This way the jars will be copied to all TTs.

Regards,
Shahab


On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Thanks guys.
>  I figured out the issue. Hence i have another question.
> I am using a third party library and I thought that once I have created
> the jar file I dont need to specify the dependancies but aparently thats
> not the case. (error below)
> Very very naive question...probably stupid. How do i specify third party
> libraries (jar) in hadoop.
>
> Error:
> Error: java.lang.ClassNotFoundException: org.json.JSONException
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>  at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
> at
> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
>
> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>
>> Whatever you are trying to do should work,
>> Here is the modified WordCount Map
>>
>>
>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>
>>         JSONObject line_as_json = new JSONObject(line);
>>         String text = line_as_json.getString("text");
>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>
>>
>>
>>
>>
>> Pramod N <http://atmachinelearner.blogspot.in>
>> Bruce Wayne of web
>> @machinelearner <https://twitter.com/machinelearner>
>>
>> --
>>
>>
>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>   For some reason, this have to be in java :(
>>>> I am trying to use org.json library, something like (in mapper)
>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>
>>>> String text = (String) jsn.get("text");
>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>
>>>> But its not working :(
>>>> It would be better to get this thing properly but I wouldnt mind using
>>>> a hack as well :)
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Yeah,
>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>
>>>>> If you want to do it as a Java program you will have to do some work
>>>>> on the input string but it too should be trivial.
>>>>> How formal do you want to go?
>>>>> Do you want to strip it down or just find the quote after the text
>>>>> part?
>>>>>
>>>>>
>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>
>>>>> my_data = LOAD 'my_data.json' USING
>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>> COUNT_STAR(words) AS word_count;
>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>
>>>>> It will be faster than the Java you'll likely write.
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>    I am stuck again. :(
>>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>>> is slight difference.
>>>>>> The data is in json format.
>>>>>> So each line of data is:
>>>>>>
>>>>>> {"author":"foo", "text": "hello"}
>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>
>>>>>> So I want to do wordcount for text part.
>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>> switch from python to java hadoop.
>>>>>> How do I do this.
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>> datasyndrome.com
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by Shahab Yunus <sh...@gmail.com>.
For starters, you can specify them through the -libjars parameter when you
kick off your M/R job. This way the jars will be copied to all TTs.

Regards,
Shahab


On Thu, May 30, 2013 at 2:43 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Thanks guys.
>  I figured out the issue. Hence i have another question.
> I am using a third party library and I thought that once I have created
> the jar file I dont need to specify the dependancies but aparently thats
> not the case. (error below)
> Very very naive question...probably stupid. How do i specify third party
> libraries (jar) in hadoop.
>
> Error:
> Error: java.lang.ClassNotFoundException: org.json.JSONException
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>  at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
> at
> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
>
> On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:
>
>> Whatever you are trying to do should work,
>> Here is the modified WordCount Map
>>
>>
>>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>>
>>         JSONObject line_as_json = new JSONObject(line);
>>         String text = line_as_json.getString("text");
>>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>>
>>
>>
>>
>>
>> Pramod N <http://atmachinelearner.blogspot.in>
>> Bruce Wayne of web
>> @machinelearner <https://twitter.com/machinelearner>
>>
>> --
>>
>>
>> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Whatever you have mentioned Jamal should work.you can debug this.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>   For some reason, this have to be in java :(
>>>> I am trying to use org.json library, something like (in mapper)
>>>> JSONObject jsn = new JSONObject(value.toString());
>>>>
>>>> String text = (String) jsn.get("text");
>>>> StringTokenizer itr = new StringTokenizer(text);
>>>>
>>>> But its not working :(
>>>> It would be better to get this thing properly but I wouldnt mind using
>>>> a hack as well :)
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> Yeah,
>>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>>
>>>>> If you want to do it as a Java program you will have to do some work
>>>>> on the input string but it too should be trivial.
>>>>> How formal do you want to go?
>>>>> Do you want to strip it down or just find the quote after the text
>>>>> part?
>>>>>
>>>>>
>>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>>
>>>>> my_data = LOAD 'my_data.json' USING
>>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>>> COUNT_STAR(words) AS word_count;
>>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>>
>>>>> It will be faster than the Java you'll likely write.
>>>>>
>>>>>
>>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>    I am stuck again. :(
>>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>>> is slight difference.
>>>>>> The data is in json format.
>>>>>> So each line of data is:
>>>>>>
>>>>>> {"author":"foo", "text": "hello"}
>>>>>> {"author":"foo123", "text": "hello world"}
>>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>>
>>>>>> So I want to do wordcount for text part.
>>>>>> I understand that in mapper, I just have to pass this data as json
>>>>>> and extract "text" and rest of the code is just the same but I am trying to
>>>>>> switch from python to java hadoop.
>>>>>> How do I do this.
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>>> datasyndrome.com
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Thanks guys.
 I figured out the issue. Hence i have another question.
I am using a third party library and I thought that once I have created the
jar file I dont need to specify the dependancies but aparently thats not
the case. (error below)
Very very naive question...probably stupid. How do i specify third party
libraries (jar) in hadoop.

Error:
Error: java.lang.ClassNotFoundException: org.json.JSONException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)



On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:

> Whatever you are trying to do should work,
> Here is the modified WordCount Map
>
>
>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>
>         JSONObject line_as_json = new JSONObject(line);
>         String text = line_as_json.getString("text");
>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>
>
>
>
>
> Pramod N <http://atmachinelearner.blogspot.in>
> Bruce Wayne of web
> @machinelearner <https://twitter.com/machinelearner>
>
> --
>
>
> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Whatever you have mentioned Jamal should work.you can debug this.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>   For some reason, this have to be in java :(
>>> I am trying to use org.json library, something like (in mapper)
>>> JSONObject jsn = new JSONObject(value.toString());
>>>
>>> String text = (String) jsn.get("text");
>>> StringTokenizer itr = new StringTokenizer(text);
>>>
>>> But its not working :(
>>> It would be better to get this thing properly but I wouldnt mind using a
>>> hack as well :)
>>>
>>>
>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah,
>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>
>>>> If you want to do it as a Java program you will have to do some work on
>>>> the input string but it too should be trivial.
>>>> How formal do you want to go?
>>>> Do you want to strip it down or just find the quote after the text
>>>> part?
>>>>
>>>>
>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>> wrote:
>>>>
>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>
>>>> my_data = LOAD 'my_data.json' USING
>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>> COUNT_STAR(words) AS word_count;
>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>
>>>> It will be faster than the Java you'll likely write.
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>    I am stuck again. :(
>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>> is slight difference.
>>>>> The data is in json format.
>>>>> So each line of data is:
>>>>>
>>>>> {"author":"foo", "text": "hello"}
>>>>> {"author":"foo123", "text": "hello world"}
>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>
>>>>> So I want to do wordcount for text part.
>>>>> I understand that in mapper, I just have to pass this data as json and
>>>>> extract "text" and rest of the code is just the same but I am trying to
>>>>> switch from python to java hadoop.
>>>>> How do I do this.
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>> datasyndrome.com
>>>>
>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Thanks guys.
 I figured out the issue. Hence i have another question.
I am using a third party library and I thought that once I have created the
jar file I dont need to specify the dependancies but aparently thats not
the case. (error below)
Very very naive question...probably stupid. How do i specify third party
libraries (jar) in hadoop.

Error:
Error: java.lang.ClassNotFoundException: org.json.JSONException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)



On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:

> Whatever you are trying to do should work,
> Here is the modified WordCount Map
>
>
>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>
>         JSONObject line_as_json = new JSONObject(line);
>         String text = line_as_json.getString("text");
>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>
>
>
>
>
> Pramod N <http://atmachinelearner.blogspot.in>
> Bruce Wayne of web
> @machinelearner <https://twitter.com/machinelearner>
>
> --
>
>
> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Whatever you have mentioned Jamal should work.you can debug this.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>   For some reason, this have to be in java :(
>>> I am trying to use org.json library, something like (in mapper)
>>> JSONObject jsn = new JSONObject(value.toString());
>>>
>>> String text = (String) jsn.get("text");
>>> StringTokenizer itr = new StringTokenizer(text);
>>>
>>> But its not working :(
>>> It would be better to get this thing properly but I wouldnt mind using a
>>> hack as well :)
>>>
>>>
>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah,
>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>
>>>> If you want to do it as a Java program you will have to do some work on
>>>> the input string but it too should be trivial.
>>>> How formal do you want to go?
>>>> Do you want to strip it down or just find the quote after the text
>>>> part?
>>>>
>>>>
>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>> wrote:
>>>>
>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>
>>>> my_data = LOAD 'my_data.json' USING
>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>> COUNT_STAR(words) AS word_count;
>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>
>>>> It will be faster than the Java you'll likely write.
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>    I am stuck again. :(
>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>> is slight difference.
>>>>> The data is in json format.
>>>>> So each line of data is:
>>>>>
>>>>> {"author":"foo", "text": "hello"}
>>>>> {"author":"foo123", "text": "hello world"}
>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>
>>>>> So I want to do wordcount for text part.
>>>>> I understand that in mapper, I just have to pass this data as json and
>>>>> extract "text" and rest of the code is just the same but I am trying to
>>>>> switch from python to java hadoop.
>>>>> How do I do this.
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>> datasyndrome.com
>>>>
>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Thanks guys.
 I figured out the issue. Hence i have another question.
I am using a third party library and I thought that once I have created the
jar file I dont need to specify the dependancies but aparently thats not
the case. (error below)
Very very naive question...probably stupid. How do i specify third party
libraries (jar) in hadoop.

Error:
Error: java.lang.ClassNotFoundException: org.json.JSONException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)



On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:

> Whatever you are trying to do should work,
> Here is the modified WordCount Map
>
>
>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>
>         JSONObject line_as_json = new JSONObject(line);
>         String text = line_as_json.getString("text");
>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>
>
>
>
>
> Pramod N <http://atmachinelearner.blogspot.in>
> Bruce Wayne of web
> @machinelearner <https://twitter.com/machinelearner>
>
> --
>
>
> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Whatever you have mentioned Jamal should work.you can debug this.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>   For some reason, this have to be in java :(
>>> I am trying to use org.json library, something like (in mapper)
>>> JSONObject jsn = new JSONObject(value.toString());
>>>
>>> String text = (String) jsn.get("text");
>>> StringTokenizer itr = new StringTokenizer(text);
>>>
>>> But its not working :(
>>> It would be better to get this thing properly but I wouldnt mind using a
>>> hack as well :)
>>>
>>>
>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah,
>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>
>>>> If you want to do it as a Java program you will have to do some work on
>>>> the input string but it too should be trivial.
>>>> How formal do you want to go?
>>>> Do you want to strip it down or just find the quote after the text
>>>> part?
>>>>
>>>>
>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>> wrote:
>>>>
>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>
>>>> my_data = LOAD 'my_data.json' USING
>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>> COUNT_STAR(words) AS word_count;
>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>
>>>> It will be faster than the Java you'll likely write.
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>    I am stuck again. :(
>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>> is slight difference.
>>>>> The data is in json format.
>>>>> So each line of data is:
>>>>>
>>>>> {"author":"foo", "text": "hello"}
>>>>> {"author":"foo123", "text": "hello world"}
>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>
>>>>> So I want to do wordcount for text part.
>>>>> I understand that in mapper, I just have to pass this data as json and
>>>>> extract "text" and rest of the code is just the same but I am trying to
>>>>> switch from python to java hadoop.
>>>>> How do I do this.
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>> datasyndrome.com
>>>>
>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Thanks guys.
 I figured out the issue. Hence i have another question.
I am using a third party library and I thought that once I have created the
jar file I dont need to specify the dependancies but aparently thats not
the case. (error below)
Very very naive question...probably stupid. How do i specify third party
libraries (jar) in hadoop.

Error:
Error: java.lang.ClassNotFoundException: org.json.JSONException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)



On Thu, May 30, 2013 at 2:02 AM, Pramod N <np...@gmail.com> wrote:

> Whatever you are trying to do should work,
> Here is the modified WordCount Map
>
>
>     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        String line = value.toString();
>
>         JSONObject line_as_json = new JSONObject(line);
>         String text = line_as_json.getString("text");
>         StringTokenizer tokenizer = new StringTokenizer(text);        while (tokenizer.hasMoreTokens()) {            word.set(tokenizer.nextToken());            context.write(word, one);        }    }
>
>
>
>
>
> Pramod N <http://atmachinelearner.blogspot.in>
> Bruce Wayne of web
> @machinelearner <https://twitter.com/machinelearner>
>
> --
>
>
> On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Whatever you have mentioned Jamal should work.you can debug this.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>   For some reason, this have to be in java :(
>>> I am trying to use org.json library, something like (in mapper)
>>> JSONObject jsn = new JSONObject(value.toString());
>>>
>>> String text = (String) jsn.get("text");
>>> StringTokenizer itr = new StringTokenizer(text);
>>>
>>> But its not working :(
>>> It would be better to get this thing properly but I wouldnt mind using a
>>> hack as well :)
>>>
>>>
>>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Yeah,
>>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>>
>>>> If you want to do it as a Java program you will have to do some work on
>>>> the input string but it too should be trivial.
>>>> How formal do you want to go?
>>>> Do you want to strip it down or just find the quote after the text
>>>> part?
>>>>
>>>>
>>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>>> wrote:
>>>>
>>>> Seriously consider Pig (free answer, 4 LOC):
>>>>
>>>> my_data = LOAD 'my_data.json' USING
>>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>>> COUNT_STAR(words) AS word_count;
>>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>>
>>>> It will be faster than the Java you'll likely write.
>>>>
>>>>
>>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>    I am stuck again. :(
>>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>>> is slight difference.
>>>>> The data is in json format.
>>>>> So each line of data is:
>>>>>
>>>>> {"author":"foo", "text": "hello"}
>>>>> {"author":"foo123", "text": "hello world"}
>>>>> {"author":"foo234", "text": "hello this world"}
>>>>>
>>>>> So I want to do wordcount for text part.
>>>>> I understand that in mapper, I just have to pass this data as json and
>>>>> extract "text" and rest of the code is just the same but I am trying to
>>>>> switch from python to java hadoop.
>>>>> How do I do this.
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>>>> datasyndrome.com
>>>>
>>>>
>>>>
>>>
>>
>

Re: Reading json format input

Posted by Pramod N <np...@gmail.com>.
Whatever you are trying to do should work,
Here is the modified WordCount Map


    public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {        String line =
value.toString();
        JSONObject line_as_json = new JSONObject(line);
        String text = line_as_json.getString("text");
        StringTokenizer tokenizer = new StringTokenizer(text);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());            context.write(word, one);
      }    }





Pramod N <http://atmachinelearner.blogspot.in>
Bruce Wayne of web
@machinelearner <https://twitter.com/machinelearner>

--


On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Whatever you have mentioned Jamal should work.you can debug this.
>
> Thanks,
> Rahul
>
>
> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>   For some reason, this have to be in java :(
>> I am trying to use org.json library, something like (in mapper)
>> JSONObject jsn = new JSONObject(value.toString());
>>
>> String text = (String) jsn.get("text");
>> StringTokenizer itr = new StringTokenizer(text);
>>
>> But its not working :(
>> It would be better to get this thing properly but I wouldnt mind using a
>> hack as well :)
>>
>>
>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah,
>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>
>>> If you want to do it as a Java program you will have to do some work on
>>> the input string but it too should be trivial.
>>> How formal do you want to go?
>>> Do you want to strip it down or just find the quote after the text part?
>>>
>>>
>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>> wrote:
>>>
>>> Seriously consider Pig (free answer, 4 LOC):
>>>
>>> my_data = LOAD 'my_data.json' USING
>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>> COUNT_STAR(words) AS word_count;
>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>
>>> It will be faster than the Java you'll likely write.
>>>
>>>
>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>    I am stuck again. :(
>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>> is slight difference.
>>>> The data is in json format.
>>>> So each line of data is:
>>>>
>>>> {"author":"foo", "text": "hello"}
>>>> {"author":"foo123", "text": "hello world"}
>>>> {"author":"foo234", "text": "hello this world"}
>>>>
>>>> So I want to do wordcount for text part.
>>>> I understand that in mapper, I just have to pass this data as json and
>>>> extract "text" and rest of the code is just the same but I am trying to
>>>> switch from python to java hadoop.
>>>> How do I do this.
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>>
>>>
>>
>

Re: Reading json format input

Posted by Pramod N <np...@gmail.com>.
Whatever you are trying to do should work,
Here is the modified WordCount Map


    public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {        String line =
value.toString();
        JSONObject line_as_json = new JSONObject(line);
        String text = line_as_json.getString("text");
        StringTokenizer tokenizer = new StringTokenizer(text);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());            context.write(word, one);
      }    }





Pramod N <http://atmachinelearner.blogspot.in>
Bruce Wayne of web
@machinelearner <https://twitter.com/machinelearner>

--


On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Whatever you have mentioned Jamal should work.you can debug this.
>
> Thanks,
> Rahul
>
>
> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>   For some reason, this have to be in java :(
>> I am trying to use org.json library, something like (in mapper)
>> JSONObject jsn = new JSONObject(value.toString());
>>
>> String text = (String) jsn.get("text");
>> StringTokenizer itr = new StringTokenizer(text);
>>
>> But its not working :(
>> It would be better to get this thing properly but I wouldnt mind using a
>> hack as well :)
>>
>>
>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah,
>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>
>>> If you want to do it as a Java program you will have to do some work on
>>> the input string but it too should be trivial.
>>> How formal do you want to go?
>>> Do you want to strip it down or just find the quote after the text part?
>>>
>>>
>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>> wrote:
>>>
>>> Seriously consider Pig (free answer, 4 LOC):
>>>
>>> my_data = LOAD 'my_data.json' USING
>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>> COUNT_STAR(words) AS word_count;
>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>
>>> It will be faster than the Java you'll likely write.
>>>
>>>
>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>    I am stuck again. :(
>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>> is slight difference.
>>>> The data is in json format.
>>>> So each line of data is:
>>>>
>>>> {"author":"foo", "text": "hello"}
>>>> {"author":"foo123", "text": "hello world"}
>>>> {"author":"foo234", "text": "hello this world"}
>>>>
>>>> So I want to do wordcount for text part.
>>>> I understand that in mapper, I just have to pass this data as json and
>>>> extract "text" and rest of the code is just the same but I am trying to
>>>> switch from python to java hadoop.
>>>> How do I do this.
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>>
>>>
>>
>

Re: Reading json format input

Posted by Pramod N <np...@gmail.com>.
Whatever you are trying to do should work,
Here is the modified WordCount Map


    public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {        String line =
value.toString();
        JSONObject line_as_json = new JSONObject(line);
        String text = line_as_json.getString("text");
        StringTokenizer tokenizer = new StringTokenizer(text);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());            context.write(word, one);
      }    }





Pramod N <http://atmachinelearner.blogspot.in>
Bruce Wayne of web
@machinelearner <https://twitter.com/machinelearner>

--


On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Whatever you have mentioned Jamal should work.you can debug this.
>
> Thanks,
> Rahul
>
>
> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>   For some reason, this have to be in java :(
>> I am trying to use org.json library, something like (in mapper)
>> JSONObject jsn = new JSONObject(value.toString());
>>
>> String text = (String) jsn.get("text");
>> StringTokenizer itr = new StringTokenizer(text);
>>
>> But its not working :(
>> It would be better to get this thing properly but I wouldnt mind using a
>> hack as well :)
>>
>>
>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah,
>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>
>>> If you want to do it as a Java program you will have to do some work on
>>> the input string but it too should be trivial.
>>> How formal do you want to go?
>>> Do you want to strip it down or just find the quote after the text part?
>>>
>>>
>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>> wrote:
>>>
>>> Seriously consider Pig (free answer, 4 LOC):
>>>
>>> my_data = LOAD 'my_data.json' USING
>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>> COUNT_STAR(words) AS word_count;
>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>
>>> It will be faster than the Java you'll likely write.
>>>
>>>
>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>    I am stuck again. :(
>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>> is slight difference.
>>>> The data is in json format.
>>>> So each line of data is:
>>>>
>>>> {"author":"foo", "text": "hello"}
>>>> {"author":"foo123", "text": "hello world"}
>>>> {"author":"foo234", "text": "hello this world"}
>>>>
>>>> So I want to do wordcount for text part.
>>>> I understand that in mapper, I just have to pass this data as json and
>>>> extract "text" and rest of the code is just the same but I am trying to
>>>> switch from python to java hadoop.
>>>> How do I do this.
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>>
>>>
>>
>

Re: Reading json format input

Posted by Pramod N <np...@gmail.com>.
Whatever you are trying to do should work,
Here is the modified WordCount Map


    public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {        String line =
value.toString();
        JSONObject line_as_json = new JSONObject(line);
        String text = line_as_json.getString("text");
        StringTokenizer tokenizer = new StringTokenizer(text);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());            context.write(word, one);
      }    }





Pramod N <http://atmachinelearner.blogspot.in>
Bruce Wayne of web
@machinelearner <https://twitter.com/machinelearner>

--


On Thu, May 30, 2013 at 8:42 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Whatever you have mentioned Jamal should work.you can debug this.
>
> Thanks,
> Rahul
>
>
> On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>   For some reason, this have to be in java :(
>> I am trying to use org.json library, something like (in mapper)
>> JSONObject jsn = new JSONObject(value.toString());
>>
>> String text = (String) jsn.get("text");
>> StringTokenizer itr = new StringTokenizer(text);
>>
>> But its not working :(
>> It would be better to get this thing properly but I wouldnt mind using a
>> hack as well :)
>>
>>
>> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Yeah,
>>> I have to agree w Russell. Pig is definitely the way to go on this.
>>>
>>> If you want to do it as a Java program you will have to do some work on
>>> the input string but it too should be trivial.
>>> How formal do you want to go?
>>> Do you want to strip it down or just find the quote after the text part?
>>>
>>>
>>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>>> wrote:
>>>
>>> Seriously consider Pig (free answer, 4 LOC):
>>>
>>> my_data = LOAD 'my_data.json' USING
>>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>>> words = FOREACH my_data GENERATE $0#'author' as author,
>>> FLATTEN(TOKENIZE($0#'text')) as word;
>>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>>> COUNT_STAR(words) AS word_count;
>>> STORE word_counts INTO '/tmp/word_counts.txt';
>>>
>>> It will be faster than the Java you'll likely write.
>>>
>>>
>>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>>
>>>> Hi,
>>>>    I am stuck again. :(
>>>> My input data is in hdfs. I am again trying to do wordcount but there
>>>> is slight difference.
>>>> The data is in json format.
>>>> So each line of data is:
>>>>
>>>> {"author":"foo", "text": "hello"}
>>>> {"author":"foo123", "text": "hello world"}
>>>> {"author":"foo234", "text": "hello this world"}
>>>>
>>>> So I want to do wordcount for text part.
>>>> I understand that in mapper, I just have to pass this data as json and
>>>> extract "text" and rest of the code is just the same but I am trying to
>>>> switch from python to java hadoop.
>>>> How do I do this.
>>>> Thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome
>>> .com
>>>
>>>
>>>
>>
>

Re: Reading json format input

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>>
>

Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
You have the entire string. 
If you tokenize on commas ... 

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with 
{"author":"foo",    and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes. 

HTH


On May 29, 2013, at 6:44 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
> 
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
> 
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack as well :)
> 
> 
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, 
> I have to agree w Russell. Pig is definitely the way to go on this. 
> 
> If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
> How formal do you want to go? 
> Do you want to strip it down or just find the quote after the text part? 
> 
> 
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:
> 
>> Seriously consider Pig (free answer, 4 LOC):
>> 
>> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>> 
>> It will be faster than the Java you'll likely write.
>> 
>> 
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
>> The data is in json format.
>> So each line of data is:
>> 
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> 
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
>> How do I do this.
>> Thanks
>> 
>> 
>> 
>> -- 
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
> 
> 


Re: Reading json format input

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>>
>

Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
You have the entire string. 
If you tokenize on commas ... 

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with 
{"author":"foo",    and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes. 

HTH


On May 29, 2013, at 6:44 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
> 
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
> 
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack as well :)
> 
> 
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, 
> I have to agree w Russell. Pig is definitely the way to go on this. 
> 
> If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
> How formal do you want to go? 
> Do you want to strip it down or just find the quote after the text part? 
> 
> 
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:
> 
>> Seriously consider Pig (free answer, 4 LOC):
>> 
>> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>> 
>> It will be faster than the Java you'll likely write.
>> 
>> 
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
>> The data is in json format.
>> So each line of data is:
>> 
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> 
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
>> How do I do this.
>> Thanks
>> 
>> 
>> 
>> -- 
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
> 
> 


Re: Reading json format input

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>>
>

Re: Reading json format input

Posted by Rahul Bhattacharjee <ra...@gmail.com>.
Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>>
>

Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
You have the entire string. 
If you tokenize on commas ... 

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with 
{"author":"foo",    and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes. 

HTH


On May 29, 2013, at 6:44 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
> 
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
> 
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack as well :)
> 
> 
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com> wrote:
> Yeah, 
> I have to agree w Russell. Pig is definitely the way to go on this. 
> 
> If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
> How formal do you want to go? 
> Do you want to strip it down or just find the quote after the text part? 
> 
> 
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:
> 
>> Seriously consider Pig (free answer, 4 LOC):
>> 
>> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>> 
>> It will be faster than the Java you'll likely write.
>> 
>> 
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
>> The data is in json format.
>> So each line of data is:
>> 
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> 
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
>> How do I do this.
>> Thanks
>> 
>> 
>> 
>> -- 
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com
> 
> 


Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  For some reason, this have to be in java :(
I am trying to use org.json library, something like (in mapper)
JSONObject jsn = new JSONObject(value.toString());

String text = (String) jsn.get("text");
StringTokenizer itr = new StringTokenizer(text);

But its not working :(
It would be better to get this thing properly but I wouldnt mind using a
hack as well :)


On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
>
> If you want to do it as a Java program you will have to do some work on
> the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
>
>
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
> wrote:
>
> Seriously consider Pig (free answer, 4 LOC):
>
> my_data = LOAD 'my_data.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author,
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
>
> It will be faster than the Java you'll likely write.
>
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>
>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  For some reason, this have to be in java :(
I am trying to use org.json library, something like (in mapper)
JSONObject jsn = new JSONObject(value.toString());

String text = (String) jsn.get("text");
StringTokenizer itr = new StringTokenizer(text);

But its not working :(
It would be better to get this thing properly but I wouldnt mind using a
hack as well :)


On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
>
> If you want to do it as a Java program you will have to do some work on
> the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
>
>
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
> wrote:
>
> Seriously consider Pig (free answer, 4 LOC):
>
> my_data = LOAD 'my_data.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author,
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
>
> It will be faster than the Java you'll likely write.
>
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>
>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  For some reason, this have to be in java :(
I am trying to use org.json library, something like (in mapper)
JSONObject jsn = new JSONObject(value.toString());

String text = (String) jsn.get("text");
StringTokenizer itr = new StringTokenizer(text);

But its not working :(
It would be better to get this thing properly but I wouldnt mind using a
hack as well :)


On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
>
> If you want to do it as a Java program you will have to do some work on
> the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
>
>
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
> wrote:
>
> Seriously consider Pig (free answer, 4 LOC):
>
> my_data = LOAD 'my_data.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author,
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
>
> It will be faster than the Java you'll likely write.
>
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>
>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  For some reason, this have to be in java :(
I am trying to use org.json library, something like (in mapper)
JSONObject jsn = new JSONObject(value.toString());

String text = (String) jsn.get("text");
StringTokenizer itr = new StringTokenizer(text);

But its not working :(
It would be better to get this thing properly but I wouldnt mind using a
hack as well :)


On Wed, May 29, 2013 at 4:30 PM, Michael Segel <mi...@hotmail.com>wrote:

> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
>
> If you want to do it as a Java program you will have to do some work on
> the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
>
>
> On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com>
> wrote:
>
> Seriously consider Pig (free answer, 4 LOC):
>
> my_data = LOAD 'my_data.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author,
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
>
> It will be faster than the Java you'll likely write.
>
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
> com
>
>
>

Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
Yeah, 
I have to agree w Russell. Pig is definitely the way to go on this. 

If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
How formal do you want to go? 
Do you want to strip it down or just find the quote after the text part? 


On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:

> Seriously consider Pig (free answer, 4 LOC):
> 
> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> 
> It will be faster than the Java you'll likely write.
> 
> 
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
> The data is in json format.
> So each line of data is:
> 
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
> 
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
> How do I do this.
> Thanks
> 
> 
> 
> -- 
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com


Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
Yeah, 
I have to agree w Russell. Pig is definitely the way to go on this. 

If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
How formal do you want to go? 
Do you want to strip it down or just find the quote after the text part? 


On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:

> Seriously consider Pig (free answer, 4 LOC):
> 
> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> 
> It will be faster than the Java you'll likely write.
> 
> 
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
> The data is in json format.
> So each line of data is:
> 
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
> 
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
> How do I do this.
> Thanks
> 
> 
> 
> -- 
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com


Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
Yeah, 
I have to agree w Russell. Pig is definitely the way to go on this. 

If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
How formal do you want to go? 
Do you want to strip it down or just find the quote after the text part? 


On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:

> Seriously consider Pig (free answer, 4 LOC):
> 
> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> 
> It will be faster than the Java you'll likely write.
> 
> 
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
> The data is in json format.
> So each line of data is:
> 
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
> 
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
> How do I do this.
> Thanks
> 
> 
> 
> -- 
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com


Re: Reading json format input

Posted by Michael Segel <mi...@hotmail.com>.
Yeah, 
I have to agree w Russell. Pig is definitely the way to go on this. 

If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial. 
How formal do you want to go? 
Do you want to strip it down or just find the quote after the text part? 


On May 29, 2013, at 5:13 PM, Russell Jurney <ru...@gmail.com> wrote:

> Seriously consider Pig (free answer, 4 LOC):
> 
> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> 
> It will be faster than the Java you'll likely write.
> 
> 
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:
> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
> The data is in json format.
> So each line of data is:
> 
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
> 
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop. 
> How do I do this.
> Thanks
> 
> 
> 
> -- 
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com


Re: Reading json format input

Posted by Russell Jurney <ru...@gmail.com>.
Seriously consider Pig (free answer, 4 LOC):

my_data = LOAD 'my_data.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
words = FOREACH my_data GENERATE $0#'author' as author,
FLATTEN(TOKENIZE($0#'text')) as word;
word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java you'll likely write.


On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Reading json format input

Posted by Russell Jurney <ru...@gmail.com>.
Seriously consider Pig (free answer, 4 LOC):

my_data = LOAD 'my_data.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
words = FOREACH my_data GENERATE $0#'author' as author,
FLATTEN(TOKENIZE($0#'text')) as word;
word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java you'll likely write.


On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Reading json format input

Posted by Russell Jurney <ru...@gmail.com>.
Seriously consider Pig (free answer, 4 LOC):

my_data = LOAD 'my_data.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
words = FOREACH my_data GENERATE $0#'author' as author,
FLATTEN(TOKENIZE($0#'text')) as word;
word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java you'll likely write.


On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
for that, you have to only write intermediate data if word = "text"

String[] words = line.split("\\W+");

 for (String word : words) {

    if (word.equals("text"))

          context.write(new Text(word), new IntWritable(1));

 }


I  am assuming you have huge volume of data for it, otherwise MapReduce
will be an overkill and simple regex will do.



On Wed, May 29, 2013 at 4:45 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Rishi,
>    But I dont want the wordcount of all the words..
> In json, there is a field "text".. and those are the words I wish to count?
>
>
> On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com>wrote:
>
>> Hi Jamal,
>>
>> I took your input and put it in sample wordcount program and it's working
>> just fine and giving this output.
>>
>> author 3
>> foo234 1
>> text 3
>> foo 1
>> foo123 1
>> hello 3
>> this 1
>> world 2
>>
>>
>> When we split using
>>
>> String[] words = input.split("\\W+");
>>
>> it takes care of all non-alphanumeric characters.
>>
>> Thanks and Regards,
>>
>> Rishi Yadav
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
for that, you have to only write intermediate data if word = "text"

String[] words = line.split("\\W+");

 for (String word : words) {

    if (word.equals("text"))

          context.write(new Text(word), new IntWritable(1));

 }


I  am assuming you have huge volume of data for it, otherwise MapReduce
will be an overkill and simple regex will do.



On Wed, May 29, 2013 at 4:45 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Rishi,
>    But I dont want the wordcount of all the words..
> In json, there is a field "text".. and those are the words I wish to count?
>
>
> On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com>wrote:
>
>> Hi Jamal,
>>
>> I took your input and put it in sample wordcount program and it's working
>> just fine and giving this output.
>>
>> author 3
>> foo234 1
>> text 3
>> foo 1
>> foo123 1
>> hello 3
>> this 1
>> world 2
>>
>>
>> When we split using
>>
>> String[] words = input.split("\\W+");
>>
>> it takes care of all non-alphanumeric characters.
>>
>> Thanks and Regards,
>>
>> Rishi Yadav
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
for that, you have to only write intermediate data if word = "text"

String[] words = line.split("\\W+");

 for (String word : words) {

    if (word.equals("text"))

          context.write(new Text(word), new IntWritable(1));

 }


I  am assuming you have huge volume of data for it, otherwise MapReduce
will be an overkill and simple regex will do.



On Wed, May 29, 2013 at 4:45 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Rishi,
>    But I dont want the wordcount of all the words..
> In json, there is a field "text".. and those are the words I wish to count?
>
>
> On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com>wrote:
>
>> Hi Jamal,
>>
>> I took your input and put it in sample wordcount program and it's working
>> just fine and giving this output.
>>
>> author 3
>> foo234 1
>> text 3
>> foo 1
>> foo123 1
>> hello 3
>> this 1
>> world 2
>>
>>
>> When we split using
>>
>> String[] words = input.split("\\W+");
>>
>> it takes care of all non-alphanumeric characters.
>>
>> Thanks and Regards,
>>
>> Rishi Yadav
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
for that, you have to only write intermediate data if word = "text"

String[] words = line.split("\\W+");

 for (String word : words) {

    if (word.equals("text"))

          context.write(new Text(word), new IntWritable(1));

 }


I  am assuming you have huge volume of data for it, otherwise MapReduce
will be an overkill and simple regex will do.



On Wed, May 29, 2013 at 4:45 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi Rishi,
>    But I dont want the wordcount of all the words..
> In json, there is a field "text".. and those are the words I wish to count?
>
>
> On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com>wrote:
>
>> Hi Jamal,
>>
>> I took your input and put it in sample wordcount program and it's working
>> just fine and giving this output.
>>
>> author 3
>> foo234 1
>> text 3
>> foo 1
>> foo123 1
>> hello 3
>> this 1
>> world 2
>>
>>
>> When we split using
>>
>> String[] words = input.split("\\W+");
>>
>> it takes care of all non-alphanumeric characters.
>>
>> Thanks and Regards,
>>
>> Rishi Yadav
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Rishi,
   But I dont want the wordcount of all the words..
In json, there is a field "text".. and those are the words I wish to count?


On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Hi Jamal,
>
> I took your input and put it in sample wordcount program and it's working
> just fine and giving this output.
>
> author 3
> foo234 1
> text 3
> foo 1
> foo123 1
> hello 3
> this 1
> world 2
>
>
> When we split using
>
> String[] words = input.split("\\W+");
>
> it takes care of all non-alphanumeric characters.
>
> Thanks and Regards,
>
> Rishi Yadav
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Rishi,
   But I dont want the wordcount of all the words..
In json, there is a field "text".. and those are the words I wish to count?


On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Hi Jamal,
>
> I took your input and put it in sample wordcount program and it's working
> just fine and giving this output.
>
> author 3
> foo234 1
> text 3
> foo 1
> foo123 1
> hello 3
> this 1
> world 2
>
>
> When we split using
>
> String[] words = input.split("\\W+");
>
> it takes care of all non-alphanumeric characters.
>
> Thanks and Regards,
>
> Rishi Yadav
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Rishi,
   But I dont want the wordcount of all the words..
In json, there is a field "text".. and those are the words I wish to count?


On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Hi Jamal,
>
> I took your input and put it in sample wordcount program and it's working
> just fine and giving this output.
>
> author 3
> foo234 1
> text 3
> foo 1
> foo123 1
> hello 3
> this 1
> world 2
>
>
> When we split using
>
> String[] words = input.split("\\W+");
>
> it takes care of all non-alphanumeric characters.
>
> Thanks and Regards,
>
> Rishi Yadav
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>

Re: Reading json format input

Posted by jamal sasha <ja...@gmail.com>.
Hi Rishi,
   But I dont want the wordcount of all the words..
In json, there is a field "text".. and those are the words I wish to count?


On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <ri...@infoobjects.com> wrote:

> Hi Jamal,
>
> I took your input and put it in sample wordcount program and it's working
> just fine and giving this output.
>
> author 3
> foo234 1
> text 3
> foo 1
> foo123 1
> hello 3
> this 1
> world 2
>
>
> When we split using
>
> String[] words = input.split("\\W+");
>
> it takes care of all non-alphanumeric characters.
>
> Thanks and Regards,
>
> Rishi Yadav
>
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com>wrote:
>
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
Hi Jamal,

I took your input and put it in sample wordcount program and it's working
just fine and giving this output.

author 3
foo234 1
text 3
foo 1
foo123 1
hello 3
this 1
world 2


When we split using

String[] words = input.split("\\W+");

it takes care of all non-alphanumeric characters.

Thanks and Regards,

Rishi Yadav

On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
Hi Jamal,

I took your input and put it in sample wordcount program and it's working
just fine and giving this output.

author 3
foo234 1
text 3
foo 1
foo123 1
hello 3
this 1
world 2


When we split using

String[] words = input.split("\\W+");

it takes care of all non-alphanumeric characters.

Thanks and Regards,

Rishi Yadav

On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
Hi Jamal,

I took your input and put it in sample wordcount program and it's working
just fine and giving this output.

author 3
foo234 1
text 3
foo 1
foo123 1
hello 3
this 1
world 2


When we split using

String[] words = input.split("\\W+");

it takes care of all non-alphanumeric characters.

Thanks and Regards,

Rishi Yadav

On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>

Re: Reading json format input

Posted by Rishi Yadav <ri...@infoobjects.com>.
Hi Jamal,

I took your input and put it in sample wordcount program and it's working
just fine and giving this output.

author 3
foo234 1
text 3
foo 1
foo123 1
hello 3
this 1
world 2


When we split using

String[] words = input.split("\\W+");

it takes care of all non-alphanumeric characters.

Thanks and Regards,

Rishi Yadav

On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>

Re: Reading json format input

Posted by Russell Jurney <ru...@gmail.com>.
Seriously consider Pig (free answer, 4 LOC):

my_data = LOAD 'my_data.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
words = FOREACH my_data GENERATE $0#'author' as author,
FLATTEN(TOKENIZE($0#'text')) as word;
word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java you'll likely write.


On Wed, May 29, 2013 at 2:54 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com