You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by hari939 <ha...@gmail.com> on 2009/04/18 14:18:23 UTC

Using the Stanford NLP with hadoop

My project of parsing through material for a semantic search engine requires
me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
NLP parser  on hadoop cluster.

To use the Stanford NLP parser, one must create a lexical parser object
using a englishPCFG.ser.gz file as a constructor's parameter.
i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
and have also tried packing the file along with the jar of the java program.

i am new to the hadoop platform and am not very familiar with some of the
salient features of hadoop.

looking forward to any form of help.
-- 
View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using the Stanford NLP with hadoop

Posted by Stuart Sierra <th...@gmail.com>.

On Tue, Apr 21, 2009 at 4:58 PM, Kevin Peterson <kp...@biz360.com> wrote:
> I'm interested to know if you have found any other open source parsers in
> Java or at least have java bindings.

Stanford is one of the best, although it is slow.  LingPipe
<http://alias-i.com/lingpipe/> is free for non-commercial use, and
they link to most of the open-source toolkits here:
<http://alias-i.com/lingpipe/web/competition.html>  It seems like most
NLP toolkits don't attempt full sentence parsing, but instead focus on
tagging, chunking, or entity recognition.

-Stuart

Re: Using the Stanford NLP with hadoop

Posted by akhil1988 <ak...@gmail.com>.

Hi hari!

To get the englishPCFG.ser.gz file into the current working directory of the
task trackers use DistributedCache class.

First put you englishPCFG.ser.gz.ser into HDFS usinf "hadoop fs -put"
command. 

Now, Suppose your file is lying in HDFS at /home/hari/englishPCFG.ser.gz

Now in the main function, use the following statements:
DistributedCache.addCahceFile("/home/hari/englishPCFG.ser.gz#englishPCFG.ser.gz");
DistributedCache.createSymlink();

Now, englishPCFG.ser.gz file is present in the current working directory of
the tasks and you can access it just as you access any other file in you
normal Java programs. So in your case, you can directly give
"englishPCFG.ser.gz" as argument to the constructor.

Hope this helps.

Best,
Akhil





hari939 wrote:
> 
> by 'ClassName', which class are you actually refering to?
> the class in which the LexicalParser is invoked?
> 
> in my code, the class that implements the parser is named 'parse'
> and this is the code that i used.
>   lp = new LexicalizedParser(new ObjectInputStream(new
> GZIPInputStream(parse.class.getResourceAsStream("/englishPCFG.ser.gz"))));
> 
> the program runs to completion and map-reduce process is declared as
> successfully completed everytime even if the code is changed to 
> lp = new LexicalizedParser(new ObjectInputStream(new
> GZIPInputStream(parse.class.getResourceAsStream("/englishPCF_G.ser.gz"))));
> 
> this indicates that the getResourceAsStream does throw an exception even
> if the file is not present, i guess.
> 
> any ideas? :confused:
> 
> 
> Kevin Peterson-3 wrote:
>> 
>> On Sat, Apr 18, 2009 at 5:18 AM, hari939 <ha...@gmail.com> wrote:
>> 
>>>
>>> My project of parsing through material for a semantic search engine
>>> requires
>>> me to use the  http://nlp.stanford.edu/software/lex-parser.shtml
>>> Stanford
>>> NLP parser  on hadoop cluster.
>>>
>>> To use the Stanford NLP parser, one must create a lexical parser object
>>> using a englishPCFG.ser.gz file as a constructor's parameter.
>>> i have tried loading the file onto the Hadoop dfs in the /user/root/
>>> folder
>>> and have also tried packing the file along with the jar of the java
>>> program.
>> 
>> 
>> Use getResourceAsStream to read it from the jar.
>> 
>> Use the ObjectInputStream constructor.
>> 
>> That is, new LexicalizedParser(new ObjectInputStream(new
>> GzipInputStream(ClassName.class.getResourceAsStream("/englishPCFG.ser.gz")))
>> 
>> I'm interested to know if you have found any other open source parsers in
>> Java or at least have java bindings.
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p24301024.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using the Stanford NLP with hadoop

Posted by hari939 <ha...@gmail.com>.

by 'ClassName', which class are you actually refering to?
the class in which the LexicalParser is invoked?

in my code, the class that implements the parser is named 'parse'
and this is the code that i used.
  lp = new LexicalizedParser(new ObjectInputStream(new
GZIPInputStream(parse.class.getResourceAsStream("/englishPCFG.ser.gz"))));

the program runs to completion and map-reduce process is declared as
successfully completed everytime even if the code is changed to 
lp = new LexicalizedParser(new ObjectInputStream(new
GZIPInputStream(parse.class.getResourceAsStream("/englishPCF_G.ser.gz"))));

this indicates that the getResourceAsStream does throw an exception even if
the file is not present, i guess.

any ideas? :confused:


Kevin Peterson-3 wrote:
> 
> On Sat, Apr 18, 2009 at 5:18 AM, hari939  wrote:
> 
>>
>> My project of parsing through material for a semantic search engine
>> requires
>> me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
>> NLP parser  on hadoop cluster.
>>
>> To use the Stanford NLP parser, one must create a lexical parser object
>> using a englishPCFG.ser.gz file as a constructor's parameter.
>> i have tried loading the file onto the Hadoop dfs in the /user/root/
>> folder
>> and have also tried packing the file along with the jar of the java
>> program.
> 
> 
> Use getResourceAsStream to read it from the jar.
> 
> Use the ObjectInputStream constructor.
> 
> That is, new LexicalizedParser(new ObjectInputStream(new
> GzipInputStream(ClassName.class.getResourceAsStream("/englishPCFG.ser.gz")))
> 
> I'm interested to know if you have found any other open source parsers in
> Java or at least have java bindings.
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p24231349.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using the Stanford NLP with hadoop

Posted by Kevin Peterson <kp...@biz360.com>.

On Sat, Apr 18, 2009 at 5:18 AM, hari939 <ha...@gmail.com> wrote:

>
> My project of parsing through material for a semantic search engine
> requires
> me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
> NLP parser  on hadoop cluster.
>
> To use the Stanford NLP parser, one must create a lexical parser object
> using a englishPCFG.ser.gz file as a constructor's parameter.
> i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
> and have also tried packing the file along with the jar of the java
> program.


Use getResourceAsStream to read it from the jar.

Use the ObjectInputStream constructor.

That is, new LexicalizedParser(new ObjectInputStream(new
GzipInputStream(ClassName.class.getResourceAsStream("/englishPCFG.ser.gz")))

I'm interested to know if you have found any other open source parsers in
Java or at least have java bindings.

Re: Using the Stanford NLP with hadoop

Posted by Bradford Stephens <br...@gmail.com>.

Greetings,

There's a way you can distribute files along with your MR job as part
of a "payload", or you could save the file in the same spot on every
machine of your cluster with some rsyncing, and hard-code loading it.

This may be of some help:
http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/filecache/DistributedCache.html

On Sat, Apr 18, 2009 at 5:18 AM, hari939 <ha...@gmail.com> wrote:
>
> My project of parsing through material for a semantic search engine requires
> me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
> NLP parser  on hadoop cluster.
>
> To use the Stanford NLP parser, one must create a lexical parser object
> using a englishPCFG.ser.gz file as a constructor's parameter.
> i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
> and have also tried packing the file along with the jar of the java program.
>
> i am new to the hadoop platform and am not very familiar with some of the
> salient features of hadoop.
>
> looking forward to any form of help.
> --
> View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>