You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by qiaoresearcher <qi...@gmail.com> on 2014/04/24 19:58:12 UTC

hadoop+python+text mining

I have Hadoop and python installed with nltk. Now I have an large input
file which has three columns:
column 1  | column 2 | column 3
positive         id1          some tweet message
negative       id2          other tweet message
positive         id3          tweet message
negative       id4          tweet message
positive         id5          tweet message
....                    ...                ....

I want to use text mining to construct TFIDF vectors from the tweet
messages (also use stop words, stem, etc) and then use some classifier to
classify tweet message as positive or negative. I know how to do it just
using python and nltk. But how to do the same thing on hadoop?

thanks!

Re: hadoop+python+text mining

Posted by Peyman Mohajerian <mo...@gmail.com>.
At the high level I think you have these choices and more:
1) Hadoop Streaming, leverage some of your python could, but not all b/c
you have to deal with map/reduce.
2) Use Mahout.
3) Use a distro of R that works with Hadoop
..


On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher <qi...@gmail.com>wrote:

> I have Hadoop and python installed with nltk. Now I have an large input
> file which has three columns:
> column 1  | column 2 | column 3
> positive         id1          some tweet message
> negative       id2          other tweet message
> positive         id3          tweet message
> negative       id4          tweet message
> positive         id5          tweet message
> ....                    ...                ....
>
> I want to use text mining to construct TFIDF vectors from the tweet
> messages (also use stop words, stem, etc) and then use some classifier to
> classify tweet message as positive or negative. I know how to do it just
> using python and nltk. But how to do the same thing on hadoop?
>
> thanks!
>
>
>

Re: hadoop+python+text mining

Posted by Peyman Mohajerian <mo...@gmail.com>.
At the high level I think you have these choices and more:
1) Hadoop Streaming, leverage some of your python could, but not all b/c
you have to deal with map/reduce.
2) Use Mahout.
3) Use a distro of R that works with Hadoop
..


On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher <qi...@gmail.com>wrote:

> I have Hadoop and python installed with nltk. Now I have an large input
> file which has three columns:
> column 1  | column 2 | column 3
> positive         id1          some tweet message
> negative       id2          other tweet message
> positive         id3          tweet message
> negative       id4          tweet message
> positive         id5          tweet message
> ....                    ...                ....
>
> I want to use text mining to construct TFIDF vectors from the tweet
> messages (also use stop words, stem, etc) and then use some classifier to
> classify tweet message as positive or negative. I know how to do it just
> using python and nltk. But how to do the same thing on hadoop?
>
> thanks!
>
>
>

Re: hadoop+python+text mining

Posted by Peyman Mohajerian <mo...@gmail.com>.
At the high level I think you have these choices and more:
1) Hadoop Streaming, leverage some of your python could, but not all b/c
you have to deal with map/reduce.
2) Use Mahout.
3) Use a distro of R that works with Hadoop
..


On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher <qi...@gmail.com>wrote:

> I have Hadoop and python installed with nltk. Now I have an large input
> file which has three columns:
> column 1  | column 2 | column 3
> positive         id1          some tweet message
> negative       id2          other tweet message
> positive         id3          tweet message
> negative       id4          tweet message
> positive         id5          tweet message
> ....                    ...                ....
>
> I want to use text mining to construct TFIDF vectors from the tweet
> messages (also use stop words, stem, etc) and then use some classifier to
> classify tweet message as positive or negative. I know how to do it just
> using python and nltk. But how to do the same thing on hadoop?
>
> thanks!
>
>
>

Re: hadoop+python+text mining

Posted by Peyman Mohajerian <mo...@gmail.com>.
At the high level I think you have these choices and more:
1) Hadoop Streaming, leverage some of your python could, but not all b/c
you have to deal with map/reduce.
2) Use Mahout.
3) Use a distro of R that works with Hadoop
..


On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher <qi...@gmail.com>wrote:

> I have Hadoop and python installed with nltk. Now I have an large input
> file which has three columns:
> column 1  | column 2 | column 3
> positive         id1          some tweet message
> negative       id2          other tweet message
> positive         id3          tweet message
> negative       id4          tweet message
> positive         id5          tweet message
> ....                    ...                ....
>
> I want to use text mining to construct TFIDF vectors from the tweet
> messages (also use stop words, stem, etc) and then use some classifier to
> classify tweet message as positive or negative. I know how to do it just
> using python and nltk. But how to do the same thing on hadoop?
>
> thanks!
>
>
>