You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SK <sk...@gmail.com> on 2014/09/23 23:04:11 UTC

HdfsWordCount only counts some of the words

Hi,

I tried out the HdfsWordCount program in the Streaming module on a cluster.
Based on the output, I find that it counts only a few of the words. How can
I have it count all the words in the text? I have only one text  file in the
directory. 

thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: HdfsWordCount only counts some of the words

Posted by "aka.fe2s" <ak...@gmail.com>.

I guess because this example is stateless, so it outputs counts only for
given RDD. Take a look at stateful word counter
StatefulNetworkWordCount.scala

On Wed, Sep 24, 2014 at 4:29 AM, SK <sk...@gmail.com> wrote:

>
> I execute it as follows:
>
> $SPARK_HOME/bin/spark-submit   --master <master url>  --class
> org.apache.spark.examples.streaming.HdfsWordCount
> target/scala-2.10/spark_stream_examples-assembly-1.0.jar  <hdfsdir>
>
> After I start the job, I add a new test file in hdfsdir. It is a large text
> file which I will not be able to copy here. But it  probably has at least
> 100 distinct words. But the streaming output has only about 5-6 words along
> with their counts as follows. I then stop the job after some time.
>
> Time ...
>
> (word1, cnt1)
> (word2, cnt2)
> (word3, cnt3)
> (word4, cnt4)
> (word5, cnt5)
>
> Time ...
>
> Time ...
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: HdfsWordCount only counts some of the words

Posted by Sean Owen <so...@cloudera.com>.

If you look at the code for HdfsWordCount, you see it calls print(),
which defaults to print 10 elements from each RDD. If you are just
talking about the console output, then it is not expected to print all
words to begin with.

On Wed, Sep 24, 2014 at 2:29 AM, SK <sk...@gmail.com> wrote:
>
> I execute it as follows:
>
> $SPARK_HOME/bin/spark-submit   --master <master url>  --class
> org.apache.spark.examples.streaming.HdfsWordCount
> target/scala-2.10/spark_stream_examples-assembly-1.0.jar  <hdfsdir>
>
> After I start the job, I add a new test file in hdfsdir. It is a large text
> file which I will not be able to copy here. But it  probably has at least
> 100 distinct words. But the streaming output has only about 5-6 words along
> with their counts as follows. I then stop the job after some time.
>
> Time ...
>
> (word1, cnt1)
> (word2, cnt2)
> (word3, cnt3)
> (word4, cnt4)
> (word5, cnt5)
>
> Time ...
>
> Time ...
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: HdfsWordCount only counts some of the words

Posted by SK <sk...@gmail.com>.

I execute it as follows:

$SPARK_HOME/bin/spark-submit   --master <master url>  --class 
org.apache.spark.examples.streaming.HdfsWordCount 
target/scala-2.10/spark_stream_examples-assembly-1.0.jar  <hdfsdir>

After I start the job, I add a new test file in hdfsdir. It is a large text
file which I will not be able to copy here. But it  probably has at least
100 distinct words. But the streaming output has only about 5-6 words along
with their counts as follows. I then stop the job after some time. 

Time ...

(word1, cnt1)
(word2, cnt2)
(word3, cnt3)
(word4, cnt4)
(word5, cnt5)

Time ...

Time ...




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929p14967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: HdfsWordCount only counts some of the words

Posted by "Liu, Raymond" <ra...@intel.com>.

It should count all the words, so you probably need to post more details on how you run it and the log, output etc. 

Best Regards,
Raymond Liu

-----Original Message-----
From: SK [mailto:skrishna.id@gmail.com] 
Sent: Wednesday, September 24, 2014 5:04 AM
To: user@spark.incubator.apache.org
Subject: HdfsWordCount only counts some of the words

Hi,

I tried out the HdfsWordCount program in the Streaming module on a cluster.
Based on the output, I find that it counts only a few of the words. How can I have it count all the words in the text? I have only one text  file in the directory. 

thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HdfsWordCount-only-counts-some-of-the-words-tp14929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org