You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by mi...@cloudtechnologypartners.co.uk on 2016/02/10 21:14:12 UTC

retrieving all the rows with collect()

 

Hi,

I have a bunch of files stored in hdfs /unit_files directory in total
319 files

scala> val errlog = sc.textFile("/unix_files/*.ksh")

scala> errlog.filter(line => line.contains("sed"))count()
res104: Long = 1113

So it returns 1113 instances the word "sed"

If I want to see the collection I can do

SCALA>  ERRLOG.FILTER(LINE => LINE.CONTAINS("SED"))COLLECT()

res105: Array[String] = Array(" DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ;
PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . in environment based on
argument for script., " exec sp_spaceused", " exec sp_spaceused",
PROGNAME=$(basename $0 | sed -e s/.ksh//), " BACKUPSERVER=$5 # Server
that is used to load the transaction dump", " BACKUPSERVER=$5 # Server
that is used to load the transaction dump", " BACKUPSERVER=$5 # Server
that is used to load the transaction dump", " cat
$TMPDIR/${DBNAME}_trandump.sql | sed s/${DSQUERY}/${REMOTESERVER}/ >
$TMPDIR/${DBNAME}_trandump.tmpsql", cat
$TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/
> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e s/.ksh//), " B...
scala>

Now is there anyway I can retrieve all these instances or perhaps they
are all wrapped up and I only see few lines?

Thanks,

Mich

 

Re: retrieving all the rows with collect()

Posted by Ted Yu <yu...@gmail.com>.
Mich:
When you execute the statements in Spark shell, you would see the types of
the intermediate results.

scala> val errlog = sc.textFile("/home/john/s.out")
errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out
MapPartitionsRDD[1] at textFile at <console>:24

scala> val sed = errlog.filter(line => line.contains("sed"))
sed: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at
<console>:26

scala> sed.collect()
res0: Array[String] = Array([WARNING] Unrecognised ...

Cheers

On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh <
mich.talebzadeh@cloudtechnologypartners.co.uk> wrote:

>
>
> Hi Chandeep
>
>
>
> Many thanks for your help
>
>
>
> In the line below
>
>
>
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
>
>
> Can you please clarify the components with the correct naming as I am new
> to Scala
>
>    1. errlog   --> is the RDD?
>    2. filter(line => line.contains("sed")) is a method
>    3. collect()  is another method ?
>    4. foreach (println) ?
>
>
>
> Thanks
>
>
>
> On 10/02/2016 21:28, Chandeep Singh wrote:
>
> Hi Mich,
>
> If you would like to print everything to the console you could - errlog.
> filter(line => line.contains("sed"))collect()foreach(println)
>
> or you could always save to a file using any of the saveAs methods.
>
> Thanks,
> Chandeep
>
> On Wed, Feb 10, 2016 at 8:14 PM, <
> mich.talebzadeh@cloudtechnologypartners.co.uk> wrote:
>
>>
>>
>> Hi,
>>
>> I have a bunch of files stored in hdfs /unit_files directory in total 319 files
>> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>>
>> scala>  errlog.filter(line => line.contains("sed"))count()
>> res104: Long = 1113
>> So it returns 1113 instances the word "sed"
>>
>> If I want to see the collection I can do
>>
>>
>> *scala>  errlog.filter(line => line.contains("sed"))collect()*
>>
>> res105: Array[String] = Array("                         DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", #    . in environment based on argument for script., "       exec sp_spaceused", "        exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "        BACKUPSERVER=$5        # Server that is used to load the transaction dump", "        BACKUPSERVER=$5         # Server that is used to load the transaction dump", "        BACKUPSERVER=$5         # Server that is used to load the transaction dump", "    cat $TMPDIR/${DBNAME}_trandump.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e s/.ksh//), "        B...
>> scala>
>>
>>
>> Now is there anyway I can retrieve all these instances or perhaps they are all wrapped up and I only see few lines?
>>
>> Thanks,
>>
>> Mich
>>
>>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility.
>
>
>

Re: retrieving all the rows with collect()

Posted by Mich Talebzadeh <mi...@cloudtechnologypartners.co.uk>.
 

Thanks Jacob much appreciated 

Mich 

On 11/02/2016 00:01, Jakob Odersky wrote: 

> Exactly!
> As a final note, `foreach` is also defined on RDDs. This means that
> you don't need to `collect()` the results into an array (which could
> give you an OutOfMemoryError in case the RDD is really really large)
> before printing them.
> 
> Personally, when I learn using a new library, I like to look at its
> Scaladoc (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD [1]
> for Spark) and test it in the REPL/worksheets (for Spark you already
> have `spark-shell`)
> 
> best,
> --Jakob

 

Links:
------
[1]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

Re: retrieving all the rows with collect()

Posted by Jakob Odersky <ja...@odersky.com>.
Exactly!
As a final note, `foreach` is also defined on RDDs. This means that
you don't need to `collect()` the results into an array (which could
give you an OutOfMemoryError in case the RDD is really really large)
before printing them.

Personally, when I learn using a new library, I like to look at its
Scaladoc (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
for Spark) and test it in the REPL/worksheets (for Spark you already
have `spark-shell`)

best,
--Jakob

On Wed, Feb 10, 2016 at 3:52 PM, Mich Talebzadeh
<mi...@cloudtechnologypartners.co.uk> wrote:
> Many thanks Jakob.
>
>
>
> So it basically boils down to this demarcation  as suggested which looks
> clearer
>
> val errlog = sc.textFile("/unix_files/*.ksh")
> errlog.filter(line => line.contains("sed")).collect().foreach(line =>
> println(line))
>
> Regards,
>
> Mich
>
> On 10/02/2016 23:21, Jakob Odersky wrote:
>
> Hi Mich,
> your assumptions 1 to 3 are all correct (nitpick: they're method
> *calls*, the methods being the part before the parentheses, but I
> assume that's what you meant). The last one is also a method call but
> uses syntactic sugar on top: `foreach(println)` boils down to
> `foreach(line => println(line))`.
>
> On an unrelated side-note, I would suggest you add a period between
> every method call, it makes things easier to read and is actually
> required in certain circumstances. Specifically I would add a period
> before collect() and foreach().
>
> best,
> --Jakob
>
> On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh
> <mi...@cloudtechnologypartners.co.uk> wrote:
>
> Hi Chandeep Many thanks for your help In the line below errlog.filter(line
> => line.contains("sed"))collect()foreach(println) Can you please clarify the
> components with the correct naming as I am new to Scala errlog --> is the
> RDD? filter(line => line.contains("sed")) is a method collect() is another
> method ? foreach (println) ? Thanks On 10/02/2016 21:28, Chandeep Singh
> wrote: Hi Mich, If you would like to print everything to the console you
> could - errlog.filter(line => line.contains("sed"))collect()foreach(println)
> or you could always save to a file using any of the saveAs methods. Thanks,
> Chandeep On Wed, Feb 10, 2016 at 8:14 PM,
> <mi...@cloudtechnologypartners.co.uk> wrote:
>
> Hi, I have a bunch of files stored in hdfs /unit_files directory in total
> 319 files scala> val errlog = sc.textFile("/unix_files/*.ksh") scala>
> errlog.filter(line => line.contains("sed"))count() res104: Long = 1113 So it
> returns 1113 instances the word "sed" If I want to see the collection I can
> do scala> errlog.filter(line => line.contains("sed"))collect() res105:
> Array[String] = Array(" DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ;
> PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . in environment based on
> argument for script., " exec sp_spaceused", " exec sp_spaceused",
> PROGNAME=$(basename $0 | sed -e s/.ksh//), " BACKUPSERVER=$5 # Server that
> is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is
> used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used
> to load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed
> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat
> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ >
> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e
> s/.ksh//), " B... scala> Now is there anyway I can retrieve all these
> instances or perhaps they are all wrapped up and I only see few lines?
> Thanks, Mich
>
> -- Dr Mich Talebzadeh LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com NOTE: The information in this email is
> proprietary and confidential. This message is for the designated recipient
> only, if you are not the intended recipient, you should destroy it
> immediately. Any information in this message shall not be understood as
> given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or
> their employees, unless expressly so stated. It is the responsibility of the
> recipient to ensure that this email is virus free, therefore neither Cloud
> Technology partners Ltd, its subsidiaries nor their employees accept any
> responsibility.
>
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Cloud Technology
> Partners Ltd, its subsidiaries or their employees, unless expressly so
> stated. It is the responsibility of the recipient to ensure that this email
> is virus free, therefore neither Cloud Technology partners Ltd, its
> subsidiaries nor their employees accept any responsibility.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: retrieving all the rows with collect()

Posted by Mich Talebzadeh <mi...@cloudtechnologypartners.co.uk>.
 

Many thanks Jakob. 

So it basically boils down to this demarcation as suggested which looks
clearer 

val errlog = sc.textFile("/unix_files/*.ksh")
errlog.filter(line => line.contains("sed")).collect().foreach(line =>
println(line)) 

Regards, 

Mich 

On 10/02/2016 23:21, Jakob Odersky wrote: 

> Hi Mich,
> your assumptions 1 to 3 are all correct (nitpick: they're method
> *calls*, the methods being the part before the parentheses, but I
> assume that's what you meant). The last one is also a method call but
> uses syntactic sugar on top: `foreach(println)` boils down to
> `foreach(line => println(line))`.
> 
> On an unrelated side-note, I would suggest you add a period between
> every method call, it makes things easier to read and is actually
> required in certain circumstances. Specifically I would add a period
> before collect() and foreach().
> 
> best,
> --Jakob
> 
> On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh
> <mi...@cloudtechnologypartners.co.uk> wrote:
> Hi Chandeep Many thanks for your help In the line below errlog.filter(line => line.contains("sed"))collect()foreach(println) Can you please clarify the components with the correct naming as I am new to Scala errlog --> is the RDD? filter(line => line.contains("sed")) is a method collect() is another method ? foreach (println) ? Thanks On 10/02/2016 21:28, Chandeep Singh wrote: Hi Mich, If you would like to print everything to the console you could - errlog.filter(line => line.contains("sed"))collect()foreach(println) or you could always save to a file using any of the saveAs methods. Thanks, Chandeep On Wed, Feb 10, 2016 at 8:14 PM, <mi...@cloudtechnologypartners.co.uk> wrote: Hi, I have a bunch of files stored in hdfs /unit_files directory in total 319 files scala> val errlog = sc.textFile("/unix_files/*.ksh") scala> errlog.filter(line => line.contains("sed"))count() res104: Long = 1113 So it returns 1113 instances the word "sed" If I want to see the collection I can do
scala> errlog.filter(line => line.contains("sed"))collect() res105: Array[String] = Array(" DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . in environment based on argument for script., " exec sp_spaceused", " exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), " BACKUPSERVER=$5 # Server that is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used to load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e s/.ksh//), " B... scala> Now is there anyway I can retrieve all these instances or perhaps they are all wrapped up and I only see few lines? Thanks, Mich -- Dr Mich Talebzadeh LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw [1] http://talebzadehmich.wordpress.com [2] NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Links:
------
[1]
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
[2] http://talebzadehmich.wordpress.com

Re: retrieving all the rows with collect()

Posted by Jakob Odersky <ja...@odersky.com>.
Hi Mich,
your assumptions 1 to 3 are all correct (nitpick: they're method
*calls*, the methods being the part before the parentheses, but I
assume that's what you meant). The last one is also a method call but
uses syntactic sugar on top: `foreach(println)` boils down to
`foreach(line => println(line))`.

On an unrelated side-note, I would suggest you add a period between
every method call, it makes things easier to read and is actually
required in certain circumstances. Specifically I would add a period
before collect() and foreach().

best,
--Jakob

On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh
<mi...@cloudtechnologypartners.co.uk> wrote:
>
>
> Hi Chandeep
>
>
>
> Many thanks for your help
>
>
>
> In the line below
>
>
>
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
>
>
> Can you please clarify the components with the correct naming as I am new to
> Scala
>
> errlog   --> is the RDD?
> filter(line => line.contains("sed")) is a method
> collect()  is another method ?
> foreach (println) ?
>
>
>
> Thanks
>
>
>
> On 10/02/2016 21:28, Chandeep Singh wrote:
>
> Hi Mich,
>
> If you would like to print everything to the console you could -
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
> or you could always save to a file using any of the saveAs methods.
>
> Thanks,
> Chandeep
>
> On Wed, Feb 10, 2016 at 8:14 PM,
> <mi...@cloudtechnologypartners.co.uk> wrote:
>>
>>
>>
>> Hi,
>>
>> I have a bunch of files stored in hdfs /unit_files directory in total 319
>> files
>>
>> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>>
>> scala>  errlog.filter(line => line.contains("sed"))count()
>> res104: Long = 1113
>>
>> So it returns 1113 instances the word "sed"
>>
>> If I want to see the collection I can do
>>
>>
>> scala>  errlog.filter(line => line.contains("sed"))collect()
>>
>> res105: Array[String] = Array("                         DSQUERY=${1} ;
>> DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", #    .
>> in environment based on argument for script., "       exec sp_spaceused", "
>> exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "
>> BACKUPSERVER=$5        # Server that is used to load the transaction dump",
>> "        BACKUPSERVER=$5         # Server that is used to load the
>> transaction dump", "        BACKUPSERVER=$5         # Server that is used to
>> load the transaction dump", "    cat $TMPDIR/${DBNAME}_trandump.sql | sed
>> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat
>> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ >
>> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e
>> s/.ksh//), "        B...
>> scala>
>>
>>
>> Now is there anyway I can retrieve all these instances or perhaps they are
>> all wrapped up and I only see few lines?
>>
>> Thanks,
>>
>> Mich
>
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Cloud Technology
> Partners Ltd, its subsidiaries or their employees, unless expressly so
> stated. It is the responsibility of the recipient to ensure that this email
> is virus free, therefore neither Cloud Technology partners Ltd, its
> subsidiaries nor their employees accept any responsibility.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: retrieving all the rows with collect()

Posted by Mich Talebzadeh <mi...@cloudtechnologypartners.co.uk>.
 

Hi Chandeep 

Many thanks for your help 

In the line below 

errlog.filter(line => line.contains("sed"))collect()foreach(println) 

Can you please clarify the components with the correct naming as I am
new to Scala 

 	* errlog --> is the RDD?
 	* filter(line => line.contains("sed")) is a method
 	* collect() is another method ?
 	* foreach (println) ?

Thanks 

On 10/02/2016 21:28, Chandeep Singh wrote: 

> Hi Mich, 
> 
> If you would like to print everything to the console you could - errlog.filter(line => line.contains("sed"))collect()foreach(println) 
> 
> or you could always save to a file using any of the saveAs methods. 
> 
> Thanks, 
> Chandeep 
> 
> On Wed, Feb 10, 2016 at 8:14 PM, <mi...@cloudtechnologypartners.co.uk> wrote:
> 
>> Hi,
>> 
>> I have a bunch of files stored in hdfs /unit_files directory in total 319 files
>> 
>> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>> 
>> scala> errlog.filter(line => line.contains("sed"))count()
>> res104: Long = 1113
>> 
>> So it returns 1113 instances the word "sed"
>> 
>> If I want to see the collection I can do
>> 
>> SCALA> ERRLOG.FILTER(LINE => LINE.CONTAINS("SED"))COLLECT()
>> 
>> res105: Array[String] = Array(" DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . in environment based on argument for script., " exec sp_spaceused", " exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), " BACKUPSERVER=$5 # Server that is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used to load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e s/.ksh//), " B...
>> scala>
>> 
>> Now is there anyway I can retrieve all these instances or perhaps they are all wrapped up and I only see few lines?
>> 
>> Thanks,
>> 
>> Mich

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Re: retrieving all the rows with collect()

Posted by Chandeep Singh <ch...@gmail.com>.
Hi Mich,

If you would like to print everything to the console you could - errlog.
filter(line => line.contains("sed"))collect()foreach(println)

or you could always save to a file using any of the saveAs methods.

Thanks,
Chandeep

On Wed, Feb 10, 2016 at 8:14 PM, <
mich.talebzadeh@cloudtechnologypartners.co.uk> wrote:

>
>
> Hi,
>
> I have a bunch of files stored in hdfs /unit_files directory in total 319 files
> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>
> scala>  errlog.filter(line => line.contains("sed"))count()
> res104: Long = 1113
> So it returns 1113 instances the word "sed"
>
> If I want to see the collection I can do
>
>
> *scala>  errlog.filter(line => line.contains("sed"))collect()*
>
> res105: Array[String] = Array("                         DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", #    . in environment based on argument for script., "       exec sp_spaceused", "        exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "        BACKUPSERVER=$5        # Server that is used to load the transaction dump", "        BACKUPSERVER=$5         # Server that is used to load the transaction dump", "        BACKUPSERVER=$5         # Server that is used to load the transaction dump", "    cat $TMPDIR/${DBNAME}_trandump.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e s/.ksh//), "        B...
> scala>
>
>
> Now is there anyway I can retrieve all these instances or perhaps they are all wrapped up and I only see few lines?
>
> Thanks,
>
> Mich
>
>