You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Davidson <ae...@ucsc.edu.INVALID> on 2021/12/21 01:21:33 UTC

??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

Happy Holidays

I am a newbie

I have 16,000 data files, all files have the same number of rows and columns. The row ids are identical and are in the same order. I want to create a new data frame that contains the 3rd column from each data file. My pyspark script runs correctly when I test on small number of files how ever I get an OOM when I run on all 16000.

To try and debug I ran a small test and set warning level to INFO. I found the following

2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

        for i in range( 1, len( self.sampleNamesList ) ):
            sampleName = self.sampleNamesList[i]

            # select the key and counts from the sample.
            qsdf = quantSparkDFList[i]
            sampleSDF = qsdf\
                .select( ["Name", "NumReads", ] )\
                .withColumnRenamed( "NumReads", sampleName )

            sampleSDF.createOrReplaceTempView( "sample" )

            # the sample name must be quoted else column names with a '-'
            # like GTEX-1117F-0426-SM-5EGHI will generate an error
            # spark think the '-' is an expression. '_' is also
            # a special char for the sql like operator
            # https://stackoverflow.com/a/63899306/4586180
            sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
                            from \n\
                               rawCounts as rc, \n\
                               sample  \n\
                            where \n\
                                rc.Name == sample.Name \n'.format( sampleName )

            rawCountsSDF = self.spark.sql( sqlStmt )
            rawCountsSDF.createOrReplaceTempView( "rawCounts" )


The way I wrote my script, I do a lot of transformations, the first action is at the end of the script
    retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite', header=True)

Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I expected to manage spark to manage the cache automatically given that I do not explicitly call cache().


How come I do not get a similar warning from?
            sampleSDF.createOrReplaceTempView( "sample" )

Will this reduce my memory requirements?


Kind regards

Andy

Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

Posted by Sean Owen <sr...@gmail.com>.
16000 joins is never going to work out, though you can do it all at once
and avoid the immediate issue. If they really are the same rows in the same
order, maybe you can read them as lines of text and use zip()

On Tue, Dec 21, 2021, 8:48 AM Andrew Davidson <ae...@ucsc.edu.invalid>
wrote:

> Hi Jun
>
> Thank you for your reply. My question is what is best practices? My for
> loop run over 16000 joins. I get an out of memory exception.
>
> What is the indented use of createOrReplaceTempView if I need to manage
> the cache or create a uniq name each time
>
>
>
> Kind regards
>
> Andy
>
> On Tue, Dec 21, 2021 at 6:12 AM Jun Zhu <ju...@vungle.com.invalid>
> wrote:
>
>> Hi
>>
>> As far as I know. The warning should be caused by create same temp view
>> names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>> You create a view "rawCounts", then in for loop, second round, you create
>> a new view with name "rawCounts", spark3 would uncache the
>> previous "rawCounts".
>>
>> Correct me if I'm wrong.
>>
>> Regards
>>
>>
>> On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson
>> <ae...@ucsc.edu.invalid> wrote:
>>
>>> Happy Holidays
>>>
>>>
>>>
>>> I am a newbie
>>>
>>>
>>>
>>> I have 16,000 data files, all files have the same number of rows and
>>> columns. The row ids are identical and are in the same order. I want to
>>> create a new data frame that contains the 3rd column from each data file.
>>> My pyspark script runs correctly when I test on small number of files how
>>> ever I get an OOM when I run on all 16000.
>>>
>>>
>>>
>>> To try and debug I ran a small test and set warning level to INFO. I
>>> found the following
>>>
>>>
>>>
>>> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
>>> `rawCounts` before replacing.
>>>
>>>
>>>
>>>         for i in range( 1, len( self.sampleNamesList ) ):
>>>
>>>             sampleName = self.sampleNamesList[i]
>>>
>>>
>>>
>>>             # select the key and counts from the sample.
>>>
>>>             qsdf = quantSparkDFList[i]
>>>
>>>             sampleSDF = qsdf\
>>>
>>>                 .select( ["Name", "NumReads", ] )\
>>>
>>>                 .withColumnRenamed( "NumReads", sampleName )
>>>
>>>
>>>
>>>             sampleSDF.createOrReplaceTempView( "sample" )
>>>
>>>
>>>
>>>             # the sample name must be quoted else column names with a '-'
>>>
>>>             # like GTEX-1117F-0426-SM-5EGHI will generate an error
>>>
>>>             # spark think the '-' is an expression. '_' is also
>>>
>>>             # a special char for the sql like operator
>>>
>>>             # https://stackoverflow.com/a/63899306/4586180
>>>
>>>             sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>>>
>>>                             from \n\
>>>
>>>                                rawCounts as rc, \n\
>>>
>>>                                sample  \n\
>>>
>>>                             where \n\
>>>
>>>                                 rc.Name == sample.Name \n'.format(
>>> sampleName )
>>>
>>>
>>>
>>>             rawCountsSDF = self.spark.sql( sqlStmt )
>>>
>>>             rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>>>
>>>
>>>
>>>
>>>
>>> The way I wrote my script, I do a lot of transformations, the first
>>> action is at the end of the script
>>>
>>>     retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
>>> header=True)
>>>
>>>
>>>
>>> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “)
>>> before calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
>>> expected to manage spark to manage the cache automatically given that I do
>>> not explicitly call cache().
>>>
>>>
>>>
>>>
>>>
>>> How come I do not get a similar warning from?
>>>
>>>             sampleSDF.createOrReplaceTempView( "sample" )
>>>
>>>
>>>
>>> Will this reduce my memory requirements?
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>
>>
>> --
>> [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
>> Sr. Engineer I, Data
>> +86 18565739171
>>
>> [image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
>> fb1552694203.png] <https://facebook.com/vungle>      [image:
>> tw1552694330.png] <https://twitter.com/vungle>      [image:
>> ig1552694392.png] <https://www.instagram.com/vungle>
>> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>>
>>

Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.
Hi Jun

Thank you for your reply. My question is what is best practices? My for
loop run over 16000 joins. I get an out of memory exception.

What is the indented use of createOrReplaceTempView if I need to manage the
cache or create a uniq name each time



Kind regards

Andy

On Tue, Dec 21, 2021 at 6:12 AM Jun Zhu <ju...@vungle.com.invalid> wrote:

> Hi
>
> As far as I know. The warning should be caused by create same temp view
> names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
> You create a view "rawCounts", then in for loop, second round, you create
> a new view with name "rawCounts", spark3 would uncache the
> previous "rawCounts".
>
> Correct me if I'm wrong.
>
> Regards
>
>
> On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson <ae...@ucsc.edu.invalid>
> wrote:
>
>> Happy Holidays
>>
>>
>>
>> I am a newbie
>>
>>
>>
>> I have 16,000 data files, all files have the same number of rows and
>> columns. The row ids are identical and are in the same order. I want to
>> create a new data frame that contains the 3rd column from each data file.
>> My pyspark script runs correctly when I test on small number of files how
>> ever I get an OOM when I run on all 16000.
>>
>>
>>
>> To try and debug I ran a small test and set warning level to INFO. I
>> found the following
>>
>>
>>
>> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
>> `rawCounts` before replacing.
>>
>>
>>
>>         for i in range( 1, len( self.sampleNamesList ) ):
>>
>>             sampleName = self.sampleNamesList[i]
>>
>>
>>
>>             # select the key and counts from the sample.
>>
>>             qsdf = quantSparkDFList[i]
>>
>>             sampleSDF = qsdf\
>>
>>                 .select( ["Name", "NumReads", ] )\
>>
>>                 .withColumnRenamed( "NumReads", sampleName )
>>
>>
>>
>>             sampleSDF.createOrReplaceTempView( "sample" )
>>
>>
>>
>>             # the sample name must be quoted else column names with a '-'
>>
>>             # like GTEX-1117F-0426-SM-5EGHI will generate an error
>>
>>             # spark think the '-' is an expression. '_' is also
>>
>>             # a special char for the sql like operator
>>
>>             # https://stackoverflow.com/a/63899306/4586180
>>
>>             sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>>
>>                             from \n\
>>
>>                                rawCounts as rc, \n\
>>
>>                                sample  \n\
>>
>>                             where \n\
>>
>>                                 rc.Name == sample.Name \n'.format(
>> sampleName )
>>
>>
>>
>>             rawCountsSDF = self.spark.sql( sqlStmt )
>>
>>             rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>>
>>
>>
>>
>>
>> The way I wrote my script, I do a lot of transformations, the first
>> action is at the end of the script
>>
>>     retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
>> header=True)
>>
>>
>>
>> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before
>> calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
>> expected to manage spark to manage the cache automatically given that I do
>> not explicitly call cache().
>>
>>
>>
>>
>>
>> How come I do not get a similar warning from?
>>
>>             sampleSDF.createOrReplaceTempView( "sample" )
>>
>>
>>
>> Will this reduce my memory requirements?
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>
>
> --
> [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
> Sr. Engineer I, Data
> +86 18565739171
>
> [image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
> fb1552694203.png] <https://facebook.com/vungle>      [image:
> tw1552694330.png] <https://twitter.com/vungle>      [image:
> ig1552694392.png] <https://www.instagram.com/vungle>
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>
>

Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

Posted by Jun Zhu <ju...@vungle.com.INVALID>.
Hi

As far as I know. The warning should be caused by create same temp view
names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
You create a view "rawCounts", then in for loop, second round, you create a
new view with name "rawCounts", spark3 would uncache the
previous "rawCounts".

Correct me if I'm wrong.

Regards


On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson <ae...@ucsc.edu.invalid>
wrote:

> Happy Holidays
>
>
>
> I am a newbie
>
>
>
> I have 16,000 data files, all files have the same number of rows and
> columns. The row ids are identical and are in the same order. I want to
> create a new data frame that contains the 3rd column from each data file.
> My pyspark script runs correctly when I test on small number of files how
> ever I get an OOM when I run on all 16000.
>
>
>
> To try and debug I ran a small test and set warning level to INFO. I found
> the following
>
>
>
> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
> `rawCounts` before replacing.
>
>
>
>         for i in range( 1, len( self.sampleNamesList ) ):
>
>             sampleName = self.sampleNamesList[i]
>
>
>
>             # select the key and counts from the sample.
>
>             qsdf = quantSparkDFList[i]
>
>             sampleSDF = qsdf\
>
>                 .select( ["Name", "NumReads", ] )\
>
>                 .withColumnRenamed( "NumReads", sampleName )
>
>
>
>             sampleSDF.createOrReplaceTempView( "sample" )
>
>
>
>             # the sample name must be quoted else column names with a '-'
>
>             # like GTEX-1117F-0426-SM-5EGHI will generate an error
>
>             # spark think the '-' is an expression. '_' is also
>
>             # a special char for the sql like operator
>
>             # https://stackoverflow.com/a/63899306/4586180
> <https://stackoverflow.com/a/63899306/4586180>
>
>             sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>
>                             from \n\
>
>                                rawCounts as rc, \n\
>
>                                sample  \n\
>
>                             where \n\
>
>                                 rc.Name == sample.Name \n'.format(
> sampleName )
>
>
>
>             rawCountsSDF = self.spark.sql( sqlStmt )
>
>             rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>
>
>
>
>
> The way I wrote my script, I do a lot of transformations, the first action
> is at the end of the script
>
>     retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
> header=True)
>
>
>
> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before
> calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
> expected to manage spark to manage the cache automatically given that I do
> not explicitly call cache().
>
>
>
>
>
> How come I do not get a similar warning from?
>
>             sampleSDF.createOrReplaceTempView( "sample" )
>
>
>
> Will this reduce my memory requirements?
>
>
>
> Kind regards
>
>
>
> Andy
>


-- 
[image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
Sr. Engineer I, Data
+86 18565739171

[image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
fb1552694203.png] <https://facebook.com/vungle>      [image:
tw1552694330.png] <https://twitter.com/vungle>      [image:
ig1552694392.png] <https://www.instagram.com/vungle>
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China