You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com on 2015/07/01 17:01:31 UTC

StorageLevel.MEMORY_AND_DISK_SER

How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?


-- 
Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by Raghavendra Pandey <ra...@gmail.com>.

For that you need to change the serialize and deserialize behavior of your
class.
Preferably, you can use Kyro serializers n override the behavior.
For details u can look
https://github.com/EsotericSoftware/kryo/blob/master/README.md
On Jul 1, 2015 9:26 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:

i original assumed that persisting is similar to writing. But its not.
Hence i want to change the behavior of intermediate persists.

On Wed, Jul 1, 2015 at 8:46 AM, Raghavendra Pandey <
raghavendra.pandey@gmail.com> wrote:

> So do you want to change the behavior of persist api or write the rdd on
> disk...
> On Jul 1, 2015 9:13 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:
>
>> I think i want to use persist then and write my intermediate RDDs to
>> disk+mem.
>>
>> On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey <
>> raghavendra.pandey@gmail.com> wrote:
>>
>>> I think persist api is internal to rdd whereas write api is for saving
>>> content on dist.
>>> Rdd persist will dump your obj bytes serialized on the disk.. If you
>>> wanna change that behavior you need to override the class serialization
>>> that your are storing in rdd..
>>>  On Jul 1, 2015 8:50 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:
>>>
>>>> This is my write API. how do i integrate it here.
>>>>
>>>>
>>>>  protected def writeOutputRecords(detailRecords:
>>>> RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) {
>>>>     val writeJob = new Job()
>>>>     val schema = SchemaUtil.outputSchema(_detail)
>>>>     AvroJob.setOutputKeySchema(writeJob, schema)
>>>>     val outputRecords = detailRecords.coalesce(100)
>>>>     outputRecords.saveAsNewAPIHadoopFile(outputDir,
>>>>       classOf[AvroKey[GenericRecord]],
>>>>       classOf[org.apache.hadoop.io.NullWritable],
>>>>       classOf[AvroKeyOutputFormat[GenericRecord]],
>>>>       writeJob.getConfiguration)
>>>>   }
>>>>
>>>> On Wed, Jul 1, 2015 at 8:11 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
>>>>>
>>>>> On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Deepak
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>
>>
>> --
>> Deepak
>>
>>


-- 
Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.

i original assumed that persisting is similar to writing. But its not.
Hence i want to change the behavior of intermediate persists.

On Wed, Jul 1, 2015 at 8:46 AM, Raghavendra Pandey <
raghavendra.pandey@gmail.com> wrote:

> So do you want to change the behavior of persist api or write the rdd on
> disk...
> On Jul 1, 2015 9:13 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:
>
>> I think i want to use persist then and write my intermediate RDDs to
>> disk+mem.
>>
>> On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey <
>> raghavendra.pandey@gmail.com> wrote:
>>
>>> I think persist api is internal to rdd whereas write api is for saving
>>> content on dist.
>>> Rdd persist will dump your obj bytes serialized on the disk.. If you
>>> wanna change that behavior you need to override the class serialization
>>> that your are storing in rdd..
>>>  On Jul 1, 2015 8:50 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:
>>>
>>>> This is my write API. how do i integrate it here.
>>>>
>>>>
>>>>  protected def writeOutputRecords(detailRecords:
>>>> RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) {
>>>>     val writeJob = new Job()
>>>>     val schema = SchemaUtil.outputSchema(_detail)
>>>>     AvroJob.setOutputKeySchema(writeJob, schema)
>>>>     val outputRecords = detailRecords.coalesce(100)
>>>>     outputRecords.saveAsNewAPIHadoopFile(outputDir,
>>>>       classOf[AvroKey[GenericRecord]],
>>>>       classOf[org.apache.hadoop.io.NullWritable],
>>>>       classOf[AvroKeyOutputFormat[GenericRecord]],
>>>>       writeJob.getConfiguration)
>>>>   }
>>>>
>>>> On Wed, Jul 1, 2015 at 8:11 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
>>>>>
>>>>> On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Deepak
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>
>>
>> --
>> Deepak
>>
>>


-- 
Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by Raghavendra Pandey <ra...@gmail.com>.

So do you want to change the behavior of persist api or write the rdd on
disk...
On Jul 1, 2015 9:13 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:

> I think i want to use persist then and write my intermediate RDDs to
> disk+mem.
>
> On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey <
> raghavendra.pandey@gmail.com> wrote:
>
>> I think persist api is internal to rdd whereas write api is for saving
>> content on dist.
>> Rdd persist will dump your obj bytes serialized on the disk.. If you
>> wanna change that behavior you need to override the class serialization
>> that your are storing in rdd..
>>  On Jul 1, 2015 8:50 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:
>>
>>> This is my write API. how do i integrate it here.
>>>
>>>
>>>  protected def writeOutputRecords(detailRecords:
>>> RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) {
>>>     val writeJob = new Job()
>>>     val schema = SchemaUtil.outputSchema(_detail)
>>>     AvroJob.setOutputKeySchema(writeJob, schema)
>>>     val outputRecords = detailRecords.coalesce(100)
>>>     outputRecords.saveAsNewAPIHadoopFile(outputDir,
>>>       classOf[AvroKey[GenericRecord]],
>>>       classOf[org.apache.hadoop.io.NullWritable],
>>>       classOf[AvroKeyOutputFormat[GenericRecord]],
>>>       writeJob.getConfiguration)
>>>   }
>>>
>>> On Wed, Jul 1, 2015 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
>>>>
>>>> On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>
>
> --
> Deepak
>
>

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.

I think i want to use persist then and write my intermediate RDDs to
disk+mem.

On Wed, Jul 1, 2015 at 8:28 AM, Raghavendra Pandey <
raghavendra.pandey@gmail.com> wrote:

> I think persist api is internal to rdd whereas write api is for saving
> content on dist.
> Rdd persist will dump your obj bytes serialized on the disk.. If you wanna
> change that behavior you need to override the class serialization that your
> are storing in rdd..
>  On Jul 1, 2015 8:50 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:
>
>> This is my write API. how do i integrate it here.
>>
>>
>>  protected def writeOutputRecords(detailRecords:
>> RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) {
>>     val writeJob = new Job()
>>     val schema = SchemaUtil.outputSchema(_detail)
>>     AvroJob.setOutputKeySchema(writeJob, schema)
>>     val outputRecords = detailRecords.coalesce(100)
>>     outputRecords.saveAsNewAPIHadoopFile(outputDir,
>>       classOf[AvroKey[GenericRecord]],
>>       classOf[org.apache.hadoop.io.NullWritable],
>>       classOf[AvroKeyOutputFormat[GenericRecord]],
>>       writeJob.getConfiguration)
>>   }
>>
>> On Wed, Jul 1, 2015 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
>>>
>>> On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
>>> wrote:
>>>
>>>> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>


-- 
Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by Raghavendra Pandey <ra...@gmail.com>.

I think persist api is internal to rdd whereas write api is for saving
content on dist.
Rdd persist will dump your obj bytes serialized on the disk.. If you wanna
change that behavior you need to override the class serialization that your
are storing in rdd..
 On Jul 1, 2015 8:50 PM, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <de...@gmail.com> wrote:

> This is my write API. how do i integrate it here.
>
>
>  protected def writeOutputRecords(detailRecords:
> RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) {
>     val writeJob = new Job()
>     val schema = SchemaUtil.outputSchema(_detail)
>     AvroJob.setOutputKeySchema(writeJob, schema)
>     val outputRecords = detailRecords.coalesce(100)
>     outputRecords.saveAsNewAPIHadoopFile(outputDir,
>       classOf[AvroKey[GenericRecord]],
>       classOf[org.apache.hadoop.io.NullWritable],
>       classOf[AvroKeyOutputFormat[GenericRecord]],
>       writeJob.getConfiguration)
>   }
>
> On Wed, Jul 1, 2015 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
>>
>> On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
>> wrote:
>>
>>> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>
>
> --
> Deepak
>
>

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com.

This is my write API. how do i integrate it here.


 protected def writeOutputRecords(detailRecords:
RDD[(AvroKey[DetailOutputRecord], NullWritable)], outputDir: String) {
    val writeJob = new Job()
    val schema = SchemaUtil.outputSchema(_detail)
    AvroJob.setOutputKeySchema(writeJob, schema)
    val outputRecords = detailRecords.coalesce(100)
    outputRecords.saveAsNewAPIHadoopFile(outputDir,
      classOf[AvroKey[GenericRecord]],
      classOf[org.apache.hadoop.io.NullWritable],
      classOf[AvroKeyOutputFormat[GenericRecord]],
      writeJob.getConfiguration)
  }

On Wed, Jul 1, 2015 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote:

> rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
>
> On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com>
> wrote:
>
>> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>>
>>
>> --
>> Deepak
>>
>>
>


-- 
Deepak

Re: StorageLevel.MEMORY_AND_DISK_SER

Posted by Koert Kuipers <ko...@tresata.com>.

rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)

On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com> wrote:

> How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ?
>
>
> --
> Deepak
>
>