You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Aishwarya Chaurasia <ai...@gmail.com> on 2017/04/23 16:53:32 UTC

Re: Please reply asap : Regarding incubator systemml/breast_cancer project

Hey,

Thank you so much for your help sir. We were finally able to run
preprocess.py without any errors. And the results obtained were
satisfactory i.e we got five set of data frames like you said we would.

But alas! when we tried to run MachineLearning.ipynb the same NameError
came : https://paste.fedoraproject.org/paste/l3LFJreg~vnYEDTSTQH7
3l5M1UNdIGYhyRLivL9gydE=

Could you guide us again as to how to proceed now?
Also, could you please provide an overview of the process
MachineLearning.ipynb is following to train the samples.
Also we have tried all possible solutions to remove the name sc error.
It would be really kind of you if you looked into the matter asap.

Thanks a lot!

On 22-Apr-2017 5:19 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
wrote:

> Hey,
>
> Thank you so much for your help sir. We were finally able to run
> preprocess.py without any errors. And the results obtained were
> satisfactory i.e we got five set of data frames like you said we would.
>
> But alas! when we tried to run MachineLearning.ipynb the same NameError
> came : https://paste.fedoraproject.org/paste/l3LFJreg~vnYEDTSTQH7
> 3l5M1UNdIGYhyRLivL9gydE=
>
> Could you guide us again as to how to proceed now?
> Also, could you please provide an overview of the process
> MachineLearning.ipynb is following to train the samples.
>
> Thanks a lot!
>
> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>
>> Hi Aishwarya,
>>
>> Looks like you've just encountered an out of memory error on one of the
>> executors.  Therefore, you just need to adjust the `spark.executor.memory`
>> and `spark.driver.memory` settings with higher amounts of RAM.  What is
>> your current setup?  I.e. are you using a cluster of machines, or a single
>> machine?  We generally use a large driver on one machine, and then a single
>> large executor on each other machine.  I would give a sizable amount of
>> memory to the driver, and about half the possible memory on the executors
>> so that the Python processes have enough memory as well.  PySpark has JVM
>> and Python components, and the Spark memory settings only pertain to the
>> JVM side, thus the need to save about half the executor memory for the
>> Python side.
>>
>> Thanks!
>>
>> - Mike
>>
>> --
>>
>> Mike Dusenberry
>> GitHub: github.com/dusenberrymw
>> LinkedIn: linkedin.com/in/mikedusenberry
>>
>> Sent from my iPhone.
>>
>>
>> > On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>> aishwarya2612@gmail.com> wrote:
>> >
>> > Hello sir,
>> >
>> > We also wanted to ensure that the spark-submit command we're using is
>> the
>> > correct one for running 'preprocess.py'.
>> > Command :  /home/new/sparks/bin/spark-submit preprocess.py
>> >
>> >
>> > Thank you.
>> > Aishwarya Chaurasia.
>> >
>> > On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
>> > wrote:
>> >
>> > Hello sir,
>> > On running the file preprocess.py we are getting the following error :
>> >
>> > https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>> > YhyRLivL9gydE=
>> >
>> > Can you please help us by looking into the error and kindly tell us the
>> > solution for it.
>> > Thanks a lot.
>> > Aishwarya Chaurasia
>> >
>> >
>> >> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
>> >>
>> >> Hi Aishwarya,
>> >>
>> >> Certainly, here is some more detailed information about`preprocess.py`:
>> >>
>> >>  * The preprocessing Python script is located at
>> >> https://github.com/apache/incubator-systemml/blob/master/
>> >> projects/breast_cancer/preprocess.py.  Note that this is different
>> than
>> >> the library module at https://github.com/apache/incu
>> >> bator-systemml/blob/master/projects/breast_cancer/breastc
>> >> ancer/preprocessing.py.
>> >>  * This script is used to preprocess a set of histology slide images,
>> >> which are `.svs` files in our case, and `.tiff` files in your case.
>> >>  * Lines 63-79 contain "settings" such as the output image sizes,
>> folder
>> >> paths, etc.  Of particular interest, line 72 has the folder path for
>> the
>> >> original slide images that should be commonly accessible from all
>> machines
>> >> being used, and lines 74-79 contain the names of the output DataFrames
>> that
>> >> will be saved.
>> >>  * Line 82 performs the actual preprocessing and creates a Spark
>> >> DataFrame with the following columns: slide number, tumor score,
>> molecular
>> >> score, sample.  The "sample" in this case is the actual small,
>> chopped-up
>> >> section of the image that has been extracted and flattened into a row
>> >> Vector.  For test images without labels (`training=false`), only the
>> slide
>> >> number and sample will be contained in the DataFrame (i.e. no labels).
>> >> This calls the `preprocess(...)` function located on line 371 of
>> >> https://github.com/apache/incubator-systemml/blob/master/
>> >> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>> >> different file.
>> >>  * Line 87 simply saves the above DataFrame to HDFS with the name from
>> >> line 74.
>> >>  * Line 93 splits the above DataFrame row-wise into separate "training"
>> >> and "validation" DataFrames, based on the split percentage from line 70
>> >> (`train_frac`).  This is performed so that downstream machine learning
>> >> tasks can learn from the training set, and validate performance and
>> >> hyperparameter choices on the validation set.  These DataFrames will
>> start
>> >> with the same columns as the above DataFrame.  If `add_row_indices`
>> from
>> >> line 69 is true, then an additional row index column (`__INDEX`) will
>> be
>> >> pretended.  This is useful for SystemML in downstream machine learning
>> >> tasks as it gives the DataFrame row numbers like a real matrix would
>> have,
>> >> and SystemML is built to operate on matrices.
>> >>  * Lines 97 & 98 simply save the training and validation DataFrames
>> using
>> >> the names defined on lines 76 & 78.
>> >>  * Lines 103-137 create smaller train and validation DataFrames by
>> taking
>> >> small row-wise samples of the full train and validation DataFrames.
>> The
>> >> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
>> >> sample).  This is generally useful for quicker downstream tasks without
>> >> having to load in the larger DataFrames, assuming you have a large
>> amount
>> >> of data.  For us, we have ~7TB of data, so having 1% sampled
>> DataFrames is
>> >> useful for quicker downstream tests.  Once again, the same columns
>> from the
>> >> larger train and validation DataFrames will be used.
>> >>  * Lines 146 & 147 simply save these sampled train and validation
>> >> DataFrames.
>> >>
>> >> As a summary, after running `preprocess.py`, you will be left with the
>> >> following saved DataFrames in HDFS:
>> >>  * Full DataFrame
>> >>  * Training DataFrame
>> >>  * Validation DataFrame
>> >>  * Sampled training DataFrame
>> >>  * Sampled validation DataFrame
>> >>
>> >> As for visualization, you may visualize a "sample" (i.e. small,
>> chopped-up
>> >> section of original image) from a DataFrame by using the `
>> >> breastcancer.visualization.visualize_sample(...)` function.  You will
>> >> need to do this after creating the DataFrames.  Here is a snippet to
>> >> visualize the first row sample in a DataFrame, where `df` is one of the
>> >> DataFrames from above:
>> >>
>> >> ```
>> >> from breastcancer.visualization import visualize_sample
>> >> visualize_sample(df.first().sample)
>> >> ```
>> >>
>> >> Please let me know if you have any additional questions.
>> >>
>> >> Thanks!
>> >>
>> >> - Mike
>> >>
>> >> --
>> >>
>> >> Mike Dusenberry
>> >> GitHub: github.com/dusenberrymw
>> >> LinkedIn: linkedin.com/in/mikedusenberry
>> >>
>> >> Sent from my iPhone.
>> >>
>> >>
>> >>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>> >> aishwarya2612@gmail.com> wrote:
>> >>>
>> >>> Hello sir,
>> >>> Can you please elaborate more on what output we would be getting
>> because
>> >> we
>> >>> tried executing the preprocess.py file using spark submit it keeps on
>> >>> adding the tiles in rdd and while running the visualisation.py file it
>> >>> isn't showing any output. Can you please help us out asap stating the
>> >>> output we will be getting and the sequence of execution of files.
>> >>> Thank you.
>> >>>
>> >>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
>> >>>>
>> >>>> Hi Aishwarya,
>> >>>>
>> >>>> Thanks for sharing more info on the issue!
>> >>>>
>> >>>> To facilitate easier usage, I've updated the preprocessing code by
>> >> pulling
>> >>>> out most of the logic into a `breastcancer/preprocessing.py` module,
>> >>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
>> >> There is
>> >>>> also a `preprocess.py` script with the same contents as the notebook
>> for
>> >>>> use with `spark-submit`.  The choice of the notebook or the script is
>> >> just
>> >>>> a matter of convenience, as they both import from the same
>> >>>> `breastcancer/preprocessing.py` package.
>> >>>>
>> >>>> As part of the updates, I've added an explicit SparkSession parameter
>> >>>> (`spark`) to the `preprocess(...)` function, and updated the body to
>> use
>> >>>> this SparkSession object rather than the older SparkContext `sc`
>> object.
>> >>>> Previously, the `preprocess(...)` function accessed the `sc` object
>> that
>> >>>> was pulled in from the enclosing scope, which would work while all of
>> >> the
>> >>>> code was colocated within the notebook, but not if the code was
>> >> extracted
>> >>>> and imported.  The explicit parameter now allows for the code to be
>> >>>> imported.
>> >>>>
>> >>>> Can you please try again with the latest updates?  We are currently
>> >> using
>> >>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark kernel
>> >>>> should have a `spark` object available that can be supplied to the
>> >>>> functions (as is done now in the notebook), and if you use the
>> >>>> `preprocess.py` script with `spark-submit`, the `spark` object will
>> be
>> >>>> created explicitly by the script.
>> >>>>
>> >>>> For a bit of context to others, Aishwarya initially reached out to
>> find
>> >>>> out if our breast cancer project could be applied to TIFF images,
>> rather
>> >>>> than the SVS images we are currently using (the answer is "yes" so
>> long
>> >> as
>> >>>> they are "generic tiled TIFF images, according to the OpenSlide
>> >>>> documentation), and then followed up with Spark issues related to the
>> >>>> preprocessing code.  This conversation has been promptly moved to the
>> >>>> mailing list so that others in the community can benefit.
>> >>>>
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>> --
>> >>>>
>> >>>> Mike Dusenberry
>> >>>> GitHub: github.com/dusenberrymw
>> >>>> LinkedIn: linkedin.com/in/mikedusenberry
>> >>>>
>> >>>> Sent from my iPhone.
>> >>>>
>> >>>>
>> >>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>> >> aishwarya2612@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hey,
>> >>>>>
>> >>>>> The object sc is already defined in pyspark and yet this name error
>> >> keeps
>> >>>>> occurring. We are using spark 2.*
>> >>>>>
>> >>>>> Here is the link to error that we are getting :
>> >>>>> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l
>> 5M1UNdIG
>> >>>> YhyRLivL9gydE=
>> >>>>
>> >>
>>
>