You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemml.apache.org by Aishwarya Chaurasia <ai...@gmail.com> on 2017/04/24 09:28:42 UTC

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Hello sir,

Thanks a lot for replying sir. But unfortunately it did not work. Although
the NameError did not appear this time but another error came about :

https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V5M1UNdIG
YhyRLivL9gydE=

This error was obtained after executing the second block of code of
MachineLearning.py in terminal. ( ml = MLContext(sc) )

We have installed the bleeding-edge version of systemml only and the
installation was done correctly. We are in a fix now. :/
Kindly look into the matter asap

On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com> wrote:

Hi Aishwarya,

Glad to hear that the preprocessing stage was successful!  As for the
`MachineLearning.ipynb` notebook, here is a general guide:


   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
   training and validation DataFrames from the preprocessing step, (2)
   converts them to normalized & one-hot encoded SystemML matrices for
   consumption by the ML algorithms, and (3) explores training a couple of
   models.
   - To run, you'll need to start Jupyter in the context of PySpark via
   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
   SystemML with pip from PyPy (`pip3 install systemml`), this will install
   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
will
   not be necessary.  If you instead have installed a bleeding-edge version
of
   SystemML locally (git clone locally, maven build, `pip3 install -e
   src/main/python` as listed in `projects/breast_cancer/README.md`), the
   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We are
   about to release 0.14, and for this project, I *would* recommend using a
   bleeding edge install.
   - Once Jupyter has been started in the context of PySpark, the `sc`
   SparkContext object should be available.  Please let me know if you
   continue to see this issue.
   - The "Read in train & val data" section simply reads in the training
   and validation data generated in the preprocessing stage.  Be sure that
the
   `size` setting is the same as the preprocessing size.  The percentage `p`
   setting determines whether the full or sampled DataFrames are loaded.  If
   you set `p = 1`, the full DataFrames will be used.  If you instead would
   prefer to use the smaller sampled DataFrames while getting started,
please
   set it to the same value as used in the preprocessing to generate the
   smaller sampled DataFrames.
   - The `Extract X & Y matrices` section splits each of the train and
   validation DataFrames into effectively X & Y matrices (still as DataFrame
   types), with X containing the images, and Y containing the labels.
   - The `Convert to SystemML Matrices` section passes the X & Y DataFrames
   into a SystemML script that performs some normalization of the images &
   one-hot encoding of the labels, and then returns SystemML `Matrix` types.
   These are now ready to be passed into the subsequent algorithms.
   - The "Trigger Caching" and "Save Matrices" are experimental features,
   and not necessary to execute.
   - Next comes the two algorithms being explored in this notebook.  The
   "Softmax Classifier" is just a multi-class logistic regression model, and
   is simply there to serve as a baseline comparison with the subsequent
   convolutional neural net model.  You may wish to simply skip this softmax
   model and move to the latter convnet model further down in the notebook.
   - The actual softmax model is located at [
   https://github.com/apache/incubator-systemml/blob/
master/projects/breast_cancer/softmax_clf.dml],
   and the notebook calls functions from that file.
   - The softmax sanity check just ensures that the model is able to
   completely overfit when given a tiny sample size.  This should yield
~100%
   training accuracy if the sample size in this section is small enough.
This
   is just a check to ensure that nothing else is wrong with the math or the
   data.
   - The softmax "Train" section will train a softmax model and return the
   weights (`W`) and biases (`b`) of the model as SystemML `Matrix` objects.
   Please adjust the hyperparameters in this section to your problem.
   - The softmax "Eval" section takes the trained weights and biases and
   evaluates the training and validation performance.
   - The next model is a LeNet-like convnet model.  The actual model is
   located at [
   https://github.com/apache/incubator-systemml/blob/
master/projects/breast_cancer/convnet.dml],
   and the notebook simply calls functions from that file.
   - Once again, there is an initial sanity check for the ability to
   overfit on a small amount of data.
   - The "Hyperparameter Search" contains a script to sample different
   hyperparams for the convnet, and save the hyperparams + validation
accuracy
   of each set after a single epoch of training.  These string files will be
   saved to HDFS.  Please feel free to adjust the range of the
hyperparameters
   for your problem.  Please also feel free to try using the `parfor`
   (parallel for-loop) instead of the while loop to speed up this section.
   Note that this is still a work in progress.  The hyperparameter tuning in
   this section makes use of random search (as opposed to grid search),
which
   has been promoted by Bengio et al. to speed up the search time.
   - The "Train" section trains the convnet and returns the weights and
   biases as SystemML `Matrix` types.  In this section, please replace the
   hyperparameters with the best ones from above, and please increase the
   number of epochs given your time constraints.
   - The "Eval" section evaluates the performance of the trained convnet.
   - Although it is not shown in the notebook yet, to save the weights and
   biases, please use the `toDF()` method on each weight and biases (i.e.
   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the
   DataFrame as desired.
   - Finally, please feel free to extend the model in `convnet.dml` for
   your particular problem!  The LeNet-like model just serves as a simple
   convnet, but there are much richer models currently, such as resnets,
that
   we are experimenting with.  To make larger models such as resnets easier
to
   define, we are also working on other tools for converting model
definitions
   + pretrained weights from other systems into SystemML.


Also, please keep in mind that the deep learning support in SystemML is
still a work in progress.  Therefore, if you run into issues, please let us
know and we'll do everything possible to help get things running!


Thanks!

- Mike


--

Michael W. Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
aishwarya2612@gmail.com> wrote:

> Hey,
>
> Thank you so much for your help sir. We were finally able to run
> preprocess.py without any errors. And the results obtained were
> satisfactory i.e we got five set of data frames like you said we would.
>
> But alas! when we tried to run MachineLearning.ipynb the same NameError
> came : https://paste.fedoraproject.org/paste/l3LFJreg~
> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>
> Could you guide us again as to how to proceed now?
> Also, could you please provide an overview of the process
> MachineLearning.ipynb is following to train the samples.
>
> Thanks a lot!
>
> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>
> > Hi Aishwarya,
> >
> > Looks like you've just encountered an out of memory error on one of the
> > executors.  Therefore, you just need to adjust the
> `spark.executor.memory`
> > and `spark.driver.memory` settings with higher amounts of RAM.  What is
> > your current setup?  I.e. are you using a cluster of machines, or a
> single
> > machine?  We generally use a large driver on one machine, and then a
> single
> > large executor on each other machine.  I would give a sizable amount of
> > memory to the driver, and about half the possible memory on the
executors
> > so that the Python processes have enough memory as well.  PySpark has
JVM
> > and Python components, and the Spark memory settings only pertain to the
> > JVM side, thus the need to save about half the executor memory for the
> > Python side.
> >
> > Thanks!
> >
> > - Mike
> >
> > --
> >
> > Mike Dusenberry
> > GitHub: github.com/dusenberrymw
> > LinkedIn: linkedin.com/in/mikedusenberry
> >
> > Sent from my iPhone.
> >
> >
> > > On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
> > aishwarya2612@gmail.com> wrote:
> > >
> > > Hello sir,
> > >
> > > We also wanted to ensure that the spark-submit command we're using is
> the
> > > correct one for running 'preprocess.py'.
> > > Command :  /home/new/sparks/bin/spark-submit preprocess.py
> > >
> > >
> > > Thank you.
> > > Aishwarya Chaurasia.
> > >
> > > On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com
> >
> > > wrote:
> > >
> > > Hello sir,
> > > On running the file preprocess.py we are getting the following error :
> > >
> > > https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
> > > YhyRLivL9gydE=
> > >
> > > Can you please help us by looking into the error and kindly tell us
the
> > > solution for it.
> > > Thanks a lot.
> > > Aishwarya Chaurasia
> > >
> > >
> > >> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
> > >>
> > >> Hi Aishwarya,
> > >>
> > >> Certainly, here is some more detailed information
> about`preprocess.py`:
> > >>
> > >>  * The preprocessing Python script is located at
> > >> https://github.com/apache/incubator-systemml/blob/master/
> > >> projects/breast_cancer/preprocess.py.  Note that this is different
> than
> > >> the library module at https://github.com/apache/incu
> > >> bator-systemml/blob/master/projects/breast_cancer/breastc
> > >> ancer/preprocessing.py.
> > >>  * This script is used to preprocess a set of histology slide images,
> > >> which are `.svs` files in our case, and `.tiff` files in your case.
> > >>  * Lines 63-79 contain "settings" such as the output image sizes,
> folder
> > >> paths, etc.  Of particular interest, line 72 has the folder path for
> the
> > >> original slide images that should be commonly accessible from all
> > machines
> > >> being used, and lines 74-79 contain the names of the output
DataFrames
> > that
> > >> will be saved.
> > >>  * Line 82 performs the actual preprocessing and creates a Spark
> > >> DataFrame with the following columns: slide number, tumor score,
> > molecular
> > >> score, sample.  The "sample" in this case is the actual small,
> > chopped-up
> > >> section of the image that has been extracted and flattened into a row
> > >> Vector.  For test images without labels (`training=false`), only the
> > slide
> > >> number and sample will be contained in the DataFrame (i.e. no
labels).
> > >> This calls the `preprocess(...)` function located on line 371 of
> > >> https://github.com/apache/incubator-systemml/blob/master/
> > >> projects/breast_cancer/breastcancer/preprocessing.py, which is a
> > >> different file.
> > >>  * Line 87 simply saves the above DataFrame to HDFS with the name
from
> > >> line 74.
> > >>  * Line 93 splits the above DataFrame row-wise into separate
> "training"
> > >> and "validation" DataFrames, based on the split percentage from line
> 70
> > >> (`train_frac`).  This is performed so that downstream machine
learning
> > >> tasks can learn from the training set, and validate performance and
> > >> hyperparameter choices on the validation set.  These DataFrames will
> > start
> > >> with the same columns as the above DataFrame.  If `add_row_indices`
> from
> > >> line 69 is true, then an additional row index column (`__INDEX`) will
> be
> > >> pretended.  This is useful for SystemML in downstream machine
learning
> > >> tasks as it gives the DataFrame row numbers like a real matrix would
> > have,
> > >> and SystemML is built to operate on matrices.
> > >>  * Lines 97 & 98 simply save the training and validation DataFrames
> > using
> > >> the names defined on lines 76 & 78.
> > >>  * Lines 103-137 create smaller train and validation DataFrames by
> > taking
> > >> small row-wise samples of the full train and validation DataFrames.
> The
> > >> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
> > >> sample).  This is generally useful for quicker downstream tasks
> without
> > >> having to load in the larger DataFrames, assuming you have a large
> > amount
> > >> of data.  For us, we have ~7TB of data, so having 1% sampled
> DataFrames
> > is
> > >> useful for quicker downstream tests.  Once again, the same columns
> from
> > the
> > >> larger train and validation DataFrames will be used.
> > >>  * Lines 146 & 147 simply save these sampled train and validation
> > >> DataFrames.
> > >>
> > >> As a summary, after running `preprocess.py`, you will be left with
the
> > >> following saved DataFrames in HDFS:
> > >>  * Full DataFrame
> > >>  * Training DataFrame
> > >>  * Validation DataFrame
> > >>  * Sampled training DataFrame
> > >>  * Sampled validation DataFrame
> > >>
> > >> As for visualization, you may visualize a "sample" (i.e. small,
> > chopped-up
> > >> section of original image) from a DataFrame by using the `
> > >> breastcancer.visualization.visualize_sample(...)` function.  You will
> > >> need to do this after creating the DataFrames.  Here is a snippet to
> > >> visualize the first row sample in a DataFrame, where `df` is one of
> the
> > >> DataFrames from above:
> > >>
> > >> ```
> > >> from breastcancer.visualization import visualize_sample
> > >> visualize_sample(df.first().sample)
> > >> ```
> > >>
> > >> Please let me know if you have any additional questions.
> > >>
> > >> Thanks!
> > >>
> > >> - Mike
> > >>
> > >> --
> > >>
> > >> Mike Dusenberry
> > >> GitHub: github.com/dusenberrymw
> > >> LinkedIn: linkedin.com/in/mikedusenberry
> > >>
> > >> Sent from my iPhone.
> > >>
> > >>
> > >>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
> > >> aishwarya2612@gmail.com> wrote:
> > >>>
> > >>> Hello sir,
> > >>> Can you please elaborate more on what output we would be getting
> > because
> > >> we
> > >>> tried executing the preprocess.py file using spark submit it keeps
on
> > >>> adding the tiles in rdd and while running the visualisation.py file
> it
> > >>> isn't showing any output. Can you please help us out asap stating
the
> > >>> output we will be getting and the sequence of execution of files.
> > >>> Thank you.
> > >>>
> > >>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
> > >>>>
> > >>>> Hi Aishwarya,
> > >>>>
> > >>>> Thanks for sharing more info on the issue!
> > >>>>
> > >>>> To facilitate easier usage, I've updated the preprocessing code by
> > >> pulling
> > >>>> out most of the logic into a `breastcancer/preprocessing.py`
> module,
> > >>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
> > >> There is
> > >>>> also a `preprocess.py` script with the same contents as the
notebook
> > for
> > >>>> use with `spark-submit`.  The choice of the notebook or the script
> is
> > >> just
> > >>>> a matter of convenience, as they both import from the same
> > >>>> `breastcancer/preprocessing.py` package.
> > >>>>
> > >>>> As part of the updates, I've added an explicit SparkSession
> parameter
> > >>>> (`spark`) to the `preprocess(...)` function, and updated the body
to
> > use
> > >>>> this SparkSession object rather than the older SparkContext `sc`
> > object.
> > >>>> Previously, the `preprocess(...)` function accessed the `sc` object
> > that
> > >>>> was pulled in from the enclosing scope, which would work while all
> of
> > >> the
> > >>>> code was colocated within the notebook, but not if the code was
> > >> extracted
> > >>>> and imported.  The explicit parameter now allows for the code to be
> > >>>> imported.
> > >>>>
> > >>>> Can you please try again with the latest updates?  We are currently
> > >> using
> > >>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
> kernel
> > >>>> should have a `spark` object available that can be supplied to the
> > >>>> functions (as is done now in the notebook), and if you use the
> > >>>> `preprocess.py` script with `spark-submit`, the `spark` object will
> be
> > >>>> created explicitly by the script.
> > >>>>
> > >>>> For a bit of context to others, Aishwarya initially reached out to
> > find
> > >>>> out if our breast cancer project could be applied to TIFF images,
> > rather
> > >>>> than the SVS images we are currently using (the answer is "yes" so
> > long
> > >> as
> > >>>> they are "generic tiled TIFF images, according to the OpenSlide
> > >>>> documentation), and then followed up with Spark issues related to
> the
> > >>>> preprocessing code.  This conversation has been promptly moved to
> the
> > >>>> mailing list so that others in the community can benefit.
> > >>>>
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> -Mike
> > >>>>
> > >>>> --
> > >>>>
> > >>>> Mike Dusenberry
> > >>>> GitHub: github.com/dusenberrymw
> > >>>> LinkedIn: linkedin.com/in/mikedusenberry
> > >>>>
> > >>>> Sent from my iPhone.
> > >>>>
> > >>>>
> > >>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
> > >> aishwarya2612@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hey,
> > >>>>>
> > >>>>> The object sc is already defined in pyspark and yet this name
error
> > >> keeps
> > >>>>> occurring. We are using spark 2.*
> > >>>>>
> > >>>>> Here is the link to error that we are getting :
> > >>>>> https://paste.fedoraproject.org/paste/
> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
> > >>>> YhyRLivL9gydE=
> > >>>>
> > >>
> >
>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by du...@gmail.com.

Hi Aishwarya,

Yes, it is quite strange that Jupyter isn't running on the PySpark kernel even though it's being started in that manner.  The good news is that we do use this everyday, so once we find the root issue with your Jupyter, it should work great!  Let's try temporarily removing all of the existing Jupyter/IPython settings & kernels and basically start fresh.  Assuming you are on OS X / macOS or Linux, can you do the following? (Please double check the exact paths, as I'm typing on a phone.)

* Stop Jupyter, and make sure that it is not running.
* Temporarily remove the Jupyter kernels.  First, you will need to see where they are installed, and then just rename that path.
    `jupyter kernelspec list`
    # look at paths above.  For example, on macOS, it may be located at ~/Library/Jupyter/kernels, and thus to move it, you would use the following. Update this as needed for the exact paths listed above 
    `mv ~/Library/Jupyter/kernels ~/Library/Jupyter_OLD/kernels`
* Temporarily remove the Jupyter & IPython settings:
    `mv ~/.jupyter ~/.jupyter_OLD`
    `mv ~/.ipython ~/.ipython_OLD`
* Make sure Jupyter is up to date:
    `pip3 install -U ipython jupyter`

After that, please ensure that Jupyter is not running, then start it in the context of PySpark as sent previously.  Once Jupyter is started this time, there should only be one kernel listed, and `sc` should be available.

Can you try that?

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 26, 2017, at 2:13 AM, Aishwarya Chaurasia <ai...@gmail.com> wrote:
> 
> Hi sir,
> The sc NameError persists.
> 
> (1) There is only one jupyter server running. And that was started with the
> pyspark command in the previous mail.
> (2) Two kernels are appearing in the change kernel option - Python3 and
> Python2. Tried with both of them and the result is the same.
> 
> How is jupyter not being able to run on the pyspark kernel when we have
> started the notebook with the pyspark command only?
> 
> Is it possible to create a .py file of MachineLearning.ipynb like was done
> with preprocessing.ipynb with explicitly creating a SparkContext() ?
> 
>> On 25-Apr-2017 11:57 PM, <du...@gmail.com> wrote:
>> 
>> Hi Aishwarya,
>> 
>> Unfortunately this mailing list removes all images, so I can't view your
>> screenshot.  I'm assuming that it is the same issue with the missing
>> SparkContext `sc` object, but please let me know if it is a different
>> issue.  This sounds like it could be an issue with multiple kernels
>> installed in Jupyter.  When you start the notebook, can you see if there
>> are multiple kernels listed in the "Kernel" -> "Change Kernel" menu?  If
>> so, please try one of the other kernels to see if Jupyter is starting by
>> default with a non-spark kernel.  Also, is it possible that you have more
>> than one instance of the Jupyter server running?  I.e. for this scenario,
>> we start Jupyter itself directly via pyspark using the command sent
>> previously, whereas usually Jupyter can just be started with `jupyter
>> notebook`.  In the latter case, PySpark (and thus `sc`) would *not* be
>> available (unless you've set up special PySpark kernels separately).  In
>> summary, can you (1) check for other kernels via the menus, and (2) check
>> for other running Jupyter servers that are non-PySpark?
>> 
>> As for the other inquiry, great question!  When training models, it's
>> quite useful to track the loss and other metrics (i.e. accuracy) from
>> *both* the training and validation sets.  The reasoning is that it allows
>> for a more holistic view of the overall learning process, such as
>> evaluating whether any overfitting or underfitting is occurring.  For
>> example, say that you train a model and achieve an accuracy of 80% on the
>> validation set.  Is this good?  Is this the best that can be done?  Without
>> also tracking performance on the training set, it can be difficult to make
>> these decisions.  Say that you then measure the performance on the training
>> set and find that the model achieves 100% accuracy on that data.  That
>> might be a good indication that your model is overfitting the training set,
>> and that a combination of more data, regularization, and a smaller model
>> may be helpful in raising the generalization performance, i.e. the
>> performance on the validation set and future real examples on which you
>> wish to make predictions.  If on the other hand, the model achieved an 82%
>> on the training set, this could be a good indication that the model is
>> underfitting, and that a combination of a more expressive model and better
>> data could be helpful.  In summary, tracking performance on both the
>> training and validation datasets can be useful for determining ways in
>> which to improve the overall learning process.
>> 
>> 
>> - Mike
>> 
>> --
>> 
>> Mike Dusenberry
>> GitHub: github.com/dusenberrymw
>> LinkedIn: linkedin.com/in/mikedusenberry
>> 
>> Sent from my iPhone.
>> 
>> 
>>> On Apr 25, 2017, at 8:47 AM, Aishwarya Chaurasia <
>> aishwarya2612@gmail.com> wrote:
>>> 
>>> We had another query, sir. We read the entire MachineLearning.ipynb code.
>>> in it the training samples and the validation samples have both been
>>> evaluated separately and their respective losses and accuracies obtained.
>>> Why are the training samples being evaluated again if they were used to
>>> train the model in the first place? Shouldn't only the validation data
>>> frames be evaluated to find out the loss and accuracy?
>>> 
>>> Thank you
>>> 
>>> On 25-Apr-2017 4:00 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
>>> wrote:
>>> 
>>>> Hello sir,
>>>> 
>>>> The NameError is occuring again sir. Why does it keep resurfacing?
>>>> 
>>>> Attaching the screenshot of the error.
>>>> 
>>>>> On 25-Apr-2017 2:50 AM, <du...@gmail.com> wrote:
>>>>> 
>>>>> Hi Aishwarya,
>>>>> 
>>>>> For the error message, that just means that the SystemML jar isn't
>> being
>>>>> found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/
>> SystemML.jar`
>>>>> to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3
>>>>> PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
>>>>> pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path
>>>>> $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was
>>>>> supposed to have been fixed in Spark 2.x, but it's possible that it is
>>>>> still an issue.
>>>>> 
>>>>> As for the output, the notebook will create SystemML `Matrix` objects
>> for
>>>>> all of the weights and biases of the trained models.  To save, please
>>>>> convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for
>> each
>>>>> matrix, and then simply save the DataFrames.  This could be done all at
>>>>> once like this for a SystemML Matrix object `Wc1`:
>>>>> `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.
>>>>> Just repeat for each matrix returned by the "Train" code for the
>>>>> algorithms.  At that point, you will have a set of saved DataFrames
>>>>> representing a trained SystemML model, and these can be used in
>> downstream
>>>>> classification tasks in a similar manner to the "Eval" sections.
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> --
>>>>> 
>>>>> Mike Dusenberry
>>>>> GitHub: github.com/dusenberrymw
>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>> 
>>>>> Sent from my iPhone.
>>>>> 
>>>>> 
>>>>>> On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <
>>>>> aishwarya2612@gmail.com> wrote:
>>>>>> 
>>>>>> Further more :
>>>>>> What is the output of MachineLearning.ipynb you're obtaining sir?
>>>>>> We are actually nearing our deadline for our problem.
>>>>>> Thanks a lot.
>>>>>> 
>>>>>> On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <
>> aishwarya2612@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hello sir,
>>>>>> 
>>>>>> Thanks a lot for replying sir. But unfortunately it did not work.
>>>>> Although
>>>>>> the NameError did not appear this time but another error came about :
>>>>>> 
>>>>>> https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
>>>>>> 5M1UNdIGYhyRLivL9gydE=
>>>>>> 
>>>>>> This error was obtained after executing the second block of code of
>>>>>> MachineLearning.py in terminal. ( ml = MLContext(sc) )
>>>>>> 
>>>>>> We have installed the bleeding-edge version of systemml only and the
>>>>>> installation was done correctly. We are in a fix now. :/
>>>>>> Kindly look into the matter asap
>>>>>> 
>>>>>> On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Aishwarya,
>>>>>> 
>>>>>> Glad to hear that the preprocessing stage was successful!  As for the
>>>>>> `MachineLearning.ipynb` notebook, here is a general guide:
>>>>>> 
>>>>>> 
>>>>>> - The `MachineLearning.ipynb` notebook essentially (1) loads in the
>>>>>> training and validation DataFrames from the preprocessing step, (2)
>>>>>> converts them to normalized & one-hot encoded SystemML matrices for
>>>>>> consumption by the ML algorithms, and (3) explores training a couple
>>>>> of
>>>>>> models.
>>>>>> - To run, you'll need to start Jupyter in the context of PySpark via
>>>>>> `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
>>>>>> PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
>>>>>> $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have
>> installed
>>>>>> SystemML with pip from PyPy (`pip3 install systemml`), this will
>>>>> install
>>>>>> our 0.13 release, and the `--jars $SYSTEMML_HOME/target/
>> SystemML.jar`
>>>>>> will
>>>>>> not be necessary.  If you instead have installed a bleeding-edge
>>>>> version
>>>>>> of
>>>>>> SystemML locally (git clone locally, maven build, `pip3 install -e
>>>>>> src/main/python` as listed in `projects/breast_cancer/README.md`),
>>>>> the
>>>>>> `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We
>>>>> are
>>>>>> about to release 0.14, and for this project, I *would* recommend
>>>>> using a
>>>>>> bleeding edge install.
>>>>>> - Once Jupyter has been started in the context of PySpark, the `sc`
>>>>>> SparkContext object should be available.  Please let me know if you
>>>>>> continue to see this issue.
>>>>>> - The "Read in train & val data" section simply reads in the training
>>>>>> and validation data generated in the preprocessing stage.  Be sure
>>>>> that
>>>>>> the
>>>>>> `size` setting is the same as the preprocessing size.  The percentage
>>>>> `p`
>>>>>> setting determines whether the full or sampled DataFrames are
>>>>> loaded.  If
>>>>>> you set `p = 1`, the full DataFrames will be used.  If you instead
>>>>> would
>>>>>> prefer to use the smaller sampled DataFrames while getting started,
>>>>>> please
>>>>>> set it to the same value as used in the preprocessing to generate the
>>>>>> smaller sampled DataFrames.
>>>>>> - The `Extract X & Y matrices` section splits each of the train and
>>>>>> validation DataFrames into effectively X & Y matrices (still as
>>>>> DataFrame
>>>>>> types), with X containing the images, and Y containing the labels.
>>>>>> - The `Convert to SystemML Matrices` section passes the X & Y
>>>>> DataFrames
>>>>>> into a SystemML script that performs some normalization of the images
>>>>> &
>>>>>> one-hot encoding of the labels, and then returns SystemML `Matrix`
>>>>> types.
>>>>>> These are now ready to be passed into the subsequent algorithms.
>>>>>> - The "Trigger Caching" and "Save Matrices" are experimental
>> features,
>>>>>> and not necessary to execute.
>>>>>> - Next comes the two algorithms being explored in this notebook.  The
>>>>>> "Softmax Classifier" is just a multi-class logistic regression model,
>>>>> and
>>>>>> is simply there to serve as a baseline comparison with the subsequent
>>>>>> convolutional neural net model.  You may wish to simply skip this
>>>>> softmax
>>>>>> model and move to the latter convnet model further down in the
>>>>> notebook.
>>>>>> - The actual softmax model is located at [
>>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>>> projects/breast_cancer/softmax_clf.dml],
>>>>>> and the notebook calls functions from that file.
>>>>>> - The softmax sanity check just ensures that the model is able to
>>>>>> completely overfit when given a tiny sample size.  This should yield
>>>>>> ~100%
>>>>>> training accuracy if the sample size in this section is small enough.
>>>>>> This
>>>>>> is just a check to ensure that nothing else is wrong with the math or
>>>>> the
>>>>>> data.
>>>>>> - The softmax "Train" section will train a softmax model and return
>>>>> the
>>>>>> weights (`W`) and biases (`b`) of the model as SystemML `Matrix`
>>>>> objects.
>>>>>> Please adjust the hyperparameters in this section to your problem.
>>>>>> - The softmax "Eval" section takes the trained weights and biases and
>>>>>> evaluates the training and validation performance.
>>>>>> - The next model is a LeNet-like convnet model.  The actual model is
>>>>>> located at [
>>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>>> projects/breast_cancer/convnet.dml],
>>>>>> and the notebook simply calls functions from that file.
>>>>>> - Once again, there is an initial sanity check for the ability to
>>>>>> overfit on a small amount of data.
>>>>>> - The "Hyperparameter Search" contains a script to sample different
>>>>>> hyperparams for the convnet, and save the hyperparams + validation
>>>>>> accuracy
>>>>>> of each set after a single epoch of training.  These string files
>>>>> will be
>>>>>> saved to HDFS.  Please feel free to adjust the range of the
>>>>>> hyperparameters
>>>>>> for your problem.  Please also feel free to try using the `parfor`
>>>>>> (parallel for-loop) instead of the while loop to speed up this
>>>>> section.
>>>>>> Note that this is still a work in progress.  The hyperparameter
>>>>> tuning in
>>>>>> this section makes use of random search (as opposed to grid search),
>>>>>> which
>>>>>> has been promoted by Bengio et al. to speed up the search time.
>>>>>> - The "Train" section trains the convnet and returns the weights and
>>>>>> biases as SystemML `Matrix` types.  In this section, please replace
>>>>> the
>>>>>> hyperparameters with the best ones from above, and please increase
>> the
>>>>>> number of epochs given your time constraints.
>>>>>> - The "Eval" section evaluates the performance of the trained
>> convnet.
>>>>>> - Although it is not shown in the notebook yet, to save the weights
>>>>> and
>>>>>> biases, please use the `toDF()` method on each weight and biases
>> (i.e.
>>>>>> `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save
>>>>> the
>>>>>> DataFrame as desired.
>>>>>> - Finally, please feel free to extend the model in `convnet.dml` for
>>>>>> your particular problem!  The LeNet-like model just serves as a
>> simple
>>>>>> convnet, but there are much richer models currently, such as resnets,
>>>>>> that
>>>>>> we are experimenting with.  To make larger models such as resnets
>>>>> easier
>>>>>> to
>>>>>> define, we are also working on other tools for converting model
>>>>>> definitions
>>>>>> + pretrained weights from other systems into SystemML.
>>>>>> 
>>>>>> 
>>>>>> Also, please keep in mind that the deep learning support in SystemML
>> is
>>>>>> still a work in progress.  Therefore, if you run into issues, please
>>>>> let us
>>>>>> know and we'll do everything possible to help get things running!
>>>>>> 
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> - Mike
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Michael W. Dusenberry
>>>>>> GitHub: github.com/dusenberrymw
>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>> 
>>>>>> On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
>>>>>> aishwarya2612@gmail.com> wrote:
>>>>>> 
>>>>>>> Hey,
>>>>>>> 
>>>>>>> Thank you so much for your help sir. We were finally able to run
>>>>>>> preprocess.py without any errors. And the results obtained were
>>>>>>> satisfactory i.e we got five set of data frames like you said we
>> would.
>>>>>>> 
>>>>>>> But alas! when we tried to run MachineLearning.ipynb the same
>> NameError
>>>>>>> came : https://paste.fedoraproject.org/paste/l3LFJreg~
>>>>>>> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>>>>>>> 
>>>>>>> Could you guide us again as to how to proceed now?
>>>>>>> Also, could you please provide an overview of the process
>>>>>>> MachineLearning.ipynb is following to train the samples.
>>>>>>> 
>>>>>>> Thanks a lot!
>>>>>>> 
>>>>>>>> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Aishwarya,
>>>>>>>> 
>>>>>>>> Looks like you've just encountered an out of memory error on one of
>>>>> the
>>>>>>>> executors.  Therefore, you just need to adjust the
>>>>>>> `spark.executor.memory`
>>>>>>>> and `spark.driver.memory` settings with higher amounts of RAM.  What
>>>>> is
>>>>>>>> your current setup?  I.e. are you using a cluster of machines, or a
>>>>>>> single
>>>>>>>> machine?  We generally use a large driver on one machine, and then a
>>>>>>> single
>>>>>>>> large executor on each other machine.  I would give a sizable amount
>>>>> of
>>>>>>>> memory to the driver, and about half the possible memory on the
>>>>>> executors
>>>>>>>> so that the Python processes have enough memory as well.  PySpark
>> has
>>>>>> JVM
>>>>>>>> and Python components, and the Spark memory settings only pertain to
>>>>> the
>>>>>>>> JVM side, thus the need to save about half the executor memory for
>> the
>>>>>>>> Python side.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> - Mike
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> Mike Dusenberry
>>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>>> 
>>>>>>>> Sent from my iPhone.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>>>>>>>> aishwarya2612@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hello sir,
>>>>>>>>> 
>>>>>>>>> We also wanted to ensure that the spark-submit command we're using
>> is
>>>>>>> the
>>>>>>>>> correct one for running 'preprocess.py'.
>>>>>>>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> Aishwarya Chaurasia.
>>>>>>>>> 
>>>>>>>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <
>>>>> aishwarya2612@gmail.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hello sir,
>>>>>>>>> On running the file preprocess.py we are getting the following
>> error
>>>>> :
>>>>>>>>> 
>>>>>>>>> https://paste.fedoraproject.org/paste/
>> IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>>>>>>>>> YhyRLivL9gydE=
>>>>>>>>> 
>>>>>>>>> Can you please help us by looking into the error and kindly tell us
>>>>>> the
>>>>>>>>> solution for it.
>>>>>>>>> Thanks a lot.
>>>>>>>>> Aishwarya Chaurasia
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Aishwarya,
>>>>>>>>>> 
>>>>>>>>>> Certainly, here is some more detailed information
>>>>>>> about`preprocess.py`:
>>>>>>>>>> 
>>>>>>>>>> * The preprocessing Python script is located at
>>>>>>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>>>>>>> projects/breast_cancer/preprocess.py. Note that this is different
>>>>>>> than
>>>>>>>>>> the library module at https://github.com/apache/incu
>>>>>>>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
>>>>>>>>>> ancer/preprocessing.py.
>>>>>>>>>> * This script is used to preprocess a set of histology slide
>> images,
>>>>>>>>>> which are `.svs` files in our case, and `.tiff` files in your
>> case.
>>>>>>>>>> * Lines 63-79 contain "settings" such as the output image sizes,
>>>>>>> folder
>>>>>>>>>> paths, etc.  Of particular interest, line 72 has the folder path
>> for
>>>>>>> the
>>>>>>>>>> original slide images that should be commonly accessible from all
>>>>>>>> machines
>>>>>>>>>> being used, and lines 74-79 contain the names of the output
>>>>>> DataFrames
>>>>>>>> that
>>>>>>>>>> will be saved.
>>>>>>>>>> * Line 82 performs the actual preprocessing and creates a Spark
>>>>>>>>>> DataFrame with the following columns: slide number, tumor score,
>>>>>>>> molecular
>>>>>>>>>> score, sample.  The "sample" in this case is the actual small,
>>>>>>>> chopped-up
>>>>>>>>>> section of the image that has been extracted and flattened into a
>>>>> row
>>>>>>>>>> Vector.  For test images without labels (`training=false`), only
>> the
>>>>>>>> slide
>>>>>>>>>> number and sample will be contained in the DataFrame (i.e. no
>>>>>> labels).
>>>>>>>>>> This calls the `preprocess(...)` function located on line 371 of
>>>>>>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>>>>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>>>>>>>>>> different file.
>>>>>>>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
>>>>>> from
>>>>>>>>>> line 74.
>>>>>>>>>> * Line 93 splits the above DataFrame row-wise into separate
>>>>>>> "training"
>>>>>>>>>> and "validation" DataFrames, based on the split percentage from
>> line
>>>>>>> 70
>>>>>>>>>> (`train_frac`).  This is performed so that downstream machine
>>>>>> learning
>>>>>>>>>> tasks can learn from the training set, and validate performance
>> and
>>>>>>>>>> hyperparameter choices on the validation set.  These DataFrames
>> will
>>>>>>>> start
>>>>>>>>>> with the same columns as the above DataFrame.  If
>> `add_row_indices`
>>>>>>> from
>>>>>>>>>> line 69 is true, then an additional row index column (`__INDEX`)
>>>>> will
>>>>>>> be
>>>>>>>>>> pretended.  This is useful for SystemML in downstream machine
>>>>>> learning
>>>>>>>>>> tasks as it gives the DataFrame row numbers like a real matrix
>> would
>>>>>>>> have,
>>>>>>>>>> and SystemML is built to operate on matrices.
>>>>>>>>>> * Lines 97 & 98 simply save the training and validation DataFrames
>>>>>>>> using
>>>>>>>>>> the names defined on lines 76 & 78.
>>>>>>>>>> * Lines 103-137 create smaller train and validation DataFrames by
>>>>>>>> taking
>>>>>>>>>> small row-wise samples of the full train and validation
>> DataFrames.
>>>>>>> The
>>>>>>>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
>>>>>>>>>> sample).  This is generally useful for quicker downstream tasks
>>>>>>> without
>>>>>>>>>> having to load in the larger DataFrames, assuming you have a large
>>>>>>>> amount
>>>>>>>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
>>>>>>> DataFrames
>>>>>>>> is
>>>>>>>>>> useful for quicker downstream tests.  Once again, the same columns
>>>>>>> from
>>>>>>>> the
>>>>>>>>>> larger train and validation DataFrames will be used.
>>>>>>>>>> * Lines 146 & 147 simply save these sampled train and validation
>>>>>>>>>> DataFrames.
>>>>>>>>>> 
>>>>>>>>>> As a summary, after running `preprocess.py`, you will be left with
>>>>>> the
>>>>>>>>>> following saved DataFrames in HDFS:
>>>>>>>>>> * Full DataFrame
>>>>>>>>>> * Training DataFrame
>>>>>>>>>> * Validation DataFrame
>>>>>>>>>> * Sampled training DataFrame
>>>>>>>>>> * Sampled validation DataFrame
>>>>>>>>>> 
>>>>>>>>>> As for visualization, you may visualize a "sample" (i.e. small,
>>>>>>>> chopped-up
>>>>>>>>>> section of original image) from a DataFrame by using the `
>>>>>>>>>> breastcancer.visualization.visualize_sample(...)` function.  You
>>>>> will
>>>>>>>>>> need to do this after creating the DataFrames.  Here is a snippet
>> to
>>>>>>>>>> visualize the first row sample in a DataFrame, where `df` is one
>> of
>>>>>>> the
>>>>>>>>>> DataFrames from above:
>>>>>>>>>> 
>>>>>>>>>> ```
>>>>>>>>>> from breastcancer.visualization import visualize_sample
>>>>>>>>>> visualize_sample(df.first().sample)
>>>>>>>>>> ```
>>>>>>>>>> 
>>>>>>>>>> Please let me know if you have any additional questions.
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> - Mike
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> Mike Dusenberry
>>>>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>>>>>>>>>> aishwarya2612@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hello sir,
>>>>>>>>>>> Can you please elaborate more on what output we would be getting
>>>>>>>> because
>>>>>>>>>> we
>>>>>>>>>>> tried executing the preprocess.py file using spark submit it
>> keeps
>>>>>> on
>>>>>>>>>>> adding the tiles in rdd and while running the visualisation.py
>> file
>>>>>>> it
>>>>>>>>>>> isn't showing any output. Can you please help us out asap stating
>>>>>> the
>>>>>>>>>>> output we will be getting and the sequence of execution of files.
>>>>>>>>>>> Thank you.
>>>>>>>>>>> 
>>>>>>>>>>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Aishwarya,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for sharing more info on the issue!
>>>>>>>>>>>> 
>>>>>>>>>>>> To facilitate easier usage, I've updated the preprocessing code
>> by
>>>>>>>>>> pulling
>>>>>>>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
>>>>>>> module,
>>>>>>>>>>>> leaving just the execution in the `Preprocessing.ipynb`
>> notebook.
>>>>>>>>>> There is
>>>>>>>>>>>> also a `preprocess.py` script with the same contents as the
>>>>>> notebook
>>>>>>>> for
>>>>>>>>>>>> use with `spark-submit`.  The choice of the notebook or the
>> script
>>>>>>> is
>>>>>>>>>> just
>>>>>>>>>>>> a matter of convenience, as they both import from the same
>>>>>>>>>>>> `breastcancer/preprocessing.py` package.
>>>>>>>>>>>> 
>>>>>>>>>>>> As part of the updates, I've added an explicit SparkSession
>>>>>>> parameter
>>>>>>>>>>>> (`spark`) to the `preprocess(...)` function, and updated the
>> body
>>>>>> to
>>>>>>>> use
>>>>>>>>>>>> this SparkSession object rather than the older SparkContext `sc`
>>>>>>>> object.
>>>>>>>>>>>> Previously, the `preprocess(...)` function accessed the `sc`
>>>>> object
>>>>>>>> that
>>>>>>>>>>>> was pulled in from the enclosing scope, which would work while
>> all
>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>> code was colocated within the notebook, but not if the code was
>>>>>>>>>> extracted
>>>>>>>>>>>> and imported.  The explicit parameter now allows for the code to
>>>>> be
>>>>>>>>>>>> imported.
>>>>>>>>>>>> 
>>>>>>>>>>>> Can you please try again with the latest updates?  We are
>>>>> currently
>>>>>>>>>> using
>>>>>>>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
>>>>>>> kernel
>>>>>>>>>>>> should have a `spark` object available that can be supplied to
>> the
>>>>>>>>>>>> functions (as is done now in the notebook), and if you use the
>>>>>>>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object
>>>>> will
>>>>>>> be
>>>>>>>>>>>> created explicitly by the script.
>>>>>>>>>>>> 
>>>>>>>>>>>> For a bit of context to others, Aishwarya initially reached out
>> to
>>>>>>>> find
>>>>>>>>>>>> out if our breast cancer project could be applied to TIFF
>> images,
>>>>>>>> rather
>>>>>>>>>>>> than the SVS images we are currently using (the answer is "yes"
>> so
>>>>>>>> long
>>>>>>>>>> as
>>>>>>>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
>>>>>>>>>>>> documentation), and then followed up with Spark issues related
>> to
>>>>>>> the
>>>>>>>>>>>> preprocessing code.  This conversation has been promptly moved
>> to
>>>>>>> the
>>>>>>>>>>>> mailing list so that others in the community can benefit.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> 
>>>>>>>>>>>> -Mike
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> 
>>>>>>>>>>>> Mike Dusenberry
>>>>>>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my iPhone.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>>>>>>>>>> aishwarya2612@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The object sc is already defined in pyspark and yet this name
>>>>>> error
>>>>>>>>>> keeps
>>>>>>>>>>>>> occurring. We are using spark 2.*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here is the link to error that we are getting :
>>>>>>>>>>>>> https://paste.fedoraproject.org/paste/
>>>>>>> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
>>>>>>>>>>>> YhyRLivL9gydE=
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by Aishwarya Chaurasia <ai...@gmail.com>.

Hi sir,
The sc NameError persists.

(1) There is only one jupyter server running. And that was started with the
pyspark command in the previous mail.
(2) Two kernels are appearing in the change kernel option - Python3 and
Python2. Tried with both of them and the result is the same.

How is jupyter not being able to run on the pyspark kernel when we have
started the notebook with the pyspark command only?

Is it possible to create a .py file of MachineLearning.ipynb like was done
with preprocessing.ipynb with explicitly creating a SparkContext() ?

On 25-Apr-2017 11:57 PM, <du...@gmail.com> wrote:

> Hi Aishwarya,
>
> Unfortunately this mailing list removes all images, so I can't view your
> screenshot.  I'm assuming that it is the same issue with the missing
> SparkContext `sc` object, but please let me know if it is a different
> issue.  This sounds like it could be an issue with multiple kernels
> installed in Jupyter.  When you start the notebook, can you see if there
> are multiple kernels listed in the "Kernel" -> "Change Kernel" menu?  If
> so, please try one of the other kernels to see if Jupyter is starting by
> default with a non-spark kernel.  Also, is it possible that you have more
> than one instance of the Jupyter server running?  I.e. for this scenario,
> we start Jupyter itself directly via pyspark using the command sent
> previously, whereas usually Jupyter can just be started with `jupyter
> notebook`.  In the latter case, PySpark (and thus `sc`) would *not* be
> available (unless you've set up special PySpark kernels separately).  In
> summary, can you (1) check for other kernels via the menus, and (2) check
> for other running Jupyter servers that are non-PySpark?
>
> As for the other inquiry, great question!  When training models, it's
> quite useful to track the loss and other metrics (i.e. accuracy) from
> *both* the training and validation sets.  The reasoning is that it allows
> for a more holistic view of the overall learning process, such as
> evaluating whether any overfitting or underfitting is occurring.  For
> example, say that you train a model and achieve an accuracy of 80% on the
> validation set.  Is this good?  Is this the best that can be done?  Without
> also tracking performance on the training set, it can be difficult to make
> these decisions.  Say that you then measure the performance on the training
> set and find that the model achieves 100% accuracy on that data.  That
> might be a good indication that your model is overfitting the training set,
> and that a combination of more data, regularization, and a smaller model
> may be helpful in raising the generalization performance, i.e. the
> performance on the validation set and future real examples on which you
> wish to make predictions.  If on the other hand, the model achieved an 82%
> on the training set, this could be a good indication that the model is
> underfitting, and that a combination of a more expressive model and better
> data could be helpful.  In summary, tracking performance on both the
> training and validation datasets can be useful for determining ways in
> which to improve the overall learning process.
>
>
> - Mike
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On Apr 25, 2017, at 8:47 AM, Aishwarya Chaurasia <
> aishwarya2612@gmail.com> wrote:
> >
> > We had another query, sir. We read the entire MachineLearning.ipynb code.
> > in it the training samples and the validation samples have both been
> > evaluated separately and their respective losses and accuracies obtained.
> > Why are the training samples being evaluated again if they were used to
> > train the model in the first place? Shouldn't only the validation data
> > frames be evaluated to find out the loss and accuracy?
> >
> > Thank you
> >
> > On 25-Apr-2017 4:00 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
> > wrote:
> >
> >> Hello sir,
> >>
> >> The NameError is occuring again sir. Why does it keep resurfacing?
> >>
> >> Attaching the screenshot of the error.
> >>
> >>> On 25-Apr-2017 2:50 AM, <du...@gmail.com> wrote:
> >>>
> >>> Hi Aishwarya,
> >>>
> >>> For the error message, that just means that the SystemML jar isn't
> being
> >>> found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/
> SystemML.jar`
> >>> to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3
> >>> PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> >>> pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path
> >>> $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was
> >>> supposed to have been fixed in Spark 2.x, but it's possible that it is
> >>> still an issue.
> >>>
> >>> As for the output, the notebook will create SystemML `Matrix` objects
> for
> >>> all of the weights and biases of the trained models.  To save, please
> >>> convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for
> each
> >>> matrix, and then simply save the DataFrames.  This could be done all at
> >>> once like this for a SystemML Matrix object `Wc1`:
> >>> `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.
> >>> Just repeat for each matrix returned by the "Train" code for the
> >>> algorithms.  At that point, you will have a set of saved DataFrames
> >>> representing a trained SystemML model, and these can be used in
> downstream
> >>> classification tasks in a similar manner to the "Eval" sections.
> >>>
> >>> -Mike
> >>>
> >>> --
> >>>
> >>> Mike Dusenberry
> >>> GitHub: github.com/dusenberrymw
> >>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>
> >>> Sent from my iPhone.
> >>>
> >>>
> >>>> On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <
> >>> aishwarya2612@gmail.com> wrote:
> >>>>
> >>>> Further more :
> >>>> What is the output of MachineLearning.ipynb you're obtaining sir?
> >>>> We are actually nearing our deadline for our problem.
> >>>> Thanks a lot.
> >>>>
> >>>> On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <
> aishwarya2612@gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hello sir,
> >>>>
> >>>> Thanks a lot for replying sir. But unfortunately it did not work.
> >>> Although
> >>>> the NameError did not appear this time but another error came about :
> >>>>
> >>>> https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
> >>>> 5M1UNdIGYhyRLivL9gydE=
> >>>>
> >>>> This error was obtained after executing the second block of code of
> >>>> MachineLearning.py in terminal. ( ml = MLContext(sc) )
> >>>>
> >>>> We have installed the bleeding-edge version of systemml only and the
> >>>> installation was done correctly. We are in a fix now. :/
> >>>> Kindly look into the matter asap
> >>>>
> >>>> On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi Aishwarya,
> >>>>
> >>>> Glad to hear that the preprocessing stage was successful!  As for the
> >>>> `MachineLearning.ipynb` notebook, here is a general guide:
> >>>>
> >>>>
> >>>>  - The `MachineLearning.ipynb` notebook essentially (1) loads in the
> >>>>  training and validation DataFrames from the preprocessing step, (2)
> >>>>  converts them to normalized & one-hot encoded SystemML matrices for
> >>>>  consumption by the ML algorithms, and (3) explores training a couple
> >>> of
> >>>>  models.
> >>>>  - To run, you'll need to start Jupyter in the context of PySpark via
> >>>>  `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
> >>>>  PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
> >>>>  $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have
> installed
> >>>>  SystemML with pip from PyPy (`pip3 install systemml`), this will
> >>> install
> >>>>  our 0.13 release, and the `--jars $SYSTEMML_HOME/target/
> SystemML.jar`
> >>>> will
> >>>>  not be necessary.  If you instead have installed a bleeding-edge
> >>> version
> >>>> of
> >>>>  SystemML locally (git clone locally, maven build, `pip3 install -e
> >>>>  src/main/python` as listed in `projects/breast_cancer/README.md`),
> >>> the
> >>>>  `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We
> >>> are
> >>>>  about to release 0.14, and for this project, I *would* recommend
> >>> using a
> >>>>  bleeding edge install.
> >>>>  - Once Jupyter has been started in the context of PySpark, the `sc`
> >>>>  SparkContext object should be available.  Please let me know if you
> >>>>  continue to see this issue.
> >>>>  - The "Read in train & val data" section simply reads in the training
> >>>>  and validation data generated in the preprocessing stage.  Be sure
> >>> that
> >>>> the
> >>>>  `size` setting is the same as the preprocessing size.  The percentage
> >>> `p`
> >>>>  setting determines whether the full or sampled DataFrames are
> >>> loaded.  If
> >>>>  you set `p = 1`, the full DataFrames will be used.  If you instead
> >>> would
> >>>>  prefer to use the smaller sampled DataFrames while getting started,
> >>>> please
> >>>>  set it to the same value as used in the preprocessing to generate the
> >>>>  smaller sampled DataFrames.
> >>>>  - The `Extract X & Y matrices` section splits each of the train and
> >>>>  validation DataFrames into effectively X & Y matrices (still as
> >>> DataFrame
> >>>>  types), with X containing the images, and Y containing the labels.
> >>>>  - The `Convert to SystemML Matrices` section passes the X & Y
> >>> DataFrames
> >>>>  into a SystemML script that performs some normalization of the images
> >>> &
> >>>>  one-hot encoding of the labels, and then returns SystemML `Matrix`
> >>> types.
> >>>>  These are now ready to be passed into the subsequent algorithms.
> >>>>  - The "Trigger Caching" and "Save Matrices" are experimental
> features,
> >>>>  and not necessary to execute.
> >>>>  - Next comes the two algorithms being explored in this notebook.  The
> >>>>  "Softmax Classifier" is just a multi-class logistic regression model,
> >>> and
> >>>>  is simply there to serve as a baseline comparison with the subsequent
> >>>>  convolutional neural net model.  You may wish to simply skip this
> >>> softmax
> >>>>  model and move to the latter convnet model further down in the
> >>> notebook.
> >>>>  - The actual softmax model is located at [
> >>>>  https://github.com/apache/incubator-systemml/blob/master/
> >>>> projects/breast_cancer/softmax_clf.dml],
> >>>>  and the notebook calls functions from that file.
> >>>>  - The softmax sanity check just ensures that the model is able to
> >>>>  completely overfit when given a tiny sample size.  This should yield
> >>>> ~100%
> >>>>  training accuracy if the sample size in this section is small enough.
> >>>> This
> >>>>  is just a check to ensure that nothing else is wrong with the math or
> >>> the
> >>>>  data.
> >>>>  - The softmax "Train" section will train a softmax model and return
> >>> the
> >>>>  weights (`W`) and biases (`b`) of the model as SystemML `Matrix`
> >>> objects.
> >>>>  Please adjust the hyperparameters in this section to your problem.
> >>>>  - The softmax "Eval" section takes the trained weights and biases and
> >>>>  evaluates the training and validation performance.
> >>>>  - The next model is a LeNet-like convnet model.  The actual model is
> >>>>  located at [
> >>>>  https://github.com/apache/incubator-systemml/blob/master/
> >>>> projects/breast_cancer/convnet.dml],
> >>>>  and the notebook simply calls functions from that file.
> >>>>  - Once again, there is an initial sanity check for the ability to
> >>>>  overfit on a small amount of data.
> >>>>  - The "Hyperparameter Search" contains a script to sample different
> >>>>  hyperparams for the convnet, and save the hyperparams + validation
> >>>> accuracy
> >>>>  of each set after a single epoch of training.  These string files
> >>> will be
> >>>>  saved to HDFS.  Please feel free to adjust the range of the
> >>>> hyperparameters
> >>>>  for your problem.  Please also feel free to try using the `parfor`
> >>>>  (parallel for-loop) instead of the while loop to speed up this
> >>> section.
> >>>>  Note that this is still a work in progress.  The hyperparameter
> >>> tuning in
> >>>>  this section makes use of random search (as opposed to grid search),
> >>>> which
> >>>>  has been promoted by Bengio et al. to speed up the search time.
> >>>>  - The "Train" section trains the convnet and returns the weights and
> >>>>  biases as SystemML `Matrix` types.  In this section, please replace
> >>> the
> >>>>  hyperparameters with the best ones from above, and please increase
> the
> >>>>  number of epochs given your time constraints.
> >>>>  - The "Eval" section evaluates the performance of the trained
> convnet.
> >>>>  - Although it is not shown in the notebook yet, to save the weights
> >>> and
> >>>>  biases, please use the `toDF()` method on each weight and biases
> (i.e.
> >>>>  `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save
> >>> the
> >>>>  DataFrame as desired.
> >>>>  - Finally, please feel free to extend the model in `convnet.dml` for
> >>>>  your particular problem!  The LeNet-like model just serves as a
> simple
> >>>>  convnet, but there are much richer models currently, such as resnets,
> >>>> that
> >>>>  we are experimenting with.  To make larger models such as resnets
> >>> easier
> >>>> to
> >>>>  define, we are also working on other tools for converting model
> >>>> definitions
> >>>>  + pretrained weights from other systems into SystemML.
> >>>>
> >>>>
> >>>> Also, please keep in mind that the deep learning support in SystemML
> is
> >>>> still a work in progress.  Therefore, if you run into issues, please
> >>> let us
> >>>> know and we'll do everything possible to help get things running!
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>> - Mike
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Michael W. Dusenberry
> >>>> GitHub: github.com/dusenberrymw
> >>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>
> >>>> On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
> >>>> aishwarya2612@gmail.com> wrote:
> >>>>
> >>>>> Hey,
> >>>>>
> >>>>> Thank you so much for your help sir. We were finally able to run
> >>>>> preprocess.py without any errors. And the results obtained were
> >>>>> satisfactory i.e we got five set of data frames like you said we
> would.
> >>>>>
> >>>>> But alas! when we tried to run MachineLearning.ipynb the same
> NameError
> >>>>> came : https://paste.fedoraproject.org/paste/l3LFJreg~
> >>>>> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
> >>>>>
> >>>>> Could you guide us again as to how to proceed now?
> >>>>> Also, could you please provide an overview of the process
> >>>>> MachineLearning.ipynb is following to train the samples.
> >>>>>
> >>>>> Thanks a lot!
> >>>>>
> >>>>>> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Aishwarya,
> >>>>>>
> >>>>>> Looks like you've just encountered an out of memory error on one of
> >>> the
> >>>>>> executors.  Therefore, you just need to adjust the
> >>>>> `spark.executor.memory`
> >>>>>> and `spark.driver.memory` settings with higher amounts of RAM.  What
> >>> is
> >>>>>> your current setup?  I.e. are you using a cluster of machines, or a
> >>>>> single
> >>>>>> machine?  We generally use a large driver on one machine, and then a
> >>>>> single
> >>>>>> large executor on each other machine.  I would give a sizable amount
> >>> of
> >>>>>> memory to the driver, and about half the possible memory on the
> >>>> executors
> >>>>>> so that the Python processes have enough memory as well.  PySpark
> has
> >>>> JVM
> >>>>>> and Python components, and the Spark memory settings only pertain to
> >>> the
> >>>>>> JVM side, thus the need to save about half the executor memory for
> the
> >>>>>> Python side.
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> - Mike
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Mike Dusenberry
> >>>>>> GitHub: github.com/dusenberrymw
> >>>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>>
> >>>>>> Sent from my iPhone.
> >>>>>>
> >>>>>>
> >>>>>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
> >>>>>> aishwarya2612@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hello sir,
> >>>>>>>
> >>>>>>> We also wanted to ensure that the spark-submit command we're using
> is
> >>>>> the
> >>>>>>> correct one for running 'preprocess.py'.
> >>>>>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
> >>>>>>>
> >>>>>>>
> >>>>>>> Thank you.
> >>>>>>> Aishwarya Chaurasia.
> >>>>>>>
> >>>>>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <
> >>> aishwarya2612@gmail.com
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hello sir,
> >>>>>>> On running the file preprocess.py we are getting the following
> error
> >>> :
> >>>>>>>
> >>>>>>> https://paste.fedoraproject.org/paste/
> IAvqiiyJChSC0V9eeETe2F5M1UNdIG
> >>>>>>> YhyRLivL9gydE=
> >>>>>>>
> >>>>>>> Can you please help us by looking into the error and kindly tell us
> >>>> the
> >>>>>>> solution for it.
> >>>>>>> Thanks a lot.
> >>>>>>> Aishwarya Chaurasia
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Aishwarya,
> >>>>>>>>
> >>>>>>>> Certainly, here is some more detailed information
> >>>>> about`preprocess.py`:
> >>>>>>>>
> >>>>>>>> * The preprocessing Python script is located at
> >>>>>>>> https://github.com/apache/incubator-systemml/blob/master/
> >>>>>>>> projects/breast_cancer/preprocess.py. Note that this is different
> >>>>> than
> >>>>>>>> the library module at https://github.com/apache/incu
> >>>>>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
> >>>>>>>> ancer/preprocessing.py.
> >>>>>>>> * This script is used to preprocess a set of histology slide
> images,
> >>>>>>>> which are `.svs` files in our case, and `.tiff` files in your
> case.
> >>>>>>>> * Lines 63-79 contain "settings" such as the output image sizes,
> >>>>> folder
> >>>>>>>> paths, etc.  Of particular interest, line 72 has the folder path
> for
> >>>>> the
> >>>>>>>> original slide images that should be commonly accessible from all
> >>>>>> machines
> >>>>>>>> being used, and lines 74-79 contain the names of the output
> >>>> DataFrames
> >>>>>> that
> >>>>>>>> will be saved.
> >>>>>>>> * Line 82 performs the actual preprocessing and creates a Spark
> >>>>>>>> DataFrame with the following columns: slide number, tumor score,
> >>>>>> molecular
> >>>>>>>> score, sample.  The "sample" in this case is the actual small,
> >>>>>> chopped-up
> >>>>>>>> section of the image that has been extracted and flattened into a
> >>> row
> >>>>>>>> Vector.  For test images without labels (`training=false`), only
> the
> >>>>>> slide
> >>>>>>>> number and sample will be contained in the DataFrame (i.e. no
> >>>> labels).
> >>>>>>>> This calls the `preprocess(...)` function located on line 371 of
> >>>>>>>> https://github.com/apache/incubator-systemml/blob/master/
> >>>>>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
> >>>>>>>> different file.
> >>>>>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
> >>>> from
> >>>>>>>> line 74.
> >>>>>>>> * Line 93 splits the above DataFrame row-wise into separate
> >>>>> "training"
> >>>>>>>> and "validation" DataFrames, based on the split percentage from
> line
> >>>>> 70
> >>>>>>>> (`train_frac`).  This is performed so that downstream machine
> >>>> learning
> >>>>>>>> tasks can learn from the training set, and validate performance
> and
> >>>>>>>> hyperparameter choices on the validation set.  These DataFrames
> will
> >>>>>> start
> >>>>>>>> with the same columns as the above DataFrame.  If
> `add_row_indices`
> >>>>> from
> >>>>>>>> line 69 is true, then an additional row index column (`__INDEX`)
> >>> will
> >>>>> be
> >>>>>>>> pretended.  This is useful for SystemML in downstream machine
> >>>> learning
> >>>>>>>> tasks as it gives the DataFrame row numbers like a real matrix
> would
> >>>>>> have,
> >>>>>>>> and SystemML is built to operate on matrices.
> >>>>>>>> * Lines 97 & 98 simply save the training and validation DataFrames
> >>>>>> using
> >>>>>>>> the names defined on lines 76 & 78.
> >>>>>>>> * Lines 103-137 create smaller train and validation DataFrames by
> >>>>>> taking
> >>>>>>>> small row-wise samples of the full train and validation
> DataFrames.
> >>>>> The
> >>>>>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
> >>>>>>>> sample).  This is generally useful for quicker downstream tasks
> >>>>> without
> >>>>>>>> having to load in the larger DataFrames, assuming you have a large
> >>>>>> amount
> >>>>>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
> >>>>> DataFrames
> >>>>>> is
> >>>>>>>> useful for quicker downstream tests.  Once again, the same columns
> >>>>> from
> >>>>>> the
> >>>>>>>> larger train and validation DataFrames will be used.
> >>>>>>>> * Lines 146 & 147 simply save these sampled train and validation
> >>>>>>>> DataFrames.
> >>>>>>>>
> >>>>>>>> As a summary, after running `preprocess.py`, you will be left with
> >>>> the
> >>>>>>>> following saved DataFrames in HDFS:
> >>>>>>>> * Full DataFrame
> >>>>>>>> * Training DataFrame
> >>>>>>>> * Validation DataFrame
> >>>>>>>> * Sampled training DataFrame
> >>>>>>>> * Sampled validation DataFrame
> >>>>>>>>
> >>>>>>>> As for visualization, you may visualize a "sample" (i.e. small,
> >>>>>> chopped-up
> >>>>>>>> section of original image) from a DataFrame by using the `
> >>>>>>>> breastcancer.visualization.visualize_sample(...)` function.  You
> >>> will
> >>>>>>>> need to do this after creating the DataFrames.  Here is a snippet
> to
> >>>>>>>> visualize the first row sample in a DataFrame, where `df` is one
> of
> >>>>> the
> >>>>>>>> DataFrames from above:
> >>>>>>>>
> >>>>>>>> ```
> >>>>>>>> from breastcancer.visualization import visualize_sample
> >>>>>>>> visualize_sample(df.first().sample)
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>> Please let me know if you have any additional questions.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> - Mike
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Mike Dusenberry
> >>>>>>>> GitHub: github.com/dusenberrymw
> >>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>>>>
> >>>>>>>> Sent from my iPhone.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
> >>>>>>>> aishwarya2612@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hello sir,
> >>>>>>>>> Can you please elaborate more on what output we would be getting
> >>>>>> because
> >>>>>>>> we
> >>>>>>>>> tried executing the preprocess.py file using spark submit it
> keeps
> >>>> on
> >>>>>>>>> adding the tiles in rdd and while running the visualisation.py
> file
> >>>>> it
> >>>>>>>>> isn't showing any output. Can you please help us out asap stating
> >>>> the
> >>>>>>>>> output we will be getting and the sequence of execution of files.
> >>>>>>>>> Thank you.
> >>>>>>>>>
> >>>>>>>>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Aishwarya,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for sharing more info on the issue!
> >>>>>>>>>>
> >>>>>>>>>> To facilitate easier usage, I've updated the preprocessing code
> by
> >>>>>>>> pulling
> >>>>>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
> >>>>> module,
> >>>>>>>>>> leaving just the execution in the `Preprocessing.ipynb`
> notebook.
> >>>>>>>> There is
> >>>>>>>>>> also a `preprocess.py` script with the same contents as the
> >>>> notebook
> >>>>>> for
> >>>>>>>>>> use with `spark-submit`.  The choice of the notebook or the
> script
> >>>>> is
> >>>>>>>> just
> >>>>>>>>>> a matter of convenience, as they both import from the same
> >>>>>>>>>> `breastcancer/preprocessing.py` package.
> >>>>>>>>>>
> >>>>>>>>>> As part of the updates, I've added an explicit SparkSession
> >>>>> parameter
> >>>>>>>>>> (`spark`) to the `preprocess(...)` function, and updated the
> body
> >>>> to
> >>>>>> use
> >>>>>>>>>> this SparkSession object rather than the older SparkContext `sc`
> >>>>>> object.
> >>>>>>>>>> Previously, the `preprocess(...)` function accessed the `sc`
> >>> object
> >>>>>> that
> >>>>>>>>>> was pulled in from the enclosing scope, which would work while
> all
> >>>>> of
> >>>>>>>> the
> >>>>>>>>>> code was colocated within the notebook, but not if the code was
> >>>>>>>> extracted
> >>>>>>>>>> and imported.  The explicit parameter now allows for the code to
> >>> be
> >>>>>>>>>> imported.
> >>>>>>>>>>
> >>>>>>>>>> Can you please try again with the latest updates?  We are
> >>> currently
> >>>>>>>> using
> >>>>>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
> >>>>> kernel
> >>>>>>>>>> should have a `spark` object available that can be supplied to
> the
> >>>>>>>>>> functions (as is done now in the notebook), and if you use the
> >>>>>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object
> >>> will
> >>>>> be
> >>>>>>>>>> created explicitly by the script.
> >>>>>>>>>>
> >>>>>>>>>> For a bit of context to others, Aishwarya initially reached out
> to
> >>>>>> find
> >>>>>>>>>> out if our breast cancer project could be applied to TIFF
> images,
> >>>>>> rather
> >>>>>>>>>> than the SVS images we are currently using (the answer is "yes"
> so
> >>>>>> long
> >>>>>>>> as
> >>>>>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
> >>>>>>>>>> documentation), and then followed up with Spark issues related
> to
> >>>>> the
> >>>>>>>>>> preprocessing code.  This conversation has been promptly moved
> to
> >>>>> the
> >>>>>>>>>> mailing list so that others in the community can benefit.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks!
> >>>>>>>>>>
> >>>>>>>>>> -Mike
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> Mike Dusenberry
> >>>>>>>>>> GitHub: github.com/dusenberrymw
> >>>>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>>>>>>
> >>>>>>>>>> Sent from my iPhone.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
> >>>>>>>> aishwarya2612@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hey,
> >>>>>>>>>>>
> >>>>>>>>>>> The object sc is already defined in pyspark and yet this name
> >>>> error
> >>>>>>>> keeps
> >>>>>>>>>>> occurring. We are using spark 2.*
> >>>>>>>>>>>
> >>>>>>>>>>> Here is the link to error that we are getting :
> >>>>>>>>>>> https://paste.fedoraproject.org/paste/
> >>>>> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
> >>>>>>>>>> YhyRLivL9gydE=
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by du...@gmail.com.

Hi Aishwarya,

Unfortunately this mailing list removes all images, so I can't view your screenshot.  I'm assuming that it is the same issue with the missing SparkContext `sc` object, but please let me know if it is a different issue.  This sounds like it could be an issue with multiple kernels installed in Jupyter.  When you start the notebook, can you see if there are multiple kernels listed in the "Kernel" -> "Change Kernel" menu?  If so, please try one of the other kernels to see if Jupyter is starting by default with a non-spark kernel.  Also, is it possible that you have more than one instance of the Jupyter server running?  I.e. for this scenario, we start Jupyter itself directly via pyspark using the command sent previously, whereas usually Jupyter can just be started with `jupyter notebook`.  In the latter case, PySpark (and thus `sc`) would *not* be available (unless you've set up special PySpark kernels separately).  In summary, can you (1) check for other kernels via the menus, and (2) check for other running Jupyter servers that are non-PySpark?

As for the other inquiry, great question!  When training models, it's quite useful to track the loss and other metrics (i.e. accuracy) from *both* the training and validation sets.  The reasoning is that it allows for a more holistic view of the overall learning process, such as evaluating whether any overfitting or underfitting is occurring.  For example, say that you train a model and achieve an accuracy of 80% on the validation set.  Is this good?  Is this the best that can be done?  Without also tracking performance on the training set, it can be difficult to make these decisions.  Say that you then measure the performance on the training set and find that the model achieves 100% accuracy on that data.  That might be a good indication that your model is overfitting the training set, and that a combination of more data, regularization, and a smaller model may be helpful in raising the generalization performance, i.e. the performance on the validation set and future real examples on which you wish to make predictions.  If on the other hand, the model achieved an 82% on the training set, this could be a good indication that the model is underfitting, and that a combination of a more expressive model and better data could be helpful.  In summary, tracking performance on both the training and validation datasets can be useful for determining ways in which to improve the overall learning process.


- Mike

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 25, 2017, at 8:47 AM, Aishwarya Chaurasia <ai...@gmail.com> wrote:
> 
> We had another query, sir. We read the entire MachineLearning.ipynb code.
> in it the training samples and the validation samples have both been
> evaluated separately and their respective losses and accuracies obtained.
> Why are the training samples being evaluated again if they were used to
> train the model in the first place? Shouldn't only the validation data
> frames be evaluated to find out the loss and accuracy?
> 
> Thank you
> 
> On 25-Apr-2017 4:00 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
> wrote:
> 
>> Hello sir,
>> 
>> The NameError is occuring again sir. Why does it keep resurfacing?
>> 
>> Attaching the screenshot of the error.
>> 
>>> On 25-Apr-2017 2:50 AM, <du...@gmail.com> wrote:
>>> 
>>> Hi Aishwarya,
>>> 
>>> For the error message, that just means that the SystemML jar isn't being
>>> found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar`
>>> to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3
>>> PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
>>> pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path
>>> $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was
>>> supposed to have been fixed in Spark 2.x, but it's possible that it is
>>> still an issue.
>>> 
>>> As for the output, the notebook will create SystemML `Matrix` objects for
>>> all of the weights and biases of the trained models.  To save, please
>>> convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each
>>> matrix, and then simply save the DataFrames.  This could be done all at
>>> once like this for a SystemML Matrix object `Wc1`:
>>> `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.
>>> Just repeat for each matrix returned by the "Train" code for the
>>> algorithms.  At that point, you will have a set of saved DataFrames
>>> representing a trained SystemML model, and these can be used in downstream
>>> classification tasks in a similar manner to the "Eval" sections.
>>> 
>>> -Mike
>>> 
>>> --
>>> 
>>> Mike Dusenberry
>>> GitHub: github.com/dusenberrymw
>>> LinkedIn: linkedin.com/in/mikedusenberry
>>> 
>>> Sent from my iPhone.
>>> 
>>> 
>>>> On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <
>>> aishwarya2612@gmail.com> wrote:
>>>> 
>>>> Further more :
>>>> What is the output of MachineLearning.ipynb you're obtaining sir?
>>>> We are actually nearing our deadline for our problem.
>>>> Thanks a lot.
>>>> 
>>>> On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
>>>> wrote:
>>>> 
>>>> Hello sir,
>>>> 
>>>> Thanks a lot for replying sir. But unfortunately it did not work.
>>> Although
>>>> the NameError did not appear this time but another error came about :
>>>> 
>>>> https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
>>>> 5M1UNdIGYhyRLivL9gydE=
>>>> 
>>>> This error was obtained after executing the second block of code of
>>>> MachineLearning.py in terminal. ( ml = MLContext(sc) )
>>>> 
>>>> We have installed the bleeding-edge version of systemml only and the
>>>> installation was done correctly. We are in a fix now. :/
>>>> Kindly look into the matter asap
>>>> 
>>>> On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Aishwarya,
>>>> 
>>>> Glad to hear that the preprocessing stage was successful!  As for the
>>>> `MachineLearning.ipynb` notebook, here is a general guide:
>>>> 
>>>> 
>>>>  - The `MachineLearning.ipynb` notebook essentially (1) loads in the
>>>>  training and validation DataFrames from the preprocessing step, (2)
>>>>  converts them to normalized & one-hot encoded SystemML matrices for
>>>>  consumption by the ML algorithms, and (3) explores training a couple
>>> of
>>>>  models.
>>>>  - To run, you'll need to start Jupyter in the context of PySpark via
>>>>  `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
>>>>  PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
>>>>  $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
>>>>  SystemML with pip from PyPy (`pip3 install systemml`), this will
>>> install
>>>>  our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
>>>> will
>>>>  not be necessary.  If you instead have installed a bleeding-edge
>>> version
>>>> of
>>>>  SystemML locally (git clone locally, maven build, `pip3 install -e
>>>>  src/main/python` as listed in `projects/breast_cancer/README.md`),
>>> the
>>>>  `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We
>>> are
>>>>  about to release 0.14, and for this project, I *would* recommend
>>> using a
>>>>  bleeding edge install.
>>>>  - Once Jupyter has been started in the context of PySpark, the `sc`
>>>>  SparkContext object should be available.  Please let me know if you
>>>>  continue to see this issue.
>>>>  - The "Read in train & val data" section simply reads in the training
>>>>  and validation data generated in the preprocessing stage.  Be sure
>>> that
>>>> the
>>>>  `size` setting is the same as the preprocessing size.  The percentage
>>> `p`
>>>>  setting determines whether the full or sampled DataFrames are
>>> loaded.  If
>>>>  you set `p = 1`, the full DataFrames will be used.  If you instead
>>> would
>>>>  prefer to use the smaller sampled DataFrames while getting started,
>>>> please
>>>>  set it to the same value as used in the preprocessing to generate the
>>>>  smaller sampled DataFrames.
>>>>  - The `Extract X & Y matrices` section splits each of the train and
>>>>  validation DataFrames into effectively X & Y matrices (still as
>>> DataFrame
>>>>  types), with X containing the images, and Y containing the labels.
>>>>  - The `Convert to SystemML Matrices` section passes the X & Y
>>> DataFrames
>>>>  into a SystemML script that performs some normalization of the images
>>> &
>>>>  one-hot encoding of the labels, and then returns SystemML `Matrix`
>>> types.
>>>>  These are now ready to be passed into the subsequent algorithms.
>>>>  - The "Trigger Caching" and "Save Matrices" are experimental features,
>>>>  and not necessary to execute.
>>>>  - Next comes the two algorithms being explored in this notebook.  The
>>>>  "Softmax Classifier" is just a multi-class logistic regression model,
>>> and
>>>>  is simply there to serve as a baseline comparison with the subsequent
>>>>  convolutional neural net model.  You may wish to simply skip this
>>> softmax
>>>>  model and move to the latter convnet model further down in the
>>> notebook.
>>>>  - The actual softmax model is located at [
>>>>  https://github.com/apache/incubator-systemml/blob/master/
>>>> projects/breast_cancer/softmax_clf.dml],
>>>>  and the notebook calls functions from that file.
>>>>  - The softmax sanity check just ensures that the model is able to
>>>>  completely overfit when given a tiny sample size.  This should yield
>>>> ~100%
>>>>  training accuracy if the sample size in this section is small enough.
>>>> This
>>>>  is just a check to ensure that nothing else is wrong with the math or
>>> the
>>>>  data.
>>>>  - The softmax "Train" section will train a softmax model and return
>>> the
>>>>  weights (`W`) and biases (`b`) of the model as SystemML `Matrix`
>>> objects.
>>>>  Please adjust the hyperparameters in this section to your problem.
>>>>  - The softmax "Eval" section takes the trained weights and biases and
>>>>  evaluates the training and validation performance.
>>>>  - The next model is a LeNet-like convnet model.  The actual model is
>>>>  located at [
>>>>  https://github.com/apache/incubator-systemml/blob/master/
>>>> projects/breast_cancer/convnet.dml],
>>>>  and the notebook simply calls functions from that file.
>>>>  - Once again, there is an initial sanity check for the ability to
>>>>  overfit on a small amount of data.
>>>>  - The "Hyperparameter Search" contains a script to sample different
>>>>  hyperparams for the convnet, and save the hyperparams + validation
>>>> accuracy
>>>>  of each set after a single epoch of training.  These string files
>>> will be
>>>>  saved to HDFS.  Please feel free to adjust the range of the
>>>> hyperparameters
>>>>  for your problem.  Please also feel free to try using the `parfor`
>>>>  (parallel for-loop) instead of the while loop to speed up this
>>> section.
>>>>  Note that this is still a work in progress.  The hyperparameter
>>> tuning in
>>>>  this section makes use of random search (as opposed to grid search),
>>>> which
>>>>  has been promoted by Bengio et al. to speed up the search time.
>>>>  - The "Train" section trains the convnet and returns the weights and
>>>>  biases as SystemML `Matrix` types.  In this section, please replace
>>> the
>>>>  hyperparameters with the best ones from above, and please increase the
>>>>  number of epochs given your time constraints.
>>>>  - The "Eval" section evaluates the performance of the trained convnet.
>>>>  - Although it is not shown in the notebook yet, to save the weights
>>> and
>>>>  biases, please use the `toDF()` method on each weight and biases (i.e.
>>>>  `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save
>>> the
>>>>  DataFrame as desired.
>>>>  - Finally, please feel free to extend the model in `convnet.dml` for
>>>>  your particular problem!  The LeNet-like model just serves as a simple
>>>>  convnet, but there are much richer models currently, such as resnets,
>>>> that
>>>>  we are experimenting with.  To make larger models such as resnets
>>> easier
>>>> to
>>>>  define, we are also working on other tools for converting model
>>>> definitions
>>>>  + pretrained weights from other systems into SystemML.
>>>> 
>>>> 
>>>> Also, please keep in mind that the deep learning support in SystemML is
>>>> still a work in progress.  Therefore, if you run into issues, please
>>> let us
>>>> know and we'll do everything possible to help get things running!
>>>> 
>>>> 
>>>> Thanks!
>>>> 
>>>> - Mike
>>>> 
>>>> 
>>>> --
>>>> 
>>>> Michael W. Dusenberry
>>>> GitHub: github.com/dusenberrymw
>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>> 
>>>> On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
>>>> aishwarya2612@gmail.com> wrote:
>>>> 
>>>>> Hey,
>>>>> 
>>>>> Thank you so much for your help sir. We were finally able to run
>>>>> preprocess.py without any errors. And the results obtained were
>>>>> satisfactory i.e we got five set of data frames like you said we would.
>>>>> 
>>>>> But alas! when we tried to run MachineLearning.ipynb the same NameError
>>>>> came : https://paste.fedoraproject.org/paste/l3LFJreg~
>>>>> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>>>>> 
>>>>> Could you guide us again as to how to proceed now?
>>>>> Also, could you please provide an overview of the process
>>>>> MachineLearning.ipynb is following to train the samples.
>>>>> 
>>>>> Thanks a lot!
>>>>> 
>>>>>> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Aishwarya,
>>>>>> 
>>>>>> Looks like you've just encountered an out of memory error on one of
>>> the
>>>>>> executors.  Therefore, you just need to adjust the
>>>>> `spark.executor.memory`
>>>>>> and `spark.driver.memory` settings with higher amounts of RAM.  What
>>> is
>>>>>> your current setup?  I.e. are you using a cluster of machines, or a
>>>>> single
>>>>>> machine?  We generally use a large driver on one machine, and then a
>>>>> single
>>>>>> large executor on each other machine.  I would give a sizable amount
>>> of
>>>>>> memory to the driver, and about half the possible memory on the
>>>> executors
>>>>>> so that the Python processes have enough memory as well.  PySpark has
>>>> JVM
>>>>>> and Python components, and the Spark memory settings only pertain to
>>> the
>>>>>> JVM side, thus the need to save about half the executor memory for the
>>>>>> Python side.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> - Mike
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Mike Dusenberry
>>>>>> GitHub: github.com/dusenberrymw
>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>> 
>>>>>> Sent from my iPhone.
>>>>>> 
>>>>>> 
>>>>>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>>>>>> aishwarya2612@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hello sir,
>>>>>>> 
>>>>>>> We also wanted to ensure that the spark-submit command we're using is
>>>>> the
>>>>>>> correct one for running 'preprocess.py'.
>>>>>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> Aishwarya Chaurasia.
>>>>>>> 
>>>>>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <
>>> aishwarya2612@gmail.com
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hello sir,
>>>>>>> On running the file preprocess.py we are getting the following error
>>> :
>>>>>>> 
>>>>>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>>>>>>> YhyRLivL9gydE=
>>>>>>> 
>>>>>>> Can you please help us by looking into the error and kindly tell us
>>>> the
>>>>>>> solution for it.
>>>>>>> Thanks a lot.
>>>>>>> Aishwarya Chaurasia
>>>>>>> 
>>>>>>> 
>>>>>>>> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Aishwarya,
>>>>>>>> 
>>>>>>>> Certainly, here is some more detailed information
>>>>> about`preprocess.py`:
>>>>>>>> 
>>>>>>>> * The preprocessing Python script is located at
>>>>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>>>>> projects/breast_cancer/preprocess.py. Note that this is different
>>>>> than
>>>>>>>> the library module at https://github.com/apache/incu
>>>>>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
>>>>>>>> ancer/preprocessing.py.
>>>>>>>> * This script is used to preprocess a set of histology slide images,
>>>>>>>> which are `.svs` files in our case, and `.tiff` files in your case.
>>>>>>>> * Lines 63-79 contain "settings" such as the output image sizes,
>>>>> folder
>>>>>>>> paths, etc.  Of particular interest, line 72 has the folder path for
>>>>> the
>>>>>>>> original slide images that should be commonly accessible from all
>>>>>> machines
>>>>>>>> being used, and lines 74-79 contain the names of the output
>>>> DataFrames
>>>>>> that
>>>>>>>> will be saved.
>>>>>>>> * Line 82 performs the actual preprocessing and creates a Spark
>>>>>>>> DataFrame with the following columns: slide number, tumor score,
>>>>>> molecular
>>>>>>>> score, sample.  The "sample" in this case is the actual small,
>>>>>> chopped-up
>>>>>>>> section of the image that has been extracted and flattened into a
>>> row
>>>>>>>> Vector.  For test images without labels (`training=false`), only the
>>>>>> slide
>>>>>>>> number and sample will be contained in the DataFrame (i.e. no
>>>> labels).
>>>>>>>> This calls the `preprocess(...)` function located on line 371 of
>>>>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>>>>>>>> different file.
>>>>>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
>>>> from
>>>>>>>> line 74.
>>>>>>>> * Line 93 splits the above DataFrame row-wise into separate
>>>>> "training"
>>>>>>>> and "validation" DataFrames, based on the split percentage from line
>>>>> 70
>>>>>>>> (`train_frac`).  This is performed so that downstream machine
>>>> learning
>>>>>>>> tasks can learn from the training set, and validate performance and
>>>>>>>> hyperparameter choices on the validation set.  These DataFrames will
>>>>>> start
>>>>>>>> with the same columns as the above DataFrame.  If `add_row_indices`
>>>>> from
>>>>>>>> line 69 is true, then an additional row index column (`__INDEX`)
>>> will
>>>>> be
>>>>>>>> pretended.  This is useful for SystemML in downstream machine
>>>> learning
>>>>>>>> tasks as it gives the DataFrame row numbers like a real matrix would
>>>>>> have,
>>>>>>>> and SystemML is built to operate on matrices.
>>>>>>>> * Lines 97 & 98 simply save the training and validation DataFrames
>>>>>> using
>>>>>>>> the names defined on lines 76 & 78.
>>>>>>>> * Lines 103-137 create smaller train and validation DataFrames by
>>>>>> taking
>>>>>>>> small row-wise samples of the full train and validation DataFrames.
>>>>> The
>>>>>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
>>>>>>>> sample).  This is generally useful for quicker downstream tasks
>>>>> without
>>>>>>>> having to load in the larger DataFrames, assuming you have a large
>>>>>> amount
>>>>>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
>>>>> DataFrames
>>>>>> is
>>>>>>>> useful for quicker downstream tests.  Once again, the same columns
>>>>> from
>>>>>> the
>>>>>>>> larger train and validation DataFrames will be used.
>>>>>>>> * Lines 146 & 147 simply save these sampled train and validation
>>>>>>>> DataFrames.
>>>>>>>> 
>>>>>>>> As a summary, after running `preprocess.py`, you will be left with
>>>> the
>>>>>>>> following saved DataFrames in HDFS:
>>>>>>>> * Full DataFrame
>>>>>>>> * Training DataFrame
>>>>>>>> * Validation DataFrame
>>>>>>>> * Sampled training DataFrame
>>>>>>>> * Sampled validation DataFrame
>>>>>>>> 
>>>>>>>> As for visualization, you may visualize a "sample" (i.e. small,
>>>>>> chopped-up
>>>>>>>> section of original image) from a DataFrame by using the `
>>>>>>>> breastcancer.visualization.visualize_sample(...)` function.  You
>>> will
>>>>>>>> need to do this after creating the DataFrames.  Here is a snippet to
>>>>>>>> visualize the first row sample in a DataFrame, where `df` is one of
>>>>> the
>>>>>>>> DataFrames from above:
>>>>>>>> 
>>>>>>>> ```
>>>>>>>> from breastcancer.visualization import visualize_sample
>>>>>>>> visualize_sample(df.first().sample)
>>>>>>>> ```
>>>>>>>> 
>>>>>>>> Please let me know if you have any additional questions.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> - Mike
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> Mike Dusenberry
>>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>>> 
>>>>>>>> Sent from my iPhone.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>>>>>>>> aishwarya2612@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hello sir,
>>>>>>>>> Can you please elaborate more on what output we would be getting
>>>>>> because
>>>>>>>> we
>>>>>>>>> tried executing the preprocess.py file using spark submit it keeps
>>>> on
>>>>>>>>> adding the tiles in rdd and while running the visualisation.py file
>>>>> it
>>>>>>>>> isn't showing any output. Can you please help us out asap stating
>>>> the
>>>>>>>>> output we will be getting and the sequence of execution of files.
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Aishwarya,
>>>>>>>>>> 
>>>>>>>>>> Thanks for sharing more info on the issue!
>>>>>>>>>> 
>>>>>>>>>> To facilitate easier usage, I've updated the preprocessing code by
>>>>>>>> pulling
>>>>>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
>>>>> module,
>>>>>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
>>>>>>>> There is
>>>>>>>>>> also a `preprocess.py` script with the same contents as the
>>>> notebook
>>>>>> for
>>>>>>>>>> use with `spark-submit`.  The choice of the notebook or the script
>>>>> is
>>>>>>>> just
>>>>>>>>>> a matter of convenience, as they both import from the same
>>>>>>>>>> `breastcancer/preprocessing.py` package.
>>>>>>>>>> 
>>>>>>>>>> As part of the updates, I've added an explicit SparkSession
>>>>> parameter
>>>>>>>>>> (`spark`) to the `preprocess(...)` function, and updated the body
>>>> to
>>>>>> use
>>>>>>>>>> this SparkSession object rather than the older SparkContext `sc`
>>>>>> object.
>>>>>>>>>> Previously, the `preprocess(...)` function accessed the `sc`
>>> object
>>>>>> that
>>>>>>>>>> was pulled in from the enclosing scope, which would work while all
>>>>> of
>>>>>>>> the
>>>>>>>>>> code was colocated within the notebook, but not if the code was
>>>>>>>> extracted
>>>>>>>>>> and imported.  The explicit parameter now allows for the code to
>>> be
>>>>>>>>>> imported.
>>>>>>>>>> 
>>>>>>>>>> Can you please try again with the latest updates?  We are
>>> currently
>>>>>>>> using
>>>>>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
>>>>> kernel
>>>>>>>>>> should have a `spark` object available that can be supplied to the
>>>>>>>>>> functions (as is done now in the notebook), and if you use the
>>>>>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object
>>> will
>>>>> be
>>>>>>>>>> created explicitly by the script.
>>>>>>>>>> 
>>>>>>>>>> For a bit of context to others, Aishwarya initially reached out to
>>>>>> find
>>>>>>>>>> out if our breast cancer project could be applied to TIFF images,
>>>>>> rather
>>>>>>>>>> than the SVS images we are currently using (the answer is "yes" so
>>>>>> long
>>>>>>>> as
>>>>>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
>>>>>>>>>> documentation), and then followed up with Spark issues related to
>>>>> the
>>>>>>>>>> preprocessing code.  This conversation has been promptly moved to
>>>>> the
>>>>>>>>>> mailing list so that others in the community can benefit.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> -Mike
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> Mike Dusenberry
>>>>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>>>>>>>> aishwarya2612@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hey,
>>>>>>>>>>> 
>>>>>>>>>>> The object sc is already defined in pyspark and yet this name
>>>> error
>>>>>>>> keeps
>>>>>>>>>>> occurring. We are using spark 2.*
>>>>>>>>>>> 
>>>>>>>>>>> Here is the link to error that we are getting :
>>>>>>>>>>> https://paste.fedoraproject.org/paste/
>>>>> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
>>>>>>>>>> YhyRLivL9gydE=
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by Aishwarya Chaurasia <ai...@gmail.com>.

We had another query, sir. We read the entire MachineLearning.ipynb code.
in it the training samples and the validation samples have both been
evaluated separately and their respective losses and accuracies obtained.
Why are the training samples being evaluated again if they were used to
train the model in the first place? Shouldn't only the validation data
frames be evaluated to find out the loss and accuracy?

Thank you

On 25-Apr-2017 4:00 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
wrote:

> Hello sir,
>
> The NameError is occuring again sir. Why does it keep resurfacing?
>
> Attaching the screenshot of the error.
>
> On 25-Apr-2017 2:50 AM, <du...@gmail.com> wrote:
>
>> Hi Aishwarya,
>>
>> For the error message, that just means that the SystemML jar isn't being
>> found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar`
>> to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3
>> PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
>> pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path
>> $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was
>> supposed to have been fixed in Spark 2.x, but it's possible that it is
>> still an issue.
>>
>> As for the output, the notebook will create SystemML `Matrix` objects for
>> all of the weights and biases of the trained models.  To save, please
>> convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each
>> matrix, and then simply save the DataFrames.  This could be done all at
>> once like this for a SystemML Matrix object `Wc1`:
>> `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.
>> Just repeat for each matrix returned by the "Train" code for the
>> algorithms.  At that point, you will have a set of saved DataFrames
>> representing a trained SystemML model, and these can be used in downstream
>> classification tasks in a similar manner to the "Eval" sections.
>>
>> -Mike
>>
>> --
>>
>> Mike Dusenberry
>> GitHub: github.com/dusenberrymw
>> LinkedIn: linkedin.com/in/mikedusenberry
>>
>> Sent from my iPhone.
>>
>>
>> > On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <
>> aishwarya2612@gmail.com> wrote:
>> >
>> > Further more :
>> > What is the output of MachineLearning.ipynb you're obtaining sir?
>> > We are actually nearing our deadline for our problem.
>> > Thanks a lot.
>> >
>> > On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
>> > wrote:
>> >
>> > Hello sir,
>> >
>> > Thanks a lot for replying sir. But unfortunately it did not work.
>> Although
>> > the NameError did not appear this time but another error came about :
>> >
>> > https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
>> > 5M1UNdIGYhyRLivL9gydE=
>> >
>> > This error was obtained after executing the second block of code of
>> > MachineLearning.py in terminal. ( ml = MLContext(sc) )
>> >
>> > We have installed the bleeding-edge version of systemml only and the
>> > installation was done correctly. We are in a fix now. :/
>> > Kindly look into the matter asap
>> >
>> > On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com>
>> wrote:
>> >
>> > Hi Aishwarya,
>> >
>> > Glad to hear that the preprocessing stage was successful!  As for the
>> > `MachineLearning.ipynb` notebook, here is a general guide:
>> >
>> >
>> >   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
>> >   training and validation DataFrames from the preprocessing step, (2)
>> >   converts them to normalized & one-hot encoded SystemML matrices for
>> >   consumption by the ML algorithms, and (3) explores training a couple
>> of
>> >   models.
>> >   - To run, you'll need to start Jupyter in the context of PySpark via
>> >   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
>> >   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
>> >   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
>> >   SystemML with pip from PyPy (`pip3 install systemml`), this will
>> install
>> >   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
>> > will
>> >   not be necessary.  If you instead have installed a bleeding-edge
>> version
>> > of
>> >   SystemML locally (git clone locally, maven build, `pip3 install -e
>> >   src/main/python` as listed in `projects/breast_cancer/README.md`),
>> the
>> >   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We
>> are
>> >   about to release 0.14, and for this project, I *would* recommend
>> using a
>> >   bleeding edge install.
>> >   - Once Jupyter has been started in the context of PySpark, the `sc`
>> >   SparkContext object should be available.  Please let me know if you
>> >   continue to see this issue.
>> >   - The "Read in train & val data" section simply reads in the training
>> >   and validation data generated in the preprocessing stage.  Be sure
>> that
>> > the
>> >   `size` setting is the same as the preprocessing size.  The percentage
>> `p`
>> >   setting determines whether the full or sampled DataFrames are
>> loaded.  If
>> >   you set `p = 1`, the full DataFrames will be used.  If you instead
>> would
>> >   prefer to use the smaller sampled DataFrames while getting started,
>> > please
>> >   set it to the same value as used in the preprocessing to generate the
>> >   smaller sampled DataFrames.
>> >   - The `Extract X & Y matrices` section splits each of the train and
>> >   validation DataFrames into effectively X & Y matrices (still as
>> DataFrame
>> >   types), with X containing the images, and Y containing the labels.
>> >   - The `Convert to SystemML Matrices` section passes the X & Y
>> DataFrames
>> >   into a SystemML script that performs some normalization of the images
>> &
>> >   one-hot encoding of the labels, and then returns SystemML `Matrix`
>> types.
>> >   These are now ready to be passed into the subsequent algorithms.
>> >   - The "Trigger Caching" and "Save Matrices" are experimental features,
>> >   and not necessary to execute.
>> >   - Next comes the two algorithms being explored in this notebook.  The
>> >   "Softmax Classifier" is just a multi-class logistic regression model,
>> and
>> >   is simply there to serve as a baseline comparison with the subsequent
>> >   convolutional neural net model.  You may wish to simply skip this
>> softmax
>> >   model and move to the latter convnet model further down in the
>> notebook.
>> >   - The actual softmax model is located at [
>> >   https://github.com/apache/incubator-systemml/blob/master/
>> > projects/breast_cancer/softmax_clf.dml],
>> >   and the notebook calls functions from that file.
>> >   - The softmax sanity check just ensures that the model is able to
>> >   completely overfit when given a tiny sample size.  This should yield
>> > ~100%
>> >   training accuracy if the sample size in this section is small enough.
>> > This
>> >   is just a check to ensure that nothing else is wrong with the math or
>> the
>> >   data.
>> >   - The softmax "Train" section will train a softmax model and return
>> the
>> >   weights (`W`) and biases (`b`) of the model as SystemML `Matrix`
>> objects.
>> >   Please adjust the hyperparameters in this section to your problem.
>> >   - The softmax "Eval" section takes the trained weights and biases and
>> >   evaluates the training and validation performance.
>> >   - The next model is a LeNet-like convnet model.  The actual model is
>> >   located at [
>> >   https://github.com/apache/incubator-systemml/blob/master/
>> > projects/breast_cancer/convnet.dml],
>> >   and the notebook simply calls functions from that file.
>> >   - Once again, there is an initial sanity check for the ability to
>> >   overfit on a small amount of data.
>> >   - The "Hyperparameter Search" contains a script to sample different
>> >   hyperparams for the convnet, and save the hyperparams + validation
>> > accuracy
>> >   of each set after a single epoch of training.  These string files
>> will be
>> >   saved to HDFS.  Please feel free to adjust the range of the
>> > hyperparameters
>> >   for your problem.  Please also feel free to try using the `parfor`
>> >   (parallel for-loop) instead of the while loop to speed up this
>> section.
>> >   Note that this is still a work in progress.  The hyperparameter
>> tuning in
>> >   this section makes use of random search (as opposed to grid search),
>> > which
>> >   has been promoted by Bengio et al. to speed up the search time.
>> >   - The "Train" section trains the convnet and returns the weights and
>> >   biases as SystemML `Matrix` types.  In this section, please replace
>> the
>> >   hyperparameters with the best ones from above, and please increase the
>> >   number of epochs given your time constraints.
>> >   - The "Eval" section evaluates the performance of the trained convnet.
>> >   - Although it is not shown in the notebook yet, to save the weights
>> and
>> >   biases, please use the `toDF()` method on each weight and biases (i.e.
>> >   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save
>> the
>> >   DataFrame as desired.
>> >   - Finally, please feel free to extend the model in `convnet.dml` for
>> >   your particular problem!  The LeNet-like model just serves as a simple
>> >   convnet, but there are much richer models currently, such as resnets,
>> > that
>> >   we are experimenting with.  To make larger models such as resnets
>> easier
>> > to
>> >   define, we are also working on other tools for converting model
>> > definitions
>> >   + pretrained weights from other systems into SystemML.
>> >
>> >
>> > Also, please keep in mind that the deep learning support in SystemML is
>> > still a work in progress.  Therefore, if you run into issues, please
>> let us
>> > know and we'll do everything possible to help get things running!
>> >
>> >
>> > Thanks!
>> >
>> > - Mike
>> >
>> >
>> > --
>> >
>> > Michael W. Dusenberry
>> > GitHub: github.com/dusenberrymw
>> > LinkedIn: linkedin.com/in/mikedusenberry
>> >
>> > On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
>> > aishwarya2612@gmail.com> wrote:
>> >
>> >> Hey,
>> >>
>> >> Thank you so much for your help sir. We were finally able to run
>> >> preprocess.py without any errors. And the results obtained were
>> >> satisfactory i.e we got five set of data frames like you said we would.
>> >>
>> >> But alas! when we tried to run MachineLearning.ipynb the same NameError
>> >> came : https://paste.fedoraproject.org/paste/l3LFJreg~
>> >> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>> >>
>> >> Could you guide us again as to how to proceed now?
>> >> Also, could you please provide an overview of the process
>> >> MachineLearning.ipynb is following to train the samples.
>> >>
>> >> Thanks a lot!
>> >>
>> >>> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>> >>>
>> >>> Hi Aishwarya,
>> >>>
>> >>> Looks like you've just encountered an out of memory error on one of
>> the
>> >>> executors.  Therefore, you just need to adjust the
>> >> `spark.executor.memory`
>> >>> and `spark.driver.memory` settings with higher amounts of RAM.  What
>> is
>> >>> your current setup?  I.e. are you using a cluster of machines, or a
>> >> single
>> >>> machine?  We generally use a large driver on one machine, and then a
>> >> single
>> >>> large executor on each other machine.  I would give a sizable amount
>> of
>> >>> memory to the driver, and about half the possible memory on the
>> > executors
>> >>> so that the Python processes have enough memory as well.  PySpark has
>> > JVM
>> >>> and Python components, and the Spark memory settings only pertain to
>> the
>> >>> JVM side, thus the need to save about half the executor memory for the
>> >>> Python side.
>> >>>
>> >>> Thanks!
>> >>>
>> >>> - Mike
>> >>>
>> >>> --
>> >>>
>> >>> Mike Dusenberry
>> >>> GitHub: github.com/dusenberrymw
>> >>> LinkedIn: linkedin.com/in/mikedusenberry
>> >>>
>> >>> Sent from my iPhone.
>> >>>
>> >>>
>> >>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>> >>> aishwarya2612@gmail.com> wrote:
>> >>>>
>> >>>> Hello sir,
>> >>>>
>> >>>> We also wanted to ensure that the spark-submit command we're using is
>> >> the
>> >>>> correct one for running 'preprocess.py'.
>> >>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
>> >>>>
>> >>>>
>> >>>> Thank you.
>> >>>> Aishwarya Chaurasia.
>> >>>>
>> >>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <
>> aishwarya2612@gmail.com
>> >>>
>> >>>> wrote:
>> >>>>
>> >>>> Hello sir,
>> >>>> On running the file preprocess.py we are getting the following error
>> :
>> >>>>
>> >>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>> >>>> YhyRLivL9gydE=
>> >>>>
>> >>>> Can you please help us by looking into the error and kindly tell us
>> > the
>> >>>> solution for it.
>> >>>> Thanks a lot.
>> >>>> Aishwarya Chaurasia
>> >>>>
>> >>>>
>> >>>>> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
>> >>>>>
>> >>>>> Hi Aishwarya,
>> >>>>>
>> >>>>> Certainly, here is some more detailed information
>> >> about`preprocess.py`:
>> >>>>>
>> >>>>> * The preprocessing Python script is located at
>> >>>>> https://github.com/apache/incubator-systemml/blob/master/
>> >>>>> projects/breast_cancer/preprocess.py.  Note that this is different
>> >> than
>> >>>>> the library module at https://github.com/apache/incu
>> >>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
>> >>>>> ancer/preprocessing.py.
>> >>>>> * This script is used to preprocess a set of histology slide images,
>> >>>>> which are `.svs` files in our case, and `.tiff` files in your case.
>> >>>>> * Lines 63-79 contain "settings" such as the output image sizes,
>> >> folder
>> >>>>> paths, etc.  Of particular interest, line 72 has the folder path for
>> >> the
>> >>>>> original slide images that should be commonly accessible from all
>> >>> machines
>> >>>>> being used, and lines 74-79 contain the names of the output
>> > DataFrames
>> >>> that
>> >>>>> will be saved.
>> >>>>> * Line 82 performs the actual preprocessing and creates a Spark
>> >>>>> DataFrame with the following columns: slide number, tumor score,
>> >>> molecular
>> >>>>> score, sample.  The "sample" in this case is the actual small,
>> >>> chopped-up
>> >>>>> section of the image that has been extracted and flattened into a
>> row
>> >>>>> Vector.  For test images without labels (`training=false`), only the
>> >>> slide
>> >>>>> number and sample will be contained in the DataFrame (i.e. no
>> > labels).
>> >>>>> This calls the `preprocess(...)` function located on line 371 of
>> >>>>> https://github.com/apache/incubator-systemml/blob/master/
>> >>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>> >>>>> different file.
>> >>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
>> > from
>> >>>>> line 74.
>> >>>>> * Line 93 splits the above DataFrame row-wise into separate
>> >> "training"
>> >>>>> and "validation" DataFrames, based on the split percentage from line
>> >> 70
>> >>>>> (`train_frac`).  This is performed so that downstream machine
>> > learning
>> >>>>> tasks can learn from the training set, and validate performance and
>> >>>>> hyperparameter choices on the validation set.  These DataFrames will
>> >>> start
>> >>>>> with the same columns as the above DataFrame.  If `add_row_indices`
>> >> from
>> >>>>> line 69 is true, then an additional row index column (`__INDEX`)
>> will
>> >> be
>> >>>>> pretended.  This is useful for SystemML in downstream machine
>> > learning
>> >>>>> tasks as it gives the DataFrame row numbers like a real matrix would
>> >>> have,
>> >>>>> and SystemML is built to operate on matrices.
>> >>>>> * Lines 97 & 98 simply save the training and validation DataFrames
>> >>> using
>> >>>>> the names defined on lines 76 & 78.
>> >>>>> * Lines 103-137 create smaller train and validation DataFrames by
>> >>> taking
>> >>>>> small row-wise samples of the full train and validation DataFrames.
>> >> The
>> >>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
>> >>>>> sample).  This is generally useful for quicker downstream tasks
>> >> without
>> >>>>> having to load in the larger DataFrames, assuming you have a large
>> >>> amount
>> >>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
>> >> DataFrames
>> >>> is
>> >>>>> useful for quicker downstream tests.  Once again, the same columns
>> >> from
>> >>> the
>> >>>>> larger train and validation DataFrames will be used.
>> >>>>> * Lines 146 & 147 simply save these sampled train and validation
>> >>>>> DataFrames.
>> >>>>>
>> >>>>> As a summary, after running `preprocess.py`, you will be left with
>> > the
>> >>>>> following saved DataFrames in HDFS:
>> >>>>> * Full DataFrame
>> >>>>> * Training DataFrame
>> >>>>> * Validation DataFrame
>> >>>>> * Sampled training DataFrame
>> >>>>> * Sampled validation DataFrame
>> >>>>>
>> >>>>> As for visualization, you may visualize a "sample" (i.e. small,
>> >>> chopped-up
>> >>>>> section of original image) from a DataFrame by using the `
>> >>>>> breastcancer.visualization.visualize_sample(...)` function.  You
>> will
>> >>>>> need to do this after creating the DataFrames.  Here is a snippet to
>> >>>>> visualize the first row sample in a DataFrame, where `df` is one of
>> >> the
>> >>>>> DataFrames from above:
>> >>>>>
>> >>>>> ```
>> >>>>> from breastcancer.visualization import visualize_sample
>> >>>>> visualize_sample(df.first().sample)
>> >>>>> ```
>> >>>>>
>> >>>>> Please let me know if you have any additional questions.
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> - Mike
>> >>>>>
>> >>>>> --
>> >>>>>
>> >>>>> Mike Dusenberry
>> >>>>> GitHub: github.com/dusenberrymw
>> >>>>> LinkedIn: linkedin.com/in/mikedusenberry
>> >>>>>
>> >>>>> Sent from my iPhone.
>> >>>>>
>> >>>>>
>> >>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>> >>>>> aishwarya2612@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Hello sir,
>> >>>>>> Can you please elaborate more on what output we would be getting
>> >>> because
>> >>>>> we
>> >>>>>> tried executing the preprocess.py file using spark submit it keeps
>> > on
>> >>>>>> adding the tiles in rdd and while running the visualisation.py file
>> >> it
>> >>>>>> isn't showing any output. Can you please help us out asap stating
>> > the
>> >>>>>> output we will be getting and the sequence of execution of files.
>> >>>>>> Thank you.
>> >>>>>>
>> >>>>>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>> Hi Aishwarya,
>> >>>>>>>
>> >>>>>>> Thanks for sharing more info on the issue!
>> >>>>>>>
>> >>>>>>> To facilitate easier usage, I've updated the preprocessing code by
>> >>>>> pulling
>> >>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
>> >> module,
>> >>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
>> >>>>> There is
>> >>>>>>> also a `preprocess.py` script with the same contents as the
>> > notebook
>> >>> for
>> >>>>>>> use with `spark-submit`.  The choice of the notebook or the script
>> >> is
>> >>>>> just
>> >>>>>>> a matter of convenience, as they both import from the same
>> >>>>>>> `breastcancer/preprocessing.py` package.
>> >>>>>>>
>> >>>>>>> As part of the updates, I've added an explicit SparkSession
>> >> parameter
>> >>>>>>> (`spark`) to the `preprocess(...)` function, and updated the body
>> > to
>> >>> use
>> >>>>>>> this SparkSession object rather than the older SparkContext `sc`
>> >>> object.
>> >>>>>>> Previously, the `preprocess(...)` function accessed the `sc`
>> object
>> >>> that
>> >>>>>>> was pulled in from the enclosing scope, which would work while all
>> >> of
>> >>>>> the
>> >>>>>>> code was colocated within the notebook, but not if the code was
>> >>>>> extracted
>> >>>>>>> and imported.  The explicit parameter now allows for the code to
>> be
>> >>>>>>> imported.
>> >>>>>>>
>> >>>>>>> Can you please try again with the latest updates?  We are
>> currently
>> >>>>> using
>> >>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
>> >> kernel
>> >>>>>>> should have a `spark` object available that can be supplied to the
>> >>>>>>> functions (as is done now in the notebook), and if you use the
>> >>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object
>> will
>> >> be
>> >>>>>>> created explicitly by the script.
>> >>>>>>>
>> >>>>>>> For a bit of context to others, Aishwarya initially reached out to
>> >>> find
>> >>>>>>> out if our breast cancer project could be applied to TIFF images,
>> >>> rather
>> >>>>>>> than the SVS images we are currently using (the answer is "yes" so
>> >>> long
>> >>>>> as
>> >>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
>> >>>>>>> documentation), and then followed up with Spark issues related to
>> >> the
>> >>>>>>> preprocessing code.  This conversation has been promptly moved to
>> >> the
>> >>>>>>> mailing list so that others in the community can benefit.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>>
>> >>>>>>> -Mike
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>>
>> >>>>>>> Mike Dusenberry
>> >>>>>>> GitHub: github.com/dusenberrymw
>> >>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>> >>>>>>>
>> >>>>>>> Sent from my iPhone.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>> >>>>> aishwarya2612@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Hey,
>> >>>>>>>>
>> >>>>>>>> The object sc is already defined in pyspark and yet this name
>> > error
>> >>>>> keeps
>> >>>>>>>> occurring. We are using spark 2.*
>> >>>>>>>>
>> >>>>>>>> Here is the link to error that we are getting :
>> >>>>>>>> https://paste.fedoraproject.org/paste/
>> >> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
>> >>>>>>> YhyRLivL9gydE=
>> >>>>>>>
>> >>>>>
>> >>>
>> >>
>>
>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by Aishwarya Chaurasia <ai...@gmail.com>.

Hello sir,

The NameError is occuring again sir. Why does it keep resurfacing?

Attaching the screenshot of the error.

On 25-Apr-2017 2:50 AM, <du...@gmail.com> wrote:

> Hi Aishwarya,
>
> For the error message, that just means that the SystemML jar isn't being
> found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar`
> to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3
> PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path
> $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was
> supposed to have been fixed in Spark 2.x, but it's possible that it is
> still an issue.
>
> As for the output, the notebook will create SystemML `Matrix` objects for
> all of the weights and biases of the trained models.  To save, please
> convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each
> matrix, and then simply save the DataFrames.  This could be done all at
> once like this for a SystemML Matrix object `Wc1`:
> `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.
> Just repeat for each matrix returned by the "Train" code for the
> algorithms.  At that point, you will have a set of saved DataFrames
> representing a trained SystemML model, and these can be used in downstream
> classification tasks in a similar manner to the "Eval" sections.
>
> -Mike
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <
> aishwarya2612@gmail.com> wrote:
> >
> > Further more :
> > What is the output of MachineLearning.ipynb you're obtaining sir?
> > We are actually nearing our deadline for our problem.
> > Thanks a lot.
> >
> > On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
> > wrote:
> >
> > Hello sir,
> >
> > Thanks a lot for replying sir. But unfortunately it did not work.
> Although
> > the NameError did not appear this time but another error came about :
> >
> > https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
> > 5M1UNdIGYhyRLivL9gydE=
> >
> > This error was obtained after executing the second block of code of
> > MachineLearning.py in terminal. ( ml = MLContext(sc) )
> >
> > We have installed the bleeding-edge version of systemml only and the
> > installation was done correctly. We are in a fix now. :/
> > Kindly look into the matter asap
> >
> > On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com>
> wrote:
> >
> > Hi Aishwarya,
> >
> > Glad to hear that the preprocessing stage was successful!  As for the
> > `MachineLearning.ipynb` notebook, here is a general guide:
> >
> >
> >   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
> >   training and validation DataFrames from the preprocessing step, (2)
> >   converts them to normalized & one-hot encoded SystemML matrices for
> >   consumption by the ML algorithms, and (3) explores training a couple of
> >   models.
> >   - To run, you'll need to start Jupyter in the context of PySpark via
> >   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
> >   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
> >   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
> >   SystemML with pip from PyPy (`pip3 install systemml`), this will
> install
> >   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
> > will
> >   not be necessary.  If you instead have installed a bleeding-edge
> version
> > of
> >   SystemML locally (git clone locally, maven build, `pip3 install -e
> >   src/main/python` as listed in `projects/breast_cancer/README.md`), the
> >   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We
> are
> >   about to release 0.14, and for this project, I *would* recommend using
> a
> >   bleeding edge install.
> >   - Once Jupyter has been started in the context of PySpark, the `sc`
> >   SparkContext object should be available.  Please let me know if you
> >   continue to see this issue.
> >   - The "Read in train & val data" section simply reads in the training
> >   and validation data generated in the preprocessing stage.  Be sure that
> > the
> >   `size` setting is the same as the preprocessing size.  The percentage
> `p`
> >   setting determines whether the full or sampled DataFrames are loaded.
> If
> >   you set `p = 1`, the full DataFrames will be used.  If you instead
> would
> >   prefer to use the smaller sampled DataFrames while getting started,
> > please
> >   set it to the same value as used in the preprocessing to generate the
> >   smaller sampled DataFrames.
> >   - The `Extract X & Y matrices` section splits each of the train and
> >   validation DataFrames into effectively X & Y matrices (still as
> DataFrame
> >   types), with X containing the images, and Y containing the labels.
> >   - The `Convert to SystemML Matrices` section passes the X & Y
> DataFrames
> >   into a SystemML script that performs some normalization of the images &
> >   one-hot encoding of the labels, and then returns SystemML `Matrix`
> types.
> >   These are now ready to be passed into the subsequent algorithms.
> >   - The "Trigger Caching" and "Save Matrices" are experimental features,
> >   and not necessary to execute.
> >   - Next comes the two algorithms being explored in this notebook.  The
> >   "Softmax Classifier" is just a multi-class logistic regression model,
> and
> >   is simply there to serve as a baseline comparison with the subsequent
> >   convolutional neural net model.  You may wish to simply skip this
> softmax
> >   model and move to the latter convnet model further down in the
> notebook.
> >   - The actual softmax model is located at [
> >   https://github.com/apache/incubator-systemml/blob/master/
> > projects/breast_cancer/softmax_clf.dml],
> >   and the notebook calls functions from that file.
> >   - The softmax sanity check just ensures that the model is able to
> >   completely overfit when given a tiny sample size.  This should yield
> > ~100%
> >   training accuracy if the sample size in this section is small enough.
> > This
> >   is just a check to ensure that nothing else is wrong with the math or
> the
> >   data.
> >   - The softmax "Train" section will train a softmax model and return the
> >   weights (`W`) and biases (`b`) of the model as SystemML `Matrix`
> objects.
> >   Please adjust the hyperparameters in this section to your problem.
> >   - The softmax "Eval" section takes the trained weights and biases and
> >   evaluates the training and validation performance.
> >   - The next model is a LeNet-like convnet model.  The actual model is
> >   located at [
> >   https://github.com/apache/incubator-systemml/blob/master/
> > projects/breast_cancer/convnet.dml],
> >   and the notebook simply calls functions from that file.
> >   - Once again, there is an initial sanity check for the ability to
> >   overfit on a small amount of data.
> >   - The "Hyperparameter Search" contains a script to sample different
> >   hyperparams for the convnet, and save the hyperparams + validation
> > accuracy
> >   of each set after a single epoch of training.  These string files will
> be
> >   saved to HDFS.  Please feel free to adjust the range of the
> > hyperparameters
> >   for your problem.  Please also feel free to try using the `parfor`
> >   (parallel for-loop) instead of the while loop to speed up this section.
> >   Note that this is still a work in progress.  The hyperparameter tuning
> in
> >   this section makes use of random search (as opposed to grid search),
> > which
> >   has been promoted by Bengio et al. to speed up the search time.
> >   - The "Train" section trains the convnet and returns the weights and
> >   biases as SystemML `Matrix` types.  In this section, please replace the
> >   hyperparameters with the best ones from above, and please increase the
> >   number of epochs given your time constraints.
> >   - The "Eval" section evaluates the performance of the trained convnet.
> >   - Although it is not shown in the notebook yet, to save the weights and
> >   biases, please use the `toDF()` method on each weight and biases (i.e.
> >   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the
> >   DataFrame as desired.
> >   - Finally, please feel free to extend the model in `convnet.dml` for
> >   your particular problem!  The LeNet-like model just serves as a simple
> >   convnet, but there are much richer models currently, such as resnets,
> > that
> >   we are experimenting with.  To make larger models such as resnets
> easier
> > to
> >   define, we are also working on other tools for converting model
> > definitions
> >   + pretrained weights from other systems into SystemML.
> >
> >
> > Also, please keep in mind that the deep learning support in SystemML is
> > still a work in progress.  Therefore, if you run into issues, please let
> us
> > know and we'll do everything possible to help get things running!
> >
> >
> > Thanks!
> >
> > - Mike
> >
> >
> > --
> >
> > Michael W. Dusenberry
> > GitHub: github.com/dusenberrymw
> > LinkedIn: linkedin.com/in/mikedusenberry
> >
> > On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
> > aishwarya2612@gmail.com> wrote:
> >
> >> Hey,
> >>
> >> Thank you so much for your help sir. We were finally able to run
> >> preprocess.py without any errors. And the results obtained were
> >> satisfactory i.e we got five set of data frames like you said we would.
> >>
> >> But alas! when we tried to run MachineLearning.ipynb the same NameError
> >> came : https://paste.fedoraproject.org/paste/l3LFJreg~
> >> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
> >>
> >> Could you guide us again as to how to proceed now?
> >> Also, could you please provide an overview of the process
> >> MachineLearning.ipynb is following to train the samples.
> >>
> >> Thanks a lot!
> >>
> >>> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
> >>>
> >>> Hi Aishwarya,
> >>>
> >>> Looks like you've just encountered an out of memory error on one of the
> >>> executors.  Therefore, you just need to adjust the
> >> `spark.executor.memory`
> >>> and `spark.driver.memory` settings with higher amounts of RAM.  What is
> >>> your current setup?  I.e. are you using a cluster of machines, or a
> >> single
> >>> machine?  We generally use a large driver on one machine, and then a
> >> single
> >>> large executor on each other machine.  I would give a sizable amount of
> >>> memory to the driver, and about half the possible memory on the
> > executors
> >>> so that the Python processes have enough memory as well.  PySpark has
> > JVM
> >>> and Python components, and the Spark memory settings only pertain to
> the
> >>> JVM side, thus the need to save about half the executor memory for the
> >>> Python side.
> >>>
> >>> Thanks!
> >>>
> >>> - Mike
> >>>
> >>> --
> >>>
> >>> Mike Dusenberry
> >>> GitHub: github.com/dusenberrymw
> >>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>
> >>> Sent from my iPhone.
> >>>
> >>>
> >>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
> >>> aishwarya2612@gmail.com> wrote:
> >>>>
> >>>> Hello sir,
> >>>>
> >>>> We also wanted to ensure that the spark-submit command we're using is
> >> the
> >>>> correct one for running 'preprocess.py'.
> >>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
> >>>>
> >>>>
> >>>> Thank you.
> >>>> Aishwarya Chaurasia.
> >>>>
> >>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <
> aishwarya2612@gmail.com
> >>>
> >>>> wrote:
> >>>>
> >>>> Hello sir,
> >>>> On running the file preprocess.py we are getting the following error :
> >>>>
> >>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
> >>>> YhyRLivL9gydE=
> >>>>
> >>>> Can you please help us by looking into the error and kindly tell us
> > the
> >>>> solution for it.
> >>>> Thanks a lot.
> >>>> Aishwarya Chaurasia
> >>>>
> >>>>
> >>>>> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Aishwarya,
> >>>>>
> >>>>> Certainly, here is some more detailed information
> >> about`preprocess.py`:
> >>>>>
> >>>>> * The preprocessing Python script is located at
> >>>>> https://github.com/apache/incubator-systemml/blob/master/
> >>>>> projects/breast_cancer/preprocess.py.  Note that this is different
> >> than
> >>>>> the library module at https://github.com/apache/incu
> >>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
> >>>>> ancer/preprocessing.py.
> >>>>> * This script is used to preprocess a set of histology slide images,
> >>>>> which are `.svs` files in our case, and `.tiff` files in your case.
> >>>>> * Lines 63-79 contain "settings" such as the output image sizes,
> >> folder
> >>>>> paths, etc.  Of particular interest, line 72 has the folder path for
> >> the
> >>>>> original slide images that should be commonly accessible from all
> >>> machines
> >>>>> being used, and lines 74-79 contain the names of the output
> > DataFrames
> >>> that
> >>>>> will be saved.
> >>>>> * Line 82 performs the actual preprocessing and creates a Spark
> >>>>> DataFrame with the following columns: slide number, tumor score,
> >>> molecular
> >>>>> score, sample.  The "sample" in this case is the actual small,
> >>> chopped-up
> >>>>> section of the image that has been extracted and flattened into a row
> >>>>> Vector.  For test images without labels (`training=false`), only the
> >>> slide
> >>>>> number and sample will be contained in the DataFrame (i.e. no
> > labels).
> >>>>> This calls the `preprocess(...)` function located on line 371 of
> >>>>> https://github.com/apache/incubator-systemml/blob/master/
> >>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
> >>>>> different file.
> >>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
> > from
> >>>>> line 74.
> >>>>> * Line 93 splits the above DataFrame row-wise into separate
> >> "training"
> >>>>> and "validation" DataFrames, based on the split percentage from line
> >> 70
> >>>>> (`train_frac`).  This is performed so that downstream machine
> > learning
> >>>>> tasks can learn from the training set, and validate performance and
> >>>>> hyperparameter choices on the validation set.  These DataFrames will
> >>> start
> >>>>> with the same columns as the above DataFrame.  If `add_row_indices`
> >> from
> >>>>> line 69 is true, then an additional row index column (`__INDEX`) will
> >> be
> >>>>> pretended.  This is useful for SystemML in downstream machine
> > learning
> >>>>> tasks as it gives the DataFrame row numbers like a real matrix would
> >>> have,
> >>>>> and SystemML is built to operate on matrices.
> >>>>> * Lines 97 & 98 simply save the training and validation DataFrames
> >>> using
> >>>>> the names defined on lines 76 & 78.
> >>>>> * Lines 103-137 create smaller train and validation DataFrames by
> >>> taking
> >>>>> small row-wise samples of the full train and validation DataFrames.
> >> The
> >>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
> >>>>> sample).  This is generally useful for quicker downstream tasks
> >> without
> >>>>> having to load in the larger DataFrames, assuming you have a large
> >>> amount
> >>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
> >> DataFrames
> >>> is
> >>>>> useful for quicker downstream tests.  Once again, the same columns
> >> from
> >>> the
> >>>>> larger train and validation DataFrames will be used.
> >>>>> * Lines 146 & 147 simply save these sampled train and validation
> >>>>> DataFrames.
> >>>>>
> >>>>> As a summary, after running `preprocess.py`, you will be left with
> > the
> >>>>> following saved DataFrames in HDFS:
> >>>>> * Full DataFrame
> >>>>> * Training DataFrame
> >>>>> * Validation DataFrame
> >>>>> * Sampled training DataFrame
> >>>>> * Sampled validation DataFrame
> >>>>>
> >>>>> As for visualization, you may visualize a "sample" (i.e. small,
> >>> chopped-up
> >>>>> section of original image) from a DataFrame by using the `
> >>>>> breastcancer.visualization.visualize_sample(...)` function.  You
> will
> >>>>> need to do this after creating the DataFrames.  Here is a snippet to
> >>>>> visualize the first row sample in a DataFrame, where `df` is one of
> >> the
> >>>>> DataFrames from above:
> >>>>>
> >>>>> ```
> >>>>> from breastcancer.visualization import visualize_sample
> >>>>> visualize_sample(df.first().sample)
> >>>>> ```
> >>>>>
> >>>>> Please let me know if you have any additional questions.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> - Mike
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Mike Dusenberry
> >>>>> GitHub: github.com/dusenberrymw
> >>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>
> >>>>> Sent from my iPhone.
> >>>>>
> >>>>>
> >>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
> >>>>> aishwarya2612@gmail.com> wrote:
> >>>>>>
> >>>>>> Hello sir,
> >>>>>> Can you please elaborate more on what output we would be getting
> >>> because
> >>>>> we
> >>>>>> tried executing the preprocess.py file using spark submit it keeps
> > on
> >>>>>> adding the tiles in rdd and while running the visualisation.py file
> >> it
> >>>>>> isn't showing any output. Can you please help us out asap stating
> > the
> >>>>>> output we will be getting and the sequence of execution of files.
> >>>>>> Thank you.
> >>>>>>
> >>>>>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Aishwarya,
> >>>>>>>
> >>>>>>> Thanks for sharing more info on the issue!
> >>>>>>>
> >>>>>>> To facilitate easier usage, I've updated the preprocessing code by
> >>>>> pulling
> >>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
> >> module,
> >>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
> >>>>> There is
> >>>>>>> also a `preprocess.py` script with the same contents as the
> > notebook
> >>> for
> >>>>>>> use with `spark-submit`.  The choice of the notebook or the script
> >> is
> >>>>> just
> >>>>>>> a matter of convenience, as they both import from the same
> >>>>>>> `breastcancer/preprocessing.py` package.
> >>>>>>>
> >>>>>>> As part of the updates, I've added an explicit SparkSession
> >> parameter
> >>>>>>> (`spark`) to the `preprocess(...)` function, and updated the body
> > to
> >>> use
> >>>>>>> this SparkSession object rather than the older SparkContext `sc`
> >>> object.
> >>>>>>> Previously, the `preprocess(...)` function accessed the `sc` object
> >>> that
> >>>>>>> was pulled in from the enclosing scope, which would work while all
> >> of
> >>>>> the
> >>>>>>> code was colocated within the notebook, but not if the code was
> >>>>> extracted
> >>>>>>> and imported.  The explicit parameter now allows for the code to be
> >>>>>>> imported.
> >>>>>>>
> >>>>>>> Can you please try again with the latest updates?  We are currently
> >>>>> using
> >>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
> >> kernel
> >>>>>>> should have a `spark` object available that can be supplied to the
> >>>>>>> functions (as is done now in the notebook), and if you use the
> >>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object will
> >> be
> >>>>>>> created explicitly by the script.
> >>>>>>>
> >>>>>>> For a bit of context to others, Aishwarya initially reached out to
> >>> find
> >>>>>>> out if our breast cancer project could be applied to TIFF images,
> >>> rather
> >>>>>>> than the SVS images we are currently using (the answer is "yes" so
> >>> long
> >>>>> as
> >>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
> >>>>>>> documentation), and then followed up with Spark issues related to
> >> the
> >>>>>>> preprocessing code.  This conversation has been promptly moved to
> >> the
> >>>>>>> mailing list so that others in the community can benefit.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> -Mike
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> Mike Dusenberry
> >>>>>>> GitHub: github.com/dusenberrymw
> >>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
> >>>>>>>
> >>>>>>> Sent from my iPhone.
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
> >>>>> aishwarya2612@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey,
> >>>>>>>>
> >>>>>>>> The object sc is already defined in pyspark and yet this name
> > error
> >>>>> keeps
> >>>>>>>> occurring. We are using spark 2.*
> >>>>>>>>
> >>>>>>>> Here is the link to error that we are getting :
> >>>>>>>> https://paste.fedoraproject.org/paste/
> >> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
> >>>>>>> YhyRLivL9gydE=
> >>>>>>>
> >>>>>
> >>>
> >>
>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by du...@gmail.com.

Hi Aishwarya,

For the error message, that just means that the SystemML jar isn't being found.  Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar` to the invocation of Jupyter?  I.e. `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was supposed to have been fixed in Spark 2.x, but it's possible that it is still an issue.

As for the output, the notebook will create SystemML `Matrix` objects for all of the weights and biases of the trained models.  To save, please convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each matrix, and then simply save the DataFrames.  This could be done all at once like this for a SystemML Matrix object `Wc1`: `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`.  Just repeat for each matrix returned by the "Train" code for the algorithms.  At that point, you will have a set of saved DataFrames representing a trained SystemML model, and these can be used in downstream classification tasks in a similar manner to the "Eval" sections.

-Mike

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <ai...@gmail.com> wrote:
> 
> Further more :
> What is the output of MachineLearning.ipynb you're obtaining sir?
> We are actually nearing our deadline for our problem.
> Thanks a lot.
> 
> On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
> wrote:
> 
> Hello sir,
> 
> Thanks a lot for replying sir. But unfortunately it did not work. Although
> the NameError did not appear this time but another error came about :
> 
> https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
> 5M1UNdIGYhyRLivL9gydE=
> 
> This error was obtained after executing the second block of code of
> MachineLearning.py in terminal. ( ml = MLContext(sc) )
> 
> We have installed the bleeding-edge version of systemml only and the
> installation was done correctly. We are in a fix now. :/
> Kindly look into the matter asap
> 
> On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com> wrote:
> 
> Hi Aishwarya,
> 
> Glad to hear that the preprocessing stage was successful!  As for the
> `MachineLearning.ipynb` notebook, here is a general guide:
> 
> 
>   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
>   training and validation DataFrames from the preprocessing step, (2)
>   converts them to normalized & one-hot encoded SystemML matrices for
>   consumption by the ML algorithms, and (3) explores training a couple of
>   models.
>   - To run, you'll need to start Jupyter in the context of PySpark via
>   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
>   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
>   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
>   SystemML with pip from PyPy (`pip3 install systemml`), this will install
>   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
> will
>   not be necessary.  If you instead have installed a bleeding-edge version
> of
>   SystemML locally (git clone locally, maven build, `pip3 install -e
>   src/main/python` as listed in `projects/breast_cancer/README.md`), the
>   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We are
>   about to release 0.14, and for this project, I *would* recommend using a
>   bleeding edge install.
>   - Once Jupyter has been started in the context of PySpark, the `sc`
>   SparkContext object should be available.  Please let me know if you
>   continue to see this issue.
>   - The "Read in train & val data" section simply reads in the training
>   and validation data generated in the preprocessing stage.  Be sure that
> the
>   `size` setting is the same as the preprocessing size.  The percentage `p`
>   setting determines whether the full or sampled DataFrames are loaded.  If
>   you set `p = 1`, the full DataFrames will be used.  If you instead would
>   prefer to use the smaller sampled DataFrames while getting started,
> please
>   set it to the same value as used in the preprocessing to generate the
>   smaller sampled DataFrames.
>   - The `Extract X & Y matrices` section splits each of the train and
>   validation DataFrames into effectively X & Y matrices (still as DataFrame
>   types), with X containing the images, and Y containing the labels.
>   - The `Convert to SystemML Matrices` section passes the X & Y DataFrames
>   into a SystemML script that performs some normalization of the images &
>   one-hot encoding of the labels, and then returns SystemML `Matrix` types.
>   These are now ready to be passed into the subsequent algorithms.
>   - The "Trigger Caching" and "Save Matrices" are experimental features,
>   and not necessary to execute.
>   - Next comes the two algorithms being explored in this notebook.  The
>   "Softmax Classifier" is just a multi-class logistic regression model, and
>   is simply there to serve as a baseline comparison with the subsequent
>   convolutional neural net model.  You may wish to simply skip this softmax
>   model and move to the latter convnet model further down in the notebook.
>   - The actual softmax model is located at [
>   https://github.com/apache/incubator-systemml/blob/master/
> projects/breast_cancer/softmax_clf.dml],
>   and the notebook calls functions from that file.
>   - The softmax sanity check just ensures that the model is able to
>   completely overfit when given a tiny sample size.  This should yield
> ~100%
>   training accuracy if the sample size in this section is small enough.
> This
>   is just a check to ensure that nothing else is wrong with the math or the
>   data.
>   - The softmax "Train" section will train a softmax model and return the
>   weights (`W`) and biases (`b`) of the model as SystemML `Matrix` objects.
>   Please adjust the hyperparameters in this section to your problem.
>   - The softmax "Eval" section takes the trained weights and biases and
>   evaluates the training and validation performance.
>   - The next model is a LeNet-like convnet model.  The actual model is
>   located at [
>   https://github.com/apache/incubator-systemml/blob/master/
> projects/breast_cancer/convnet.dml],
>   and the notebook simply calls functions from that file.
>   - Once again, there is an initial sanity check for the ability to
>   overfit on a small amount of data.
>   - The "Hyperparameter Search" contains a script to sample different
>   hyperparams for the convnet, and save the hyperparams + validation
> accuracy
>   of each set after a single epoch of training.  These string files will be
>   saved to HDFS.  Please feel free to adjust the range of the
> hyperparameters
>   for your problem.  Please also feel free to try using the `parfor`
>   (parallel for-loop) instead of the while loop to speed up this section.
>   Note that this is still a work in progress.  The hyperparameter tuning in
>   this section makes use of random search (as opposed to grid search),
> which
>   has been promoted by Bengio et al. to speed up the search time.
>   - The "Train" section trains the convnet and returns the weights and
>   biases as SystemML `Matrix` types.  In this section, please replace the
>   hyperparameters with the best ones from above, and please increase the
>   number of epochs given your time constraints.
>   - The "Eval" section evaluates the performance of the trained convnet.
>   - Although it is not shown in the notebook yet, to save the weights and
>   biases, please use the `toDF()` method on each weight and biases (i.e.
>   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the
>   DataFrame as desired.
>   - Finally, please feel free to extend the model in `convnet.dml` for
>   your particular problem!  The LeNet-like model just serves as a simple
>   convnet, but there are much richer models currently, such as resnets,
> that
>   we are experimenting with.  To make larger models such as resnets easier
> to
>   define, we are also working on other tools for converting model
> definitions
>   + pretrained weights from other systems into SystemML.
> 
> 
> Also, please keep in mind that the deep learning support in SystemML is
> still a work in progress.  Therefore, if you run into issues, please let us
> know and we'll do everything possible to help get things running!
> 
> 
> Thanks!
> 
> - Mike
> 
> 
> --
> 
> Michael W. Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
> 
> On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
> aishwarya2612@gmail.com> wrote:
> 
>> Hey,
>> 
>> Thank you so much for your help sir. We were finally able to run
>> preprocess.py without any errors. And the results obtained were
>> satisfactory i.e we got five set of data frames like you said we would.
>> 
>> But alas! when we tried to run MachineLearning.ipynb the same NameError
>> came : https://paste.fedoraproject.org/paste/l3LFJreg~
>> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>> 
>> Could you guide us again as to how to proceed now?
>> Also, could you please provide an overview of the process
>> MachineLearning.ipynb is following to train the samples.
>> 
>> Thanks a lot!
>> 
>>> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>>> 
>>> Hi Aishwarya,
>>> 
>>> Looks like you've just encountered an out of memory error on one of the
>>> executors.  Therefore, you just need to adjust the
>> `spark.executor.memory`
>>> and `spark.driver.memory` settings with higher amounts of RAM.  What is
>>> your current setup?  I.e. are you using a cluster of machines, or a
>> single
>>> machine?  We generally use a large driver on one machine, and then a
>> single
>>> large executor on each other machine.  I would give a sizable amount of
>>> memory to the driver, and about half the possible memory on the
> executors
>>> so that the Python processes have enough memory as well.  PySpark has
> JVM
>>> and Python components, and the Spark memory settings only pertain to the
>>> JVM side, thus the need to save about half the executor memory for the
>>> Python side.
>>> 
>>> Thanks!
>>> 
>>> - Mike
>>> 
>>> --
>>> 
>>> Mike Dusenberry
>>> GitHub: github.com/dusenberrymw
>>> LinkedIn: linkedin.com/in/mikedusenberry
>>> 
>>> Sent from my iPhone.
>>> 
>>> 
>>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
>>> aishwarya2612@gmail.com> wrote:
>>>> 
>>>> Hello sir,
>>>> 
>>>> We also wanted to ensure that the spark-submit command we're using is
>> the
>>>> correct one for running 'preprocess.py'.
>>>> Command :  /home/new/sparks/bin/spark-submit preprocess.py
>>>> 
>>>> 
>>>> Thank you.
>>>> Aishwarya Chaurasia.
>>>> 
>>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com
>>> 
>>>> wrote:
>>>> 
>>>> Hello sir,
>>>> On running the file preprocess.py we are getting the following error :
>>>> 
>>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
>>>> YhyRLivL9gydE=
>>>> 
>>>> Can you please help us by looking into the error and kindly tell us
> the
>>>> solution for it.
>>>> Thanks a lot.
>>>> Aishwarya Chaurasia
>>>> 
>>>> 
>>>>> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
>>>>> 
>>>>> Hi Aishwarya,
>>>>> 
>>>>> Certainly, here is some more detailed information
>> about`preprocess.py`:
>>>>> 
>>>>> * The preprocessing Python script is located at
>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>> projects/breast_cancer/preprocess.py.  Note that this is different
>> than
>>>>> the library module at https://github.com/apache/incu
>>>>> bator-systemml/blob/master/projects/breast_cancer/breastc
>>>>> ancer/preprocessing.py.
>>>>> * This script is used to preprocess a set of histology slide images,
>>>>> which are `.svs` files in our case, and `.tiff` files in your case.
>>>>> * Lines 63-79 contain "settings" such as the output image sizes,
>> folder
>>>>> paths, etc.  Of particular interest, line 72 has the folder path for
>> the
>>>>> original slide images that should be commonly accessible from all
>>> machines
>>>>> being used, and lines 74-79 contain the names of the output
> DataFrames
>>> that
>>>>> will be saved.
>>>>> * Line 82 performs the actual preprocessing and creates a Spark
>>>>> DataFrame with the following columns: slide number, tumor score,
>>> molecular
>>>>> score, sample.  The "sample" in this case is the actual small,
>>> chopped-up
>>>>> section of the image that has been extracted and flattened into a row
>>>>> Vector.  For test images without labels (`training=false`), only the
>>> slide
>>>>> number and sample will be contained in the DataFrame (i.e. no
> labels).
>>>>> This calls the `preprocess(...)` function located on line 371 of
>>>>> https://github.com/apache/incubator-systemml/blob/master/
>>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a
>>>>> different file.
>>>>> * Line 87 simply saves the above DataFrame to HDFS with the name
> from
>>>>> line 74.
>>>>> * Line 93 splits the above DataFrame row-wise into separate
>> "training"
>>>>> and "validation" DataFrames, based on the split percentage from line
>> 70
>>>>> (`train_frac`).  This is performed so that downstream machine
> learning
>>>>> tasks can learn from the training set, and validate performance and
>>>>> hyperparameter choices on the validation set.  These DataFrames will
>>> start
>>>>> with the same columns as the above DataFrame.  If `add_row_indices`
>> from
>>>>> line 69 is true, then an additional row index column (`__INDEX`) will
>> be
>>>>> pretended.  This is useful for SystemML in downstream machine
> learning
>>>>> tasks as it gives the DataFrame row numbers like a real matrix would
>>> have,
>>>>> and SystemML is built to operate on matrices.
>>>>> * Lines 97 & 98 simply save the training and validation DataFrames
>>> using
>>>>> the names defined on lines 76 & 78.
>>>>> * Lines 103-137 create smaller train and validation DataFrames by
>>> taking
>>>>> small row-wise samples of the full train and validation DataFrames.
>> The
>>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
>>>>> sample).  This is generally useful for quicker downstream tasks
>> without
>>>>> having to load in the larger DataFrames, assuming you have a large
>>> amount
>>>>> of data.  For us, we have ~7TB of data, so having 1% sampled
>> DataFrames
>>> is
>>>>> useful for quicker downstream tests.  Once again, the same columns
>> from
>>> the
>>>>> larger train and validation DataFrames will be used.
>>>>> * Lines 146 & 147 simply save these sampled train and validation
>>>>> DataFrames.
>>>>> 
>>>>> As a summary, after running `preprocess.py`, you will be left with
> the
>>>>> following saved DataFrames in HDFS:
>>>>> * Full DataFrame
>>>>> * Training DataFrame
>>>>> * Validation DataFrame
>>>>> * Sampled training DataFrame
>>>>> * Sampled validation DataFrame
>>>>> 
>>>>> As for visualization, you may visualize a "sample" (i.e. small,
>>> chopped-up
>>>>> section of original image) from a DataFrame by using the `
>>>>> breastcancer.visualization.visualize_sample(...)` function.  You will
>>>>> need to do this after creating the DataFrames.  Here is a snippet to
>>>>> visualize the first row sample in a DataFrame, where `df` is one of
>> the
>>>>> DataFrames from above:
>>>>> 
>>>>> ```
>>>>> from breastcancer.visualization import visualize_sample
>>>>> visualize_sample(df.first().sample)
>>>>> ```
>>>>> 
>>>>> Please let me know if you have any additional questions.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> - Mike
>>>>> 
>>>>> --
>>>>> 
>>>>> Mike Dusenberry
>>>>> GitHub: github.com/dusenberrymw
>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>> 
>>>>> Sent from my iPhone.
>>>>> 
>>>>> 
>>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
>>>>> aishwarya2612@gmail.com> wrote:
>>>>>> 
>>>>>> Hello sir,
>>>>>> Can you please elaborate more on what output we would be getting
>>> because
>>>>> we
>>>>>> tried executing the preprocess.py file using spark submit it keeps
> on
>>>>>> adding the tiles in rdd and while running the visualisation.py file
>> it
>>>>>> isn't showing any output. Can you please help us out asap stating
> the
>>>>>> output we will be getting and the sequence of execution of files.
>>>>>> Thank you.
>>>>>> 
>>>>>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Aishwarya,
>>>>>>> 
>>>>>>> Thanks for sharing more info on the issue!
>>>>>>> 
>>>>>>> To facilitate easier usage, I've updated the preprocessing code by
>>>>> pulling
>>>>>>> out most of the logic into a `breastcancer/preprocessing.py`
>> module,
>>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
>>>>> There is
>>>>>>> also a `preprocess.py` script with the same contents as the
> notebook
>>> for
>>>>>>> use with `spark-submit`.  The choice of the notebook or the script
>> is
>>>>> just
>>>>>>> a matter of convenience, as they both import from the same
>>>>>>> `breastcancer/preprocessing.py` package.
>>>>>>> 
>>>>>>> As part of the updates, I've added an explicit SparkSession
>> parameter
>>>>>>> (`spark`) to the `preprocess(...)` function, and updated the body
> to
>>> use
>>>>>>> this SparkSession object rather than the older SparkContext `sc`
>>> object.
>>>>>>> Previously, the `preprocess(...)` function accessed the `sc` object
>>> that
>>>>>>> was pulled in from the enclosing scope, which would work while all
>> of
>>>>> the
>>>>>>> code was colocated within the notebook, but not if the code was
>>>>> extracted
>>>>>>> and imported.  The explicit parameter now allows for the code to be
>>>>>>> imported.
>>>>>>> 
>>>>>>> Can you please try again with the latest updates?  We are currently
>>>>> using
>>>>>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
>> kernel
>>>>>>> should have a `spark` object available that can be supplied to the
>>>>>>> functions (as is done now in the notebook), and if you use the
>>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object will
>> be
>>>>>>> created explicitly by the script.
>>>>>>> 
>>>>>>> For a bit of context to others, Aishwarya initially reached out to
>>> find
>>>>>>> out if our breast cancer project could be applied to TIFF images,
>>> rather
>>>>>>> than the SVS images we are currently using (the answer is "yes" so
>>> long
>>>>> as
>>>>>>> they are "generic tiled TIFF images, according to the OpenSlide
>>>>>>> documentation), and then followed up with Spark issues related to
>> the
>>>>>>> preprocessing code.  This conversation has been promptly moved to
>> the
>>>>>>> mailing list so that others in the community can benefit.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> -Mike
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Mike Dusenberry
>>>>>>> GitHub: github.com/dusenberrymw
>>>>>>> LinkedIn: linkedin.com/in/mikedusenberry
>>>>>>> 
>>>>>>> Sent from my iPhone.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
>>>>> aishwarya2612@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey,
>>>>>>>> 
>>>>>>>> The object sc is already defined in pyspark and yet this name
> error
>>>>> keeps
>>>>>>>> occurring. We are using spark 2.*
>>>>>>>> 
>>>>>>>> Here is the link to error that we are getting :
>>>>>>>> https://paste.fedoraproject.org/paste/
>> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
>>>>>>> YhyRLivL9gydE=
>>>>>>> 
>>>>> 
>>> 
>>

Re: Please reply ASAP : Regarding incubator systemml/breast_cancer project

Posted by Aishwarya Chaurasia <ai...@gmail.com>.

Further more :
What is the output of MachineLearning.ipynb you're obtaining sir?
We are actually nearing our deadline for our problem.
Thanks a lot.

On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <ai...@gmail.com>
wrote:

Hello sir,

Thanks a lot for replying sir. But unfortunately it did not work. Although
the NameError did not appear this time but another error came about :

https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V
5M1UNdIGYhyRLivL9gydE=

This error was obtained after executing the second block of code of
MachineLearning.py in terminal. ( ml = MLContext(sc) )

We have installed the bleeding-edge version of systemml only and the
installation was done correctly. We are in a fix now. :/
Kindly look into the matter asap

On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <du...@gmail.com> wrote:

Hi Aishwarya,

Glad to hear that the preprocessing stage was successful!  As for the
`MachineLearning.ipynb` notebook, here is a general guide:


   - The `MachineLearning.ipynb` notebook essentially (1) loads in the
   training and validation DataFrames from the preprocessing step, (2)
   converts them to normalized & one-hot encoded SystemML matrices for
   consumption by the ML algorithms, and (3) explores training a couple of
   models.
   - To run, you'll need to start Jupyter in the context of PySpark via
   `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter
   PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark  --jars
   $SYSTEMML_HOME/target/SystemML.jar`.  Note that if you have installed
   SystemML with pip from PyPy (`pip3 install systemml`), this will install
   our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar`
will
   not be necessary.  If you instead have installed a bleeding-edge version
of
   SystemML locally (git clone locally, maven build, `pip3 install -e
   src/main/python` as listed in `projects/breast_cancer/README.md`), the
   `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary.  We are
   about to release 0.14, and for this project, I *would* recommend using a
   bleeding edge install.
   - Once Jupyter has been started in the context of PySpark, the `sc`
   SparkContext object should be available.  Please let me know if you
   continue to see this issue.
   - The "Read in train & val data" section simply reads in the training
   and validation data generated in the preprocessing stage.  Be sure that
the
   `size` setting is the same as the preprocessing size.  The percentage `p`
   setting determines whether the full or sampled DataFrames are loaded.  If
   you set `p = 1`, the full DataFrames will be used.  If you instead would
   prefer to use the smaller sampled DataFrames while getting started,
please
   set it to the same value as used in the preprocessing to generate the
   smaller sampled DataFrames.
   - The `Extract X & Y matrices` section splits each of the train and
   validation DataFrames into effectively X & Y matrices (still as DataFrame
   types), with X containing the images, and Y containing the labels.
   - The `Convert to SystemML Matrices` section passes the X & Y DataFrames
   into a SystemML script that performs some normalization of the images &
   one-hot encoding of the labels, and then returns SystemML `Matrix` types.
   These are now ready to be passed into the subsequent algorithms.
   - The "Trigger Caching" and "Save Matrices" are experimental features,
   and not necessary to execute.
   - Next comes the two algorithms being explored in this notebook.  The
   "Softmax Classifier" is just a multi-class logistic regression model, and
   is simply there to serve as a baseline comparison with the subsequent
   convolutional neural net model.  You may wish to simply skip this softmax
   model and move to the latter convnet model further down in the notebook.
   - The actual softmax model is located at [
   https://github.com/apache/incubator-systemml/blob/master/
projects/breast_cancer/softmax_clf.dml],
   and the notebook calls functions from that file.
   - The softmax sanity check just ensures that the model is able to
   completely overfit when given a tiny sample size.  This should yield
~100%
   training accuracy if the sample size in this section is small enough.
This
   is just a check to ensure that nothing else is wrong with the math or the
   data.
   - The softmax "Train" section will train a softmax model and return the
   weights (`W`) and biases (`b`) of the model as SystemML `Matrix` objects.
   Please adjust the hyperparameters in this section to your problem.
   - The softmax "Eval" section takes the trained weights and biases and
   evaluates the training and validation performance.
   - The next model is a LeNet-like convnet model.  The actual model is
   located at [
   https://github.com/apache/incubator-systemml/blob/master/
projects/breast_cancer/convnet.dml],
   and the notebook simply calls functions from that file.
   - Once again, there is an initial sanity check for the ability to
   overfit on a small amount of data.
   - The "Hyperparameter Search" contains a script to sample different
   hyperparams for the convnet, and save the hyperparams + validation
accuracy
   of each set after a single epoch of training.  These string files will be
   saved to HDFS.  Please feel free to adjust the range of the
hyperparameters
   for your problem.  Please also feel free to try using the `parfor`
   (parallel for-loop) instead of the while loop to speed up this section.
   Note that this is still a work in progress.  The hyperparameter tuning in
   this section makes use of random search (as opposed to grid search),
which
   has been promoted by Bengio et al. to speed up the search time.
   - The "Train" section trains the convnet and returns the weights and
   biases as SystemML `Matrix` types.  In this section, please replace the
   hyperparameters with the best ones from above, and please increase the
   number of epochs given your time constraints.
   - The "Eval" section evaluates the performance of the trained convnet.
   - Although it is not shown in the notebook yet, to save the weights and
   biases, please use the `toDF()` method on each weight and biases (i.e.
   `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the
   DataFrame as desired.
   - Finally, please feel free to extend the model in `convnet.dml` for
   your particular problem!  The LeNet-like model just serves as a simple
   convnet, but there are much richer models currently, such as resnets,
that
   we are experimenting with.  To make larger models such as resnets easier
to
   define, we are also working on other tools for converting model
definitions
   + pretrained weights from other systems into SystemML.


Also, please keep in mind that the deep learning support in SystemML is
still a work in progress.  Therefore, if you run into issues, please let us
know and we'll do everything possible to help get things running!


Thanks!

- Mike


--

Michael W. Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia <
aishwarya2612@gmail.com> wrote:

> Hey,
>
> Thank you so much for your help sir. We were finally able to run
> preprocess.py without any errors. And the results obtained were
> satisfactory i.e we got five set of data frames like you said we would.
>
> But alas! when we tried to run MachineLearning.ipynb the same NameError
> came : https://paste.fedoraproject.org/paste/l3LFJreg~
> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE=
>
> Could you guide us again as to how to proceed now?
> Also, could you please provide an overview of the process
> MachineLearning.ipynb is following to train the samples.
>
> Thanks a lot!
>
> On 20-Apr-2017 12:16 AM, <du...@gmail.com> wrote:
>
> > Hi Aishwarya,
> >
> > Looks like you've just encountered an out of memory error on one of the
> > executors.  Therefore, you just need to adjust the
> `spark.executor.memory`
> > and `spark.driver.memory` settings with higher amounts of RAM.  What is
> > your current setup?  I.e. are you using a cluster of machines, or a
> single
> > machine?  We generally use a large driver on one machine, and then a
> single
> > large executor on each other machine.  I would give a sizable amount of
> > memory to the driver, and about half the possible memory on the
executors
> > so that the Python processes have enough memory as well.  PySpark has
JVM
> > and Python components, and the Spark memory settings only pertain to the
> > JVM side, thus the need to save about half the executor memory for the
> > Python side.
> >
> > Thanks!
> >
> > - Mike
> >
> > --
> >
> > Mike Dusenberry
> > GitHub: github.com/dusenberrymw
> > LinkedIn: linkedin.com/in/mikedusenberry
> >
> > Sent from my iPhone.
> >
> >
> > > On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia <
> > aishwarya2612@gmail.com> wrote:
> > >
> > > Hello sir,
> > >
> > > We also wanted to ensure that the spark-submit command we're using is
> the
> > > correct one for running 'preprocess.py'.
> > > Command :  /home/new/sparks/bin/spark-submit preprocess.py
> > >
> > >
> > > Thank you.
> > > Aishwarya Chaurasia.
> > >
> > > On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2612@gmail.com
> >
> > > wrote:
> > >
> > > Hello sir,
> > > On running the file preprocess.py we are getting the following error :
> > >
> > > https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
> > > YhyRLivL9gydE=
> > >
> > > Can you please help us by looking into the error and kindly tell us
the
> > > solution for it.
> > > Thanks a lot.
> > > Aishwarya Chaurasia
> > >
> > >
> > >> On 19-Apr-2017 12:43 AM, <du...@gmail.com> wrote:
> > >>
> > >> Hi Aishwarya,
> > >>
> > >> Certainly, here is some more detailed information
> about`preprocess.py`:
> > >>
> > >>  * The preprocessing Python script is located at
> > >> https://github.com/apache/incubator-systemml/blob/master/
> > >> projects/breast_cancer/preprocess.py.  Note that this is different
> than
> > >> the library module at https://github.com/apache/incu
> > >> bator-systemml/blob/master/projects/breast_cancer/breastc
> > >> ancer/preprocessing.py.
> > >>  * This script is used to preprocess a set of histology slide images,
> > >> which are `.svs` files in our case, and `.tiff` files in your case.
> > >>  * Lines 63-79 contain "settings" such as the output image sizes,
> folder
> > >> paths, etc.  Of particular interest, line 72 has the folder path for
> the
> > >> original slide images that should be commonly accessible from all
> > machines
> > >> being used, and lines 74-79 contain the names of the output
DataFrames
> > that
> > >> will be saved.
> > >>  * Line 82 performs the actual preprocessing and creates a Spark
> > >> DataFrame with the following columns: slide number, tumor score,
> > molecular
> > >> score, sample.  The "sample" in this case is the actual small,
> > chopped-up
> > >> section of the image that has been extracted and flattened into a row
> > >> Vector.  For test images without labels (`training=false`), only the
> > slide
> > >> number and sample will be contained in the DataFrame (i.e. no
labels).
> > >> This calls the `preprocess(...)` function located on line 371 of
> > >> https://github.com/apache/incubator-systemml/blob/master/
> > >> projects/breast_cancer/breastcancer/preprocessing.py, which is a
> > >> different file.
> > >>  * Line 87 simply saves the above DataFrame to HDFS with the name
from
> > >> line 74.
> > >>  * Line 93 splits the above DataFrame row-wise into separate
> "training"
> > >> and "validation" DataFrames, based on the split percentage from line
> 70
> > >> (`train_frac`).  This is performed so that downstream machine
learning
> > >> tasks can learn from the training set, and validate performance and
> > >> hyperparameter choices on the validation set.  These DataFrames will
> > start
> > >> with the same columns as the above DataFrame.  If `add_row_indices`
> from
> > >> line 69 is true, then an additional row index column (`__INDEX`) will
> be
> > >> pretended.  This is useful for SystemML in downstream machine
learning
> > >> tasks as it gives the DataFrame row numbers like a real matrix would
> > have,
> > >> and SystemML is built to operate on matrices.
> > >>  * Lines 97 & 98 simply save the training and validation DataFrames
> > using
> > >> the names defined on lines 76 & 78.
> > >>  * Lines 103-137 create smaller train and validation DataFrames by
> > taking
> > >> small row-wise samples of the full train and validation DataFrames.
> The
> > >> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
> > >> sample).  This is generally useful for quicker downstream tasks
> without
> > >> having to load in the larger DataFrames, assuming you have a large
> > amount
> > >> of data.  For us, we have ~7TB of data, so having 1% sampled
> DataFrames
> > is
> > >> useful for quicker downstream tests.  Once again, the same columns
> from
> > the
> > >> larger train and validation DataFrames will be used.
> > >>  * Lines 146 & 147 simply save these sampled train and validation
> > >> DataFrames.
> > >>
> > >> As a summary, after running `preprocess.py`, you will be left with
the
> > >> following saved DataFrames in HDFS:
> > >>  * Full DataFrame
> > >>  * Training DataFrame
> > >>  * Validation DataFrame
> > >>  * Sampled training DataFrame
> > >>  * Sampled validation DataFrame
> > >>
> > >> As for visualization, you may visualize a "sample" (i.e. small,
> > chopped-up
> > >> section of original image) from a DataFrame by using the `
> > >> breastcancer.visualization.visualize_sample(...)` function.  You will
> > >> need to do this after creating the DataFrames.  Here is a snippet to
> > >> visualize the first row sample in a DataFrame, where `df` is one of
> the
> > >> DataFrames from above:
> > >>
> > >> ```
> > >> from breastcancer.visualization import visualize_sample
> > >> visualize_sample(df.first().sample)
> > >> ```
> > >>
> > >> Please let me know if you have any additional questions.
> > >>
> > >> Thanks!
> > >>
> > >> - Mike
> > >>
> > >> --
> > >>
> > >> Mike Dusenberry
> > >> GitHub: github.com/dusenberrymw
> > >> LinkedIn: linkedin.com/in/mikedusenberry
> > >>
> > >> Sent from my iPhone.
> > >>
> > >>
> > >>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
> > >> aishwarya2612@gmail.com> wrote:
> > >>>
> > >>> Hello sir,
> > >>> Can you please elaborate more on what output we would be getting
> > because
> > >> we
> > >>> tried executing the preprocess.py file using spark submit it keeps
on
> > >>> adding the tiles in rdd and while running the visualisation.py file
> it
> > >>> isn't showing any output. Can you please help us out asap stating
the
> > >>> output we will be getting and the sequence of execution of files.
> > >>> Thank you.
> > >>>
> > >>>> On 07-Apr-2017 5:54 AM, <du...@gmail.com> wrote:
> > >>>>
> > >>>> Hi Aishwarya,
> > >>>>
> > >>>> Thanks for sharing more info on the issue!
> > >>>>
> > >>>> To facilitate easier usage, I've updated the preprocessing code by
> > >> pulling
> > >>>> out most of the logic into a `breastcancer/preprocessing.py`
> module,
> > >>>> leaving just the execution in the `Preprocessing.ipynb` notebook.
> > >> There is
> > >>>> also a `preprocess.py` script with the same contents as the
notebook
> > for
> > >>>> use with `spark-submit`.  The choice of the notebook or the script
> is
> > >> just
> > >>>> a matter of convenience, as they both import from the same
> > >>>> `breastcancer/preprocessing.py` package.
> > >>>>
> > >>>> As part of the updates, I've added an explicit SparkSession
> parameter
> > >>>> (`spark`) to the `preprocess(...)` function, and updated the body
to
> > use
> > >>>> this SparkSession object rather than the older SparkContext `sc`
> > object.
> > >>>> Previously, the `preprocess(...)` function accessed the `sc` object
> > that
> > >>>> was pulled in from the enclosing scope, which would work while all
> of
> > >> the
> > >>>> code was colocated within the notebook, but not if the code was
> > >> extracted
> > >>>> and imported.  The explicit parameter now allows for the code to be
> > >>>> imported.
> > >>>>
> > >>>> Can you please try again with the latest updates?  We are currently
> > >> using
> > >>>> Spark 2.x with Python 3.  If you use the notebook, the pyspark
> kernel
> > >>>> should have a `spark` object available that can be supplied to the
> > >>>> functions (as is done now in the notebook), and if you use the
> > >>>> `preprocess.py` script with `spark-submit`, the `spark` object will
> be
> > >>>> created explicitly by the script.
> > >>>>
> > >>>> For a bit of context to others, Aishwarya initially reached out to
> > find
> > >>>> out if our breast cancer project could be applied to TIFF images,
> > rather
> > >>>> than the SVS images we are currently using (the answer is "yes" so
> > long
> > >> as
> > >>>> they are "generic tiled TIFF images, according to the OpenSlide
> > >>>> documentation), and then followed up with Spark issues related to
> the
> > >>>> preprocessing code.  This conversation has been promptly moved to
> the
> > >>>> mailing list so that others in the community can benefit.
> > >>>>
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> -Mike
> > >>>>
> > >>>> --
> > >>>>
> > >>>> Mike Dusenberry
> > >>>> GitHub: github.com/dusenberrymw
> > >>>> LinkedIn: linkedin.com/in/mikedusenberry
> > >>>>
> > >>>> Sent from my iPhone.
> > >>>>
> > >>>>
> > >>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
> > >> aishwarya2612@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hey,
> > >>>>>
> > >>>>> The object sc is already defined in pyspark and yet this name
error
> > >> keeps
> > >>>>> occurring. We are using spark 2.*
> > >>>>>
> > >>>>> Here is the link to error that we are getting :
> > >>>>> https://paste.fedoraproject.org/paste/
> 89iQODxzpNZVbSfgwocH8l5M1UNdIG
> > >>>> YhyRLivL9gydE=
> > >>>>
> > >>
> >
>