Executing Open Source Code in Machine Learning Pipelines

Executing Open Source Code in Machine Learning Pipelines


Hi. My name is Radhikha Myneni. And I work in data mining and
machine learning R&D at SAS. In this video, we
are going to look at how to execute open source
code, specifically Python and R in SAS Visual Data Mining and
Machine Learning pipelines. SAS Visual Data Mining
and Machine Learning encompasses many
tools, including those for data preparation,
interactive model building, defining modeling
flows with pipelines, managing and deploying
models, et cetera. Models Studio is a component
of SAS Visual Data Mining and Machine Learning that aids
in defining and automating modeling with pipelines. It includes nodes that can
be added to pipeline flows for building various models
and eventually comparing them to pick the best model. The Open Source Code node
is new in Model Studio 8.3 and will be available
in July 2018. It supports the
Python and R languages and requires them and
any necessary packages to be installed on the SAS
Compute Server, which is the engine that runs SAS code. The Open Source Code node
downloads a data sample from the SAS Cloud Analytic
Server, also known as CAS, to the Compute Server
during execution. This data transfer happens
with Comma-Separated Value, or CSV, files. The user can choose to
work with CSV files. But for convenience,
the node also makes these data
available as a data frame. Let’s start by describing the
setup steps for the Open Source Code node. As I’ve already
mentioned, Python and R have to be installed
on the Compute Server. The node itself is agnostic to
the version of the Python or R installed. That is, any version can be
used as long as the code written in the Node Editor matches
the installed version of the software. I mention this because the
syntax for Python 2 and Python 3 is slightly different and
sometimes not interchangeable. To make the node
accessible to all users, Python and R should be installed
with administrative privileges. And the directories where the
executables python and Rscript reside should be placed
in the system path. On Windows, the system
environment variables PYTHONHOME and RHOME can be
defined as an alternative to point to these executables. Now let’s get into
the node details. The Open Source
Code node is located in the Miscellaneous
group as shown, right above the SAS Code node. You can place this node
in the Preprocessing group as shown in the leftmost flow. Or you can move it to the
Supervised Learning group as shown on the right. When you have Modeling node that
trains and makes predictions, you should move the note
to Supervised Learning. We will get into those
details in a little bit. But first, let’s talk
about the node properties. In the Open Source Code node,
you can choose Python and/or R by using the Language property. Use the Data Sample
group of properties to select how much input
data and what kind of sample needs to be downloaded
when you execute this node. Getting into the
details, the data sample is downloaded as a
CSV file from CAS. Remember, Python
and R are installed on the Compute Server. So data has to move from CAS to
the Compute Server for the node to do its work. So be sure to use
caution and avoid moving large amounts of data. By default, a stratified
sample of 10,000 observations is created based on the
partition variable in the class target when applicable. Be aware that the Open Source
Code node is working off a sample here. This would become
important aspect to consider if you
choose to compare this node with
other modeling nodes in the pipeline
that use full data. Use the Open button
in the node properties to access the Code Editor. There you can type
in Python or R code. As you can see, the Code Editor
has a fair amount of context highlighting for readability. The variables that you
see on the left-hand side are created by the
precursor code. You can hover the
mouse pointer over them to see their descriptions. So what is this
precursor called? During node execution,
various code snippets are generated and put
together before and after the user that code as
shown in this slide. What snippets get added somewhat
depend on the node properties. So why are these
code snippets needed? The simple answer
is for convenience. Let me show you an example. This is the first part
of the precursor code. You can see a sample,
both in Python and in R. It defines
variables like dm_nodedir, which represents the transient
working directory of the node during execution. dm_dec_target has the name
of the target variable. dm_class_input contains a
list of categorical variables in your data. As you can see, these
are convenient variables that you can use when
coding in Python or R. The node properties include a
Generate data frame property that is selected by default.
When this property is selected, the second part of the
precursor code is added. What this code is doing is
converting the downloaded CSV input data to an R data frame or
a pandas data frame in Python. Again, this is for convenience. In R, all the
categorical variables are also converted to
factors as part of this step. The posterior code snippet,
as the name implies, is added after the user code
and converts the data frame containing predictions
back to CSV so they can be uploaded into
CAS for model assessment. What I mean by model
assessment is the computation of model statistics
like misclassification rate for binary
or nominal targets or mean squared error
for interval targets. So in summary, when
you run the node, the actual code that is
executed is a combination of the precursor
code, the code entered by the user in the Code
Editor, and the posterior code. Now let’s look at
the node results. After the node
executes successfully, you can view them by
right-clicking the node and selecting Results. In the R example shown
here, to output files are generated after a Random
Forest model is trained. The rpt_forestMsePlot file
is a PNG, or image file, that plots the mean squared
error during training. And the rpt_forestIMP.csv file
contains variable importance in a CSV or tabular format. After successful
execution, they show up like this in the node results. There are two reasons
why the output from Python or R execution
can be viewed in the node. First, output file names should
be saved with an rpt_ prefix. Because this tells the node
to display them in results. And second, the file extension
has to be .csv, .txt, .png, .jpeg, or .gif. Because this tells the
node how to display them. CSV files are
displayed as tables. TXT files are displayed
as plain text. And PNG, JPEG, and GIF files
are displayed as images. Now let’s look at what it means
to move the Open Source Code node to Supervised Learning. To reiterate, the
Open Source Code node can be executed in the
Preprocessing group or Supervised Learning group. What that means
is that you can do preprocessing work
in this node, or you can build supervised models. Typically, you move this
node to Supervised Learning to perform model
assessment and model comparison with other modeling
nodes in the pipeline. For this to happen, you need to
ensure that model predictions are saved in the dm_scoreddf
data frame if the Generate data from property is enabled, or
in the node_scored.csv file if Generate data
frame is disabled. In addition, the
prediction variables in this data frame or CSV should
follow a naming convention. When the target is interval,
the prediction variable should be named P underscore
target variable name. And when the target
is categorical, the posterior probabilities
for every level of the target should be generated with the
naming convention P underscore target variable name, followed
by the corresponding target level. For example, if you have a
binary target where variable called BAD with levels 0
and 1, the variable names of the posterior probabilities
should be P_BAD0 or P_BAD1 respectively. When the Open Source Code
node is in Supervised Learning and prediction
data and variables are named, as mentioned
in the previous slide, then you should see
additional charts in the node results and
vector form model assessment. These charts include
Lift, Fit Statistics, ROC for binary target, et cetera. In Supervised Learning, the node
also performs model compassion. That is, you can compare Python
or R models with other modeling nodes in the pipeline. There is one point I
want to emphasize here. If your data are big,
the Open Source Code node will download a sample and
train by using that sample, while other nodes will
train on the complete data. In this scenario, if you
are comparing models, you need to be aware that
the number of observations in the validation data
used for comparison might be different for
the Open Source model compared to the other
models in the pipeline. In summary, the Open
Source Code node supports both Python
and R languages. It can display results. It can produce
assessment statistics and enable you to compare
your model with other models in the pipeline. What the Open Source Code
node cannot do is be part of an ensemble. It cannot support
registering to a database, publishing to Model Manager, or
the ability to download scoring code or scoring APIs. This is because there
is no compatible scoring code available for this node. Thank you for watching. You can get more
information and sign up for a free trial of SAS
Visual Data Mining and Machine Learning at the URL shown here.

Leave a Reply

Your email address will not be published. Required fields are marked *