What Is Azure Machine Learning Studio? | Microsoft Docs Manual Azura

User Manual:

Open the PDF directly: View PDF .
Page Count: 421 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover Page
Machine Learning Studio Documentation
Overview
Get Started
How To
Reference
Related
- Azure AI Gallery
Resources

Table of ContentsTable of Contents
 Machine Learning Studio Documentation
 Overview
 Machine Learning Studio
 ML Studio capabilities
 ML Studio basics (infographic)
 Frequently asked questions
 What's new?
 Get Started
 Create your first experiment
 Example walkthrough
 Create a predictive solution
 1 - Create a workspace
 2 - Upload data
 3 - Create experiment
 4 - Train and evaluate
 5 - Deploy web service
 6 - Access web service
 Data Science for Beginners
 1 - Five questions
 2 - Is your data ready?
 3 - Ask the right question
 4 - Predict an answer
 5 - Copy other people's work
 R quick start
 How To
 Set up tools and utilities
 Manage a workspace
 Acquire and understand data
 Import training data

 Develop models
 Create and train models
 Operationalize models
 Overview
 Deploy models
 Manage web services
 Retrain models
 Consume models
 Examples
 Sample experiments
 Sample datasets
 Customer churn example
 Reference
 Azure PowerShell module (New)
 Azure PowerShell module (Classic)
 Algorithm & Module reference
 REST management APIs
 Web service error codes
 Related
 Azure AI Gallery
 Overview
 Industries
 Solutions
 Experiments
 Jupyter Notebooks
 Competitions
 Competitions FAQ
 Tutorials
 Collections
 Custom Modules
 Resources
 Azure Roadmap

 Net# Neural Networks Language
 Pricing
 Service updates
 Blog
 MSDN forum
 Stack Overflow
 Videos

What is Azure Machine Learning Studio?

4/9/2018 • 9 min to read • Edit Online

NOTENOTE

The Machine Learning Studio interactive workspace

TIPTIP

Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and

deploy predictive analytics solutions on your data. Machine Learning Studio publishes models as web services that

can easily be consumed by custom apps or BI tools such as Excel.

Machine Learning Studio is where data science, predictive analytics, cloud resources, and your data meet.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

To develop a predictive analysis model, you typically use data from one or more sources, transform and analyze

that data through various data manipulation and statistical functions, and generate a set of results. Developing a

model like this is an iterative process. As you modify the various functions and their parameters, your results

converge until you are satisfied that you have a trained, effective model.

Azure Machine Learning Studio gives you an interactive, visual workspace to easily build, test, and iterate on a

predictive analysis model. You drag-and-drop datasets and analysis modules onto an interactive canvas,

connecting them together to form an experiment, which you run in Machine Learning Studio. To iterate on your

model design, you edit the experiment, save a copy if desired, and run it again. When you're ready, you can convert

your training experiment to a predictive experiment, and then publish it as a web service so that your model

can be accessed by others.

There is no programming required, just visually connecting datasets and modules to construct your predictive

analysis model.

To download and print a diagram that gives an overview of the capabilities of Machine Learning Studio, see Overview

diagram of Azure Machine Learning Studio capabilities.

Get started with Machine Learning Studio

Cortana IntelligenceCortana Intelligence

Azure Machine Learning StudioAzure Machine Learning Studio

GalleryGallery

Components of an experiment

When you first enter Machine Learning Studio you see the Home page. From here you can view documentation,

videos, webinars, and find other valuable resources.

Click the upper-left menu and you'll see several options.

Click Cortana Intelligence and you'll be taken to the home page of the Cortana Intelligence Suite. The Cortana

Intelligence Suite is a fully managed big data and advanced analytics suite to transform your data into intelligent

action. See the Suite home page for full documentation, including customer stories.

There are two options here, Home, the page where you started, and Studio.

Click Studio and you'll be taken to the Azure Machine Learning Studio. First you'll be asked to sign in using

your Microsoft account, or your work or school account. Once signed in, you'll see the following tabs on the left:

PROJECTS - Collections of experiments, datasets, notebooks, and other resources representing a single project

EXPERIMENTS - Experiments that you have created and run or saved as drafts

WEB SERVICES - Web services that you have deployed from your experiments

NOTEBOOKS - Jupyter notebooks that you have created

DATASETS - Datasets that you have uploaded into Studio

TRAINED MODELS - Models that you have trained in experiments and saved in Studio

SETTINGS - A collection of settings that you can use to configure your account and resources.

Click Gallery and you'll be taken to the Azure AI Gallery. The Gallery is a place where a community of data

scientists and developers share solutions created using components of the Cortana Intelligence Suite.

For more information about the Gallery, see Share and discover solutions in the Azure AI Gallery.

DatasetsDatasets
ModulesModules
Deploying a predictive analytics web service
An experiment consists of datasets that provide data to analytical modules, which you connect together to
construct a predictive analysis model. Specifically, a valid experiment has these characteristics:
The experiment has at least one dataset and one module
Datasets may be connected only to modules
Modules may be connected to either datasets or other modules
All input ports for modules must have some connection to the data flow
All required parameters for each module must be set
You can create an experiment from scratch, or you can use an existing sample experiment as a template. For more
information, see Copy example experiments to create new machine learning experiments.
For an example of creating a simple experiment, see Create a simple experiment in Azure Machine Learning
Studio.
For a more complete walkthrough of creating a predictive analytics solution, see Develop a predictive solution with
Azure Machine Learning.
A dataset is data that has been uploaded to Machine Learning Studio so that it can be used in the modeling
process. A number of sample datasets are included with Machine Learning Studio for you to experiment with, and
you can upload more datasets as you need them. Here are some examples of included datasets:
MPG data for various automobiles - Miles per gallon (MPG) values for automobiles identified by number of
cylinders, horsepower, etc.
Breast cancer data - Breast cancer diagnosis data.
Forest fires data - Forest fire sizes in northeast Portugal.
As you build an experiment you can choose from the list of datasets available to the left of the canvas.
For a list of sample datasets included in Machine Learning Studio, see Use the sample data sets in Azure Machine
Learning Studio.
A module is an algorithm that you can perform on your data. Machine Learning Studio has a number of modules
ranging from data ingress functions to training, scoring, and validation processes. Here are some examples of
included modules:
Convert to ARFF - Converts a .NET serialized dataset to Attribute-Relation File Format (ARFF).
Compute Elementary Statistics - Calculates elementary statistics such as mean, standard deviation, etc.
Linear Regression - Creates an online gradient descent-based linear regression model.
Score Model - Scores a trained classification or regression model.
As you build an experiment you can choose from the list of modules available to the left of the canvas.
A module may have a set of parameters that you can use to configure the module's internal algorithms. When you
select a module on the canvas, the module's parameters are displayed in the Properties pane to the right of the
canvas. You can modify the parameters in that pane to tune your model.
For some help navigating through the large library of machine learning algorithms available, see How to choose
algorithms for Microsoft Azure Machine Learning.
Once your predictive analytics model is ready, you can deploy it as a web service right from Machine Learning
Studio. For more details on this process, see Deploy an Azure Machine Learning web service.

Key machine learning terms and concepts

Data exploration, descriptive analytics, and predictive analyticsData exploration, descriptive analytics, and predictive analytics

Supervised and unsupervised learningSupervised and unsupervised learning

Model training and evaluationModel training and evaluation

Training dataTraining data

Evaluation dataEvaluation data

Other common machine learning terms

Machine learning terms can be confusing. Here are definitions of key terms to help you. Use comments following

to tell us about any other term you'd like defined.

Data exploration is the process of gathering information about a large and often unstructured data set in order

to find characteristics for focused analysis.

Data mining refers to automated data exploration.

Descriptive analytics is the process of analyzing a data set in order to summarize what happened. The vast

majority of business analytics - such as sales reports, web metrics, and social networks analysis - are descriptive.

Predictive analytics is the process of building models from historical or current data in order to forecast future

outcomes.

Supervised learning algorithms are trained with labeled data - in other words, data comprised of examples of the

answers wanted. For instance, a model that identifies fraudulent credit card use would be trained from a data set

with labeled data points of known fraudulent and valid charges. Most machine learning is supervised.

Unsupervised learning is used on data with no labels, and the goal is to find relationships in the data. For

instance, you might want to find groupings of customer demographics with similar buying habits.

A machine learning model is an abstraction of the question you are trying to answer or the outcome you want to

predict. Models are trained and evaluated from existing data.

When you train a model from data, you use a known data set and make adjustments to the model based on the

data characteristics to get the most accurate answer. In Azure Machine Learning, a model is built from an

algorithm module that processes training data and functional modules, such as a scoring module.

In supervised learning, if you're training a fraud detection model, you use a set of transactions that are labeled as

either fraudulent or valid. You split your data set randomly, and use part to train the model and part to test or

evaluate the model.

Once you have a trained model, evaluate the model using the remaining test data. You use data you already know

the outcomes for, so that you can tell whether your model predicts accurately.

algorithm: A self-contained set of rules used to solve problems through data processing, math, or automated

reasoning.

anomaly detection: A model that flags unusual events or values and helps you discover problems. For

example, credit card fraud detection looks for unusual purchases.

categorical data: Data that is organized by categories and that can be divided into groups. For example a

categorical data set for autos could specify year, make, model, and price.

classification: A model for organizing data points into categories based on a data set for which category

groupings are already known.

feature engineering: The process of extracting or selecting features related to a data set in order to enhance

the data set and improve outcomes. For instance, airfare data could be enhanced by days of the week and

holidays. See Feature selection and engineering in Azure Machine Learning.

module: A functional part in a Machine Learning Studio model, such as the Enter Data module that enables

Next steps

entering and editing small data sets. An algorithm is also a type of module in Machine Learning Studio.

model: A supervised learning model is the product of a machine learning experiment comprised of training

data, an algorithm module, and functional modules, such as a Score Model module.

numerical data: Data that has meaning as measurements (continuous data) or counts (discrete data). Also

referred to as quantitative data.

partition: The method by which you divide data into samples. See Partition and Sample for more information.

prediction: A prediction is a forecast of a value or values from a machine learning model. You might also see

the term "predicted score." However, predicted scores are not the final output of a model. An evaluation of the

model follows the score.

regression: A model for predicting a value based on independent variables, such as predicting the price of a car

based on its year and make.

score: A predicted value generated from a trained classification or regression model, using the Score Model

module in Machine Learning Studio. Classification models also return a score for the probability of the

predicted value. Once you've generated scores from a model, you can evaluate the model's accuracy using the

Evaluate Model module.

sample: A part of a data set intended to be representative of the whole. Samples can be selected randomly or

based on specific features of the data set.

You can learn the basics of predictive analytics and machine learning using a step-by-step tutorial and by building

on samples.

Overview diagram of Azure Machine Learning Studio

capabilities

3/21/2018 • 1 min to read • Edit Online

NOTENOTE

Download the Machine Learning Studio overview diagram

The Microsoft Azure Machine Learning Studio Capabilities Overview diagram gives you a high-level

overview of how you can use Machine Learning Studio to develop a predictive analytics model and operationalize

it in the Azure cloud.

Azure Machine Learning Studio has available a large number of machine learning algorithms, along with modules

that help with data input, output, preparation, and visualization. Using these components you can develop a

predictive analytics experiment, iterate on it, and use it to train your model. Then with one click you can

operationalize your model in the Azure cloud so that it can be used to score new data.

This diagram demonstrates how all those pieces fit together.

See Machine learning algorithm cheat sheet for Microsoft Azure Machine Learning Studio for additional help in navigating

through and choosing the machine learning algorithms available in Machine Learning Studio.

Download the Microsoft Azure Machine Learning Studio Capabilities Overview diagram and get a high-

level view of the capabilities of Machine Learning Studio. To keep it nearby, you can print the diagram in tabloid

size (11 x 17 in.).

Download the diagram here: Microsoft Azure Machine Learning Studio Capabilities Overview

More help with Machine Learning Studio

NOTENOTE

For an overview of Microsoft Azure Machine Learning, see Introduction to machine learning on Microsoft

Azure

For an overview of Machine Learning Studio, see What is Azure Machine Learning Studio?.

For a detailed discussion of the machine learning algorithms available in Machine Learning Studio, see How to

choose algorithms for Microsoft Azure Machine Learning.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

Downloadable Infographic: Machine learning basics

with algorithm examples

3/21/2018 • 1 min to read • Edit Online

Popular algorithms in Machine Learning Studio

Download the infographic with algorithm examples

More help with algorithms for beginners and advanced users

Download this easy-to-understand infographic overview of machine learning basics to learn about popular

algorithms used to answer common machine learning questions. Algorithm examples help the machine learning

beginner understand which algorithms to use and what they're used for.

Azure Machine Learning Studio comes with a large library of algorithms for predictive analytics. The infographic

identifies four popular families of algorithms - regression, anomaly detection, clustering, and classification - and

provides links to working examples in the Azure AI Gallery. The Gallery contains example experiments and

tutorials that demonstrate how these algorithms can be applied in many real-world solutions.

Download: Infographic of machine learning basics with links to algorithm examples (PDF)

For a deeper discussion of the different types of machine learning algorithms, how they're used, and how to

choose the right one for your solution, see How to choose algorithms for Microsoft Azure Machine Learning.

For a list by category of all the machine learning algorithms available in Machine Learning Studio, see Initialize

Model in the Machine Learning Studio Algorithm and Module Help.

For a complete alphabetical list of algorithms and modules in Machine Learning Studio, see A-Z list of Machine

Learning Studio modules in Machine Learning Studio Algorithm and Module Help.

NOTENOTE

To download and print a diagram that gives an overview of the capabilities of Machine Learning Studio, see

Overview diagram of Azure Machine Learning Studio capabilities.

For an overview of the Azure AI Gallery and the many community-generated resources available there, see

Share and discover resources in the Azure AI Gallery.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

Azure Machine Learning frequently asked questions:

Billing, capabilities, limitations, and support

4/25/2018 • 29 min to read • Edit Online

General questions

NOTENOTE

Here are some frequently asked questions (FAQs) and corresponding answers about Azure Machine Learning, a

cloud service for developing predictive models and operationalizing solutions through web services. These FAQs

provide questions about how to use the service, which includes the billing model, capabilities, limitations, and

support.

Have a question you can't find here?

Azure Machine Learning has a forum on MSDN where members of the data science community can ask questions

about Azure Machine Learning. The Azure Machine Learning team monitors the forum. Go to the Azure Machine

Learning Forum to search for answers or to post a new question of your own.

What is Azure Machine Learning?

Azure Machine Learning is a fully managed service that you can use to create, test, operate, and manage predictive

analytic solutions in the cloud. With only a browser, you can sign in, upload data, and immediately start machine-

learning experiments. Drag-and-drop predictive modeling, a large pallet of modules, and a library of starting

templates make common machine-learning tasks simple and quick. For more information, see the Azure Machine

Learning service overview. For an introduction to machine learning that explains key terminology and concepts,

see Introduction to Azure Machine Learning.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

What is Machine Learning Studio?

Machine Learning Studio is a workbench environment that you access by using a web browser. Machine Learning

Studio hosts a pallet of modules in a visual composition interface that helps you build an end-to-end, data-science

workflow in the form of an experiment.

For more information about Machine Learning Studio, see What is Machine Learning Studio?

What is the Machine Learning API service?

The Machine Learning API service enables you to deploy predictive models, like those that are built into Machine

Learning Studio, as scalable, fault-tolerant, web services. The web services that the Machine Learning API service

creates are REST APIs that provide an interface for communication between external applications and your

predictive analytics models.

For more information, see How to consume an Azure Machine Learning Web service.

Where are my Classic web services listed? Where are my New (Azure Resource Manager-based) web

services listed?

Web services created using the Classic deployment model and web services created using the New Azure

Resource Manager deployment model are listed in the Microsoft Azure Machine Learning Web Services portal.

Azure Machine Learning questions

Machine Learning Studio questions

Import and export data for Machine LearningImport and export data for Machine Learning

How large can the data set be for my modules?How large can the data set be for my modules?

Classic web services are also listed in Machine Learning Studio on the Web services tab.

What are Azure Machine Learning web services?

Machine Learning web services provide an interface between an application and a Machine Learning workflow

scoring model. An external application can use Azure Machine Learning to communicate with a Machine Learning

workflow scoring model in real time. A call to a Machine Learning web service returns prediction results to an

external application. To make a call to a web service, you pass an API key that was created when you deployed the

web service. A Machine Learning web service is based on REST, a popular architecture choice for web

programming projects.

Azure Machine Learning has two types of web services:

Request-Response Service (RRS): A low latency, highly scalable service that provides an interface to the

stateless models created and deployed by using Machine Learning Studio.

Batch Execution Service (BES): An asynchronous service that scores a batch for data records.

There are several ways to consume the REST API and access the web service. For example, you can write an

application in C#, R, or Python by using the sample code that's generated for you when you deployed the web

service.

The sample code is available on:

The Consume page for the web service in the Azure Machine Learning Web Services portal

The API Help Page in the web service dashboard in Machine Learning Studio

You can also use the sample Microsoft Excel workbook that's created for you and is available in the web service

dashboard in Machine Learning Studio.

What are the main updates to Azure Machine Learning?

For the latest updates, see What's new in Azure Machine Learning.

What data sources does Machine Learning support?

You can download data to a Machine Learning Studio experiment in three ways:

Upload a local file as a dataset

Use a module to import data from cloud data services

Import a dataset saved from another experiment

To learn more about supported file formats, see Import training data into Machine Learning Studio.

Modules in Machine Learning Studio support datasets of up to 10 GB of dense numerical data for common use

cases. If a module takes more than one input, the 10 GB value is the total of all input sizes. You can also sample

larger datasets by using queries from Hive or Azure SQL Database, or you can use Learning by Counts

preprocessing before ingestion.

The following types of data can expand to larger datasets during feature normalization and are limited to less than

10 GB:

Sparse

What are the limits for data upload?What are the limits for data upload?

ModulesModules

Data processingData processing

Categorical

Strings

Binary data

The following modules are limited to datasets less than 10 GB:

Recommender modules

Synthetic Minority Oversampling Technique (SMOTE) module

Scripting modules: R, Python, SQL

Modules where the output data size can be larger than input data size, such as Join or Feature Hashing

Cross-validation, Tune Model Hyperparameters, Ordinal Regression, and One-vs-All Multiclass, when the

number of iterations is very large

For datasets that are larger than a couple GBs, upload data to Azure Storage or Azure SQL Database, or use Azure

HDInsight rather than directly uploading from a local file.

Can I read data from Amazon S3?

If you have a small amount of data and want to expose it via an HTTP URL, then you can use the Import Data

module. For larger amounts of data, transfer it to Azure Storage first, and then use the Import Data module to

bring it into your experiment.

Is there a built-in image input capability?

You can learn about image input capability in the Import Images reference.

The algorithm, data source, data format, or data transformation operation that I am looking for isn't in

Azure Machine Learning Studio. What are my options?

You can go to the user feedback forum to see feature requests that we are tracking. Add your vote to a request if a

capability that you're looking for has already been requested. If the capability that you're looking for doesn't exist,

create a new request. You can view the status of your request in this forum, too. We track this list closely and

update the status of feature availability frequently. In addition, you can use the built-in support for R and Python to

create custom transformations when needed.

Can I bring my existing code into Machine Learning Studio?

Yes, you can bring your existing R or Python code into Machine Learning Studio, run it in the same experiment with

Azure Machine Learning learners, and deploy the solution as a web service via Azure Machine Learning. For more

information, see Extend your experiment with R and Execute Python machine learning scripts in Azure Machine

Learning Studio.

Is it possible to use something like PMML to define a model?

No, Predictive Model Markup Language (PMML) is not supported. You can use custom R and Python code to

define a module.

How many modules can I execute in parallel in my experiment?

You can execute up to four modules in parallel in an experiment.

Is there an ability to visualize data (beyond R visualizations) interactively within the experiment?

Click the output of a module to visualize the data and get statistics.

When previewing results or data in a browser, the number of rows and columns is limited. Why?

AlgorithmsAlgorithms

R moduleR module

Python modulePython module

Because large amounts of data might be sent to a browser, data size is limited to prevent slowing down Machine

Learning Studio. To visualize all the data/results, it's better to download the data and use Excel or another tool.

What existing algorithms are supported in Machine Learning Studio?

Machine Learning Studio provides state-of-the-art algorithms, such as Scalable Boosted Decision trees, Bayesian

Recommendation systems, Deep Neural Networks, and Decision Jungles developed at Microsoft Research.

Scalable open-source machine learning packages, like Vowpal Wabbit, are also included. Machine Learning Studio

supports machine learning algorithms for multiclass and binary classification, regression, and clustering. See the

complete list of Machine Learning Modules.

Do you automatically suggest the right Machine Learning algorithm to use for my data?

No, but Machine Learning Studio has various ways to compare the results of each algorithm to determine the right

one for your problem.

Do you have any guidelines on picking one algorithm over another for the provided algorithms?

See How to choose an algorithm.

Are the provided algorithms written in R or Python?

No, these algorithms are mostly written in compiled languages to provide better performance.

Are any details of the algorithms provided?

The documentation provides some information about the algorithms and parameters for tuning are described to

optimize the algorithm for your use.

Is there any support for online learning?

No, currently only programmatic retraining is supported.

Can I visualize the layers of a Neural Net Model by using the built-in module?

No.

Can I create my own modules in C# or some other language?

Currently, you can only use R to create new custom modules.

What R packages are available in Machine Learning Studio?

Machine Learning Studio supports more than 400 CRAN R packages today, and here is the current list of all

included packages. Also, see Extend your experiment with R to learn how to retrieve this list yourself. If the package

that you want is not in this list, provide the name of the package at the user feedback forum.

Is it possible to build a custom R module?

Yes, see Author custom R modules in Azure Machine Learning for more information.

Is there a REPL environment for R?

No, there is no Read-Eval-Print-Loop (REPL) environment for R in the studio.

Is it possible to build a custom Python module?

Not currently, but you can use one or more Execute Python Script modules to get the same result.

Web service
RetrainRetrain
CreateCreate
UseUse
Is there a REPL environment for Python?
You can use the Jupyter Notebooks in Machine Learning Studio. For more information, see Introducing Jupyter
Notebooks in Azure Machine Learning Studio.
How do I retrain Azure Machine Learning models programmatically?
Use the retraining APIs. For more information, see Retrain Machine Learning models programmatically. Sample
code is also available in the Microsoft Azure Machine Learning Retraining Demo.
Can I deploy the model locally or in an application that doesn't have an Internet connection?
No.
Is there a baseline latency that is expected for all web services?
See the Azure subscription limits.
When would I want to run my predictive model as a Batch Execution service versus a Request Response
service?
The Request Response service (RRS) is a low-latency, high-scale web service that is used to provide an interface to
stateless models that are created and deployed from the experimentation environment. The Batch Execution
service (BES) is a service that asynchronously scores a batch of data records. The input for BES is like data input
that RRS uses. The main difference is that BES reads a block of records from a variety of sources, such as Azure
Blob storage, Azure Table storage, Azure SQL Database, HDInsight (hive query), and HTTP sources. For more
information, see How to consume an Azure Machine Learning Web service.
How do I update the model for the deployed web service?
To update a predictive model for an already deployed service, modify and rerun the experiment that you used to
author and save the trained model. After you have a new version of the trained model available, Machine Learning
Studio asks you if you want to update your web service. For details about how to update a deployed web service,
see Deploy a Machine Learning web service.
You can also use the Retraining APIs. For more information, see Retrain Machine Learning models
programmatically. Sample code is also available in the Microsoft Azure Machine Learning Retraining Demo.
How do I monitor my web service deployed in production?
After you deploy a predictive model, you can monitor it from the Azure Machine Learning Web Services portal.
Each deployed service has its own dashboard where you can see monitoring information for that service. For more
information about how to manage your deployed web services, see Manage a Web service using the Azure
Machine Learning Web Services portal and Manage an Azure Machine Learning workspace.
Is there a place where I can see the output of my RRS/BES?
For RRS, the web service response is typically where you see the result. You can also write it to Azure Blob storage.
For BES, the output is written to a blob by default. You can also write the output to a database or table by using the
Export Data module.
Can I create web services only from models that were created in Machine Learning Studio?
No, you can also create web services directly by using Jupyter Notebooks and RStudio.

Scalability

Security and availability

Where can I find information about error codes?

See Machine Learning Module Error Codes for a list of error codes and descriptions.

What is the scalability of the web service?

Currently, the default endpoint is provisioned with 20 concurrent RRS requests per endpoint. You can scale this to

200 concurrent requests per endpoint, and you can scale each web service to 10,000 endpoints per web service as

described in Scaling a Web Service. For BES, each endpoint can process 40 requests at a time, and additional

requests beyond 40 requests are queued. These queued requests run automatically as the queue drains.

Are R jobs spread across nodes?

No.

How much data can I use for training?

Modules in Machine Learning Studio support datasets of up to 10 GB of dense numerical data for common use

cases. If a module takes more than one input, the total size for all inputs is 10 GB. You can also sample larger

datasets via Hive queries, via Azure SQL Database queries, or by preprocessing with Learning with Counts

modules before ingestion.

The following types of data can expand to larger datasets during feature normalization and are limited to less than

10 GB:

Sparse

Categorical

Strings

Binary data

The following modules are limited to datasets less than 10 GB:

Recommender modules

Synthetic Minority Oversampling Technique (SMOTE) module

Scripting modules: R, Python, SQL

Modules where the output data size can be larger than input data size, such as Join or Feature Hashing

Cross-Validate, Tune Model Hyperparameters, Ordinal Regression, and One-vs-All Multiclass, when number of

iterations is very large

For datasets that are larger than a few GBs, upload data to Azure Storage or Azure SQL Database, or use

HDInsight rather than directly uploading from a local file.

Are there any vector size limitations?

Rows and columns are each limited to the .NET limitation of Max Int: 2,147,483,647.

Can I adjust the size of the virtual machine that runs the web service?

No.

Who can access the http endpoint for the web service by default? How do I restrict access to the

endpoint?

After a web service is deployed, a default endpoint is created for that service. The default endpoint can be called by

Support and training

Billing questions

using its API key. You can add more endpoints with their own keys from the Web Services portal or

programmatically by using the Web Service Management APIs. Access keys are needed to make calls to the web

service. For more information, see How to consume an Azure Machine Learning Web service.

What happens if my Azure storage account can't be found?

Machine Learning Studio relies on a user-supplied Azure storage account to save intermediary data when it

executes the workflow. This storage account is provided to Machine Learning Studio when a workspace is created.

After the workspace is created, if the storage account is deleted and can no longer be found, the workspace will

stop functioning, and all experiments in that workspace will fail.

If you accidentally deleted the storage account, recreate the storage account with the same name in the same

region as the deleted storage account. After that, resync the access key.

What happens if my storage account access key is out of sync?

Machine Learning Studio relies on a user-supplied Azure storage account to store intermediary data when it

executes the workflow. This storage account is provided to Machine Learning Studio when a workspace is created,

and the access keys are associated with that workspace. If the access keys are changed after the workspace is

created, the workspace can no longer access the storage account. It will stop functioning and all experiments in that

workspace will fail.

If you changed storage account access keys, resync the access keys in the workspace by using the Azure portal.

Where can I get training for Azure Machine Learning?

The Azure Machine Learning Documentation Center hosts video tutorials and how-to guides. These step-by-step

guides introduce the services and explain the data science life cycle of importing data, cleaning data, building

predictive models, and deploying them in production by using Azure Machine Learning.

We add new material to the Machine Learning Center on an ongoing basis. You can submit requests for additional

learning material on Machine Learning Center at the user feedback forum.

You can also find training at Microsoft Virtual Academy.

How do I get support for Azure Machine Learning?

To get technical support for Azure Machine Learning, go to Azure Support, and select Machine Learning.

Azure Machine Learning also has a community forum on MSDN where you can ask questions about Azure

Machine Learning. The Azure Machine Learning team monitors the forum. Go to Azure Forum.

How does Machine Learning billing work?

Azure Machine Learning has two components: Machine Learning Studio and Machine Learning web services.

While you are evaluating Machine Learning Studio, you can use the Free billing tier. The Free tier also lets you

deploy a Classic web service that has limited capacity.

If you decide that Azure Machine Learning meets your needs, you can sign up for the Standard tier. To sign up, you

must have a Microsoft Azure subscription.

In the Standard tier, you are billed monthly for each workspace that you define in Machine Learning Studio. When

you run an experiment in the studio, you are billed for compute resources when you are running an experiment.

When you deploy a Classic web service, transactions and compute hours are billed on the Pay As You Go basis.

NOTENOTE

New (Resource Manager-based) web services introduce billing plans that allow for more predictability in costs.

Tiered pricing offers discounted rates to customers who need a large amount of capacity.

When you create a plan, you commit to a fixed cost that comes with an included quantity of API compute hours

and API transactions. If you need more included quantities, you can add instances to your plan. If you need a lot

more included quantities, you can choose a higher tier plan that provides considerably more included quantities

and a better discounted rate.

After the included quantities in existing instances are used up, additional usage is charged at the overage rate that's

associated with the billing plan tier.

Included quantities are reallocated every 30 days, and unused included quantities do not roll over to the next period.

For additional billing and pricing information, see Machine Learning Pricing.

Does Machine Learning have a free trial?

Azure Machine Learning has a free subscription option that's explained in Machine Learning Pricing. Machine

Learning Studio has an eight-hour quick evaluation trial that's available when you sign in to Machine Learning

Studio.

In addition, when you sign up for an Azure free trial, you can try any Azure services for a month. To learn more

about the Azure free trial, visit Azure free trial FAQ.

What is a transaction?

A transaction represents an API call that Azure Machine Learning responds to. Transactions from Request-

Response Service (RRS) and Batch Execution Service (BES) calls are aggregated and charged against your billing

plan.

Can I use the included transaction quantities in a plan for both RRS and BES transactions?

Yes, your transactions from your RRS and BES are aggregated and charged against your billing plan.

What is an API compute hour?

An API compute hour is the billing unit for the time that API calls take to run by using Machine Learning compute

resources. All your calls are aggregated for billing purposes.

How long does a typical production API call take?

Production API call times can vary significantly, generally ranging from hundreds of milliseconds to a few seconds.

Some API calls might require minutes depending on the complexity of the data processing and machine-learning

model. The best way to estimate production API call times is to benchmark a model on the Machine Learning

service.

What is a Studio compute hour?

A Studio compute hour is the billing unit for the aggregate time that your experiments use compute resources in

studio.

In New (Azure Resource Manager-based) web services, what is the Dev/Test tier meant for?

Resource Manager-based web services provide multiple tiers that you can use to provision your billing plan. The

Dev/Test pricing tier provides limited, included quantities that allow you to test your experiment as a web service

without incurring costs. You have the opportunity to see how it works.

Are there separate storage charges?

Management of New (Resource Manager-based) web servicesManagement of New (Resource Manager-based) web services

NOTENOTE

The Machine Learning Free tier does not require or allow separate storage. The Machine Learning Standard tier

requires users to have an Azure storage account. Azure Storage is billed separately.

Does Machine Learning support high availability?

Yes. For details, see Machine Learning Pricing for a description of the service level agreement (SL A).

What specific kind of compute resources will my production API calls be run on?

The Machine Learning service is a multitenant service. Actual compute resources that are used on the back end

vary and are optimized for performance and predictability.

What happens if I delete my plan?

The plan is removed from your subscription, and you are billed for prorated usage.

You cannot delete a plan that a web service is using. To delete the plan, you must either assign a new plan to the web service

or delete the web service.

What is a plan instance?

A plan instance is a unit of included quantities that you can add to your billing plan. When you select a billing tier

for your billing plan, it comes with one instance. If you need more included quantities, you can add instances of the

selected billing tier to your plan.

How many plan instances can I add?

You can have one instance of the Dev/Test pricing tier in a subscription.

For Standard S1, Standard S2, and Standard S3 tiers, you can add as many as necessary.

Depending on your anticipated usage, it might be more cost effective to upgrade to a tier that has more included quantities

rather than add instances to the current tier.

What happens when I change plan tiers (upgrade / downgrade)?

The old plan is deleted and the current usage is billed on a prorated basis. A new plan with the full included

quantities of the upgraded/downgraded tier is created for the rest of the period.

Included quantities are allocated per period, and unused quantities do not roll over.

What happens when I increase the instances in a plan?

Quantities are included on a prorated basis and may take 24 hours to be effective.

What happens when I delete an instance of a plan?

The instance is removed from your subscription, and you are billed for prorated usage.

How do I sign up for a plan?

New web services: OveragesNew web services: Overages

You have two ways to create billing plans.

When you first deploy a Resource Manager-based web service, you can choose an existing plan or create a new

plan.

Plans that you create in this manner are in your default region, and your web service will be deployed to that

region.

If you want to deploy services to regions other than your default region, you may want to define your billing plans

before you deploy your service.

In that case, you can sign in to the Azure Machine Learning Web Services portal, and go to the Plans page. From

there, you can add plans, delete plans, and modify existing plans.

Which plan should I choose to start off with?

We recommend you that you start with the Standard S1 tier and monitor your service for usage. If you find that

you are using your included quantities rapidly, you can add instances or move to a higher tier and get better

discounted rates. You can adjust your billing plan as needed throughout your billing cycle.

Which regions are the new plans available in?

The new billing plans are available in the three production regions in which we support the new web services:

South Central US

West Europe

South East Asia

I have web services in multiple regions. Do I need a plan for every region?

Yes. Plan pricing varies by region. When you deploy a web service to another region, you need to assign it a plan

that is specific to that region. For more information, see Products available by region.

How do I check if I exceeded my web service usage?

You can view the usage on all your plans on the Plans page in the Azure Machine Learning Web Services portal.

In the Transactions and Compute columns of the table, you can see the included quantities of the plan and the

percentage used.

What happens when I use up the include quantities in the Dev/Test pricing tier?

Services that have a Dev/Test pricing tier assigned to them are stopped until the next period or until you move

them to a paid tier.

For Classic web services and overages of New (Resource Manager-based) web services, how are prices

calculated for Request Response (RRS) and Batch (BES) workloads?

For an RRS workload, you are charged for every API transaction call that you make and for the compute time

that's associated with those requests. Your RRS production API transaction costs are calculated as the total number

of API calls that you make multiplied by the price per 1,000 transactions (prorated by individual transaction). Your

RRS API production API compute hour costs are calculated as the amount of time required for each API call to run,

multiplied by the total number of API transactions, multiplied by the price per production API compute hour.

For example, for Standard S1 overage, 1,000,000 API transactions that take 0.72 seconds each to run would result

in (1,000,000 * $0.50/1K API transactions) in $500 in production API transaction costs and (1,000,000 * 0.72 sec *

$2/hr) $400 in production API compute hours, for a total of $900.

Azure Machine Learning Classic web servicesAzure Machine Learning Classic web services

Azure Machine Learning Free and Standard tierAzure Machine Learning Free and Standard tier

For a BES workload, you are charged in the same manner. However, the API transaction costs represent the

number of batch jobs that you submit, and the compute costs represent the compute time that's associated with

those batch jobs. Your BES production API transaction costs are calculated as the total number of jobs submitted

multiplied by the price per 1,000 transactions (prorated by individual transaction). Your BES API production API

compute hour costs are calculated as the amount of time required for each row in your job to run multiplied by the

total number of rows in your job multiplied by the total number of jobs multiplied by the price per production API

compute hour. When you use the Machine Learning calculator, the transaction meter represents the number of jobs

that you plan to submit, and the time-per-transaction field represents the combined time that's needed for all rows

in each job to run.

For example, assume Standard S1 overage, and you submit 100 jobs per day that each consist of 500 rows that

take 0.72 seconds each. Your monthly overage costs would be (100 jobs per day = 3,100 jobs/mo * $0.50/1K API

transactions) $1.55 in production API transaction costs and (500 rows * 0.72 sec * 3,100 Jobs * $2/hr) $620 in

production API compute hours, for a total of $621.55.

Is Pay As You Go still available?

Yes, Classic web services are still available in Azure Machine Learning.

What is included in the Azure Machine Learning Free tier?

The Azure Machine Learning Free tier is intended to provide an in-depth introduction to the Azure Machine

Learning Studio. All you need is a Microsoft account to sign up. The Free tier includes free access to one Azure

Machine Learning Studio workspace per Microsoft account. In this tier, you can use up to 10 GB of storage and

operationalize models as staging APIs. Free tier workloads are not covered by an SL A and are intended for

development and personal use only.

Free tier workspaces have the following limitations:

Workloads can't access data by connecting to an on-premises server that runs SQL Server.

You cannot deploy New Resource Manager base web services.

What is included in the Azure Machine Learning Standard tier and plans?

The Azure Machine Learning Standard tier is a paid production version of Azure Machine Learning Studio. The

Azure Machine Learning Studio monthly fee is billed on a per workspace per month basis and prorated for partial

months. Azure Machine Learning Studio experiment hours are billed per compute hour for active experimentation.

Billing is prorated for partial hours.

The Azure Machine Learning API service is billed depending on whether it's a Classic web service or a New

(Resource Manager-based) web service.

The following charges are aggregated per workspace for your subscription.

Machine Learning Workspace Subscription: The Machine Learning workspace subscription is a monthly fee that

provides access to a Machine Learning Studio workspace. The subscription is required to run experiments in the

studio and to utilize the production APIs.

Studio Experiment hours: This meter aggregates all compute charges that are accrued by running experiments

in Machine Learning Studio and running production API calls in the staging environment.

Access data by connecting to an on-premises server that runs SQL Server in your models for your training and

scoring.

For Classic web services:

Production API Compute Hours: This meter includes compute charges that are accrued by web services

running in production.

Production API Transactions (in 1000s): This meter includes charges that are accrued per call to your

production web service.

Apart from the preceding charges, in the case of Resource Manager-based web service, charges are aggregated to

the selected plan:

Standard S1/S2/S3 API Plan (Units): This meter represents the type of instance that's selected for Resource

Manager-based web services.

Standard S1/S2/S3 Overage API Compute Hours: This meter includes compute charges that are accrued by

Resource Manager-based web services that run in production after the included quantities in existing instances

are used up. The additional usage is charged at the overate rate that's associated with S1/S2/S3 plan tier.

Standard S1/S2/S3 Overage API Transactions (in 1,000s): This meter includes charges that are accrued per call

to your production Resource Manager-based web service after the included quantities in existing instances are

used up. The additional usage is charged at the overate rate associated with S1/S2/S3 plan tier.

Included Quantity API Compute Hours: With Resource Manager-based web services, this meter represents the

included quantity of API compute hours.

Included Quantity API Transactions (in 1,000s): With Resource Manager-based web services, this meter

represents the included quantity of API transactions.

How do I sign up for Azure Machine Learning Free tier?

All you need is a Microsoft account. Go to Azure Machine Learning home, and then click Start Now. Sign in with

your Microsoft account and a workspace in Free tier is created for you. You can start to explore and create Machine

Learning experiments right away.

How do I sign up for Azure Machine Learning Standard tier?

You must first have access to an Azure subscription to create a Standard Machine Learning workspace. You can

purchase a paid Azure subscription outright. You can then create a Machine Learning workspace from the

Microsoft Azure portal after you gain access to the subscription. View the step-by-step instructions.

Alternatively, you can be invited by a Standard Machine Learning workspace owner to access the owner's

workspace.

Can I specify my own Azure Blob storage account to use with the Free tier?

No, the Standard tier is equivalent to the version of the Machine Learning service that was available before the

tiers were introduced.

Can I deploy my machine learning models as APIs in the Free tier?

Yes, you can operationalize machine learning models to staging API services as part of the Free tier. To put the

staging API service into production and get a production endpoint for the operationalized service, you must use

the Standard tier.

What is the difference between Azure free trial and Azure Machine Learning Free tier?

The Microsoft Azure free trial offers credits that you can apply to any Azure service for one month. The Azure

Machine Learning Free tier offers continuous access specifically to Azure Machine Learning for non-production

workloads.

How do I move an experiment from the Free tier to the Standard tier?

To copy your experiments from the Free tier to the Standard tier:

1. Sign in to Azure Machine Learning Studio, and make sure that you can see both the Free workspace and the

Standard workspace in the workspace selector in the top navigation bar.

Studio workspaceStudio workspace

Guest AccessGuest Access

2. Switch to Free workspace if you are in the Standard workspace.

3. In the experiment list view, select an experiment that you'd like to copy, and then click the Copy command

button.

4. Select the Standard workspace from the dialog box that opens, and then click the Copy button. All the

associated datasets, trained model, etc. are copied together with the experiment into the Standard workspace.

5. You need to rerun the experiment and republish your web service in the Standard workspace.

Will I see different bills for different workspaces?

Workspace charges are broken out separately for each applicable meter on a single bill.

What specific kind of compute resources will my experiments be run on?

The Machine Learning service is a multitenant service. Actual compute resources that are used on the back end

vary and are optimized for performance and predictability.

What is Guest Access to Azure Machine Learning Studio?

Guest Access is a restricted trial experience. You can create and run experiments in Azure Machine Learning Studio

at no cost and without authentication. Guest sessions are non-persistent (cannot be saved) and limited to eight

hours. Other limitations include lack of support for R and Python, lack of staging APIs, and restricted dataset size

and storage capacity. By comparison, users who choose to sign in with a Microsoft account have full access to the

Free tier of Machine Learning Studio that's described previously, which includes a persistent workspace and more

comprehensive capabilities. To choose your free Machine Learning experience, click Get started on

https://studio.azureml.net, and then select Guess Access or sign in with a Microsoft account.

What's New in Azure Machine Learning

3/21/2018 • 1 min to read • Edit Online

The March 2017 release of Microsoft Azure Machine Learning updates provides the following feature:The March 2017 release of Microsoft Azure Machine Learning updates provides the following feature:

The August 2016 release of Microsoft Azure Machine Learning updates provide the following features:The August 2016 release of Microsoft Azure Machine Learning updates provide the following features:

The July 2016 release of Microsoft Azure Machine Learning updates provide the following features:The July 2016 release of Microsoft Azure Machine Learning updates provide the following features:

Dedicated Capacity for Azure Machine Learning BES Jobs

Machine Learning Batch Pool processing uses the Azure Batch service to provide customer-managed scale

for the Azure Machine Learning Batch Execution Service. Batch Pool processing allows you to create Azure

Batch pools on which you can submit batch jobs and have them execute in a predictable manner.

For more information, see Azure Batch service for Machine Learning jobs.

Classic Web services can now be managed in the new Microsoft Azure Machine Learning Web Services portal

that provides one place to manage all aspects of your Web service.

Which provides web service usage statistics.

Simplifies testing of Azure Machine Learning Remote-Request calls using sample data.

Provides a new Batch Execution Service test page with sample data and job submission history.

Provides easier endpoint management.

Web services are now managed as Azure resources managed through Azure Resource Manager interfaces,

allowing for the following enhancements:

Incorporates a new subscription-based, multi-region web service deployment model using Resource Manager

based APIs leveraging the Resource Manager Resource Provider for Web Services.

Introduces new pricing plans and plan management capabilities using the new Resource Manager RP for

Billing.

Provides web service usage statistics.

Simplifies testing of Azure Machine Learning Remote-Request calls using sample data.

Provides a new Batch Execution Service test page with sample data and job submission history.

There are new REST APIs to deploy and manage your Resource Manager based Web services.

There is a new Microsoft Azure Machine Learning Web Services portal that provides one place to

manage all aspects of your Web service.

You can now deploy your web service to multiple regions without needing to create a subscription in

each region.

In addition, the Machine Learning Studio has been updated to allow you to deploy to the new Web service model

or continue to deploy to the classic Web service model.

Machine learning tutorial: Create your first data

science experiment in Azure Machine Learning

Studio

4/9/2018 • 16 min to read • Edit Online

NOTENOTE

How does Machine Learning Studio help?

If you've never used Azure Machine Learning Studio before, this tutorial is for you.

In this tutorial, we'll walk through how to use Studio for the first time to create a machine learning experiment.

The experiment will test an analytical model that predicts the price of an automobile based on different variables

such as make and technical specifications.

This tutorial shows you the basics of how to drag-and-drop modules onto your experiment, connect them together, run

the experiment, and look at the results. We're not going to discuss the general topic of machine learning or how to select

and use the 100+ built-in algorithms and data manipulation modules included in Studio.

If you're new to machine learning, the video series Data Science for Beginners might be a good place to start. This video

series is a great introduction to machine learning using everyday language and concepts.

If you're familiar with machine learning, but you're looking for more general information about Machine Learning Studio,

and the machine learning algorithms it contains, here are some good resources:

What is Machine Learning Studio? - This is a high-level overview of Studio.

Machine learning basics with algorithm examples - This infographic is useful if you want to learn more about the

different types of machine learning algorithms included with Machine Learning Studio.

Machine Learning Guide - This guide covers similar information as the infographic above, but in an interactive format.

Machine learning algorithm cheat sheet and How to choose algorithms for Microsoft Azure Machine Learning - This

downloadable poster and accompanying article discuss the Studio algorithms in depth.

Machine Learning Studio: Algorithm and Module Help - This is the complete reference for all Studio modules, including

machine learning algorithms,

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

Machine Learning Studio makes it easy to set up an experiment using drag-and-drop modules preprogrammed

with predictive modeling techniques.

Using an interactive, visual workspace, you drag-and-drop datasets and analysis modules onto an interactive

canvas. You connect them together to form an experiment that you run in Machine Learning Studio. You create

a model, train the model, and score and test the model.

You can iterate on your model design, editing the experiment and running it until it gives you the results you're

looking for. When your model is ready, you can publish it as a web service so that others can send it new data

and get predictions in return.

Open Machine Learning Studio

Five steps to create an experiment

TIPTIP

Step 1: Get data

To get started with Studio, go to https://studio.azureml.net. If you’ve signed into Machine Learning Studio before,

click Sign In. Otherwise, click Sign up here and choose between free and paid options.

In this machine learning tutorial, you'll follow five basic steps to build an experiment in Machine Learning Studio

to create, train, and score your model:

Create a model

Train the model

Score and test the model

Step 1: Get data

Step 2: Prepare the data

Step 3: Define features

Step 4: Choose and apply a learning algorithm

Step 5: Predict new automobile prices

You can find a working copy of the following experiment in the Azure AI Gallery. Go to Your first data science

experiment - Automobile price prediction and click Open in Studio to download a copy of the experiment into your

Machine Learning Studio workspace.

The first thing you need to perform machine learning is data. There are several sample datasets included with

Machine Learning Studio that you can use, or you can import data from many sources. For this example, we'll

use the sample dataset, Automobile price data (Raw), that's included in your workspace. This dataset includes

entries for various individual automobiles, including information such as make, model, technical specifications,

and price.

Here's how to get the dataset into your experiment.

1. Create a new experiment by clicking +NEW at the bottom of the Machine Learning Studio window, select

EXPERIMENT, and then select Blank Experiment.

2. The experiment is given a default name that you can see at the top of the canvas. Select this text and

rename it to something meaningful, for example, Automobile price prediction. The name doesn't need

to be unique.

3. To the left of the experiment canvas is a palette of datasets and modules. Type automobile in the Search

box at the top of this palette to find the dataset labeled Automobile price data (Raw). Drag this dataset

to the experiment canvas.

Find the automobile dataset and drag it onto the experiment canvas

To see what this data looks like, click the output port at the bottom of the automobile dataset, and then select

Visualize.

Click the output port and select "Visualize"

TIPTIP

Step 2: Prepare the data

TIPTIP

Datasets and modules have input and output ports represented by small circles - input ports at the top, output ports at

the bottom. To create a flow of data through your experiment, you'll connect an output port of one module to an input

port of another. At any time, you can click the output port of a dataset or module to see what the data looks like at that

point in the data flow.

In this sample dataset, each instance of an automobile appears as a row, and the variables associated with each

automobile appear as columns. Given the variables for a specific automobile, we're going to try to predict the

price in far-right column (column 26, titled "price").

View the automobile data in the data visualization window

Close the visualization window by clicking the "x" in the upper-right corner.

A dataset usually requires some preprocessing before it can be analyzed. For example, you might have noticed

the missing values present in the columns of various rows. These missing values need to be cleaned so the

model can analyze the data correctly. In our case, we'll remove any rows that have missing values. Also, the

normalized-losses column has a large proportion of missing values, so we'll exclude that column from the

model altogether.

Cleaning the missing values from input data is a prerequisite for using most of the modules.

First we add a module that removes the normalized-losses column completely, and then we add another

module that removes any row that has missing data.

1. Type select columns in the Search box at the top of the module palette to find the Select Columns in

Dataset module, then drag it to the experiment canvas. This module allows us to select which columns of

data we want to include or exclude in the model.

2. Connect the output port of the Automobile price data (Raw) dataset to the input port of the Select

Columns in Dataset module.

Add the "Select Columns in Dataset" module to the experiment canvas and connect it

3. Click the Select Columns in Dataset module and click Launch column selector in the Properties pane.

On the left, click With rules

Under Begin With, click All columns. This directs Select Columns in Dataset to pass through all the

columns (except those columns we're about to exclude).

From the drop-downs, select Exclude and column names, and then click inside the text box. A list of

columns is displayed. Select normalized-losses, and it's added to the text box.

Click the check mark (OK) button to close the column selector (on the lower-right).

Launch the column selector and exclude the "normalized-losses" column

Now the properties pane for Select Columns in Dataset indicates that it will pass through all

columns from the dataset except normalized-losses.

The properties pane shows that the "normalized-losses" column is excluded

TIPTIP

You can add a comment to a module by double-clicking the module and entering text. This can help you

see at a glance what the module is doing in your experiment. In this case double-click the Select Columns in

Dataset module and type the comment "Exclude normalized losses."

Double-click a module to add a comment

4. Drag the Clean Missing Data module to the experiment canvas and connect it to the Select Columns in

Dataset module. In the Properties pane, select Remove entire row under Cleaning mode. This directs

Clean Missing Data to clean the data by removing rows that have any missing values. Double-click the

module and type the comment "Remove missing value rows."

Set the cleaning mode to "Remove entire row" for the "Clean Missing Data" module

5. Run the experiment by clicking RUN at the bottom of the page.

When the experiment has finished running, all the modules have a green check mark to indicate that they

finished successfully. Notice also the Finished running status in the upper-right corner.

TIPTIP

Step 3: Define features

make, body-style, wheel-base, engine-size, horsepower, peak-rpm, highway-mpg, price

After running it, the experiment should look something like this

Why did we run the experiment now? By running the experiment, the column definitions for our data pass from the

dataset, through the Select Columns in Dataset module, and through the Clean Missing Data module. This means that any

modules we connect to Clean Missing Data will also have this same information.

All we have done in the experiment up to this point is clean the data. If you want to view the cleaned dataset, click

the left output port of the Clean Missing Data module and select Visualize. Notice that the normalized-losses

column is no longer included, and there are no missing values.

Now that the data is clean, we're ready to specify what features we're going to use in the predictive model.

In machine learning, features are individual measurable properties of something you’re interested in. In our

dataset, each row represents one automobile, and each column is a feature of that automobile.

Finding a good set of features for creating a predictive model requires experimentation and knowledge about the

problem you want to solve. Some features are better for predicting the target than others. Also, some features

have a strong correlation with other features and can be removed. For example, city-mpg and highway-mpg are

closely related so we can keep one and remove the other without significantly affecting the prediction.

Let's build a model that uses a subset of the features in our dataset. You can come back later and select different

features, run the experiment again, and see if you get better results. But to start, let's try the following features:

1. Drag another Select Columns in Dataset module to the experiment canvas. Connect the left output port of

the Clean Missing Data module to the input of the Select Columns in Dataset module.

Step 4: Choose and apply a learning algorithm

Connect the "Select Columns in Dataset" module to the "Clean Missing Data" module

2. Double-click the module and type "Select features for prediction."

3. Click Launch column selector in the Properties pane.

4. Click With rules.

5. Under Begin With, click No columns. In the filter row, select Include and column names and select

our list of column names in the text box. This directs the module to not pass through any columns

(features) except the ones that we specify.

6. Click the check mark (OK) button.

Select the columns (features) to include in the prediction

This produces a filtered dataset containing only the features we want to pass to the learning algorithm we'll use

in the next step. Later, you can return and try again with a different selection of features.

Now that the data is ready, constructing a predictive model consists of training and testing. We'll use our data to

train the model, and then we'll test the model to see how closely it's able to predict prices.

Classification and regression are two types of supervised machine learning algorithms. Classification predicts an

answer from a defined set of categories, such as a color (red, blue, or green). Regression is used to predict a

number.

Because we want to predict price, which is a number, we'll use a regression algorithm. For this example, we'll use

a simple linear regression model.

TIPTIP
If you want to learn more about different types of machine learning algorithms and when to use them, you might view the
first video in the Data Science for Beginners series, The five questions data science answers. You might also look at the
infographic Machine learning basics with algorithm examples, or check out the Machine learning algorithm cheat sheet.
We train the model by giving it a set of data that includes the price. The model scans the data and look for
correlations between an automobile's features and its price. Then we'll test the model - we'll give it a set of
features for automobiles we're familiar with and see how close the model comes to predicting the known price.
We'll use our data for both training the model and testing it by splitting the data into separate training and
testing datasets.
TIPTIP
1.  Select and drag the Split Data module to the experiment canvas and connect it to the last Select Columns
in Dataset module.
2.  Click the Split Data module to select it. Find the Fraction of rows in the first output dataset (in the
Properties pane to the right of the canvas) and set it to 0.75. This way, we'll use 75 percent of the data to
train the model, and hold back 25 percent for testing (later, you can experiment with using different
percentages).
 
Set the split fraction of the "Split Data" module to 0.75
By changing the Random seed parameter, you can produce different random samples for training and testing.
This parameter controls the seeding of the pseudo-random number generator.
3.  Run the experiment. When the experiment is run, the Select Columns in Dataset and Split Data modules
pass column definitions to the modules we'll be adding next.
4.  To select the learning algorithm, expand the Machine Learning category in the module palette to the left
of the canvas, and then expand Initialize Model. This displays several categories of modules that can be
used to initialize machine learning algorithms. For this experiment, select the Linear Regression module
under the Regression category, and drag it to the experiment canvas. (You can also find the module by
typing "linear regression" in the palette Search box.)
5.  Find and drag the Train Model module to the experiment canvas. Connect the output of the Linear
Regression module to the left input of the Train Model module, and connect the training data output (left
port) of the Split Data module to the right input of the Train Model module.

Connect the "Train Model" module to both the "Linear Regression" and "Split Data" modules

6. Click the Train Model module, click Launch column selector in the Properties pane, and then select the

price column. This is the value that our model is going to predict.

You select the price column in the column selector by moving it from the Available columns list to the

Selected columns list.

Select the price column for the "Train Model" module

7. Run the experiment.

We now have a trained regression model that can be used to score new automobile data to make price

predictions.

Step 5: Predict new automobile prices

After running, the experiment should now look something like this

Now that we've trained the model using 75 percent of our data, we can use it to score the other 25 percent of the

data to see how well our model functions.

1. Find and drag the Score Model module to the experiment canvas. Connect the output of the Train Model

module to the left input port of Score Model. Connect the test data output (right port) of the Split Data

module to the right input port of Score Model.

Connect the "Score Model" module to both the "Train Model" and "Split Data" modules

TIPTIP

2. Run the experiment and view the output from the Score Model module (click the output port of Score

Model and select Visualize). The output shows the predicted values for price and the known values from

the test data.

Output of the "Score Model" module

3. Finally, we test the quality of the results. Select and drag the Evaluate Model module to the experiment

canvas, and connect the output of the Score Model module to the left input of Evaluate Model.

There are two input ports on the Evaluate Model module because it can be used to compare two models side by

side. Later, you can add another algorithm to the experiment and use Evaluate Model to see which one gives better

results.

4. Run the experiment.

To view the output from the Evaluate Model module, click the output port, and then select Visualize.

Final experiment

Evaluation results for the experiment

The following statistics are shown for our model:

Mean Absolute Error (MAE): The average of absolute errors (an error is the difference between the

predicted value and the actual value).

Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made

on the test dataset.

Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual

values and the average of all actual values.

Relative Squared Error: The average of squared errors relative to the squared difference between the actual

values and the average of all actual values.

Coefficient of Determination: Also known as the R squared value, this is a statistical metric indicating

how well a model fits the data.

For each of the error statistics, smaller is better. A smaller value indicates that the predictions more closely match

the actual values. For Coefficient of Determination, the closer its value is to one (1.0), the better the

predictions.

The final experiment should look something like this:

Next steps

The final experiment

Now that you've completed the first machine learning tutorial and have your experiment set up, you can

continue to improve the model and then deploy it as a predictive web service.

TIPTIP

Iterate to try to improve the model - For example, you can change the features you use in your

prediction. Or you can modify the properties of the Linear Regression algorithm or try a different

algorithm altogether. You can even add multiple machine learning algorithms to your experiment at one

time and compare two of them by using the Evaluate Model module. For an example of how to compare

multiple models in a single experiment, see Compare Regressors in the Azure AI Gallery.

To copy any iteration of your experiment, use the SAVE AS button at the bottom of the page. You can see all the

iterations of your experiment by clicking VIEW RUN HISTORY at the bottom of the page. For more details, see

Manage experiment iterations in Azure Machine Learning Studio.

Deploy the model as a predictive web service - When you're satisfied with your model, you can deploy it

as a web service to be used to predict automobile prices by using new data. For more details, see Deploy an

Azure Machine Learning web service.

Want to learn more? For a more extensive and detailed walkthrough of the process of creating, training, scoring,

and deploying a model, see Develop a predictive solution by using Azure Machine Learning.

Walkthrough: Develop a predictive analytics solution

for credit risk assessment in Azure Machine Learning

3/21/2018 • 2 min to read • Edit Online

NOTENOTE

The problem

The solution

In this walkthrough, we take an extended look at the process of developing a predictive analytics solution in

Machine Learning Studio. We develop a simple model in Machine Learning Studio, and then deploy it as an

Azure Machine Learning web service where the model can make predictions using new data.

This walkthrough assumes that you've used Machine Learning Studio at least once before, and that you have

some understanding of machine learning concepts. But it doesn't assume you're an expert in either.

If you've never used Azure Machine Learning Studio before, you might want to start with the tutorial, Create

your first data science experiment in Azure Machine Learning Studio. That tutorial takes you through Machine

Learning Studio for the first time. It shows you the basics of how to drag-and-drop modules onto your

experiment, connect them together, run the experiment, and look at the results. Another tool that may be helpful

for getting started is a diagram that gives an overview of the capabilities of Machine Learning Studio. You can

download and print it from here: Overview diagram of Azure Machine Learning Studio capabilities.

If you're new to the field of machine learning in general, there's a video series that might be helpful to you. It's

called Data Science for Beginners and it can give you a great introduction to machine learning using everyday

language and concepts.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

Suppose you need to predict an individual's credit risk based on the information they gave on a credit

application.

Credit risk assessment is a complex problem, but we can simplify it a bit for this walkthrough. We'll use it as an

example of how you can create a predictive analytics solution using Microsoft Azure Machine Learning. To do

this, we use Azure Machine Learning Studio and a Machine Learning web service.

In this detailed walkthrough, we start with publicly available credit risk data and develop and train a predictive

model based on that data. Then we deploy the model as a web service so it can be used by others for credit risk

assessment.

To create this credit risk assessment solution, we follow these steps:

1. Create a Machine Learning workspace

2. Upload existing data

3. Create an experiment

4. Train and evaluate the models

5. Deploy the web service

6. Access the web service

TIPTIP

You can find a working copy of the experiment that we develop in this walkthrough in the Azure AI Gallery. Go to

Walkthrough - Credit risk prediction and click Open in Studio to download a copy of the experiment into your

Machine Learning Studio workspace.

This walkthrough is based on a simplified version of the sample experiment, Binary Classification: Credit risk prediction, also

available in the Gallery.

Walkthrough Step 1: Create a Machine Learning

workspace

3/21/2018 • 1 min to read • Edit Online

TIPTIP

This is the first step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning.

1. Create a Machine Learning workspace

2. Upload existing data

3. Create a new experiment

4. Train and evaluate the models

5. Deploy the Web service

6. Access the Web service

To use Machine Learning Studio, you need to have a Microsoft Azure Machine Learning workspace. This

workspace contains the tools you need to create, manage, and publish experiments.

The administrator for your Azure subscription needs to create the workspace and then add you as an owner or

contributor. For details, see Create and share an Azure Machine Learning workspace.

After your workspace is created, open Machine Learning Studio (https://studio.azureml.net/Home). If you have

more than one workspace, you can select the workspace in the toolbar in the upper-right corner of the window.

If you were made an owner of the workspace, you can share the experiments you're working on by inviting others to the

workspace. You can do this in Machine Learning Studio on the SETTINGS page. You just need the Microsoft account or

organizational account for each user.

On the SETTINGS page, click USERS, then click INVITE MORE USERS at the bottom of the window.

Next: Upload existing data

Walkthrough Step 2: Upload existing data into an

Azure Machine Learning experiment

5/1/2018 • 3 min to read • Edit Online

Convert the dataset format

This is the second step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning

1. Create a Machine Learning workspace

2. Upload existing data

3. Create a new experiment

4. Train and evaluate the models

5. Deploy the Web service

6. Access the Web service

To develop a predictive model for credit risk, we need data that we can use to train and then test the model. For

this walkthrough, we'll use the "UCI Statlog (German Credit Data) Data Set" from the UC Irvine Machine

Learning repository. You can find it here:

http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

We'll use the file named german.data. Download this file to your local hard drive.

The german.data dataset contains rows of 20 variables for 1000 past applicants for credit. These 20 variables

represent the dataset's set of features (the feature vector), which provides identifying characteristics for each

credit applicant. An additional column in each row represents the applicant's calculated credit risk, with 700

applicants identified as a low credit risk and 300 as a high risk.

The UCI website provides a description of the attributes of the feature vector for this data. This includes financial

information, credit history, employment status, and personal information. For each applicant, a binary rating has

been given indicating whether they are a low or high credit risk.

We'll use this data to train a predictive analytics model. When we're done, our model should be able to accept a

feature vector for a new individual and predict whether he or she is a low or high credit risk.

Here's an interesting twist. The description of the dataset on the UCI website mentions what it costs if we

misclassify a person's credit risk. If the model predicts a high credit risk for someone who is actually a low credit

risk, the model has made a misclassification. But the reverse misclassification is five times more costly to the

financial institution: if the model predicts a low credit risk for someone who is actually a high credit risk.

So, we want to train our model so that the cost of this latter type of misclassification is five times higher than

misclassifying the other way. One simple way to do this when training the model in our experiment is by

duplicating (five times) those entries that represent someone with a high credit risk. Then, if the model

misclassifies someone as a low credit risk when they're actually a high risk, the model does that same

misclassification five times, once for each duplicate. This will increase the cost of this error in the training results.

The original dataset uses a blank-separated format. Machine Learning Studio works better with a comma-

separated value (CSV) file, so we'll convert the dataset by replacing spaces with commas.

There are many ways to convert this data. One way is by using the following Windows PowerShell command:

cat german.data | %{$_ -replace " ",","} | sc german.csv

sed 's/ /,/g' german.data > german.csv

Upload the dataset to Machine Learning Studio

Another way is by using the Unix sed command:

In either case, we have created a comma-separated version of the data in a file named german.csv that we can

use in our experiment.

Once the data has been converted to CSV format, we need to upload it into Machine Learning Studio.

1. Open the Machine Learning Studio home page (https://studio.azureml.net).

2. Click the menu in the upper-left corner of the window, click Azure Machine Learning, select

Studio, and sign in.

3. Click +NEW at the bottom of the window.

4. Select DATASET.

5. Select FROM LOCAL FILE.

6. In the Upload a new dataset dialog, click Browse and find the german.csv file you created.

7. Enter a name for the dataset. For this walkthrough, call it "UCI German Credit Card Data".

8. For data type, select Generic CSV File With no header (.nh.csv).

9. Add a description if you’d like.

10. Click the OK check mark.

This uploads the data into a dataset module that we can use in an experiment.

You can manage datasets that you've uploaded to Studio by clicking the DATASETS tab to the left of the Studio

window.

For more information about importing other types of data into an experiment, see Import your training data into

Azure Machine Learning Studio.

Next: Create a new experiment

Walkthrough Step 3: Create a new Azure Machine

Learning experiment

3/21/2018 • 6 min to read • Edit Online

This is the third step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning

1. Create a Machine Learning workspace

2. Upload existing data

3. Create a new experiment

4. Train and evaluate the models

5. Deploy the Web service

6. Access the Web service

The next step in this walkthrough is to create an experiment in Machine Learning Studio that uses the dataset we

uploaded.

1. In Studio, click +NEW at the bottom of the window.

2. Select EXPERIMENT, and then select "Blank Experiment".

3. Select the default experiment name at the top of the canvas and rename it to something meaningful.

Prepare the data

TIPTIP

4. In the module palette to the left of the experiment canvas, expand Saved Datasets.

It's a good practice to fill in Summary and Description for the experiment in the Properties pane. These

properties give you the chance to document the experiment so that anyone who looks at it later will understand

your goals and methodology.

5. Find the dataset you created under My Datasets and drag it onto the canvas. You can also find the dataset

by entering the name in the Search box above the palette.

You can view the first 100 rows of the data and some statistical information for the whole dataset: Click the

output port of the dataset (the small circle at the bottom) and select Visualize.
Because the data file didn't come with column headings, Studio has provided generic headings (Col1, Col2, etc.).
Good headings aren't essential to creating a model, but they make it easier to work with the data in the
experiment. Also, when we eventually publish this model in a web service, the headings help identify the columns
to the user of the service.
We can add column headings using the Edit Metadata module. You use the Edit Metadata module to change
metadata associated with a dataset. In this case, we use it to provide more friendly names for column headings.
To use Edit Metadata, you first specify which columns to modify (in this case, all of them.) Next, you specify the
action to be performed on those columns (in this case, changing column headings.)
TIPTIP
1.  In the module palette, type "metadata" in the Search box. The Edit Metadata appears in the module list.
2.  Click and drag the Edit Metadata module onto the canvas and drop it below the dataset we added earlier.
3.  Connect the dataset to the Edit Metadata: click the output port of the dataset (the small circle at the bottom
of the dataset), drag to the input port of Edit Metadata (the small circle at the top of the module), then
release the mouse button. The dataset and module remain connected even if you move either around on
the canvas.
The experiment should now look something like this:
The red exclamation mark indicates that we haven't set the properties for this module yet. We'll do that
next.
You can add a comment to a module by double-clicking the module and entering text. This can help you see at a
glance what the module is doing in your experiment. In this case, double-click the Edit Metadata module and type
the comment "Add column headings". Click anywhere else on the canvas to close the text box. To display the
comment, click the down-arrow on the module.
4.  Select Edit Metadata, and in the Properties pane to the right of the canvas, click Launch column
selector.
5.  In the Select columns dialog, select all the rows in Available Columns and click > to move them to
Selected Columns. The dialog should look like this:

Status of checking account, Duration in months, Credit history, Purpose, Credit amount, Savings

account/bond, Present employment since, Installment rate in percentage of disposable income, Personal

status and sex, Other debtors, Present residence since, Property, Age in years, Other installment

plans, Housing, Number of existing credits, Job, Number of people providing maintenance for,

Telephone, Foreign worker, Credit risk

6. Click the OK check mark.

7. Back in the Properties pane, look for the New column names parameter. In this field, enter a list of

names for the 21 columns in the dataset, separated by commas and in column order. You can obtain the

columns names from the dataset documentation on the UCI website, or for convenience you can copy and

paste the following list:

The Properties pane looks like this:

TIPTIP

Create training and test datasets

If you want to verify the column headings, run the experiment (click RUN below the experiment canvas). When it finishes

running (a green check mark appears on Edit Metadata), click the output port of the Edit Metadata module, and select

Visualize. You can view the output of any module in the same way to view the progress of the data through the

experiment.

We need some data to train the model and some to test it. So in the next step of the experiment, we split the

dataset into two separate datasets: one for training our model and one for testing it.

To do this, we use the Split Data module.

TIPTIP

1. Find the Split Data module, drag it onto the canvas, and connect it to the Edit Metadata module.

2. By default, the split ratio is 0.5 and the Randomized split parameter is set. This means that a random half

of the data is output through one port of the Split Data module, and half through the other. You can adjust

these parameters, as well as the Random seed parameter, to change the split between training and testing

data. For this example, we leave them as-is.

The property Fraction of rows in the first output dataset determines how much of the data is output through

the left output port. For instance, if you set the ratio to 0.7, then 70% of the data is output through the left port

and 30% through the right port.

3. Double-click the Split Data module and enter the comment, "Training/testing data split 50%".

We can use the outputs of the Split Data module however we like, but let's choose to use the left output as

training data and the right output as testing data.

As mentioned in the previous step, the cost of misclassifying a high credit risk as low is five times higher than the

cost of misclassifying a low credit risk as high. To account for this, we generate a new dataset that reflects this cost

function. In the new dataset, each high risk example is replicated five times, while each low risk example is not

replicated.

We can do this replication using R code:

dataset1 <- maml.mapInputPort(1)

data.set<-dataset1[dataset1[,21]==1,]

pos<-dataset1[dataset1[,21]==2,]

for (i in 1:5) data.set<-rbind(data.set,pos)

maml.mapOutputPort("data.set")

1. Find and drag the Execute R Script module onto the experiment canvas.

2. Connect the left output port of the Split Data module to the first input port ("Dataset1") of the Execute R

Script module.

3. Double-click the Execute R Script module and enter the comment, "Set cost adjustment".

4. In the Properties pane, delete the default text in the R Script parameter and enter this script:

TIPTIP

We need to do this same replication operation for each output of the Split Data module so that the training and

testing data have the same cost adjustment. The easiest way to do this is by duplicating the Execute R Script

module we just made and connecting it to the other output port of the Split Data module.

1. Right-click the Execute R Script module and select Copy.

2. Right-click the experiment canvas and select Paste.

3. Drag the new module into position, and then connect the right output port of the Split Data module to the

first input port of this new Execute R Script module.

4. At the bottom of the canvas, click Run.

The copy of the Execute R Script module contains the same script as the original module. When you copy and paste a

module on the canvas, the copy retains all the properties of the original.

Our experiment now looks something like this:

For more information on using R scripts in your experiments, see Extend your experiment with R.

Next: Train and evaluate the models

Walkthrough Step 4: Train and evaluate the
predictive analytic models
3/21/2018 • 8 min to read • Edit Online
TIPTIP
Train the models
Two-Class Boosted Decision TreeTwo-Class Boosted Decision Tree
This topic contains the fourth step of the walkthrough, Develop a predictive analytics solution in Azure Machine
Learning
1.  Create a Machine Learning workspace
2.  Upload existing data
3.  Create a new experiment
4.  Train and evaluate the models
5.  Deploy the Web service
6.  Access the Web service
One of the benefits of using Azure Machine Learning Studio for creating machine learning models is the ability to
try more than one type of model at a time in a single experiment and compare the results. This type of
experimentation helps you find the best solution for your problem.
In the experiment we're developing in this walkthrough, we'll create two different types of models and then
compare their scoring results to decide which algorithm we want to use in our final experiment.
There are various models we could choose from. To see the models available, expand the Machine Learning
node in the module palette, and then expand Initialize Model and the nodes beneath it. For the purposes of this
experiment, we'll select the Two-Class Support Vector Machine (SVM) and the Two-Class Boosted Decision Tree
modules.
To get help deciding which Machine Learning algorithm best suits the particular problem you're trying to solve, see How to
choose algorithms for Microsoft Azure Machine Learning.
We'll add both the Two-Class Boosted Decision Tree module and Two-Class Support Vector Machine module in
this experiment.
First, let's set up the boosted decision tree model.
1.  Find the Two-Class Boosted Decision Tree module in the module palette and drag it onto the canvas.
2.  Find the Train Model module, drag it onto the canvas, and then connect the output of the Two-Class
Boosted Decision Tree module to the left input port of the Train Model module.
The Two-Class Boosted Decision Tree module initializes the generic model, and Train Model uses training
data to train the model.
3.  Connect the left output of the left Execute R Script module to the right input port of the Train Model
module (we decided in Step 3 of this walkthrough to use the data coming from the left side of the Split
Data module for training).

Two-Class Support Vector MachineTwo-Class Support Vector Machine

TIPTIP

We don't need two of the inputs and one of the outputs of the Execute R Script module for this experiment, so we

can leave them unattached.

This portion of the experiment now looks something like this:

Now we need to tell the Train Model module that we want the model to predict the Credit Risk value.

1. Select the Train Model module. In the Properties pane, click Launch column selector.

2. In the Select a single column dialog, type "credit risk" in the search field under Available Columns,

select "Credit risk" below, and click the right arrow button (>) to move "Credit risk" to Selected Columns.

3. Click the OK check mark.

Next, we set up the SVM model.

First, a little explanation about SVM. Boosted decision trees work well with features of any type. However, since

the SVM module generates a linear classifier, the model that it generates has the best test error when all numeric

features have the same scale. To convert all numeric features to the same scale, we use a "Tanh" transformation

(with the Normalize Data module). This transforms our numbers into the [0,1] range. The SVM module converts

string features to categorical features and then to binary 0/1 features, so we don't need to manually transform

string features. Also, we don't want to transform the Credit Risk column (column 21) - it's numeric, but it's the
value we're training the model to predict, so we need to leave it alone.
To set up the SVM model, do the following:
1.  Find the Two-Class Support Vector Machine module in the module palette and drag it onto the canvas.
2.  Right-click the Train Model module, select Copy, and then right-click the canvas and select Paste. The copy
of the Train Model module has the same column selection as the original.
3.  Connect the output of the Two-Class Support Vector Machine module to the left input port of the second
Train Model module.
4.  Find the Normalize Data module and drag it onto the canvas.
5.  Connect the left output of the left Execute R Script module to the input of this module (notice that the
output port of a module may be connected to more than one other module).
6.  Connect the left output port of the Normalize Data module to the right input port of the second Train
Model module.
This portion of our experiment should now look something like this:
Now configure the Normalize Data module:
1.  Click to select the Normalize Data module. In the Properties pane, select Tanh for the Transformation
method parameter.
2.  Click Launch column selector, select "No columns" for Begin With, select Include in the first
dropdown, select column type in the second dropdown, and select Numeric in the third dropdown. This
specifies that all the numeric columns (and only numeric) are transformed.
3.  Click the plus sign (+) to the right of this row - this creates a row of dropdowns. Select Exclude in the first
dropdown, select column names in the second dropdown, and enter "Credit risk" in the text field. This
specifies that the Credit Risk column should be ignored (we need to do this because this column is
numeric and so would be transformed if we didn't exclude it).
4.  Click the OK check mark.

Score and evaluate the models
Add the Score Model modulesAdd the Score Model modules
The Normalize Data module is now set to perform a Tanh transformation on all numeric columns except for the
Credit Risk column.
We use the testing data that was separated out by the Split Data module to score our trained models. We can
then compare the results of the two models to see which generated better results.
1.  Find the Score Model module and drag it onto the canvas.
2.  Connect the Train Model module that's connected to the Two-Class Boosted Decision Tree module to the
left input port of the Score Model module.
3.  Connect the right Execute R Script module (our testing data) to the right input port of the Score Model
module.
The Score Model module can now take the credit information from the testing data, run it through the
model, and compare the predictions the model generates with the actual credit risk column in the testing
data.
4.  Copy and paste the Score Model module to create a second copy.
5.  Connect the output of the SVM model (that is, the output port of the Train Model module that's connected
to the Two-Class Support Vector Machine module) to the input port of the second Score Model module.
6.  For the SVM model, we have to do the same transformation to the test data as we did to the training data.
So copy and paste the Normalize Data module to create a second copy and connect it to the right Execute

Add the Evaluate Model moduleAdd the Evaluate Model module

Run the experiment and check the resultsRun the experiment and check the results

R Script module.

7. Connect the left output of the second Normalize Data module to the right input port of the second Score

Model module.

To evaluate the two scoring results and compare them, we use an Evaluate Model module.

1. Find the Evaluate Model module and drag it onto the canvas.

2. Connect the output port of the Score Model module associated with the boosted decision tree model to

the left input port of the Evaluate Model module.

3. Connect the other Score Model module to the right input port.

To run the experiment, click the RUN button below the canvas. It may take a few minutes. A spinning indicator on

each module shows that it's running, and then a green check mark shows when the module is finished. When all

the modules have a check mark, the experiment has finished running.

The experiment should now look something like this:

To check the results, click the output port of the Evaluate Model module and select Visualize.

The Evaluate Model module produces a pair of curves and metrics that allow you to compare the results of the

two scored models. You can view the results as Receiver Operator Characteristic (ROC) curves, Precision/Recall

curves, or Lift curves. Additional data displayed includes a confusion matrix, cumulative values for the area under

the curve (AUC), and other metrics. You can change the threshold value by moving the slider left or right and see

how it affects the set of metrics.

To the right of the graph, click Scored dataset or Scored dataset to compare to highlight the associated curve

and to display the associated metrics below. In the legend for the curves, "Scored dataset" corresponds to the left

input port of the Evaluate Model module - in our case, this is the boosted decision tree model. "Scored dataset to

compare" corresponds to the right input port - the SVM model in our case. When you click one of these labels,

the curve for that model is highlighted and the corresponding metrics are displayed, as shown in the following

graphic.

TIPTIP

By examining these values, you can decide which model is closest to giving you the results you're looking for. You

can go back and iterate on your experiment by changing parameter values in the different models.

The science and art of interpreting these results and tuning the model performance is outside the scope of this

walkthrough. For additional help, you might read the following articles:

How to evaluate model performance in Azure Machine Learning

Choose parameters to optimize your algorithms in Azure Machine Learning

Interpret model results in Azure Machine Learning

Each time you run the experiment a record of that iteration is kept in the Run History. You can view these iterations, and

return to any of them, by clicking VIEW RUN HISTORY below the canvas. You can also click Prior Run in the Properties

pane to return to the iteration immediately preceding the one you have open.

You can make a copy of any iteration of your experiment by clicking SAVE AS below the canvas. Use the experiment's

Summary and Description properties to keep a record of what you've tried in your experiment iterations.

For more details, see Manage experiment iterations in Azure Machine Learning Studio.

Next: Deploy the web service

Walkthrough Step 5: Deploy the Azure Machine

Learning web service

4/25/2018 • 9 min to read • Edit Online

Remove one of the models

This is the fifth step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning

1. Create a Machine Learning workspace

2. Upload existing data

3. Create a new experiment

4. Train and evaluate the models

5. Deploy the web service

6. Access the web service

To give others a chance to use the predictive model we've developed in this walkthrough, we can deploy it as a

web service on Azure.

Up to this point we've been experimenting with training our model. But the deployed service is no longer going

to do training - it's going to generate new predictions by scoring the user's input based on our model. So we're

going to do some preparation to convert this experiment from a training experiment to a predictive

experiment.

This is a three-step process:

1. Remove one of the models

2. Convert the training experiment we've created into a predictive experiment

3. Deploy the predictive experiment as a web service

First, we need to trim this experiment down a little. We currently have two different models in the experiment,

but we only want to use one model when we deploy this as a web service.

Let's say we've decided that the boosted tree model performed better than the SVM model. So the first thing to

do is remove the Two-Class Support Vector Machine module and the modules that were used for training it. You

may want to make a copy of the experiment first by clicking Save As at the bottom of the experiment canvas.

We need to delete the following modules:

Two-Class Support Vector Machine

Train Model and Score Model modules that were connected to it

Normalize Data (both of them)

Evaluate Model (because we're finished evaluating the models)

Select each module and press the Delete key, or right-click the module and select Delete.

Our model should now look something like this:

Now we're ready to deploy this model using the Two-Class Boosted Decision Tree.

Convert the training experiment to a predictive experiment
TIPTIP
NOTENOTE
To get this model ready for deployment, we need to convert this training experiment to a predictive experiment.
This involves three steps:
1.  Save the model we've trained and then replace our training modules
2.  Trim the experiment to remove modules that were only needed for training
3.  Define where the web service will accept input and where it generates the output
We could do this manually, but fortunately all three steps can be accomplished by clicking Set Up Web Service
at the bottom of the experiment canvas (and selecting the Predictive Web Service option).
If you want more details on what happens when you convert a training experiment to a predictive experiment, see How to
prepare your model for deployment in Azure Machine Learning Studio.
When you click Set Up Web Service, several things happen:
The trained model is converted to a single Trained Model module and stored in the module palette to the
left of the experiment canvas (you can find it under Trained Models)
Modules that were used for training are removed; specifically:
The saved trained model is added back into the experiment
Web service input and Web service output modules are added (these identify where the user's data will
enter the model, and what data is returned, when the web service is accessed)
Two-Class Boosted Decision Tree
Train Model
Split Data
the second Execute R Script module that was used for test data
You can see that the experiment is saved in two parts under tabs that have been added at the top of the experiment
canvas. The original training experiment is under the tab Training experiment, and the newly created predictive
experiment is under Predictive experiment. The predictive experiment is the one we'll deploy as a web service.
We need to take one additional step with this particular experiment. We added two Execute R Script modules to
provide a weighting function to the data. That was just a trick we needed for training and testing, so we can take
out those modules in the final model. Machine Learning Studio removed one Execute R Script module when it
removed the Split module. Now we can remove the other and connect Metadata Editor directly to Score Model.
Our experiment should now look like this:

NOTENOTE

Deploy the web service

Deploy as a Classic web serviceDeploy as a Classic web service

You may be wondering why we left the UCI German Credit Card Data dataset in the predictive experiment. The service is

going to score the user's data, not the original dataset, so why leave the original dataset in the model?

It's true that the service doesn't need the original credit card data. But it does need the schema for that data, which

includes information such as how many columns there are and which columns are numeric. This schema information is

necessary to interpret the user's data. We leave these components connected so that the scoring module has the dataset

schema when the service is running. The data isn't used, just the schema.

Run the experiment one last time (click Run.) If you want to verify that the model is still working, click the output

of the Score Model module and select View Results. You can see that the original data is displayed, along with

the credit risk value ("Scored Labels") and the scoring probability value ("Scored Probabilities".)

You can deploy the experiment as either a Classic web service, or as a New web service that's based on Azure

Resource Manager.

To deploy a Classic web service derived from our experiment, click Deploy Web Service below the canvas and

select Deploy Web Service [Classic]. Machine Learning Studio deploys the experiment as a web service and

takes you to the dashboard for that web service. From this page you can return to the experiment (View

snapshot or View latest) and run a simple test of the web service (see Test the web service below). There is

also information here for creating applications that can access the web service (more on that in the next step of

this walkthrough).

Deploy as a New web serviceDeploy as a New web service

You can configure the service by clicking the CONFIGURATION tab. Here you can modify the service name (it's

given the experiment name by default) and give it a description. You can also give more friendly labels for the

input and output data.

NOTENOTE

TIPTIP

Test the web service

TIPTIP

To deploy a New web service you must have sufficient permissions in the subscription to which you are deploying the web

service. For more information, see Manage a web service using the Azure Machine Learning Web Services portal.

To deploy a New web service derived from our experiment:

1. Click Deploy Web Service below the canvas and select Deploy Web Service [New]. Machine

Learning Studio transfers you to the Azure Machine Learning web services Deploy Experiment page.

2. Enter a name for the web service.

3. For Price Plan, you can select an existing pricing plan, or select "Create new" and give the new plan a

name and select the monthly plan option. The plan tiers default to the plans for your default region and

your web service is deployed to that region.

4. Click Deploy.

After a few minutes, the Quickstart page for your web service opens.

You can configure the service by clicking the Configure tab. Here you can modify the service title and give it a

description.

To test the web service, click the Test tab (see Test the web service below). For information on creating

applications that can access the web service, click the Consume tab (the next step in this walkthrough will go

into more detail).

You can update the web service after you've deployed it. For example, if you want to change your model, then you can edit

the training experiment, tweak the model parameters, and click Deploy Web Service, selecting Deploy Web Service

[Classic] or Deploy Web Service [New]. When you deploy the experiment again, it replaces the web service, now using

your updated model.

When the web service is accessed, the user's data enters through the Web service input module where it's

passed to the Score Model module and scored. The way we've set up the predictive experiment, the model

expects data in the same format as the original credit risk dataset. The results are returned to the user from the

web service through the Web service output module.

The way we have the predictive experiment configured, the entire results from the Score Model module are returned. This

includes all the input data plus the credit risk value and the scoring probability. But you can return something different if

you want - for example, you could return just the credit risk value. To do this, insert a Project Columns module between

Score Model and the Web service output to eliminate columns you don't want the web service to return.

You can test a Classic web service either in Machine Learning Studio or in the Azure Machine Learning

Web Services portal. You can test a New web service only in the Machine Learning Web Services portal.

TIPTIP

Test a Classic web serviceTest a Classic web service

Test in Machine Learning StudioTest in Machine Learning Studio

Test in the Machine Learning Web Services portalTest in the Machine Learning Web Services portal

Test a New web serviceTest a New web service

Manage the web service

When testing in the Azure Machine Learning Web Services portal, you can have the portal create sample data that you

can use to test the Request-Response service. On the Configure page, select "Yes" for Sample Data Enabled?. When you

open the Request-Response tab on the Test page, the portal fills in sample data taken from the original credit risk dataset.

You can test a Classic web service in Machine Learning Studio or in the Machine Learning Web Services portal.

1. On the DASHBOARD page for the web service, click the Test button under Default Endpoint. A dialog

pops up and asks you for the input data for the service. These are the same columns that appeared in the

original credit risk dataset.

2. Enter a set of data and then click OK.

1. On the DASHBOARD page for the web service, click the Test preview link under Default Endpoint.

The test page in the Azure Machine Learning Web Services portal for the web service endpoint opens

and asks you for the input data for the service. These are the same columns that appeared in the original

credit risk dataset.

2. Click Test Request-Response.

You can test a New web service only in the Machine Learning Web Services portal.

1. In the Azure Machine Learning Web Services portal, click Test at the top of the page. The Test page

opens and you can input data for the service. The input fields displayed correspond to the columns that

appeared in the original credit risk dataset.

2. Enter a set of data and then click Test Request-Response.

The results of the test are displayed on the right-hand side of the page in the output column.

Once you've deployed your web service, whether Classic or New, you can manage it from the Microsoft Azure

Machine Learning Web Services portal.

To monitor the performance of your web service:

1. Sign in to the Microsoft Azure Machine Learning Web Services portal

2. Click Web services

3. Click your web service

4. Click the Dashboard

Next: Access the web service

Walkthrough Step 6: Access the Azure Machine

Learning web service

3/21/2018 • 1 min to read • Edit Online

This is the last step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning

1. Create a Machine Learning workspace

2. Upload existing data

3. Create a new experiment

4. Train and evaluate the models

5. Deploy the Web service

6. Access the Web service

In the previous step in this walkthrough we deployed a web service that uses our credit risk prediction model.

Now users are able to send data to it and receive results.

The Web service is an Azure web service that can receive and return data using REST APIs in one of two ways:

Request/Response - The user sends one or more rows of credit data to the service by using an HTTP

protocol, and the service responds with one or more sets of results.

Batch Execution - The user stores one or more rows of credit data in an Azure blob and then sends the blob

location to the service. The service scores all the rows of data in the input blob, stores the results in another

blob, and returns the URL of that container.

The quickest and easiest way to access a Classic web service is through the Azure ML Request-Response Service

Web App or Azure ML Batch Execution Service Web App Template.

These web app templates can build a custom web app that knows your web service's input data and what it will

return. All you need to do is provide access to your web service and data, and the template does the rest.

For more information on using the web app templates, see Consume an Azure Machine Learning Web service

with a web app template.

You can also develop a custom application to access the web service using starter code provided for you in R, C#,

and Python programming languages.

You can find complete details in How to consume an Azure Machine Learning Web service.

Data Science for Beginners video 1: The 5 questions

data science answers

3/20/2018 • 4 min to read • Edit Online

Other videos in this series

Transcript: The 5 questions data science answers

Question 1: Is this A or B? uses classification algorithms

Get a quick introduction to data science from Data Science for Beginners in five short videos from a top data

scientist. These videos are basic but useful, whether you're interested in doing data science or you work with data

scientists.

This first video is about the kinds of questions that data science can answer. To get the most out of the series,

watch them all. Go to the list of videos

Data Science for Beginners is a quick introduction to data science taking about 25 minutes total. Check out all five

videos:

Video 1: The 5 questions data science answers

Video 2: Is your data ready for data science? (4 min 56 sec)

Video 3: Ask a question you can answer with data (4 min 17 sec)

Video 4: Predict an answer with a simple model (7 min 42 sec)

Video 5: Copy other people's work to do data science (3 min 18 sec)

Hi! Welcome to the video series Data Science for Beginners.

Data Science can be intimidating, so I'll introduce the basics here without any equations or computer

programming jargon.

In this first video, we'll talk about "The 5 questions data science answers."

Data Science uses numbers and names (also known as categories or labels) to predict answers to questions.

It might surprise you, but there are only five questions that data science answers:

Is this A or B?

Is this weird?

How much – or – How many?

How is this organized?

What should I do next?

Each one of these questions is answered by a separate family of machine learning methods, called algorithms.

It's helpful to think about an algorithm as a recipe and your data as the ingredients. An algorithm tells how to

combine and mix the data in order to get an answer. Computers are like a blender. They do most of the hard work

of the algorithm for you and they do it pretty fast.

Let's start with the question: Is this A or B?

Question 2: Is this weird? uses anomaly detection algorithms

This family of algorithms is called two-class classification.

It's useful for any question that has just two possible answers.

For example:

Will this tire fail in the next 1,000 miles: Yes or no?

Which brings in more customers: a $5 coupon or a 25% discount?

This question can also be rephrased to include more than two options: Is this A or B or C or D, etc.? This is called

multiclass classification and it's useful when you have several — or several thousand — possible answers.

Multiclass classification chooses the most likely one.

The next question data science can answer is: Is this weird? This question is answered by a family of algorithms

called anomaly detection.

If you have a credit card, you’ve already benefited from anomaly detection. Your credit card company analyzes

your purchase patterns, so that they can alert you to possible fraud. Charges that are "weird" might be a purchase

at a store where you don't normally shop or buying an unusually pricey item.

This question can be useful in lots of ways. For instance:

Question 3: How much? or How many? uses regression algorithms

Question 4: How is this organized? uses clustering algorithms

If you have a car with pressure gauges, you might want to know: Is this pressure gauge reading normal?

If you're monitoring the internet, you’d want to know: Is this message from the internet typical?

Anomaly detection flags unexpected or unusual events or behaviors. It gives clues where to look for problems.

Machine learning can also predict the answer to How much? or How many? The algorithm family that answers

this question is called regression.

Regression algorithms make numerical predictions, such as:

What will the temperature be next Tuesday?

What will my fourth quarter sales be?

They help answer any question that asks for a number.

Now the last two questions are a bit more advanced.

Sometimes you want to understand the structure of a data set - How is this organized? For this question, you

don’t have examples that you already know outcomes for.

There are a lot of ways to tease out the structure of data. One approach is clustering. It separates data into natural

"clumps," for easier interpretation. With clustering, there is no one right answer.

Question 5: What should I do now? uses reinforcement learning

algorithms

Common examples of clustering questions are:

Which viewers like the same types of movies?

Which printer models fail the same way?

By understanding how data is organized, you can better understand - and predict - behaviors and events.

The last question – What should I do now? – uses a family of algorithms called reinforcement learning.

Reinforcement learning was inspired by how the brains of rats and humans respond to punishment and rewards.

These algorithms learn from outcomes, and decide on the next action.

Typically, reinforcement learning is a good fit for automated systems that have to make lots of small decisions

without human guidance.

Questions it answers are always about what action should be taken - usually by a machine or a robot. Examples

are:

Next steps

If I'm a temperature control system for a house: Adjust the temperature or leave it where it is?

If I'm a self-driving car: At a yellow light, brake or accelerate?

For a robot vacuum: Keep vacuuming, or go back to the charging station?

Reinforcement learning algorithms gather data as they go, learning from trial and error.

So that's it - The 5 questions data science can answer.

Try a first data science experiment with Machine Learning Studio

Get an introduction to Machine Learning on Microsoft Azure

Is your data ready for data science?

3/20/2018 • 4 min to read • Edit Online

Video 2: Data Science for Beginners series

Other videos in this series

Transcript: Is your data ready for data science?

Criteria for data

Is your data relevant?

Learn how to evaluate your data to make sure it meets basic criteria to be ready for data science.

To get the most out of the series, watch them all. Go to the list of videos

Data Science for Beginners is a quick introduction to data science in five short videos.

Video 1: The 5 questions data science answers (5 min 14 sec)

Video 2: Is your data ready for data science?

Video 3: Ask a question you can answer with data (4 min 17 sec)

Video 4: Predict an answer with a simple model (7 min 42 sec)

Video 5: Copy other people's work to do data science (3 min 18 sec)

Welcome to "Is your data ready for data science?" the second video in the series Data Science for Beginners.

Before data science can give you the answers you want, you have to give it some high-quality raw materials to

work with. Just like making a pizza, the better the ingredients you start with, the better the final product.

In data science, there are certain ingredients that must be pulled together including:

Relevant

Connected

Accurate

Enough to work with

So the first ingredient - you need data that's relevant.

Do you have connected data?

On the left, the table presents the blood alcohol level of seven people tested outside a Boston bar, the Red Sox

batting average in their last game, and the price of milk in the nearest convenience store.

This is all perfectly legitimate data. It’s only fault is that it isn’t relevant. There's no obvious relationship between

these numbers. If someone gave you the current price of milk and the Red Sox batting average, there's no way you

could guess their blood alcohol content.

Now look at the table on the right. This time each person’s body mass was measured as well as the number of

drinks they’ve had. The numbers in each row are now relevant to each other. If I gave you my body mass and the

number of Margaritas I've had, you could make a guess at my blood alcohol content.

The next ingredient is connected data.

Here is some relevant data on the quality of hamburgers: grill temperature, patty weight, and rating in the local

food magazine. But notice the gaps in the table on the left.

Most data sets are missing some values. It's common to have holes like this and there are ways to work around

them. But if there's too much missing, your data begins to look like Swiss cheese.

If you look at the table on the left, there's so much missing data, it's hard to come up with any kind of relationship

Is your data accurate?

Do you have enough data to work with?

between grill temperature and patty weight. This example shows disconnected data.

The table on the right, though, is full and complete - an example of connected data.

The next ingredient is accuracy. Here are four targets to hit.

Look at the target in the upper right. There is a tight grouping right around the bulls eye. That, of course, is

accurate. Oddly, in the language of data science, performance on the target right below it is also considered

accurate.

If you mapped out the center of these arrows, you'd see that it's very close to the bulls eye. The arrows are spread

out all around the target, so they're considered imprecise, but they're centered around the bulls eye, so they're

considered accurate.

Now look at the upper-left target. Here the arrows hit very close together, a tight grouping. They're precise, but

they're inaccurate because the center is way off the bulls eye. The arrows in the bottom-left target are both

inaccurate and imprecise. This archer needs more practice.

Finally, ingredient #4 is sufficient data.

Next steps

Think of each data point in your table as being a brush stroke in a painting. If you have only a few of them, the

painting can be fuzzy - it's hard to tell what it is.

If you add some more brush strokes, then your painting starts to get a little sharper.

When you have barely enough strokes, you only see enough to make some broad decisions. Is it somewhere I

might want to visit? It looks bright, that looks like clean water – yes, that’s where I’m going on vacation.

As you add more data, the picture becomes clearer and you can make more detailed decisions. Now you can look

at the three hotels on the left bank. You can notice the architectural features of the one in the foreground. You

might even choose to stay on the third floor because of the view.

With data that's relevant, connected, accurate, and enough, you have all the ingredients needed to do some high-

quality data science.

Be sure to check out the other four videos in Data Science for Beginners from Microsoft Azure Machine Learning.

Try a first data science experiment with Machine Learning Studio

Get an introduction to Machine Learning on Microsoft Azure

Ask a question you can answer with data

3/20/2018 • 4 min to read • Edit Online

Video 3: Data Science for Beginners series

Other videos in this series

Transcript: Ask a question you can answer with data

Ask a sharp question

Examples of your answer: Target data

Learn how to formulate a data science problem into a question in Data Science for Beginners video 3. This video

includes a comparison of questions for classification and regression algorithms.

To get the most out of the series, watch them all. Go to the list of videos

Data Science for Beginners is a quick introduction to data science in five short videos.

Video 1: The 5 questions data science answers (5 min 14 sec)

Video 2: Is your data ready for data science? (4 min 56 sec)

Video 3: Ask a question you can answer with data

Video 4: Predict an answer with a simple model (7 min 42 sec)

Video 5: Copy other people's work to do data science (3 min 18 sec)

Welcome to the third video in the series "Data Science for Beginners."

In this one, you'll get some tips for formulating a question you can answer with data.

You might get more out of this video, if you first watch the two earlier videos in this series: "The 5 questions data

science can answer" and "Is your data is ready for data science?"

We've talked about how data science is the process of using names (also called categories or labels) and numbers

to predict an answer to a question. But it can't be just any question; it has to be a sharp question.

A vague question doesn't have to be answered with a name or a number. A sharp question must.

Imagine you found a magic lamp with a genie who will truthfully answer any question you ask. But it's a

mischievous genie, and he'll try to make his answer as vague and confusing as he can get away with. You want to

pin him down with a question so airtight that he can't help but tell you what you want to know.

If you were to ask a vague question, like "What's going to happen with my stock?", the genie might answer, "The

price will change". That's a truthful answer, but it's not very helpful.

But if you were to ask a sharp question, like "What will my stock's sale price be next week?", the genie can't help

but give you a specific answer and predict a sale price.

Once you formulate your question, check to see whether you have examples of the answer in your data.

Reformulate your question

If our question is "What will my stock's sale price be next week?" then we have to make sure our data includes the

stock price history.

If our question is "Which car in my fleet is going to fail first?" then we have to make sure our data includes

information about previous failures.

These examples of answers are called a target. A target is what we are trying to predict about future data points,

whether it's a category or a number.

If you don't have any target data, you'll need to get some. You won't be able to answer your question without it.

Sometimes you can reword your question to get a more useful answer.

The question "Is this data point A or B?" predicts the category (or name or label) of something. To answer it, we

use a classification algorithm.

The question "How much?" or "How many?" predicts an amount. To answer it we use a regression algorithm.

To see how we can transform these, let's look at the question, "Which news story is the most interesting to this

reader?" It asks for a prediction of a single choice from many possibilities - in other words "Is this A or B or C or

D?" - and would use a classification algorithm.

But, this question may be easier to answer if you reword it as "How interesting is each story on this list to this

reader?" Now you can give each article a numerical score, and then it's easy to identify the highest-scoring article.

This is a rephrasing of the classification question into a regression question or How much?

Next steps

How you ask a question is a clue to which algorithm can give you an answer.

You'll find that certain families of algorithms - like the ones in our news story example - are closely related. You

can reformulate your question to use the algorithm that gives you the most useful answer.

But, most important, ask that sharp question - the question that you can answer with data. And be sure you have

the right data to answer it.

We've talked about some basic principles for asking a question you can answer with data.

Be sure to check out the other videos in "Data Science for Beginners" from Microsoft Azure Machine Learning.

Try a first data science experiment with Machine Learning Studio

Get an introduction to Machine Learning on Microsoft Azure

Predict an answer with a simple model

3/20/2018 • 5 min to read • Edit Online

Video 4: Data Science for Beginners series

Other videos in this series

Transcript: Predict an answer with a simple model

Collect relevant, accurate, connected, enough data

Learn how to create a simple regression model to predict the price of a diamond in Data Science for Beginners

video 4. We'll draw a regression model with target data.

To get the most out of the series, watch them all. Go to the list of videos

Data Science for Beginners is a quick introduction to data science in five short videos.

Video 1: The 5 questions data science answers (5 min 14 sec)

Video 2: Is your data ready for data science? (4 min 56 sec)

Video 3: Ask a question you can answer with data (4 min 17 sec)

Video 4: Predict an answer with a simple model

Video 5: Copy other people's work to do data science (3 min 18 sec)

Welcome to the fourth video in the "Data Science for Beginners" series. In this one, we'll build a simple model and

make a prediction.

A model is a simplified story about our data. I'll show you what I mean.

Say I want to shop for a diamond. I have a ring that belonged to my grandmother with a setting for a 1.35 carat

diamond, and I want to get an idea of how much it will cost. I take a notepad and pen into the jewelry store, and I

write down the price of all of the diamonds in the case and how much they weigh in carats. Starting with the first

diamond - it's 1.01 carats and $7,366.

Now I go through and do this for all the other diamonds in the store.

Ask a sharp question

Plot the existing data

Notice that our list has two columns. Each column has a different attribute - weight in carats and price - and each

row is a single data point that represents a single diamond.

We've actually created a small data set here - a table. Notice that it meets our criteria for quality:

The data is relevant - weight is definitely related to price

It's accurate - we double-checked the prices that we write down

It's connected - there are no blank spaces in either of these columns

And, as we'll see, it's enough data to answer our question

Now we'll pose our question in a sharp way: "How much will it cost to buy a 1.35 carat diamond?"

Our list doesn't have a 1.35 carat diamond in it, so we'll have to use the rest of our data to get an answer to the

question.

The first thing we'll do is draw a horizontal number line, called an axis, to chart the weights. The range of the

weights is 0 to 2, so we'll draw a line that covers that range and put ticks for each half carat.

Next we'll draw a vertical axis to record the price and connect it to the horizontal weight axis. This will be in units

of dollars. Now we have a set of coordinate axes.

We're going to take this data now and turn it into a scatter plot. This is a great way to visualize numerical data sets.

For the first data point, we eyeball a vertical line at 1.01 carats. Then, we eyeball a horizontal line at $7,366. Where

they meet, we draw a dot. This represents our first diamond.

Now we go through each diamond on this list and do the same thing. When we're through, this is what we get: a

bunch of dots, one for each diamond.

Draw the model through the data points

Now if you look at the dots and squint, the collection looks like a fat, fuzzy line. We can take our marker and draw

a straight line through it.

By drawing a line, we created a model. Think of this as taking the real world and making a simplistic cartoon

version of it. Now the cartoon is wrong - the line doesn't go through all the data points. But, it's a useful

simplification.

Use the model to find the answer

The fact that all the dots don't go exactly through the line is OK. Data scientists explain this by saying that there's

the model - that's the line - and then each dot has some noise or variance associated with it. There's the underlying

perfect relationship, and then there's the gritty, real world that adds noise and uncertainty.

Because we're trying to answer the question How much? this is called a regression. And because we're using a

straight line, it's a linear regression.

Now we have a model and we ask it our question: How much will a 1.35 carat diamond cost?

To answer our question, we eyeball 1.35 carats and draw a vertical line. Where it crosses the model line, we eyeball

a horizontal line to the dollar axis. It hits right at 10,000. Boom! That's the answer: A 1.35 carat diamond costs

about $10,000.

Create a confidence interval

It's natural to wonder how precise this prediction is. It's useful to know whether the 1.35 carat diamond will be

very close to $10,000, or a lot higher or lower. To figure this out, let's draw an envelope around the regression line

that includes most of the dots. This envelope is called our confidence interval: We're pretty confident that prices

fall within this envelope, because in the past most of them have. We can draw two more horizontal lines from

where the 1.35 carat line crosses the top and the bottom of that envelope.

We're done, with no math or computers

Next steps

Now we can say something about our confidence interval: We can say confidently that the price of a 1.35 carat

diamond is about $10,000 - but it might be as low as $8,000 and it might be as high as $12,000.

We did what data scientists get paid to do, and we did it just by drawing:

We asked a question that we could answer with data

We built a model using linear regression

We made a prediction, complete with a confidence interval

And we didn't use math or computers to do it.

Now if we'd had more information, like...

the cut of the diamond

color variations (how close the diamond is to being white)

the number of inclusions in the diamond

...then we would have had more columns. In that case, math becomes helpful. If you have more than two columns,

it's hard to draw dots on paper. The math lets you fit that line or that plane to your data very nicely.

Also, if instead of just a handful of diamonds, we had two thousand or two million, then you can do that work

much faster with a computer.

Today, we've talked about how to do linear regression, and we made a prediction using data.

Be sure to check out the other videos in "Data Science for Beginners" from Microsoft Azure Machine Learning.

Try a first data science experiment with Machine Learning Studio

Get an introduction to Machine Learning on Microsoft Azure

Copy other people's work to do data science

3/20/2018 • 3 min to read • Edit Online

Video 5: Data Science for Beginners series

IMPORTANTIMPORTANT

Other videos in this series

Transcript: Copy other people's work to do data science

Find examples in the Azure AI Gallery

One of the trade secrets of data science is getting other people to do your work for you. Find a clustering

algorithm example in Azure AI Gallery to use for your own machine learning experiment.

Cortana Intelligence Gallery was renamed Azure AI Gallery. As a result, text and images in this transcript vary slightly

from the video, which uses the former name.

To get the most out of the series, watch them all. Go to the list of videos

Data Science for Beginners is a quick introduction to data science in five short videos.

Video 1: The 5 questions data science answers (5 min 14 sec)

Video 2: Is your data ready for data science? (4 min 56 sec)

Video 3: Ask a question you can answer with data (4 min 17 sec)

Video 4: Predict an answer with a simple model (7 min 42 sec)

Video 5: Copy other people's work to do data science

Welcome to the fifth video in the series "Data Science for Beginners."

In this one, you’ll discover a place to find examples that you can borrow from as a starting point for your own

work. You might get the most out of this video if you first watch the earlier videos in this series.

One of the trade secrets of data science is getting other people to do your work for you.

Microsoft has a cloud-based service called Azure Machine Learning Studio that you're welcome to try for free. It

provides you with a workspace where you can experiment with different machine learning algorithms, and, when

you've got your solution worked out, you can launch it as a web service.

Part of this service is something called the Azure AI Gallery. It contains resources, including a collection of Azure

Machine Learning experiments, or models, that people have built and contributed for others to use. These

experiments are a great way to leverage the thought and hard work of others to get you started on your own

solutions. Everyone is welcome to browse through it.

Find and use a clustering algorithm example

If you click Experiments at the top, you'll see a number of the most recent and popular experiments in the

gallery. You can search through the rest of experiments by clicking Browse All at the top of the screen, and there

you can enter search terms and choose search filters.

So, for instance, let's say you want to see an example of how clustering works, so you search for "clustering

sweep" experiments.

Here's an interesting one that someone contributed to the gallery.

Click on that experiment and you get a web page that describes the work that this contributor did, along with

some of their results.

Notice the link that says Open in Studio.

Find experiments that demonstrate machine learning techniques

I can click on that and it takes me right to Azure Machine Learning Studio. It creates a copy of the experiment

and puts it in my own workspace. This includes the contributor's dataset, all the processing that they did, all of the

algorithms that they used, and how they saved out the results.

And now I have a starting point. I can swap out their data for my own and do my own tweaking of the model. This

gives me a running start, and it lets me build on the work of people who really know what they’re doing.

There are other experiments in the Azure AI Gallery that were contributed specifically to provide how-to examples

for people new to data science. For instance, there's an experiment in the gallery that demonstrates how to handle

missing values (Methods for handling missing values). It walks you through 15 different ways of substituting

empty values, and talks about the benefits of each method and when to use it.

Next steps

Azure AI Gallery is a place to find working experiments that you can use as a starting point for your own

solutions.

Be sure to check out the other videos in "Data Science for Beginners" from Microsoft Azure Machine Learning.

Try your first data science experiment with Azure Machine Learning

Get an introduction to Machine Learning on Microsoft Azure

Quickstart tutorial for the R programming language

for Azure Machine Learning

3/21/2018 • 53 min to read • Edit Online

Introduction

NOTENOTE

Forecasting and the datasetForecasting and the dataset

OrganizationOrganization

This quickstart tutorial helps you quickly start extending Azure Machine Learning by using the R programming

language. Follow this R programming tutorial to create, test and execute R code within Azure Machine Learning.

As you work through tutorial, you will create a complete forecasting solution by using the R language in Azure

Machine Learning.

Microsoft Azure Machine Learning contains many powerful machine learning and data manipulation modules. The

powerful R language has been described as the lingua franca of analytics. Happily, analytics and data manipulation

in Azure Machine Learning can be extended by using R. This combination provides the scalability and ease of

deployment of Azure Machine Learning with the flexibility and deep analytics of R.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

Forecasting is a widely employed and quite useful analytical method. Common uses range from predicting sales of

seasonal items, determining optimal inventory levels, to predicting macroeconomic variables. Forecasting is

typically done with time series models.

Time series data is data in which the values have a time index. The time index can be regular, e.g. every month or

every minute, or irregular. A time series model is based on time series data. The R programming language contains

a flexible framework and extensive analytics for time series data.

In this quickstart guide we will be working with California dairy production and pricing data. This data includes

monthly information on the production of several dairy products and the price of milk fat, a benchmark

commodity.

The data used in this article, along with R scripts, can be downloaded here. This data was originally synthesized

from information available from the University of Wisconsin at http://future.aae.wisc.edu/tab/production.html.

We will progress through several steps as you learn how to create, test and execute analytics and data

manipulation R code in the Azure Machine Learning environment.

First we will explore the basics of using the R language in the Azure Machine Learning Studio environment.

Then we progress to discussing various aspects of I/O for data, R code and graphics in the Azure Machine

Learning environment.

We will then construct the first part of our forecasting solution by creating code for data cleaning and

transformation.

With our data prepared we will perform an analysis of the correlations between several of the variables in our

dataset.

Finally, we will create a seasonal time series forecasting model for milk production.

Interact with R language in Machine Learning Studio
The Execute R Script moduleThe Execute R Script module
Execute R codeExecute R code
Defensive R coding for Azure Machine LearningDefensive R coding for Azure Machine Learning
This section takes you through some basics of interacting with the R programming language in the Machine
Learning Studio environment. The R language provides a powerful tool to create customized analytics and data
manipulation modules within the Azure Machine Learning environment.
I will use RStudio to develop, test and debug R code on a small scale. This code is then cut and paste into an
Execute R Script module in Machine Learning Studio ready to run.
Within Machine Learning Studio, R scripts are run within the Execute R Script module. An example of the Execute
R Script module in Machine Learning Studio is shown in Figure 1.
Figure 1. The Machine Learning Studio environment showing the Execute R Script module selected.
Referring to Figure 1, let's look at some of the key parts of the Machine Learning Studio environment for working
with the Execute R Script module.
The modules in the experiment are shown in the center pane.
The upper part of the right pane contains a window to view and edit your R scripts.
The lower part of right pane shows some properties of the Execute R Script. You can view the error and output
logs by clicking on the appropriate spots of this pane.
We will, of course, be discussing the Execute R Script in greater detail in the rest of this document.
When working with complex R functions, I recommend that you edit, test and debug in RStudio. As with any
software development, extend your code incrementally and test it on small simple test cases. Then cut and paste
your functions into the R script window of the Execute R Script module. This approach allows you to harness both
the RStudio integrated development environment (IDE) and the power of Azure Machine Learning.
Any R code in the Execute R Script module will execute when you run the experiment by clicking on the Run
button. When execution has completed, a check mark will appear on the Execute R Script icon.
If you are developing R code for, say, a web service by using Azure Machine Learning, you should definitely plan
how your code will deal with an unexpected data input and exceptions. To maintain clarity, I have not included
much in the way of checking or exception handling in most of the code examples shown. However, as we proceed I
will give you several examples of functions by using R's exception handling capability.

Debug and test R in Machine Learning StudioDebug and test R in Machine Learning Studio

x <- 1.0

z <- x + y

[Critical] Error: Error 0063: The following error occurred during evaluation of R script:

---------- Start of error message from R ----------

object 'y' not found

----------- End of error message from R -----------

If you need a more complete treatment of R exception handling, I recommend you read the applicable sections of

the book by Wickham listed in Appendix B - Further Reading.

To reiterate, I recommend you test and debug your R code on a small scale in RStudio. However, there are cases

where you will need to track down R code problems in the Execute R Script itself. In addition, it is good practice to

check your results in Machine Learning Studio.

Output from the execution of your R code and on the Azure Machine Learning platform is found primarily in

output.log. Some additional information will be seen in error.log.

If an error occurs in Machine Learning Studio while running your R code, your first course of action should be to

look at error.log. This file can contain useful error messages to help you understand and correct your error. To view

error.log, click on View error log on the properties pane for the Execute R Script containing the error.

For example, I ran the following R code, with an undefined variable y, in an Execute R Script module:

This code fails to execute, resulting in an error condition. Clicking on View error log on the properties pane

produces the display shown in Figure 2.

Figure 2. Error message pop-up.

It looks like we need to look in output.log to see the R error message. Click on the Execute R Script and then click

on the View output.log item on the properties pane to the right. A new browser window opens, and I see the

following.

This error message contains no surprises and clearly identifies the problem.

To inspect the value of any object in R, you can print these values to the output.log file. The rules for examining

object values are essentially the same as in an interactive R session. For example, if you type a variable name on a

line, the value of the object will be printed to the output.log file.

Packages in Machine Learning StudioPackages in Machine Learning Studio

data.set <- data.frame(installed.packages())

maml.mapOutputPort("data.set")

Introduction to RStudioIntroduction to RStudio

Get data in and out of the Execute R Script module

Load and check data in Machine Learning StudioLoad and check data in Machine Learning Studio

Load the datasetLoad the dataset

Create an experimentCreate an experiment

Azure Machine Learning comes with over 350 preinstalled R language packages. You can use the following code in

the Execute R Script module to retrieve a list of the preinstalled packages.

If you don't understand the last line of this code at the moment, read on. In the rest of this document we will

extensively discuss using R in the Azure Machine Learning environment.

RStudio is a widely used IDE for R. I will use RStudio for editing, testing and debugging some of the R code used

in this quick start guide. Once R code is tested and ready, you simply cut and paste from the RStudio editor into a

Machine Learning Studio Execute R Script module.

If you do not have the R programming language installed on your desktop machine, I recommend you do so now.

Free downloads of open source R language are available at the Comprehensive R Archive Network (CRAN) at

http://www.r-project.org/. There are downloads available for Windows, Mac OS, and Linux/UNIX. Choose a nearby

mirror and follow the download directions. In addition, CRAN contains a wealth of useful analytics and data

manipulation packages.

If you are new to RStudio, you should download and install the desktop version. You can find the RStudio

downloads for Windows, Mac OS, and Linux/UNIX at http://www.rstudio.com/products/RStudio/. Follow the

directions provided to install RStudio on your desktop machine.

A tutorial introduction to RStudio is available at https://support.rstudio.com/hc/sections/200107586-Using-

RStudio.

I provide some additional information on using RStudio in Appendix A.

In this section we will discuss how you get data into and out of the Execute R Script module. We will review how to

handle various data types read into and out of the Execute R Script module.

The complete code for this section is in the zip file you downloaded earlier.

We will start by loading the csdairydata.csv file into Azure Machine Learning Studio.

Start your Azure Machine Learning Studio environment.

Click on + NEW at the lower left of your screen and select Dataset.

Select From Local File, and then Browse to select the file.

Make sure you have selected Generic CSV file with header (.csv) as the type for the dataset.

Click the check mark.

After the dataset has been uploaded, you should see the new dataset by clicking on the Datasets tab.

Now that we have some data in Machine Learning Studio, we need to create an experiment to do the analysis.

Click on + NEW at the lower left and select Experiment, then Blank Experiment.

You can name your experiment by selecting, and modifying, the Experiment created on ... title at the top of

the page. For example, changing it to CA Dairy Analysis.

On the left of the experiment page, expand Saved Datasets, and then My Datasets. You should see the

Check on the dataCheck on the data

First R scriptFirst R script

cadairydata.csv that you uploaded earlier.

Drag and drop the csdairydata.csv dataset onto the experiment.

In the Search experiment items box on the top of the left pane, type Execute R Script. You will see the module

appear in the search list.

Drag and drop the Execute R Script module onto your pallet.

Connect the output of the csdairydata.csv dataset to the leftmost input (Dataset1) of the Execute R Script.

Don't forget to click on 'Save'!

At this point your experiment should look something like Figure 3.

Figure 3. The CA Dairy Analysis experiment with dataset and Execute R Script module.

Let's have a look at the data we have loaded into our experiment. In the experiment, click on the output of the

cadairydata.csv dataset and select visualize. You should see something like Figure 4.

Figure 4. Summary of the cadairydata.csv dataset.

In this view we see a lot of useful information. We can see the first several rows of that dataset. If we select a

column, the Statistics section shows more information about the column. For example, the Feature Type row shows

us what data types Azure Machine Learning Studio assigned to the column. Having a quick look like this is a good

sanity check before we start to do any serious work.

Let's create a simple first R script to experiment with in Azure Machine Learning Studio. I have created and tested

the following script in RStudio.

## Only one of the following two lines should be used

## If running in Machine Learning Studio, use the first line with maml.mapInputPort()

## If in RStudio, use the second line with read.csv()

cadairydata <- maml.mapInputPort(1)

# cadairydata <- read.csv("cadairydata.csv", header = TRUE, stringsAsFactors = FALSE)

str(cadairydata)

pairs(~ Cotagecheese.Prod + Icecream.Prod + Milk.Prod + N.CA.Fat.Price, data = cadairydata)

## The following line should be executed only when running in

## Azure Machine Learning Studio

maml.mapOutputPort('cadairydata')

Data input to the Execute R Script moduleData input to the Execute R Script module

Script BundleScript Bundle

source("src/yourfile.R") # Reads a zipped R script

load("src/yourData.rdata") # Reads a zipped R data file

NOTENOTE

Now I need to transfer this script to Azure Machine Learning Studio. I could simply cut and paste. However, in this

case, I will transfer my R script via a zip file.

Let's have a look at the inputs to the Execute R Script module. In this example we will read the California dairy data

into the Execute R Script module.

There are three possible inputs for the Execute R Script module. You may use any one or all of these inputs,

depending on your application. It is also perfectly reasonable to use an R script that takes no input at all.

Let's look at each of these inputs, going from left to right. You can see the names of each of the inputs by placing

your cursor over the input and reading the tooltip.

The Script Bundle input allows you to pass the contents of a zip file into Execute R Script module. You can use one

of the following commands to read the contents of the zip file into your R code.

Azure Machine Learning treats files in the zip as if they are in the src/ directory, so you need to prefix your file names with

this directory name. For example, if the zip contains the files yourfile.R and yourData.rdata in the root of the zip, you

would address these as src/yourfile.R and src/yourData.rdata when using source and load .

We already discussed loading datasets in Loading the dataset. Once you have created and tested the R script

shown in the previous section, do the following:

## Only one of the following two lines should be used

## If running in Machine Learning Studio, use the first line with maml.mapInputPort()

## If in RStudio, use the second line with read.csv()

cadairydata <- maml.mapInputPort(1)

# cadairydata <- read.csv("cadairydata.csv", header = TRUE, stringsAsFactors = FALSE)

str(cadairydata)

pairs(~ Cotagecheese.Prod + Icecream.Prod + Milk.Prod + N.CA.Fat.Price, data = cadairydata)

## The following line should be executed only when running in

## Azure Machine Learning Studio

maml.mapOutputPort('cadairydata')

2. Create a zip file and copy your script into this zip file. On Windows, you can right-click on the file and select

Send to, and then Compressed folder. This will create a new zip file containing the "simpleplot.R" file.

3. Add your file to the datasets in Machine Learning Studio, specifying the type as zip. You should now see the

1. Save the R script into a .R file. I call my script file "simpleplot.R". Here's the contents.

Dataset1Dataset1

cadairydata <- maml.mapInputPort(1)

[ModuleOutput] InputDataStructure

[ModuleOutput]

[ModuleOutput] {

[ModuleOutput] "InputName":Dataset1

[ModuleOutput] "Rows":228

[ModuleOutput] "Cols":9

[ModuleOutput] "ColumnTypes":System.Int32,3,System.Double,5,System.String,1

[ModuleOutput] }

zip file in your datasets.

4. Drag and drop the zip file from datasets onto the ML Studio canvas.

5. Connect the output of the zip data icon to the Script Bundle input of the Execute R Script module.

6. Type the source() function with your zip file name into the code window for the Execute R Script module. In

my case I typed source("src/simpleplot.R") .

7. Make sure you click Save.

Once these steps are complete, the Execute R Script module will execute the R script in the zip file when the

experiment is run. At this point your experiment should look something like Figure 5.

Figure 5. Experiment using zipped R script.

You can pass a rectangular table of data to your R code by using the Dataset1 input. In our simple script the

maml.mapInputPort(1) function reads the data from port 1. This data is then assigned to a dataframe variable name

in your code. In our simple script the first line of code performs the assignment.

Execute your experiment by clicking on the Run button. When the execution finishes, click on the Execute R Script

module and then click View output log on the properties pane. A new page should appear in your browser

showing the contents of the output.log file. When you scroll down you should see something like the following.

Farther down the page is more detailed information on the columns, which will look something like the following.

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 9 variables:

[ModuleOutput]

[ModuleOutput] $ Column 0 : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year.Month : num 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : chr "Jan" "Feb" "Mar" "Apr" ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...

NOTENOTE

Dataset2Dataset2

Execute R Script outputsExecute R Script outputs

Output a dataframeOutput a dataframe

maml.mapOutputPort('cadairydata')

These results are mostly as expected, with 228 observations and 9 columns in the dataframe. We can see the

column names, the R data type and a sample of each column.

This same printed output is conveniently available from the R Device output of the Execute R Script module. We will discuss

the outputs of the Execute R Script module in the next section.

The behavior of the Dataset2 input is identical to that of Dataset1. Using this input you can pass a second

rectangular table of data into your R code. The function maml.mapInputPort(2) , with the argument 2, is used to pass

this data.

You can output the contents of an R dataframe as a rectangular table through the Result Dataset1 port by using the

maml.mapOutputPort() function. In our simple R script this is performed by the following line.

After running the experiment, click on the Result Dataset1 output port and then click on Visualize. You should see

something like Figure 6.

R Device outputR Device output

Figure 6. The visualization of the output of the California dairy data.

This output looks identical to the input, exactly as we expected.

The Device output of the Execute R Script module contains messages and graphics output. Both standard output

and standard error messages from R are sent to the R Device output port.

To view the R Device output, click on the port and then on Visualize. We see the standard output and standard

error from the R script in Figure 7.

Figure 7. Standard output and standard error from the R Device port.

Scrolling down we see the graphics output from our R script in Figure 8.

Data filtering and transformation

Type transformationsType transformations

Figure 8. Graphics output from the R Device port.

In this section we will perform some basic data filtering and transformation operations on the California dairy data.

By the end of this section we will have data in a format suitable for building an analytic model.

More specifically, in this section we will perform several common data cleaning and transformation tasks: type

transformation, filtering on dataframes, adding new computed columns, and value transformations. This

background should help you deal with the many variations encountered in real-world problems.

The complete R code for this section is available in the zip file you downloaded earlier.

Now that we can read the California dairy data into the R code in the Execute R Script module, we need to ensure

that the data in the columns has the intended type and format.

R is a dynamically typed language, which means that data types are coerced from one to another as required. The

atomic data types in R include numeric, logical and character. The factor type is used to compactly store categorical

data. You can find much more information on data types in the references in Appendix B - Further reading.

When tabular data is read into R from an external source, it is always a good idea to check the resulting types in the

columns. You may want a column of type character, but in many cases this will show up as factor or vice versa. In

other cases a column you think should be numeric is represented by character data, e.g. '1.23' rather than 1.23 as a

floating point number.

Fortunately, it is easy to convert one type to another, as long as mapping is possible. For example, you cannot

convert 'Nevada' into a numeric value, but you can convert it to a factor (categorical variable). As another example,

you can convert a numeric 1 into a character '1' or a factor.

The syntax for any of these conversions is simple: as.datatype() . These type conversion functions include the

following.

as.numeric()

as.character()

## Only one of the following two lines should be used

## If running in Machine Learning Studio, use the first line with maml.mapInputPort()

## If in RStudio, use the second line with read.csv()

cadairydata <- maml.mapInputPort(1)

# cadairydata <- read.csv("cadairydata.csv", header = TRUE, stringsAsFactors = FALSE)

## Ensure the coding is consistent and convert column to a factor

cadairydata$Month <- as.factor(cadairydata$Month)

str(cadairydata) # Check the result

## The following line should be executed only when running in

## Azure Machine Learning Studio

maml.mapOutputPort('cadairydata')

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 9 variables:

[ModuleOutput]

[ModuleOutput] $ Column 0 : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year.Month : num 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 14 levels "Apr","April",..: 6 5 9 1 11 8 7 3 14 13 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...

[ModuleOutput]

[ModuleOutput] [1] "Saving variable cadairydata ..."

[ModuleOutput]

[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"

as.logical()

as.factor()

Looking at the data types of the columns we input in the previous section: all columns are of type numeric, except

for the column labeled 'Month', which is of type character. Let's convert this to a factor and test the results.

I have deleted the line that created the scatterplot matrix and added a line converting the 'Month' column to a

factor. In my experiment I will just cut and paste the R code into the code window of the Execute R Script Module.

You could also update the zip file and upload it to Azure Machine Learning Studio, but this takes several steps.

Let's execute this code and look at the output log for the R script. The relevant data from the log is shown in Figure

Figure 9. Summary of the dataframe with a factor variable.

The type for Month should now say 'Factor w/ 14 levels'. This is a problem since there are only 12 months in the

year. You can also check to see that the type in Visualize of the Result Dataset port is 'Categorical'.

The problem is that the 'Month' column has not been coded systematically. In some cases a month is called April

and in others it is abbreviated as Apr. We can solve this problem by trimming the string to 3 characters. The line of

code now looks like the following:

## Ensure the coding is consistent and convert column to a factor

cadairydata$Month <- as.factor(substr(cadairydata$Month, 1, 3))

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 9 variables:

[ModuleOutput]

[ModuleOutput] $ Column 0 : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year.Month : num 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...

[ModuleOutput]

[ModuleOutput] [1] "Saving variable cadairydata ..."

[ModuleOutput]

[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"

Basic data frame filteringBasic data frame filtering

NOTENOTE

# Remove two columns we do not need

cadairydata <- cadairydata[, c(-1, -2)]

Rerun the experiment and view the output log. The expected results are shown in Figure 10.

Figure 10. Summary of the dataframe with correct number of factor levels.

Our factor variable now has the desired 12 levels.

R dataframes support powerful filtering capabilities. Datasets can be subsetted by using logical filters on either

rows or columns. In many cases, complex filter criteria will be required. The references in Appendix B - Further

reading contain extensive examples of filtering dataframes.

There is one bit of filtering we should do on our dataset. If you look at the columns in the cadairydata dataframe,

you will see two unnecessary columns. The first column just holds a row number, which is not very useful. The

second column, Year.Month, contains redundant information. We can easily exclude these columns by using the

following R code.

From now on in this section, I will just show you the additional code I am adding in the Execute R Script module. I will add

each new line before the str() function. I use this function to verify my results in Azure Machine Learning Studio.

I add the following line to my R code in the Execute R Script module.

Run this code in your experiment and check the result from the output log. These results are shown in Figure 11.

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 7 variables:

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...

[ModuleOutput]

[ModuleOutput] [1] "Saving variable cadairydata ..."

[ModuleOutput]

[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"

Add a new columnAdd a new column

## Create a new column with the month count

## Function to find the number of months from the first

## month of the time series

num.month <- function(Year, Month) {

## Find the starting year

min.year <- min(Year)

## Compute the number of months from the start of the time series

12 * (Year - min.year) + Month - 1

}

## Compute the new column for the dataframe

cadairydata$Month.Count <- num.month(cadairydata$Year, cadairydata$Month.Number)

Figure 11. Summary of the dataframe with two columns removed.

Good news! We get the expected results.

To create time series models it will be convenient to have a column containing the months since the start of the

time series. We will create a new column 'Month.Count'.

To help organize the code we will create our first simple function, num.month() . We will then apply this function to

create a new column in the dataframe. The new code is as follows.

Now run the updated experiment and use the output log to view the results. These results are shown in Figure 12.

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 8 variables:

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 4.37 3.69 4.54 4.28 4.47 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 51.6 56.1 68.5 65.7 73.7 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 2.11 1.93 2.16 2.13 2.23 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 0.98 0.892 0.892 0.897 0.897 ...

[ModuleOutput]

[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...

[ModuleOutput]

[ModuleOutput] [1] "Saving variable cadairydata ..."

[ModuleOutput]

[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"

Value transformationsValue transformations

Figure 12. Summary of the dataframe with the additional column.

It looks like everything is working. We have the new column with the expected values in our dataframe.

In this section we will perform some simple transformations on the values in some of the columns of our

dataframe. The R language supports nearly arbitrary value transformations. The references in Appendix B -

Further Reading contain extensive examples.

If you look at the values in the summaries of our dataframe you should see something odd here. Is more ice cream

than milk produced in California? No, of course not, as this makes no sense, sad as this fact may be to some of us

ice cream lovers. The units are different. The price is in units of US pounds, milk is in units of 1 M US pounds, ice

cream is in units of 1,000 US gallons, and cottage cheese is in units of 1,000 US pounds. Assuming ice cream

weighs about 6.5 pounds per gallon, we can easily do the multiplication to convert these values so they are all in

equal units of 1,000 pounds.

For our forecasting model we use a multiplicative model for trend and seasonal adjustment of this data. A log

transformation allows us to use a linear model, simplifying this process. We can apply the log transformation in the

same function where the multiplier is applied.

In the following code, I define a new function, log.transform() , and apply it to the rows containing the numerical

values. The R Map() function is used to apply the log.transform() function to the selected columns of the

dataframe. Map() is similar to apply() but allows for more than one list of arguments to the function. Note that a

list of multipliers supplies the second argument to the log.transform() function. The na.omit() function is used

as a bit of cleanup to ensure we do not have missing or undefined values in the dataframe.

log.transform <- function(invec, multiplier = 1) {

## Function for the transformation, which is the log

## of the input value times a multiplier

warningmessages <- c("ERROR: Non-numeric argument encountered in function log.transform",

"ERROR: Arguments to function log.transform must be greate than zero",

"ERROR: Aggurment multiplier to funcition log.transform must be a scaler",

"ERROR: Invalid time seies value encountered in function log.transform"

)

## Check the input arguments

if(!is.numeric(invec) | !is.numeric(multiplier)) {warning(warningmessages[1]); return(NA)}

if(any(invec < 0.0) | any(multiplier < 0.0)) {warning(warningmessages[2]); return(NA)}

if(length(multiplier) != 1) {{warning(warningmessages[3]); return(NA)}}

## Wrap the multiplication in tryCatch

## If there is an exception, print the warningmessage to

## standard error and return NA

tryCatch(log(multiplier * invec),

error = function(e){warning(warningmessages[4]); NA})

}

## Apply the transformation function to the 4 columns

## of the dataframe with production data

multipliers <- list(1.0, 6.5, 1000.0, 1000.0)

cadairydata[, 4:7] <- Map(log.transform, cadairydata[, 4:7], multipliers)

## Get rid of any rows with NA values

cadairydata <- na.omit(cadairydata)

There is quite a bit happening in the log.transform() function. Most of this code is checking for potential

problems with the arguments or dealing with exceptions, which can still arise during the computations. Only a few

lines of this code actually do the computations.

The goal of the defensive programming is to prevent the failure of a single function that prevents processing from

continuing. An abrupt failure of a long-running analysis can be quite frustrating for users. To avoid this situation,

default return values must be chosen that will limit damage to downstream processing. A message is also

produced to alert users that something has gone wrong.

If you are not used to defensive programming in R, all this code may seem a bit overwhelming. I will walk you

through the major steps:

1. A vector of four messages is defined. These messages are used to communicate information about some of the

possible errors and exceptions that can occur with this code.

2. I return a value of NA for each case. There are many other possibilities that might have fewer side effects. I

could return a vector of zeroes, or the original input vector, for example.

3. Checks are run on the arguments to the function. In each case, if an error is detected, a default value is returned

and a message is produced by the warning() function. I am using warning() rather than stop() as the latter

will terminate execution, exactly what I am trying to avoid. Note that I have written this code in a procedural

style, as in this case a functional approach seemed complex and obscure.

4. The log computations are wrapped in tryCatch() so that exceptions will not cause an abrupt halt to processing.

Without tryCatch() most errors raised by R functions result in a stop signal, which does just that.

Execute this R code in your experiment and have a look at the printed output in the output.log file. You will now see

the transformed values of the four columns in the log, as shown in Figure 13.

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 8 variables:

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 1.47 1.31 1.51 1.45 1.5 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 5.82 5.9 6.1 6.06 6.17 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 7.66 7.57 7.68 7.66 7.71 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 6.89 6.79 6.79 6.8 6.8 ...

[ModuleOutput]

[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...

[ModuleOutput]

[ModuleOutput] [1] "Saving variable cadairydata ..."

[ModuleOutput]

[ModuleOutput] [1] "Saving the following item(s): .maml.oport1"

Time series objects and correlation analysis

Time series objects in RTime series objects in R

Time series object exampleTime series object example

Reading the dataframeReading the dataframe

Figure 13. Summary of the transformed values in the dataframe.

We see the values have been transformed. Milk production now greatly exceeds all other dairy product production,

recalling that we are now looking at a log scale.

At this point our data is cleaned up and we are ready for some modeling. Looking at the visualization summary for

the Result Dataset output of our Execute R Script module, you will see the 'Month' column is 'Categorical' with 12

unique values, again, just as we wanted.

In this section we will explore a few basic R time series objects and analyze the correlations between some of the

variables. Our goal is to output a dataframe containing the pairwise correlation information at several lags.

The complete R code for this section is in the zip file you downloaded earlier.

As already mentioned, time series are a series of data values indexed by time. R time series objects are used to

create and manage the time index. There are several advantages to using time series objects. Time series objects

free you from the many details of managing the time series index values that are encapsulated in the object. In

addition, time series objects allow you to use the many time series methods for plotting, printing, modeling, etc.

The POSIXct time series class is commonly used and is relatively simple. This time series class measures time from

the start of the epoch, January 1, 1970. We will use POSIXct time series objects in this example. Other widely used

R time series object classes include zoo and xts, extensible time series.

Let's get started with our example. Drag and drop a new Execute R Script module into your experiment. Connect

the Result Dataset1 output port of the existing Execute R Script module to the Dataset1 input port of the new

Execute R Script module.

As I did for the first examples, as we progress through the example, at some points I will show only the incremental

additional lines of R code at each step.

As a first step, let's read in a dataframe and make sure we get the expected results. The following code should do

# Comment the following if using RStudio

cadairydata <- maml.mapInputPort(1)

str(cadairydata) # Check the results

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 8 variables:

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 1.47 1.31 1.51 1.45 1.5 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 5.82 5.9 6.1 6.06 6.17 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 7.66 7.57 7.68 7.66 7.71 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 6.89 6.79 6.79 6.8 6.8 ...

[ModuleOutput]

[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...

Creating a time series objectCreating a time series object

# Comment the following if using RStudio

cadairydata <- maml.mapInputPort(1)

## Create a new column as a POSIXct object

Sys.setenv(TZ = "PST8PDT")

cadairydata$Time <- as.POSIXct(strptime(paste(as.character(cadairydata$Year), "-",

as.character(cadairydata$Month.Number), "-01 00:00:00", sep = ""), "%Y-%m-%d %H:%M:%S"))

str(cadairydata) # Check the results

the job.

Now, run the experiment. The log of the new Execute R Script shape should look like Figure 14.

Figure 14. Summary of the dataframe in the Execute R Script module.

This data is of the expected types and format. Note that the 'Month' column is of type factor and has the expected

number of levels.

We need to add a time series object to our dataframe. Replace the current code with the following, which adds a

new column of class POSIXct.

Now, check the log. It should look like Figure 15.

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 9 variables:

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 1.47 1.31 1.51 1.45 1.5 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 5.82 5.9 6.1 6.06 6.17 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 7.66 7.57 7.68 7.66 7.71 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 6.89 6.79 6.79 6.8 6.8 ...

[ModuleOutput]

[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...

[ModuleOutput]

[ModuleOutput] $ Time : POSIXct, format: "1995-01-01" "1995-02-01" ...

Exploring and transforming the dataExploring and transforming the data

pairs(~ Cotagecheese.Prod + Icecream.Prod + Milk.Prod + N.CA.Fat.Price, data = cadairydata, main = "Pairwise

Scatterplots of dairy time series")

Figure 15. Summary of the dataframe with a time series object.

We can see from the summary that the new column is in fact of class POSIXct.

Let's explore some of the variables in this dataset. A scatterplot matrix is a good way to produce a quick look. I am

replacing the str() function in the previous R code with the following line.

Run this code and see what happens. The plot produced at the R Device port should look like Figure 16.

Figure 16. Scatterplot matrix of selected variables.

Correlation analysisCorrelation analysis

ts.detrend <- function(ts, Time, min.length = 3){

## Function to de-trend and standardize a time series

## Define some messages if they are NULL

messages <- c('ERROR: ts.detrend requires arguments ts and Time to have the same length',

'ERROR: ts.detrend requires argument ts to be of type numeric',

paste('WARNING: ts.detrend has encountered a time series with length less than',

as.character(min.length)),

'ERROR: ts.detrend has encountered a Time argument not of class POSIXct',

'ERROR: Detrend regression has failed in ts.detrend',

'ERROR: Exception occurred in ts.detrend while standardizing time series in function

ts.detrend'

)

# Create a vector of zeros to return as a default in some cases

zerovec <- rep(length(ts), 0.0)

# The input arguments are not of the same length, return ts and quit

if(length(Time) != length(ts)) {warning(messages[1]); return(ts)}

# If the ts is not numeric, just return a zero vector and quit

if(!is.numeric(ts)) {warning(messages[2]); return(zerovec)}

# If the ts is too short, just return it and quit

if((ts.length <- length(ts)) < min.length) {warning(messages[3]); return(ts)}

## Check that the Time variable is of class POSIXct

if(class(cadairydata$Time)[[1]] != "POSIXct") {warning(messages[4]); return(ts)}

## De-trend the time series by using a linear model

ts.frame <- data.frame(ts = ts, Time = Time)

tryCatch({ts <- ts - fitted(lm(ts ~ Time, data = ts.frame))},

error = function(e){warning(messages[5]); zerovec})

tryCatch( {stdev <- sqrt(sum((ts - mean(ts))^2))/(ts.length - 1)

ts <- ts/stdev},

error = function(e){warning(messages[6]); zerovec})

}

## Apply the detrend.ts function to the variables of interest

df.detrend <- data.frame(lapply(cadairydata[, 4:7], ts.detrend, cadairydata$Time))

## Plot the results to look at the relationships

pairs(~ Cotagecheese.Prod + Icecream.Prod + Milk.Prod + N.CA.Fat.Price, data = df.detrend, main = "Pairwise

Scatterplots of detrended standardized time series")

There is some odd-looking structure in the relationships between these variables. Perhaps this arises from trends

in the data and from the fact that we have not standardized the variables.

To perform correlation analysis we need to both de-trend and standardize the variables. We could simply use the R

scale() function, which both centers and scales variables. This function might well run faster. However, I want to

show you an example of defensive programing in R.

The ts.detrend() function shown below performs both of these operations. The following two lines of code de-

trend the data and then standardize the values.

There is quite a bit happening in the ts.detrend() function. Most of this code is checking for potential problems

with the arguments or dealing with exceptions, which can still arise during the computations. Only a few lines of

this code actually do the computations.

We have already discussed an example of defensive programming in Value transformations. Both computation

blocks are wrapped in tryCatch() . For some errors it makes sense to return the original input vector, and in other

## A function to compute pairwise correlations from a

## list of time series value vectors

pair.cor <- function(pair.ind, ts.list, lag.max = 1, plot = FALSE){

ccf(ts.list[[pair.ind[1]]], ts.list[[pair.ind[2]]], lag.max = lag.max, plot = plot)

}

## A list of the pairwise indices

corpairs <- list(c(1,2), c(1,3), c(1,4), c(2,3), c(2,4), c(3,4))

## Compute the list of ccf objects

cadairycorrelations <- lapply(corpairs, pair.cor, df.detrend)

cadairycorrelations

cases, I return a vector of zeros.

Note that the linear regression used for de-trending is a time series regression. The predictor variable is a time

series object.

Once ts.detrend() is defined we apply it to the variables of interest in our dataframe. We must coerce the

resulting list created by lapply() to data dataframe by using as.data.frame() . Because of defensive aspects of

ts.detrend() , failure to process one of the variables will not prevent correct processing of the others.

The final line of code creates a pairwise scatterplot. After running the R code, the results of the scatterplot are

shown in Figure 17.

Figure 17. Pairwise scatterplot of de-trended and standardized time series.

You can compare these results to those shown in Figure 16. With the trend removed and the variables

standardized, we see a lot less structure in the relationships between these variables.

The code to compute the correlations as R ccf objects is as follows.

Running this code produces the log shown in Figure 18.

[ModuleOutput] Loading objects:

[ModuleOutput] port1

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput] [[1]]

[ModuleOutput]

[ModuleOutput] Autocorrelations of series 'X', by lag

[ModuleOutput]

[ModuleOutput] -1 0 1

[ModuleOutput] 0.148 0.358 0.317

[ModuleOutput]

[ModuleOutput] [[2]]

[ModuleOutput]

[ModuleOutput] Autocorrelations of series 'X', by lag

[ModuleOutput]

[ModuleOutput] -1 0 1

[ModuleOutput] -0.395 -0.186 -0.238

[ModuleOutput]

[ModuleOutput] [[3]]

[ModuleOutput]

[ModuleOutput] Autocorrelations of series 'X', by lag

[ModuleOutput]

[ModuleOutput] -1 0 1

[ModuleOutput] -0.059 -0.089 -0.127

[ModuleOutput]

[ModuleOutput] [[4]]

[ModuleOutput]

[ModuleOutput] Autocorrelations of series 'X', by lag

[ModuleOutput]

[ModuleOutput] -1 0 1

[ModuleOutput] 0.140 0.294 0.293

[ModuleOutput]

[ModuleOutput] [[5]]

[ModuleOutput]

[ModuleOutput] Autocorrelations of series 'X', by lag

[ModuleOutput]

[ModuleOutput] -1 0 1

[ModuleOutput] -0.002 -0.074 -0.124

Output a dataframeOutput a dataframe

Figure 18. List of ccf objects from the pairwise correlation analysis.

There is a correlation value for each lag. None of these correlation values is large enough to be significant. We can

therefore conclude that we can model each variable independently.

We have computed the pairwise correlations as a list of R ccf objects. This presents a bit of a problem as the Result

Dataset output port really requires a dataframe. Further, the ccf object is itself a list and we want only the values in

the first element of this list, the correlations at the various lags.

The following code extracts the lag values from the list of ccf objects, which are themselves lists.

df.correlations <- data.frame(do.call(rbind, lapply(cadairycorrelations, '[[', 1)))

c.names <- c("correlation pair", "-1 lag", "0 lag", "+1 lag")

r.names <- c("Corr Cot Cheese - Ice Cream",

"Corr Cot Cheese - Milk Prod",

"Corr Cot Cheese - Fat Price",

"Corr Ice Cream - Mik Prod",

"Corr Ice Cream - Fat Price",

"Corr Milk Prod - Fat Price")

## Build a dataframe with the row names column and the

## correlation data frame and assign the column names

outframe <- cbind(r.names, df.correlations)

colnames(outframe) <- c.names

outframe

## WARNING!

## The following line works only in Azure Machine Learning

## When running in RStudio, this code will result in an error

#maml.mapOutputPort('outframe')

The first line of code is a bit tricky, and some explanation may help you understand it. Working from the inside out

we have the following:

1. The '[[' operator with the argument '1' selects the vector of correlations at the lags from the first element of the

ccf object list.

2. The do.call() function applies the rbind() function over the elements of the list returns by lapply() .

3. The data.frame() function coerces the result produced by do.call() to a dataframe.

Note that the row names are in a column of the dataframe. Doing so preserves the row names when they are

output from the Execute R Script.

Running the code produces the output shown in Figure 19 when I Visualize the output at the Result Dataset port.

The row names are in the first column, as intended.

Figure 19. Results output from the correlation analysis.

Time series example: seasonal forecasting

Creating the dataframe for analysisCreating the dataframe for analysis

# If running in Machine Learning Studio, uncomment the first line with maml.mapInputPort()

cadairydata <- maml.mapInputPort(1)

## Create a new column as a POSIXct object

Sys.setenv(TZ = "PST8PDT")

cadairydata$Time <- as.POSIXct(strptime(paste(as.character(cadairydata$Year), "-",

as.character(cadairydata$Month.Number), "-01 00:00:00", sep = ""), "%Y-%m-%d %H:%M:%S"))

str(cadairydata)

Our data is now in a form suitable for analysis, and we have determined there are no significant correlations

between the variables. Let's move on and create a time series forecasting model. Using this model we will forecast

California milk production for the 12 months of 2013.

Our forecasting model will have two components, a trend component and a seasonal component. The complete

forecast is the product of these two components. This type of model is known as a multiplicative model. The

alternative is an additive model. We have already applied a log transformation to the variables of interest, which

makes this analysis tractable.

The complete R code for this section is in the zip file you downloaded earlier.

Start by adding a new Execute R Script module to your experiment. Connect the Result Dataset output of the

existing Execute R Script module to the Dataset1 input of the new module. The result should look something like

Figure 20.

Figure 20. The experiment with the new Execute R Script module added.

As with the correlation analysis we just completed, we need to add a column with a POSIXct time series object. The

following code will do just this.

Run this code and look at the log. The result should look like Figure 21.

[ModuleOutput] [1] "Loading variable port1..."

[ModuleOutput]

[ModuleOutput] 'data.frame': 228 obs. of 9 variables:

[ModuleOutput]

[ModuleOutput] $ Month.Number : int 1 2 3 4 5 6 7 8 9 10 ...

[ModuleOutput]

[ModuleOutput] $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...

[ModuleOutput]

[ModuleOutput] $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 4 8 1 9 7 6 2 12 11 ...

[ModuleOutput]

[ModuleOutput] $ Cotagecheese.Prod: num 1.47 1.31 1.51 1.45 1.5 ...

[ModuleOutput]

[ModuleOutput] $ Icecream.Prod : num 5.82 5.9 6.1 6.06 6.17 ...

[ModuleOutput]

[ModuleOutput] $ Milk.Prod : num 7.66 7.57 7.68 7.66 7.71 ...

[ModuleOutput]

[ModuleOutput] $ N.CA.Fat.Price : num 6.89 6.79 6.79 6.8 6.8 ...

[ModuleOutput]

[ModuleOutput] $ Month.Count : num 0 1 2 3 4 5 6 7 8 9 ...

[ModuleOutput]

[ModuleOutput] $ Time : POSIXct, format: "1995-01-01" "1995-02-01" ...

Create a training datasetCreate a training dataset

cadairytrain <- cadairydata[1:216, ]

Ylabs <- list("Log CA Cotage Cheese Production, 1000s lb",

"Log CA Ice Cream Production, 1000s lb",

"Log CA Milk Production 1000s lb",

"Log North CA Milk Milk Fat Price per 1000 lb")

Map(function(y, Ylabs){plot(cadairytrain$Time, y, xlab = "Time", ylab = Ylabs, type = "l")}, cadairytrain[,

4:7], Ylabs)

Figure 21. A summary of the dataframe.

With this result, we are ready to start our analysis.

With the dataframe constructed we need to create a training dataset. This data will include all of the observations

except the last 12, of the year 2013, which is our test dataset. The following code subsets the dataframe and creates

plots of the dairy production and price variables. I then create plots of the four production and price variables. An

anonymous function is used to define some augments for plot, and then iterate over the list of the other two

arguments with Map() . If you are thinking that a for loop would have worked fine here, you are correct. But, since

R is a functional language I am showing you a functional approach.

Running the code produces the series of time series plots from the R Device output shown in Figure 22. Note that

the time axis is in units of dates, a nice benefit of the time series plot method.

A trend modelA trend model

Figure 22. Time series plots of California dairy production and price data.

Having created a time series object and having had a look at the data, let's start to construct a trend model for the

California milk production data. We can do this with a time series regression. However, it is clear from the plot that

we will need more than a slope and intercept to accurately model the observed trend in the training data.

Given the small scale of the data, I will build the model for trend in RStudio and then cut and paste the resulting

model into Azure Machine Learning. RStudio provides an interactive environment for this type of interactive

analysis.

milk.lm <- lm(Milk.Prod ~ Time + I(Month.Count^2) + I(Month.Count^3), data = cadairytrain)

summary(milk.lm)

## Call:

## lm(formula = Milk.Prod ~ Time + I(Month.Count^2) + I(Month.Count^3),

## data = cadairytrain)

## Residuals:

## Min 1Q Median 3Q Max

## -0.12667 -0.02730 0.00236 0.02943 0.10586

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 6.33e+00 1.45e-01 43.60 <2e-16 ***

## Time 1.63e-09 1.72e-10 9.47 <2e-16 ***

## I(Month.Count^2) -1.71e-06 4.89e-06 -0.35 0.726

## I(Month.Count^3) -3.24e-08 1.49e-08 -2.17 0.031 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.0418 on 212 degrees of freedom

## Multiple R-squared: 0.941, Adjusted R-squared: 0.94

## F-statistic: 1.12e+03 on 3 and 212 DF, p-value: <2e-16

milk.lm <- update(milk.lm, . ~ . - I(Month.Count^2))

summary(milk.lm)

## Call:

## lm(formula = Milk.Prod ~ Time + I(Month.Count^3), data = cadairytrain)

## Residuals:

## Min 1Q Median 3Q Max

## -0.12597 -0.02659 0.00185 0.02963 0.10696

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 6.38e+00 4.07e-02 156.6 <2e-16 ***

## Time 1.57e-09 4.32e-11 36.3 <2e-16 ***

## I(Month.Count^3) -3.76e-08 2.50e-09 -15.1 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.0417 on 213 degrees of freedom

## Multiple R-squared: 0.941, Adjusted R-squared: 0.94

## F-statistic: 1.69e+03 on 2 and 213 DF, p-value: <2e-16

As a first attempt, I will try a polynomial regression with powers up to 3. There is a real danger of over-fitting these

kinds of models. Therefore, it is best to avoid high order terms. The I() function inhibits interpretation of the

contents (interprets the contents 'as is') and allows you to write a literally interpreted function in a regression

equation.

This generates the following.

From P values (Pr(>|t|)) in this output, we can see that the squared term may not be significant. I will use the

update() function to modify this model by dropping the squared term.

This generates the following.

milk.lm <- lm(Milk.Prod ~ Time + I(Month.Count^3), data = cadairytrain)

plot(cadairytrain$Time, cadairytrain$Milk.Prod, xlab = "Time", ylab = "Log CA Milk Production 1000s lb", type =

"l")

lines(cadairytrain$Time, predict(milk.lm, cadairytrain), lty = 2, col = 2)

Seasonal modelSeasonal model

milk.lm2 <- update(milk.lm, . ~ . + Month - 1)

summary(milk.lm2)

This looks better. All of the terms are significant. However, the 2e-16 value is a default value, and should not be

taken too seriously.

As a sanity test, let's make a time series plot of the California dairy production data with the trend curve shown. I

have added the following code in the Azure Machine Learning Execute R Script model (not RStudio) to create the

model and make a plot. The result is shown in Figure 23.

Figure 23. California milk production data with trend model shown.

It looks like the trend model fits the data fairly well. Further, there does not seem to be evidence of over-fitting,

such as odd wiggles in the model curve.

With a trend model in hand, we need to push on and include the seasonal effects. We will use the month of the

year as a dummy variable in the linear model to capture the month-by-month effect. Note that when you introduce

factor variables into a model, the intercept must not be computed. If you do not do this, the formula is over-

specified and R will drop one of the desired factors but keep the intercept term.

Since we have a satisfactory trend model we can use the update() function to add the new terms to the existing

model. The -1 in the update formula drops the intercept term. Continuing in RStudio for the moment:

This generates the following.

## Call:

## lm(formula = Milk.Prod ~ Time + I(Month.Count^3) + Month - 1,

## data = cadairytrain)

## Residuals:

## Min 1Q Median 3Q Max

## -0.06879 -0.01693 0.00346 0.01543 0.08726

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## Time 1.57e-09 2.72e-11 57.7 <2e-16 ***

## I(Month.Count^3) -3.74e-08 1.57e-09 -23.8 <2e-16 ***

## MonthApr 6.40e+00 2.63e-02 243.3 <2e-16 ***

## MonthAug 6.38e+00 2.63e-02 242.2 <2e-16 ***

## MonthDec 6.38e+00 2.64e-02 241.9 <2e-16 ***

## MonthFeb 6.31e+00 2.63e-02 240.1 <2e-16 ***

## MonthJan 6.39e+00 2.63e-02 243.1 <2e-16 ***

## MonthJul 6.39e+00 2.63e-02 242.6 <2e-16 ***

## MonthJun 6.38e+00 2.63e-02 242.4 <2e-16 ***

## MonthMar 6.42e+00 2.63e-02 244.2 <2e-16 ***

## MonthMay 6.43e+00 2.63e-02 244.3 <2e-16 ***

## MonthNov 6.34e+00 2.63e-02 240.6 <2e-16 ***

## MonthOct 6.37e+00 2.63e-02 241.8 <2e-16 ***

## MonthSep 6.34e+00 2.63e-02 240.6 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.0263 on 202 degrees of freedom

## Multiple R-squared: 1, Adjusted R-squared: 1

## F-statistic: 1.42e+06 on 14 and 202 DF, p-value: <2e-16

milk.lm2 <- lm(Milk.Prod ~ Time + I(Month.Count^3) + Month - 1, data = cadairytrain)

plot(cadairytrain$Time, cadairytrain$Milk.Prod, xlab = "Time", ylab = "Log CA Milk Production 1000s lb", type =

"l")

lines(cadairytrain$Time, predict(milk.lm2, cadairytrain), lty = 2, col = 2)

We see that the model no longer has an intercept term and has 12 significant month factors. This is exactly what

we wanted to see.

Let's make another time series plot of the California dairy production data to see how well the seasonal model is

working. I have added the following code in the Azure Machine Learning Execute R Script to create the model and

make a plot.

Running this code in Azure Machine Learning produces the plot shown in Figure 24.

## Compute predictions from our models

predict1 <- predict(milk.lm, cadairydata)

predict2 <- predict(milk.lm2, cadairydata)

## Compute and plot the residuals

residuals <- cadairydata$Milk.Prod - predict2

plot(cadairytrain$Time, residuals[1:216], xlab = "Time", ylab ="Residuals of Seasonal Model")

Figure 24. California milk production with model including seasonal effects.

The fit to the data shown in Figure 24 is rather encouraging. Both the trend and the seasonal effect (monthly

variation) look reasonable.

As another check on our model, let's have a look at the residuals. The following code computes the predicted

values from our two models, computes the residuals for the seasonal model, and then plots these residuals for the

training data.

The residual plot is shown in Figure 25.

## Show the diagnostic plots for the model

plot(milk.lm2, ask = FALSE)

Figure 25. Residuals of the seasonal model for the training data.

These residuals look reasonable. There is no particular structure, except the effect of the 2008-2009 recession,

which our model does not account for particularly well.

The plot shown in Figure 25 is useful for detecting any time-dependent patterns in the residuals. The explicit

approach of computing and plotting the residuals I used places the residuals in time order on the plot. If, on the

other hand, I had plotted milk.lm$residuals , the plot would not have been in time order.

You can also use plot.lm() to produce a series of diagnostic plots.

This code produces a series of diagnostic plots shown in Figure 26.

Forecasting and model evaluationForecasting and model evaluation

Figure 26. Diagnostic plots for the seasonal model.

There are a few highly influential points identified in these plots, but nothing to cause great concern. Further, we

can see from the Normal Q-Q plot that the residuals are close to normally distributed, an important assumption

for linear models.

There is just one more thing to do to complete our example. We need to compute forecasts and measure the error

against the actual data. Our forecast will be for the 12 months of 2013. We can compute an error measure for this

forecast to the actual data that is not part of our training dataset. Additionally, we can compare performance on the

18 years of training data to the 12 months of test data.

RMS.error <- function(series1, series2, is.log = TRUE, min.length = 2){

## Function to compute the RMS error or difference between two

## series or vectors

messages <- c("ERROR: Input arguments to function RMS.error of wrong type encountered",

"ERROR: Input vector to function RMS.error is too short",

"ERROR: Input vectors to function RMS.error must be of same length",

"WARNING: Funtion rms.error has received invald input time series.")

## Check the arguments

if(!is.numeric(series1) | !is.numeric(series2) | !is.logical(is.log) | !is.numeric(min.length)) {

warning(messages[1])

return(NA)}

if(length(series1) < min.length) {

warning(messages[2])

return(NA)}

if((length(series1) != length(series2))) {

warning(messages[3])

return(NA)}

## If is.log is TRUE exponentiate the values, else just copy

if(is.log) {

tryCatch( {

temp1 <- exp(series1)

temp2 <- exp(series2) },

error = function(e){warning(messages[4]); NA}

)

} else {

temp1 <- series1

temp2 <- series2

}

## Compute predictions from our models

predict1 <- predict(milk.lm, cadairydata)

predict2 <- predict(milk.lm2, cadairydata)

## Compute the RMS error in a dataframe

tryCatch( {

sqrt(sum((temp1 - temp2)^2) / length(temp1))},

error = function(e){warning(messages[4]); NA})

}

A number of metrics are used to measure the performance of time series models. In our case we will use the root

mean square (RMS) error. The following function computes the RMS error between two series.

As with the log.transform() function we discussed in the "Value transformations" section, there is quite a lot of

error checking and exception recovery code in this function. The principles employed are the same. The work is

done in two places wrapped in tryCatch() . First, the time series are exponentiated, since we have been working

with the logs of the values. Second, the actual RMS error is computed.

Equipped with a function to measure the RMS error, let's build and output a dataframe containing the RMS errors.

We will include terms for the trend model alone and the complete model with seasonal factors. The following code

does the job by using the two linear models we have constructed.

## Compute the RMS error in a dataframe

## Include the row names in the first column so they will

## appear in the output of the Execute R Script

RMS.df <- data.frame(

rowNames = c("Trend Model", "Seasonal Model"),

Traing = c(

RMS.error(predict1[1:216], cadairydata$Milk.Prod[1:216]),

RMS.error(predict2[1:216], cadairydata$Milk.Prod[1:216])),

Forecast = c(

RMS.error(predict1[217:228], cadairydata$Milk.Prod[217:228]),

RMS.error(predict2[217:228], cadairydata$Milk.Prod[217:228]))

)

RMS.df

## The following line should be executed only when running in

## Azure Machine Learning Studio

maml.mapOutputPort('RMS.df')

APPENDIX A: Guide to RStudio

Running this code produces the output shown in Figure 27 at the Result Dataset output port.

Figure 27. Comparison of RMS errors for the models.

From these results, we see that adding the seasonal factors to the model reduces the RMS error significantly. Not

too surprisingly, the RMS error for the training data is a bit less than for the forecast.

RStudio is quite well documented, so in this appendix I will provide some links to the key sections of the RStudio

documentation to get you started.

1. Creating projects

You can organize and manage your R code into projects by using RStudio. The documentation that uses

projects can be found at https://support.rstudio.com/hc/articles/200526207-Using-Projects.

I recommend that you follow these directions and create a project for the R code examples in this document.

2. Editing and executing R code

RStudio provides an integrated environment for editing and executing R code. Documentation can be found

at https://support.rstudio.com/hc/articles/200484448-Editing-and-Executing-Code.

3. Debugging

RStudio includes powerful debugging capabilities. Documentation for these features is at

https://support.rstudio.com/hc/articles/200713843-Debugging-with-RStudio.

    APPENDIX B: Further reading
The breakpoint troubleshooting features are documented at
https://support.rstudio.com/hc/articles/200534337-Breakpoint-Troubleshooting.
This R programming tutorial covers the basics of what you need to use the R language with Azure Machine
Learning Studio. If you are not familiar with R, two introductions are available on CRAN:
R for Beginners by Emmanuel Paradis is a good place to start at http://cran.r-project.org/doc/contrib/Paradis-
rdebuts_en.pdf.
An Introduction to R by W. N. Venables et. al. goes into a bit more depth, at http://cran.r-
project.org/doc/manuals/R-intro.html.
There are many books on R that can help you get started. Here are a few I find useful:
The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff is an excellent
introduction to programming in R.
R Cookbook by Paul Teetor provides a problem and solution approach to using R.
R in Action by Robert Kabacoff is another useful introductory book. The companion Quick R website is a useful
resource at http://www.statmethods.net/.
R Inferno by Patrick Burns is a surprisingly humorous book that deals with a number of tricky and difficult
topics that can be encountered when programming in R. The book is available for free at http://www.burns-
stat.com/documents/books/the-r-inferno/.
If you want a deep dive into advanced topics in R, have a look at the book Advanced R by Hadley Wickham. The
online version of this book is available for free at http://adv-r.had.co.nz/.
A catalogue of R time series packages can be found in the CRAN Task View for time series analysis: http://cran.r-
project.org/web/views/TimeSeries.html. For information on specific time series object packages, you should refer
to the documentation for that package.
The book Introductory Time Series with R by Paul Cowpertwait and Andrew Metcalfe provides an introduction to
using R for time series analysis. Many more theoretical texts provide R examples.
Some great internet resources:
DataCamp: DataCamp teaches R in the comfort of your browser with video lessons and coding exercises. There
are interactive tutorials on the latest R techniques and packages. Take the free interactive R tutorial at
https://www.datacamp.com/courses/introduction-to-r
A guide on Getting started with R from Programiz https://www.programiz.com/r-programming
A quick R tutorial by Kelly Black from Clarkson University http://www.cyclismo.org/tutorial/R/
60+ R resources listed at http://www.computerworld.com/article/2497464/business-intelligence-60-r-
resources-to-improve-your-data-skills.html

Create and share an Azure Machine Learning

workspace

3/21/2018 • 2 min to read • Edit Online

NOTENOTE

To create a workspaceTo create a workspace

This menu links to topics that describe how to set up the various data science environments used by the Cortana

Analytics Process (CAPS).

To use Azure Machine Learning Studio, you need to have a Machine Learning workspace. This workspace contains

the tools you need to create, manage, and publish experiments.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

NOTENOTE

1. Sign in to the Azure portal

To sign in and create a workspace, you need to be an Azure subscription administrator.

2. Click +New

3. In the search box, type Machine Learning Studio Workspace and select the matching item. Then, select

click Create at the bottom of the page.

4. Enter your workspace information:

The workspace name may be up to 260 characters, not ending in a space. The name can't include these

characters: < > * % & : \ ? + /

The web service plan you choose (or create), along with the associated pricing tier you select, is used

if you deploy web services from this workspace.

Sharing an Azure Machine Learning workspace

5. Click Create.

Once the workspace is deployed, you can open it in Machine Learning Studio.

1. Browse to Machine Learning Studio at https://studio.azureml.net/.

2. Select your workspace in the upper-right-hand corner.

3. Click my experiments.

For information about managing your workspace, see Manage an Azure Machine Learning workspace. If you

encounter a problem creating your workspace, see Troubleshooting guide: Create and connect to a Machine

Learning workspace.

NOTENOTE

To share a workspaceTo share a workspace

Once a Machine Learning workspace is created, you can invite users to your workspace to share access to your

workspace and all its experiments, datasets, notebooks, etc. You can add users in one of two roles:

User - A workspace user can create, open, modify, and delete experiments, datasets, etc. in the workspace.

Owner - An owner can invite and remove users in the workspace, in addition to what a user can do.

The administrator account that creates the workspace is automatically added to the workspace as workspace Owner.

However, other administrators or users in that subscription are not automatically granted access to the workspace - you

need to invite them explicitly.

1. Sign in to Machine Learning Studio at https://studio.azureml.net/Home

2. In the left panel, click SETTINGS

3. Click the USERS tab

4. Click INVITE MORE USERS at the bottom of the page

5. Enter one or more email addresses. The users need a valid Microsoft account or an organizational account

(from Azure Active Directory).

6. Select whether you want to add the users as Owner or User.

7. Click the OK checkmark button.

Each user you add will receive an email with instructions on how to sign in to the shared workspace.

NOTENOTE

For users to be able to deploy or manage web services in this workspace, they must be a contributor or administrator in the

Azure subscription.

Manage an Azure Machine Learning workspace

4/11/2018 • 1 min to read • Edit Online

NOTENOTE

Use the Azure portal

NOTENOTE

Next steps

For information on managing Web services in the Machine Learning Web Services portal, see Manage a Web service using

the Azure Machine Learning Web Services portal.

You can manage Machine Learning workspaces in the Azure portal.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

To manage a workspace in the Azure portal:

1. Sign in to the Azure portal using an Azure subscription administrator account.

2. In the search box at the top of the page, enter "machine learning workspaces" and then select Machine

Learning Workspaces.

3. Click the workspace you want to manage.

In addition to the standard resource management information and options available, you can:

View Properties - This page displays the workspace and resource information, and you can change the

subscription and resource group that this workspace is connected with.

Resync Storage Keys - The workspace maintains keys to the storage account. If the storage account changes

keys, then you can click Resync keys to synchronize the keys with the workspace.

To manage the web services associated with this workspace, use the Machine Learning Web Services portal. See

Manage a Web service using the Azure Machine Learning Web Services portal for complete information.

To deploy or manage New web services you must be assigned a contributor or administrator role on the subscription to

which the web service is deployed. If you invite another user to a machine learning workspace, you must assign them to a

contributor or administrator role on the subscription before they can deploy or manage web services.

For more information on setting access permissions, see View access assignments for users and groups in the Azure portal.

Learn more about deploy Machine Learning with Azure Resource Manager Templates.

Troubleshooting guide: Create and connect to an

Machine Learning workspace

3/21/2018 • 1 min to read • Edit Online

NOTENOTE

Workspace owner

Allowed regions

Storage account

This guide provides solutions for some frequently encountered challenges when you are setting up Azure Machine

Learning workspaces.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

To open a workspace in Machine Learning Studio, you must be signed in to the Microsoft Account you used to

create the workspace, or you need to receive an invitation from the owner to join the workspace. From the Azure

portal you can manage the workspace, which includes the ability to configure access.

For more information on managing a workspace, see Manage an Azure Machine Learning workspace.

Machine Learning is currently available in a limited number of regions. If your subscription does not include one of

these regions, you may see the error message, “You have no subscriptions in the allowed regions.”

To request that a region be added to your subscription, create a new Microsoft support request from the Azure

portal, choose Billing as the problem type, and follow the prompts to submit your request.

The Machine Learning service needs a storage account to store data. You can use an existing storage account, or

you can create a new storage account when you create the new Machine Learning workspace (if you have quota to

create a new storage account).

After the new Machine Learning workspace is created, you can sign in to Machine Learning Studio by using the

Microsoft account you used to create the workspace. If you encounter the error message, “Workspace Not Found”

(similar to the following screenshot), please use the following steps to delete your browser cookies.

To delete browser cookies

1. If you use Internet Explorer, click the Tools button in the upper-right corner and select Internet options.

1. Under the General tab, click Delete…

1. In the Delete Browsing History dialog box, make sure Cookies and website data is selected, and click

Delete.

Comments

After the cookies are deleted, restart the browser and then go to the Microsoft Azure Machine Learning page.

When you are prompted for a user name and password, enter the same Microsoft account you used to create the

workspace.

Our goal is to make the Machine Learning experience as seamless as possible. Please post any comments and

issues at the Azure Machine Learning forum to help us serve you better.

Deploy Machine Learning Workspace Using Azure

Resource Manager

4/18/2018 • 3 min to read • Edit Online

Introduction

Step-by-step: create a Machine Learning Workspace

Create an Azure Resource Manager templateCreate an Azure Resource Manager template

Using an Azure Resource Manager deployment template saves you time by giving you a scalable way to deploy

interconnected components with a validation and retry mechanism. To set up Azure Machine Learning

Workspaces, for example, you need to first configure an Azure storage account and then deploy your workspace.

Imagine doing this manually for hundreds of workspaces. An easier alternative is to use an Azure Resource

Manager template to deploy an Azure Machine Learning Workspace and all its dependencies. This article takes you

through this process step-by-step. For a great overview of Azure Resource Manager, see Azure Resource Manager

overview.

We will create an Azure resource group, then deploy a new Azure storage account and a new Azure Machine

Learning Workspace using a Resource Manager template. Once the deployment is complete, we will print out

important information about the workspaces that were created (the primary key, the workspaceID, and the URL to

the workspace).

A Machine Learning Workspace requires an Azure storage account to store the dataset linked to it. The following

template uses the name of the resource group to generate the storage account name and the workspace name. It

also uses the storage account name as a property when creating the workspace.

{

"contentVersion": "1.0.0.0",

"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",

"variables": {

"namePrefix": "[resourceGroup().name]",

"location": "[resourceGroup().location]",

"mlVersion": "2016-04-01",

"stgVersion": "2015-06-15",

"storageAccountName": "[concat(variables('namePrefix'),'stg')]",

"mlWorkspaceName": "[concat(variables('namePrefix'),'mlwk')]",

"mlResourceId": "[resourceId('Microsoft.MachineLearning/workspaces', variables('mlWorkspaceName'))]",

"stgResourceId": "[resourceId('Microsoft.Storage/storageAccounts', variables('storageAccountName'))]",

"storageAccountType": "Standard_LRS"

"resources": [

{

"apiVersion": "[variables('stgVersion')]",

"name": "[variables('storageAccountName')]",

"type": "Microsoft.Storage/storageAccounts",

"location": "[variables('location')]",

"properties": {

"accountType": "[variables('storageAccountType')]"

}

{

"apiVersion": "[variables('mlVersion')]",

"type": "Microsoft.MachineLearning/workspaces",

"name": "[variables('mlWorkspaceName')]",

"location": "[variables('location')]",

"dependsOn": ["[variables('stgResourceId')]"],

"properties": {

"UserStorageAccountId": "[variables('stgResourceId')]"

}

"outputs": {

"mlWorkspaceObject": {"type": "object", "value": "[reference(variables('mlResourceId'),

variables('mlVersion'))]"},

"mlWorkspaceToken": {"type": "string", "value": "[listWorkspaceKeys(variables('mlResourceId'),

variables('mlVersion')).primaryToken]"},

"mlWorkspaceWorkspaceID": {"type": "string", "value": "[reference(variables('mlResourceId'),

variables('mlVersion')).WorkspaceId]"},

"mlWorkspaceWorkspaceLink": {"type": "string", "value": "

[concat('https://studio.azureml.net/Home/ViewWorkspace/', reference(variables('mlResourceId'),

variables('mlVersion')).WorkspaceId)]"}

}

Deploy the resource group, based on the templateDeploy the resource group, based on the template

# Install the Azure Resource Manager modules from the PowerShell Gallery (press “A”)

Install-Module AzureRM -Scope CurrentUser

# Install the Azure Service Management modules from the PowerShell Gallery (press “A”)

Install-Module Azure -Scope CurrentUser

Save this template as mlworkspace.json file under c:\temp.

Open PowerShell

Install modules for Azure Resource Manager and Azure Service Management

These steps download and install the modules necessary to complete the remaining steps. This only needs to be

done once in the environment where you are executing the PowerShell commands.

# Authenticate (enter your credentials in the pop-up window)

Connect-AzureRmAccount

$rg = New-AzureRmResourceGroup -Name "uniquenamerequired523" -Location "South Central US"

$rg

# Create a Resource Group, TemplateFile is the location of the JSON template.

$rgd = New-AzureRmResourceGroupDeployment -Name "demo" -TemplateFile "C:\temp\mlworkspace.json" -

ResourceGroupName $rg.ResourceGroupName

# Access Azure ML Workspace Token after its deployment.

$rgd.Outputs.mlWorkspaceToken.Value

# List the primary and secondary tokens of all workspaces

Get-AzureRmResource |? { $_.ResourceType -Like "*MachineLearning/workspaces*"} |% { Invoke-

AzureRmResourceAction -ResourceId $_.ResourceId -Action listworkspacekeys -Force}

Next Steps

Authenticate to Azure

This step needs to be repeated for each session. Once authenticated, your subscription information should be

displayed.

Now that we have access to Azure, we can create the resource group.

Create a resource group

Verify that the resource group is correctly provisioned. ProvisioningState should be “Succeeded.” The resource

group name is used by the template to generate the storage account name. The storage account name must be

between 3 and 24 characters in length and use numbers and lower-case letters only.

Using the resource group deployment, deploy a new Machine Learning Workspace.

Once the deployment is completed, it is straightforward to access properties of the workspace you deployed. For

example, you can access the Primary Key Token.

Another way to retrieve tokens of existing workspace is to use the Invoke-AzureRmResourceAction command. For

example, you can list the primary and secondary tokens of all workspaces.

After the workspace is provisioned, you can also automate many Azure Machine Learning Studio tasks using the

PowerShell Module for Azure Machine Learning.

Learn more about authoring Azure Resource Manager Templates.

Have a look at the Azure Quickstart Templates Repository.

Watch this video about Azure Resource Manager.

Edit Online

1 min to read •

Import your training data into Azure Machine

Learning Studio from various data sources

3/21/2018 • 3 min to read • Edit Online

NOTENOTE

Get data ready for use in Azure Machine Learning Studio

Data formats and data types supported

To use your own data in Machine Learning Studio to develop and train a predictive analytics solution, you can:

upload data from a local file ahead of time from your hard drive to create a dataset module in your

workspace

access data from one of several online data sources while your experiment is running using the Import Data

module

use data from another Azure Machine learning experiment saved as a dataset

use data from an on-premises SQL Server database

Each of these options is described in one of the topics on the menu below. These topics show you how to import

data from these various data sources to use in Machine Learning Studio.

There are a number of sample datasets available in Machine Learning Studio that you can use for training data. For

information on these, see Use the sample datasets in Azure Machine Learning Studio).

This introductory topic also discusses how to get data ready for use in Machine Learning Studio and describes

which data formats and data types are supported.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

Machine Learning Studio is designed to work with rectangular or tabular data, such as text data that's delimited or

structured data from a database, though in some circumstances non-rectangular data may be used.

It's best if your data is relatively clean. That is, you'll want to take care of issues such as unquoted strings before

you upload the data into your experiment.

However, there are modules available in Machine Learning Studio that enable some manipulation of data within

your experiment. Depending on the machine learning algorithms you'll be using, you may need to decide how

you'll handle data structural issues such as missing values and sparse data, and there are modules that can help

with that. Look in the Data Transformation section of the module palette for modules that perform these

functions.

At any point in your experiment you can view or download the data that's produced by a module by clicking the

output port. Depending on the module, there may be different download options available, or you may be able to

visualize the data within your web browser in Machine Learning Studio.

You can import a number of data types into your experiment, depending on what mechanism you use to import

data and where it's coming from:

Plain text (.txt)

Comma-separated values (CSV) with a header (.csv) or without (.nh.csv)

Tab-separated values (TSV) with a header (.tsv) or without (.nh.tsv)

Excel file

Azure table

Hive table

SQL database table

OData values

SVMLight data (.svmlight) (see the SVMLight definition for format information)

Attribute Relation File Format (ARFF) data (.arff) (see the ARFF definition for format information)

Zip file (.zip)

R object or workspace file (.RData)

If you import data in a format such as ARFF that includes metadata, Machine Learning Studio uses this metadata

to define the heading and data type of each column.

If you import data such as TSV or CSV format that doesn't include this metadata, Machine Learning Studio infers

the data type for each column by sampling the data. If the data also doesn't have column headings, Machine

Learning Studio provides default names.

You can explicitly specify or change the headings and data types for columns using the Edit Metadata.

The following data types are recognized by Machine Learning Studio:

String

Integer

Double

Boolean

DateTime

TimeSpan

Machine Learning Studio uses an internal data type called Data Table to pass data between modules. You can

explicitly convert your data into Data Table format using the Convert to Dataset module.

Any module that accepts formats other than Data Table will convert the data to Data Table silently before passing

it to the next module.

If necessary, you can convert Data Table format back into CSV, TSV, ARFF, or SVMLight format using other

conversion modules. Look in the Data Format Conversions section of the module palette for modules that

perform these functions.

Import training data from a file on your hard drive

into Machine Learning Studio

3/20/2018 • 1 min to read • Edit Online

Steps to import data from a local file

Dataset module is ready for use

Learn how to upload a data file from your hard drive to use as training data in Azure Machine Learning Studio. By

importing the data file, you have a dataset module ready for use in your workspace.

To import data from a local hard drive, do the following:

1. Click +NEW at the bottom of the Machine Learning Studio window.

2. Select DATASET and FROM LOCAL FILE.

3. In the Upload a new dataset dialog, browse to the file you want to upload

4. Enter a name, identify the data type, and optionally enter a description. A description is recommended - it

allows you to record any characteristics about the data that you want to remember when using the data in the

future.

5. The checkbox This is the new version of an existing dataset allows you to update an existing dataset with

new data. Click this checkbox and then enter the name of an existing dataset.

During upload, you'll see a message that your file is being uploaded. Upload time depends on the size of your data

and the speed of your connection to the service. If you know the file will take a long time, you can do other things

inside Machine Learning Studio while you wait. However, closing the browser causes the data upload to fail.

Once your data is uploaded, it's stored in a dataset module and is available to any experiment in your workspace.

When you're editing an experiment, you can find the datasets you've created in the My Datasets list under the

Saved Datasets list in the module palette. You can drag and drop the dataset onto the experiment canvas when

you want to use the dataset for further analytics and machine learning.

Import data into Azure Machine Learning Studio

from various online data sources with the Import

Data module

3/20/2018 • 7 min to read • Edit Online

NOTENOTE

Introduction

IMPORTANTIMPORTANT

This article describes the support for importing online data from various sources and the information needed to

move data from these sources into an Azure Machine Learning experiment.

This article provides general information about the Import Data module. For more detailed information about the types of

data you can access, formats, parameters, and answers to common questions, see the module reference topic for the Import

Data module.

By using the Import Data module, you can access data from one of several online data sources while your

experiment is running in Azure Machine Learning Studio:

A Web URL using HTTP

Hadoop using HiveQL

Azure blob storage

Azure table

Azure SQL database or SQL Server on Azure VM

On-premises SQL Server database

A data feed provider, OData currently

Azure Cosmos DB

To access online data sources in your Studio experiment, add the Import Data module to your, select the Data

source, and then provide the parameters needed to access the data. The online data sources that are supported

are itemized in the table below. This table also summarizes the file formats that are supported and parameters that

are used to access the data.

Note that because this training data is accessed while your experiment is running, it's only available in that

experiment. By comparison, data that has been stored in a dataset module are available to any experiment in your

workspace.

Currently, the Import Data and Export Data modules can read and write data only from Azure storage created using the

Classic deployment model. In other words, the new Azure Blob Storage account type that offers a hot storage access tier or

cool storage access tier is not yet supported.

Generally, any Azure storage accounts that you might have created before this service option became available should not

be affected. If you need to create a new account, select Classic for the Deployment model, or use Resource manager and

select General purpose rather than Blob storage for Account kind.

For more information, see Azure Blob Storage: Hot and Cool Storage Tiers.

Supported online data sources

DATA SOURCE DESCRIPTION PARAMETERS

Web URL via HTTP Reads data in comma-separated values

(CSV), tab-separated values (TSV),

attribute-relation file format (ARFF), and

Support Vector Machines (SVM-light)

formats, from any web URL that uses

HTTP

URL: Specifies the full name of the file,

including the site URL and the file

name, with any extension.

Data format: Specifies one of the

supported data formats: CSV, TSV, ARFF,

or SVM-light. If the data has a header

row, it is used to assign column names.

Hadoop/HDFS Reads data from distributed storage in

Hadoop. You specify the data you want

by using HiveQL, a SQL-like query

language. HiveQL can also be used to

aggregate data and perform data

filtering before you add the data to

Machine Learning Studio.

Hive database query: Specifies the

Hive query used to generate the data.

HCatalog server URI : Specified the

name of your cluster using the format

<your cluster

name>.azurehdinsight.net.

Hadoop user account name: Specifies

the Hadoop user account name used to

provision the cluster.

Hadoop user account password :

Specifies the credentials used when

provisioning the cluster. For more

information, see Create Hadoop clusters

in HDInsight.

Location of output data: Specifies

whether the data is stored in a Hadoop

distributed file system (HDFS) or in

Azure.

Azure Machine Learning Import Data module supports the following data sources:

If you store output data in HDFS,

specify the HDFS server URI. (Be

sure to use the HDInsight cluster

name without the HTTPS:// prefix).

If you store your output data in

Azure, you must specify the Azure

storage account name, Storage

access key and Storage container

name.

SQL database Reads data that is stored in an Azure

SQL database or in a SQL Server

database running on an Azure virtual

machine.

Database server name: Specifies the

name of the server on which the

database is running.

Database name : Specifies the name of

the database on the server.

Server user account name: Specifies a

user name for an account that has

access permissions for the database.

Server user account password:

Specifies the password for the user

account.

Database query:Enter a SQL

statement that describes the data you

want to read.

DATA SOURCE DESCRIPTION PARAMETERS

In case of Azure SQL Database

enter the server name that is

generated. Typically it has the form

<generated_identifier>.database.w

indows.net.

In case of a SQL server hosted on a

Azure Virtual machine enter tcp:

<Virtual Machine DNS Name>,

1433

On-premises SQL database Reads data that is stored in an on-

premises SQL database.

Data gateway: Specifies the name of

the Data Management Gateway

installed on a computer where it can

access your SQL Server database. For

information about setting up the

gateway, see Perform advanced

analytics with Azure Machine Learning

using data from an on-premises SQL

server.

Database server name: Specifies the

name of the server on which the

database is running.

Database name : Specifies the name of

the database on the server.

Server user account name: Specifies a

user name for an account that has

access permissions for the database.

User name and password: Click Enter

values to enter your database

credentials. You can use Windows

Integrated Authentication or SQL

Server Authentication depending upon

how your on-premises SQL Server is

configured.

Database query:Enter a SQL

statement that describes the data you

want to read.

Azure Table Reads data from the Table service in

Azure Storage.

If you read large amounts of data

infrequently, use the Azure Table

Service. It provides a flexible, non-

relational (NoSQL), massively scalable,

inexpensive, and highly available

storage solution.

The options in the Import Data change

depending on whether you are

accessing public information or a

private storage account that requires

the Authentication Type which can

have value of "PublicOrSAS" or

"Account", each of which has its own set

of parameters.

Public or Shared Access Signature

(SAS) URI: The parameters are:

DATA SOURCE DESCRIPTION PARAMETERS

Table URI: Specifies the Public or

SAS URL for the table.

Specifies the rows to scan for

property names: The values are

TopN to scan the specified number

of rows, or ScanAll to get all rows

in the table.

If the data is homogeneous and

predictable, it is recommended that

you select TopN and enter a

number for N. For large tables, this

can result in quicker reading times.

If the data is structured with sets of

Private Storage Account: The

parameters are:

DATA SOURCE DESCRIPTION PARAMETERS

properties that vary based on the

depth and position of the table,

choose the ScanAll option to scan

all rows. This ensures the integrity

of your resulting property and

metadata conversion.

Account name: Specifies the name

of the account that contains the

table to read.

Account key: Specifies the storage

key associated with the account.

Table name : Specifies the name of

the table that contains the data to

read.

Rows to scan for property

names: The values are TopN to

scan the specified number of rows,

or ScanAll to get all rows in the

table.

If the data is homogeneous and

predictable, we recommend that

you select TopN and enter a

number for N. For large tables, this

can result in quicker reading times.

If the data is structured with sets of

properties that vary based on the

depth and position of the table,

choose the ScanAll option to scan

all rows. This ensures the integrity

of your resulting property and

metadata conversion.

Azure Blob Storage Reads data stored in the Blob service in

Azure Storage, including images,

unstructured text, or binary data.

You can use the Blob service to publicly

expose data, or to privately store

application data. You can access your

data from anywhere by using HTTP or

HTTPS connections.

The options in the Import Data

module change depending on whether

you are accessing public information or

a private storage account that requires

the Authentication Type which can

have a value either of "PublicOrSAS" or

of "Account".

Public or Shared Access Signature

(SAS) URI: The parameters are:

Private Storage Account: The

parameters are:

DATA SOURCE DESCRIPTION PARAMETERS

URI: Specifies the Public or SAS URL

for the storage blob.

File Format: Specifies the format of

the data in the Blob service. The

supported formats are CSV, TSV,

and ARFF.

Account name: Specifies the name

of the account that contains the

blob you want to read.

Account key: Specifies the storage

key associated with the account.

Path to container, directory, or

blob : Specifies the name of the

blob that contains the data to read.

Blob file format: Specifies the

format of the data in the blob

service. The supported data

formats are CSV, TSV, ARFF, CSV

with a specified encoding, and Excel.

If the format is CSV or TSV, be

sure to indicate whether the file

contains a header row.

You can use the Excel option to

read data from Excel

workbooks. In the

Excel data format option,

indicate whether the data is in

an Excel worksheet range, or in

an Excel table. In the

Excel sheet or embedded table

option, specify the name of the

sheet or table that you want to

read from.

Data Feed Provider Reads data from a supported feed

provider. Currently only the Open Data

Protocol (OData) format is supported.

Data content type: Specifies the OData

format.

Source URL: Specifies the full URL for

the data feed.

For example, the following URL reads

from the Northwind sample database:

http://services.odata.org/northwind/nor

thwind.svc/

DATA SOURCE DESCRIPTION PARAMETERS

Next steps

Deploying Azure ML web services that use Data Import and Data Export modules

Import your data into Azure Machine Learning

Studio from another experiment

3/5/2018 • 1 min to read • Edit Online

There will be times when you'll want to take an intermediate result from one experiment and use it as part of

another experiment. To do this, you save the module as a dataset:

1. Click the output of the module that you want to save as a dataset.

2. Click Save as Dataset.

3. When prompted, enter a name and a description that would allow you to identify the dataset easily.

4. Click the OK checkmark.

When the save finishes, the dataset will be available for use within any experiment in your workspace. You can find

it in the Saved Datasets list in the module palette.

Perform advanced analytics with Azure Machine

Learning using data from an on-premises SQL

Server database

3/21/2018 • 9 min to read • Edit Online

NOTENOTE

Install the Microsoft Data Management Gateway

Often enterprises that work with on-premises data would like to take advantage of the scale and agility of the

cloud for their machine learning workloads. But they don't want to disrupt their current business processes and

workflows by moving their on-premises data to the cloud. Azure Machine Learning now supports reading your

data from an on-premises SQL Server database and then training and scoring a model with this data. You no

longer have to manually copy and sync the data between the cloud and your on-premises server. Instead, the

Import Data module in Azure Machine Learning Studio can now read directly from your on-premises SQL

Server database for your training and scoring jobs.

This article provides an overview of how to ingress on-premises SQL server data into Azure Machine Learning. It

assumes that you're familiar with Azure Machine Learning concepts like workspaces, modules, datasets,

experiments, etc..

This feature is not available for free workspaces. For more information about Machine Learning pricing and tiers, see Azure

Machine Learning Pricing.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

To access an on-premises SQL Server database in Azure Machine Learning, you need to download and install the

Microsoft Data Management Gateway. When you configure the gateway connection in Machine Learning Studio,

you have the opportunity to download and install the gateway using the Download and register data gateway

dialog described below.

You also can install the Data Management Gateway ahead of time by downloading and running the MSI setup

package from the Microsoft Download Center. Choose the latest version, selecting either 32-bit or 64-bit as

appropriate for your computer. The MSI can also be used to upgrade an existing Data Management Gateway to

the latest version, with all settings preserved.

The gateway has the following prerequisites:

The supported Windows operating system versions are Windows 7, Windows 8/8.1, Windows 10, Windows

Server 2008 R2, Windows Server 2012, and Windows Server 2012 R2.

The recommended configuration for the gateway machine is at least 2 GHz, 4 cores, 8GB RAM, and 80GB

disk.

If the host machine hibernates, the gateway won’t respond to data requests. Therefore, configure an

appropriate power plan on the computer before installing the gateway. If the machine is configured to

hibernate, the gateway installation displays a message.

Ingress data from your on-premises SQL Server database into Azure

Machine Learning

TIPTIP

Step 1: Create a gatewayStep 1: Create a gateway

Because copy activity occurs at a specific frequency, the resource usage (CPU, memory) on the machine also

follows the same pattern with peak and idle times. Resource utilization also depends heavily on the amount of

data being moved. When multiple copy jobs are in progress, you'll observe resource usage go up during peak

times. While the minimum configuration listed above is technically sufficient, you may want to have a

configuration with more resources than the minimum configuration depending on your specific load for data

movement.

Consider the following when setting up and using a Data Management Gateway:

You can install only one instance of Data Management Gateway on a single computer.

You can use a single gateway for multiple on-premises data sources.

You can connect multiple gateways on different computers to the same on-premises data source.

You configure a gateway for only one workspace at a time. Currently, gateways can’t be shared across

workspaces.

You can configure multiple gateways for a single workspace. For example, you may want to use a gateway

that's connected to your test data sources during development and a production gateway when you're ready to

operationalize.

The gateway does not need to be on the same machine as the data source. But staying closer to the data

source reduces the time for the gateway to connect to the data source. We recommend that you install the

gateway on a machine that's different from the one that hosts the on-premises data source so that the gateway

and data source don't compete for resources.

NOTENOTE

You need to use the Data Management Gateway for Azure Machine Learning even if you are using Azure

ExpressRoute for other data. You should treat your data source as an on-premises data source (that's behind a

firewall) even when you use ExpressRoute. Use the Data Management Gateway to establish connectivity

between Machine Learning and the data source.

If you already have a gateway installed on your computer serving Power BI or Azure Data Factory

scenarios, install a separate gateway for Azure Machine Learning on another computer.

You can't run Data Management Gateway and Power BI Gateway on the same computer.

You can find detailed information on installation prerequisites, installation steps, and troubleshooting tips in the

article Data Management Gateway.

In this walkthrough, you will set up a Data Management Gateway in an Azure Machine Learning workspace,

configure it, and then read data from an on-premises SQL Server database.

Before you start, disable your browser’s pop-up blocker for studio.azureml.net . If you're using the Google Chrome

browser, download and install one of the several plug-ins available at Google Chrome WebStore Click Once App Extension.

The first step is to create and set up the gateway to access your on-premises SQL database.

1. Log in to Azure Machine Learning Studio and select the workspace that you want to work in.

2. Click the SETTINGS blade on the left, and then click the DATA GATEWAYS tab at the top.

3. Click NEW DATA GATEWAY at the bottom of the screen.

4. In the New data gateway dialog, enter the Gateway Name and optionally add a Description. Click the

arrow on the bottom right-hand corner to go to the next step of the configuration.

5. In the Download and register data gateway dialog, copy the GATEWAY REGISTRATION KEY to the

clipboard.

6. If you have not yet downloaded and installed the Microsoft Data Management Gateway, then click Download

data management gateway. This takes you to the Microsoft Download Center where you can select the

gateway version you need, download it, and install it. You can find detailed information on installation

prerequisites, installation steps, and troubleshooting tips in the beginning sections of the article Move data

between on-premises sources and cloud with Data Management Gateway.

7. After the gateway is installed, the Data Management Gateway Configuration Manager will open and the

and click Register.

8. If you already have a gateway installed, run the Data Management Gateway Configuration Manager. Click

Change key, paste the Gateway Registration Key that you copied to the clipboard in the previous step, and

click OK.

9. When the installation is complete, the Register gateway dialog for Microsoft Data Management Gateway

Configuration Manager is displayed. Paste the GATEWAY REGISTRATION KEY that you copied to the

clipboard in a previous step and click Register.

10. The gateway configuration is complete when the following values are set on the Home tab in Microsoft

Data Management Gateway Configuration Manager:

Gateway name and Instance name are set to the name of the gateway.

Registration is set to Registered.

Status is set to Started.

The status bar at the bottom displays Connected to Data Management Gateway Cloud Service

along with a green check mark.

Azure Machine Learning Studio also gets updated when the registration is successful.

11. In the Download and register data gateway dialog, click the check mark to complete the setup. The

Settings page displays the gateway status as "Online". In the right-hand pane, you'll find status and other

useful information.

12. In the Microsoft Data Management Gateway Configuration Manager switch to the Certificate tab. The

certificate specified on this tab is used to encrypt/decrypt credentials for the on-premises data store that

you specify in the portal. This certificate is the default certificate. Microsoft recommends changing this to

your own certificate that you back up in your certificate management system. Click Change to use your

own certificate instead.

13. (optional) If you want to enable verbose logging in order to troubleshoot issues with the gateway, in the

Microsoft Data Management Gateway Configuration Manager switch to the Diagnostics tab and check

the Enable verbose logging for troubleshooting purposes option. The logging information can be

found in the Windows Event Viewer under the Applications and Services Logs -> Data Management

Gateway node. You can also use the Diagnostics tab to test the connection to an on-premises data source

using the gateway.

Step 2: Use the gateway to read data from an on-premises data sourceStep 2: Use the gateway to read data from an on-premises data source

This completes the gateway setup process in Azure Machine Learning. You're now ready to use your on-premises

data.

You can create and set up multiple gateways in Studio for each workspace. For example, you may have a gateway

that you want to connect to your test data sources during development, and a different gateway for your

production data sources. Azure Machine Learning gives you the flexibility to set up multiple gateways depending

upon your corporate environment. Currently you can’t share a gateway between workspaces and only one

gateway can be installed on a single computer. For more information, see Move data between on-premises

sources and cloud with Data Management Gateway.

After you set up the gateway, you can add an Import Data module to an experiment that inputs the data from the

on-premises SQL Server database.

1. In Machine Learning Studio, select the EXPERIMENTS tab, click +NEW in the lower-left corner, and select

Blank Experiment (or select one of several sample experiments available).

2. Find and drag the Import Data module to the experiment canvas.

3. Click Save as below the canvas. Enter "Azure Machine Learning On-Premises SQL Server Tutorial" for the

experiment name, select the workspace, and click the OK check mark.

4. Click the Import Data module to select it, then in the Properties pane to the right of the canvas, select "On-

Premises SQL Database" in the Data source dropdown list.

6. Enter the SQL Database server name and Database name, along with the SQL Database query you want

to execute.

5. Select the Data gateway you installed and registered. You can set up another gateway by selecting "(add

new Data Gateway…)".

7. Click Enter values under User name and password and enter your database credentials. You can use

Windows Integrated Authentication or SQL Server Authentication depending upon how your on-premises

SQL Server is configured.

8. Click RUN to run the experiment.

The message "values required" changes to "values set" with a green check mark. You only need to enter the

credentials once unless the database information or password changes. Azure Machine Learning uses the

certificate you provided when you installed the gateway to encrypt the credentials in the cloud. Azure never

stores on-premises credentials without encryption.

Once the experiment finishes running, you can visualize the data you imported from the database by clicking the

output port of the Import Data module and selecting Visualize.

Once you finish developing your experiment, you can deploy and operationalize your model. Using the Batch

Execution Service, data from the on-premises SQL Server database configured in the Import Data module will

be read and used for scoring. While you can use the Request Response Service for scoring on-premises data,

Microsoft recommends using the Excel Add-in instead. Currently, writing to an on-premises SQL Server database

through Export Data is not supported either in your experiments or published web services.

Application Lifecycle Management in Azure Machine

Learning Studio

3/21/2018 • 6 min to read • Edit Online

Versioning experiment

Experiment snapshots using Run HistoryExperiment snapshots using Run History

Azure Machine Learning Studio is a tool for developing machine learning experiments that are operationalized in

the Azure cloud platform. It is like the Visual Studio IDE and scalable cloud service merged into a single platform.

You can incorporate standard Application Lifecycle Management (ALM) practices from versioning various assets to

automated execution and deployment, into Azure Machine Learning Studio. This article discusses some of the

options and approaches.

There are two recommended ways to version your experiments. You can either rely on the built-in run history or

export the experiment in a JSON format so as to manage it externally. Each approach comes with its pros and cons.

In the execution model of the Azure Machine Learning Studio learning experiment, an immutable snapshot of the

experiment is submitted to the job scheduler whenever you click Run in the experiment editor. To view this list of

snapshots, click Run History on the command bar in the experiment editor view.

You can then open the snapshot in Locked mode by clicking the name of the experiment at the time the experiment

was submitted to run and the snapshot was taken. Notice that only the first item in the list, which represents the

current experiment, is in an Editable state. Also notice that each snapshot can be in various Status states as well,

including Finished (Partial run), Failed, Failed (Partial run), or Draft.

After it's opened, you can save the snapshot experiment as a new experiment and then modify it. If your experiment

snapshot contains assets such as trained models, transforms, or datasets that have updated versions, the snapshot

retains the references to the original version when the snapshot was taken. If you save the locked snapshot as a

new experiment, Azure Machine Learning Studio detects the existence of a newer version of these assets, and

automatically updates them in the new experiment.

Export/import experiment in JSON formatExport/import experiment in JSON format

Versioning trained model

Versioning web service

Classic web serviceClassic web service

If you delete the experiment, all snapshots of that experiment are deleted.

The run history snapshots keep an immutable version of the experiment in Azure Machine Learning Studio every

time it is submitted to run. You can also save a local copy of the experiment and check it in to your favorite source

control system, such as Team Foundation Server, and later on re-create an experiment from that local file. You can

use the Azure Machine Learning PowerShell commandlets Export-AmlExperimentGraph and Import-

AmlExperimentGraph to accomplish that.

The JSON file is a textual representation of the experiment graph, which might include a reference to assets in the

workspace such as a dataset or a trained model. It doesn't contain a serialized version of the asset. If you attempt to

import the JSON document back into the workspace, the referenced assets must already exist with the same asset

IDs that are referenced in the experiment. Otherwise you cannot access the imported experiment.

A trained model in Azure Machine Learning is serialized into a format known as an iLearner file (.iLearner ), and is

stored in the Azure Blob storage account associated with the workspace. One way to get a copy of the iLearner file

is through the retraining API. This article explains how the retraining API works. The high-level steps:

1. Set up your training experiment.

2. Add a web service output port to the Train Model module, or the module that produces the trained model, such

as Tune Model Hyperparameter or Create R Model.

3. Run your training experiment and then deploy it as a model training web service.

4. Call the BES endpoint of the training web service, and specify the desired iLearner file name and Blob storage

account location where it will be stored.

5. Harvest the produced iLearner file after the BES call finishes.

Another way to retrieve the iLearner file is through the PowerShell commandlet Download-

AmlExperimentNodeOutput. This might be easier if you just want to get a copy of the iLearner file without the need

to retrain the model programmatically.

After you have the iLearner file containing the trained model, you can then employ your own versioning strategy.

The strategy can be as simple as applying a pre/postfix as a naming convention and just leaving the iLearner file in

Blob storage, or copying/importing it into your version control system.

The saved iLearner file can then be used for scoring through deployed web services.

You can deploy two types of web services from an Azure Machine Learning experiment. The classic web service is

tightly coupled with the experiment as well as the workspace. The new web service uses the Azure Resource

Manager framework, and it is no longer coupled with the original experiment or the workspace.

To version a classic web service, you can take advantage of the web service endpoint construct. Here is a typical

flow:

1. From your predictive experiment, you deploy a new classic web service, which contains a default endpoint.

2. You create a new endpoint named ep2, which exposes the current version of the experiment/trained model.

3. You go back and update your predictive experiment and trained model.

4. You redeploy the predictive experiment, which will then update the default endpoint. But this will not alter ep2.

5. You create an additional endpoint named ep3, which exposes the new version of the experiment and trained

model.

New web serviceNew web service
Automate experiment execution and deployment
Next steps
6.  Go back to step 3 if needed.
Over time, you might have many endpoints created in the same web service. Each endpoint represents a point-in-
time copy of the experiment containing the point-in-time version of the trained model. You can then use external
logic to determine which endpoint to call, which effectively means selecting a version of the trained model for the
scoring run.
You can also create many identical web service endpoints, and then patch different versions of the iLearner file to
the endpoint to achieve similar effect. This article explains in more detail how to accomplish that.
If you create a new Azure Resource Manager-based web service, the endpoint construct is no longer available.
Instead, you can generate web service definition (WSD) files, in JSON format, from your predictive experiment by
using the Export-AmlWebServiceDefinitionFromExperiment PowerShell commandlet, or by using the Export-
AzureRmMlWebservice PowerShell commandlet from a deployed Resource Manager-based web service.
After you have the exported WSD file and version control it, you can also deploy the WSD as a new web service in
a different web service plan in a different Azure region. Just make sure you supply the proper storage account
configuration as well as the new web service plan ID. To patch in different iLearner files, you can modify the WSD
file and update the location reference of the trained model, and deploy it as a new web service.
An important aspect of ALM is to be able to automate the execution and deployment process of the application. In
Azure Machine Learning, you can accomplish this by using the PowerShell module. Here is an example of end-to-
end steps that are relevant to a standard ALM automated execution/deployment process by using the Azure
Machine Learning Studio PowerShell module. Each step is linked to one or more PowerShell commandlets that you
can use to accomplish that step.
1.  Upload a dataset.
2.  Copy a training experiment into the workspace from a workspace or from Gallery, or import an exported
experiment from local disk.
3.  Update the dataset in the training experiment.
4.  Run the training experiment.
5.  Promote the trained model.
6.  Copy a predictive experiment into the workspace.
7.  Update the trained model in the predictive experiment.
8.  Run the predictive experiment.
9.  Deploy a web service from the predictive experiment.
10.  Test the web service RRS or BES endpoint.
Download the Azure Machine Learning Studio PowerShell module and start to automate your ALM tasks.
Learn how to create and manage large number of ML models by using just a single experiment through
PowerShell and retraining API.
Learn more about deploying Azure Machine Learning web services.

Manage experiment iterations in Azure Machine

Learning Studio

3/21/2018 • 4 min to read • Edit Online

NOTENOTE

Viewing the Prior Run

Viewing the Run History

LEARNING RATE VALUE RUN START TIME

0.1 9/11/2014 4:18:58 pm

0.2 9/11/2014 4:24:33 pm

Developing a predictive analysis model is an iterative process - as you modify the various functions and

parameters of your experiment, your results converge until you are satisfied that you have a trained, effective

model. Key to this process is tracking the various iterations of your experiment parameters and configurations.

You can try Azure Machine Learning for free. No credit card or Azure subscription is required. Get started now.

You can review previous runs of your experiments at any time in order to challenge, revisit, and ultimately either

confirm or refine previous assumptions. When you run an experiment, Machine Learning Studio keeps a history of

the run, including dataset, module, and port connections and parameters. This history also captures results,

runtime information such as start and stop times, log messages, and execution status. You can look back at any of

these runs at any time to review the chronology of your experiment and intermediate results. You can even use a

previous run of your experiment to launch into a new phase of inquiry and discovery on your path to creating

simple, complex, or even ensemble modeling solutions.

When you view a previous run of an experiment, that version of the experiment is locked and can't be edited. You can,

however, save a copy of it by clicking SAVE AS and providing a new name for the copy. Machine Learning Studio opens the

new copy, which you can then edit and run. This copy of your experiment is available in the EXPERIMENTS list along with all

your other experiments.

When you have an experiment open that you have run at least once, you can view the preceding run of the

experiment by clicking Prior Run in the properties pane.

For example, suppose you create an experiment and run versions of it at 11:23, 11:42, and 11:55. If you open the

last run of the experiment (11:55) and click Prior Run, the version you ran at 11:42 is opened.

You can view all the previous runs of an experiment by clicking View Run History in an open experiment.

For example, suppose you create an experiment with the Linear Regression module and you want to observe the

effect of changing the value of Learning rate on your experiment results. You run the experiment multiple times

with different values for this parameter, as follows:

0.4 9/11/2014 4:28:36 pm

0.5 9/11/2014 4:33:31 pm

LEARNING RATE VALUE RUN START TIME

TIPTIP

Iterating on a Previous Run

If you click VIEW RUN HISTORY, you see a list of all these runs:

Click any of these runs to view a snapshot of the experiment at the time you ran it. The configuration, parameter

values, comments, and results are all preserved to give you a complete record of that run of your experiment.

To document your iterations of the experiment, you can modify the title each time you run it, you can update the Summary

of the experiment in the properties pane, and you can add or update comments on individual modules to record your

changes. The title, summary, and module comments are saved with each run of the experiment.

The list of experiments in the EXPERIMENTS tab in Machine Learning Studio always displays the latest version

of an experiment. If you open a previous run of the experiment (using Prior Run or VIEW RUN HISTORY), you

can return to the draft version by clicking VIEW RUN HISTORY and selecting the iteration that has a STATE of

Editable.

When you click Prior Run or VIEW RUN HISTORY and open a previous run, you can view a finished experiment

in read-only mode.

If you want to begin an iteration of your experiment starting with the way you configured it for a previous run, you

can do this by opening the run and clicking SAVE AS. This creates a new experiment, with a new title, an empty

run history, and all the components and parameter values of the previous run. This new experiment is listed in the

EXPERIMENTS tab in the Machine Learning Studio home page, and you can modify and run it, initiating a new

run history for this iteration of your experiment.

For example, suppose you have the experiment run history shown in the previous section. You want to observe

what happens when you set the Learning rate parameter to 0.4, and try different values for the Number of

training epochs parameter.

1. Click VIEW RUN HISTORY and open the iteration of the experiment that you ran at 4:28:36 pm (in which you

set the parameter value to 0.4).

2. Click SAVE AS.

3. Enter a new title and click the OK checkmark. A new copy of the experiment is created.

4. Modify the Number of training epochs parameter.

5. Click RUN.

You can now continue to modify and run this version of your experiment, building a new run history to record your

work.

Create many Machine Learning models and web

service endpoints from one experiment using

PowerShell

3/21/2018 • 9 min to read • Edit Online

NOTENOTE

Set up the training experiment

NOTENOTE

Here's a common machine learning problem: You want to create many models that have the same training

workflow and use the same algorithm. But you want them to have different training datasets as input. This article

shows you how to do this at scale in Azure Machine Learning Studio using just a single experiment.

For example, let's say you own a global bike rental franchise business. You want to build a regression model to

predict the rental demand based on historic data. You have 1,000 rental locations across the world and you've

collected a dataset for each location. They include important features such as date, time, weather, and traffic that

are specific to each location.

You could train your model once using a merged version of all the datasets across all locations. But, each of your

locations has a unique environment. So a better approach would be to train your regression model separately

using the dataset for each location. That way, each trained model could take into account the different store sizes,

volume, geography, population, bike-friendly traffic environment, and more.

That may be the best approach, but you don't want to create 1,000 training experiments in Azure Machine

Learning with each one representing a unique location. Besides being an overwhelming task, it's also seems

inefficient since each experiment would have all the same components except for the training dataset.

Fortunately, you can accomplish this by using the Azure Machine Learning retraining API and automating the task

with Azure Machine Learning PowerShell.

To make your sample run faster, reduce the number of locations from 1,000 to 10. But the same principles and procedures

apply to 1,000 locations. However, if you do want to train from 1,000 datasets you might want to run the following

PowerShell scripts in parallel. How to do that is beyond the scope of this article, but you can find examples of PowerShell

multi-threading on the Internet.

Use the example training experiment that's in the Cortana Intelligence Gallery. Open this experiment in your

Azure Machine Learning Studio workspace.

In order to follow along with this example, you may want to use a standard workspace rather than a free workspace. You

create one endpoint for each customer - for a total of 10 endpoints - and that requires a standard workspace since a free

workspace is limited to 3 endpoints. If you only have a free workspace, just change the scripts to allow for only th locations.

The experiment uses an Import Data module to import the training dataset customer001.csv from an Azure

storage account. Let's assume you have collected training datasets from all bike rental locations and stored them

in the same blob storage location with file names ranging from rentalloc001.csv to rentalloc10.csv.

Deploy the training and scoring web services

Create 10 identical web service endpoints with PowerShell

Note that a Web Service Output module has been added to the Train Model module. When this experiment is

deployed as a web service, the endpoint associated with that output returns the trained model in the format of an

.ilearner file.

Also note that you set up a web service parameter that defines the URL that the Import Data module uses. This

allows you to use the parameter to specify individual training datasets to train the model for each location. There

are other ways you could have done this. You can use a SQL query with a web service parameter to get data from

a SQL Azure database. Or you can use a Web Service Input module to pass in a dataset to the web service.

Now, let's run this training experiment using the default value rental001.csv as the training dataset. If you view the

output of the Evaluate module (click the output and select Visualize), you can see you get a decent performance

of AUC = 0.91. At this point, you're ready to deploy a web service out of this training experiment.

To deploy the training web service, click the Set Up Web Service button below the experiment canvas and select

Deploy Web Service. Call this web service ""Bike Rental Training".

Now you need to deploy the scoring web service. To do this, click Set Up Web Service below the canvas and

select Predictive Web Service. This creates a scoring experiment. You need to make a few minor adjustments to

make it work as a web service. Remove the label column "cnt" from the input data and limit the output to only the

instance id and the corresponding predicted value.

To save yourself that work, you can open the predictive experiment in the Gallery that has already been prepared.

To deploy the web service, run the predictive experiment, then click the Deploy Web Service button below the

canvas. Name the scoring web service "Bike Rental Scoring"".

This web service comes with a default endpoint. But you're not as interested in the default endpoint since it can't

be updated. What you need to do is to create 10 additional endpoints, one for each location. You can do this with

PowerShell.

Import-Module .\AzureMLPS.dll

# Assume the default configuration file exists and is properly set to point to the valid Workspace.

$scoringSvc = Get-AmlWebService | where Name -eq 'Bike Rental Scoring'

$trainingSvc = Get-AmlWebService | where Name -eq 'Bike Rental Training'

# Create 10 endpoints on the scoring web service.

For ($i = 1; $i -le 10; $i++){

$seq = $i.ToString().PadLeft(3, '0');

$endpointName = 'rentalloc' + $seq;

Write-Host ('adding endpoint ' + $endpointName + '...')

Add-AmlWebServiceEndpoint -WebServiceId $scoringSvc.Id -EndpointName $endpointName -Description

$endpointName

}

Update the endpoints to use separate training datasets using

PowerShell

First, you set up the PowerShell environment:

Then, run the following PowerShell command:

Now you created 10 endpoints and they all contain the same trained model trained on customer001.csv. You can

view them in the Azure portal.

The next step is to update the endpoints with models uniquely trained on each customer's individual data. But first

you need to produce these models from the Bike Rental Training web service. Let's go back to the Bike Rental

Training web service. You need to call its BES endpoint 10 times with 10 different training datasets in order to

produce 10 different models. Use the InovkeAmlWebServiceBESEndpoint PowerShell cmdlet to do this.

You will also need to provide credentials for your blob storage account into $configContent . Namely, at the fields

AccountName , AccountKey , and RelativeLocation . The AccountName can be one of your account names, as seen in

the Azure portal (Storage tab). Once you click on a storage account, its AccountKey can be found by pressing the

Manage Access Keys button at the bottom and copying the Primary Access Key. The RelativeLocation is the

path relative to your storage where a new model will be stored. For instance, the path hai/retrain/bike_rental/ in

the following script points to a container named hai , and /retrain/bike_rental/ are subfolders. Currently, you

# Invoke the retraining API 10 times

# This is the default (and the only) endpoint on the training web service

$trainingSvcEp = (Get-AmlWebServiceEndpoint -WebServiceId $trainingSvc.Id)[0];

$submitJobRequestUrl = $trainingSvcEp.ApiLocation + '/jobs?api-version=2.0';

$apiKey = $trainingSvcEp.PrimaryKey;

For ($i = 1; $i -le 10; $i++){

$seq = $i.ToString().PadLeft(3, '0');

$inputFileName = 'https://bostonmtc.blob.core.windows.net/hai/retrain/bike_rental/BikeRental' + $seq +

'.csv';

$configContent = '{ "GlobalParameters": { "URI": "' + $inputFileName + '" }, "Outputs": { "output1": {

"ConnectionString": "DefaultEndpointsProtocol=https;AccountName=<myaccount>;AccountKey=<mykey>",

"RelativeLocation": "hai/retrain/bike_rental/model' + $seq + '.ilearner" } } }';

Write-Host ('training regression model on ' + $inputFileName + ' for rental location ' + $seq + '...');

Invoke-AmlWebServiceBESEndpoint -JobConfigString $configContent -SubmitJobRequestUrl $submitJobRequestUrl

-ApiKey $apiKey

}

NOTENOTE

# Patch the 10 endpoints with respective .ilearner models

$baseLoc = 'http://bostonmtc.blob.core.windows.net/'

$sasToken = '<my_blob_sas_token>'

For ($i = 1; $i -le 10; $i++){

$seq = $i.ToString().PadLeft(3, '0');

$endpointName = 'rentalloc' + $seq;

$relativeLoc = 'hai/retrain/bike_rental/model' + $seq + '.ilearner';

Write-Host ('Patching endpoint ' + $endpointName + '...');

Patch-AmlWebServiceEndpoint -WebServiceId $scoringSvc.Id -EndpointName $endpointName -ResourceName 'Bike

Rental [trained model]' -BaseLocation $baseLoc -RelativeLocation $relativeLoc -SasBlobToken $sasToken

}

Full PowerShell script

cannot create subfolders through the portal UI, but there are several Azure Storage Explorers that allow you to do

so. It is recommended that you create a new container in your storage to store the new trained models (.iLearner

files) as follows: from your storage page, click the Add button at the bottom and name it retrain . In summary,

the necessary changes to the following script pertain to AccountName , AccountKey , and RelativeLocation (:

"retrain/model' + $seq + '.ilearner" ).

The BES endpoint is the only supported mode for this operation. RRS cannot be used for producing trained models.

As you can see above, instead of constructing 10 different BES job configuration json files, you dynamically create

the config string instead. Then feed it to the jobConfigString parameter of the

InvokeAmlWebServceBESEndpoint cmdlet. There's really no need to keep a copy on disk.

If everything goes well, after a while you should see 10 .iLearner files, from model001.ilearner to

model010.ilearner, in your Azure storage account. Now you're ready to update the 10 scoring web service

endpoints with these models using the Patch-AmlWebServiceEndpoint PowerShell cmdlet. Remember again

that you can only patch the non-default endpoints you programmatically created earlier.

This should run fairly quickly. When the execution finishes, you'll have successfully created 10 predictive web

service endpoints. Each one will contain a trained model uniquely trained on the dataset specific to a rental

location, all from a single training experiment. To verify this, you can try calling these endpoints using the

InvokeAmlWebServiceRRSEndpoint cmdlet, providing them with the same input data. You should expect to

see different prediction results since the models are trained with different training sets.

Here's the listing of the full source code:

Import-Module .\AzureMLPS.dll

# Assume the default configuration file exists and properly set to point to the valid workspace.

$scoringSvc = Get-AmlWebService | where Name -eq 'Bike Rental Scoring'

$trainingSvc = Get-AmlWebService | where Name -eq 'Bike Rental Training'

# Create 10 endpoints on the scoring web service

For ($i = 1; $i -le 10; $i++){

$seq = $i.ToString().PadLeft(3, '0');

$endpointName = 'rentalloc' + $seq;

Write-Host ('adding endpoint ' + $endpontName + '...')

Add-AmlWebServiceEndpoint -WebServiceId $scoringSvc.Id -EndpointName $endpointName -Description

$endpointName

}

# Invoke the retraining API 10 times to produce 10 regression models in .ilearner format

$trainingSvcEp = (Get-AmlWebServiceEndpoint -WebServiceId $trainingSvc.Id)[0];

$submitJobRequestUrl = $trainingSvcEp.ApiLocation + '/jobs?api-version=2.0';

$apiKey = $trainingSvcEp.PrimaryKey;

For ($i = 1; $i -le 10; $i++){

$seq = $i.ToString().PadLeft(3, '0');

$inputFileName = 'https://bostonmtc.blob.core.windows.net/hai/retrain/bike_rental/BikeRental' + $seq +

'.csv';

$configContent = '{ "GlobalParameters": { "URI": "' + $inputFileName + '" }, "Outputs": { "output1": {

"ConnectionString": "DefaultEndpointsProtocol=https;AccountName=<myaccount>;AccountKey=<mykey>",

"RelativeLocation": "hai/retrain/bike_rental/model' + $seq + '.ilearner" } } }';

Write-Host ('training regression model on ' + $inputFileName + ' for rental location ' + $seq + '...');

Invoke-AmlWebServiceBESEndpoint -JobConfigString $configContent -SubmitJobRequestUrl $submitJobRequestUrl

-ApiKey $apiKey

}

# Patch the 10 endpoints with respective .ilearner models

$baseLoc = 'http://bostonmtc.blob.core.windows.net/'

$sasToken = '?test'

For ($i = 1; $i -le 10; $i++){

$seq = $i.ToString().PadLeft(3, '0');

$endpointName = 'rentalloc' + $seq;

$relativeLoc = 'hai/retrain/bike_rental/model' + $seq + '.ilearner';

Write-Host ('Patching endpoint ' + $endpointName + '...');

Patch-AmlWebServiceEndpoint -WebServiceId $scoringSvc.Id -EndpointName $endpointName -ResourceName 'Bike

Rental [trained model]' -BaseLocation $baseLoc -RelativeLocation $relativeLoc -SasBlobToken $sasToken

}

How to choose algorithms for Microsoft Azure

Machine Learning

3/21/2018 • 15 min to read • Edit Online

The Machine Learning Algorithm Cheat Sheet

NOTENOTE

How to use the cheat sheetHow to use the cheat sheet

TIPTIP

Flavors of machine learning

SupervisedSupervised

The answer to the question "What machine learning algorithm should I use?" is always "It depends." It depends

on the size, quality, and nature of the data. It depends on what you want to do with the answer. It depends on how

the math of the algorithm was translated into instructions for the computer you are using. And it depends on how

much time you have. Even the most experienced data scientists can't tell which algorithm will perform best before

trying them.

The Microsoft Azure Machine Learning Algorithm Cheat Sheet helps you choose the right machine

learning algorithm for your predictive analytics solutions from the Microsoft Azure Machine Learning library of

algorithms. This article walks you through how to use it.

To download the cheat sheet and follow along with this article, go to Machine learning algorithm cheat sheet for Microsoft

Azure Machine Learning Studio.

This cheat sheet has a very specific audience in mind: a beginning data scientist with undergraduate-level

machine learning, trying to choose an algorithm to start with in Azure Machine Learning Studio. That means that

it makes some generalizations and oversimplifications, but it points you in a safe direction. It also means that

there are lots of algorithms not listed here. As Azure Machine Learning grows to encompass a more complete set

of available methods, we'll add them.

These recommendations are compiled feedback and tips from many data scientists and machine learning experts.

We didn't agree on everything, but I've tried to harmonize our opinions into a rough consensus. Most of the

statements of disagreement begin with "It depends…"

Read the path and algorithm labels on the chart as "For <path label>, use <algorithm>." For example, "For

speed, use two class logistic regression." Sometimes more than one branch applies. Sometimes none of them are

a perfect fit. They're intended to be rule-of-thumb recommendations, so don't worry about it being exact. Several

data scientists I talked with said that the only sure way to find the very best algorithm is to try all of them.

Here's an example from the Azure AI Gallery of an experiment that tries several algorithms against the same data

and compares the results: Compare Multi-class Classifiers: Letter recognition.

To download and print a diagram that gives an overview of the capabilities of Machine Learning Studio, see Overview

diagram of Azure Machine Learning Studio capabilities.

UnsupervisedUnsupervised

Reinforcement learningReinforcement learning

Considerations when choosing an algorithm

AccuracyAccuracy

Training timeTraining time

LinearityLinearity

Supervised learning algorithms make predictions based on a set of examples. For instance, historical stock prices

can be used to hazard guesses at future prices. Each example used for training is labeled with the value of interest

—in this case the stock price. A supervised learning algorithm looks for patterns in those value labels. It can use

any information that might be relevant—the day of the week, the season, the company's financial data, the type of

industry, the presence of disruptive geopolitical events—and each algorithm looks for different types of patterns.

After the algorithm has found the best pattern it can, it uses that pattern to make predictions for unlabeled testing

data—tomorrow's prices.

Supervised learning is a popular and useful type of machine learning. With one exception, all the modules in

Azure Machine Learning are supervised learning algorithms. There are several specific types of supervised

learning that are represented within Azure Machine Learning: classification, regression, and anomaly detection.

Classification. When the data are being used to predict a category, supervised learning is also called

classification. This is the case when assigning an image as a picture of either a 'cat' or a 'dog'. When there are

only two choices, it's called two-class or binomial classification. When there are more categories, as when

predicting the winner of the NCAA March Madness tournament, this problem is known as multi-class

classification.

Regression. When a value is being predicted, as with stock prices, supervised learning is called regression.

Anomaly detection. Sometimes the goal is to identify data points that are simply unusual. In fraud detection,

for example, any highly unusual credit card spending patterns are suspect. The possible variations are so

numerous and the training examples so few, that it's not feasible to learn what fraudulent activity looks like.

The approach that anomaly detection takes is to simply learn what normal activity looks like (using a history

non-fraudulent transactions) and identify anything that is significantly different.

In unsupervised learning, data points have no labels associated with them. Instead, the goal of an unsupervised

learning algorithm is to organize the data in some way or to describe its structure. This can mean grouping it into

clusters or finding different ways of looking at complex data so that it appears simpler or more organized.

In reinforcement learning, the algorithm gets to choose an action in response to each data point. The learning

algorithm also receives a reward signal a short time later, indicating how good the decision was. Based on this,

the algorithm modifies its strategy in order to achieve the highest reward. Currently there are no reinforcement

learning algorithm modules in Azure Machine Learning. Reinforcement learning is common in robotics, where

the set of sensor readings at one point in time is a data point, and the algorithm must choose the robot's next

action. It is also a natural fit for Internet of Things applications.

Getting the most accurate answer possible isn't always necessary. Sometimes an approximation is adequate,

depending on what you want to use it for. If that's the case, you may be able to cut your processing time

dramatically by sticking with more approximate methods. Another advantage of more approximate methods is

that they naturally tend to avoid overfitting.

The number of minutes or hours necessary to train a model varies a great deal between algorithms. Training time

is often closely tied to accuracy—one typically accompanies the other. In addition, some algorithms are more

sensitive to the number of data points than others. When time is limited it can drive the choice of algorithm,

especially when the data set is large.

Lots of machine learning algorithms make use of linearity. Linear classification algorithms assume that classes

Number of parametersNumber of parameters

can be separated by a straight line (or its higher-dimensional analog). These include logistic regression and

support vector machines (as implemented in Azure Machine Learning). Linear regression algorithms assume that

data trends follow a straight line. These assumptions aren't bad for some problems, but on others they bring

accuracy down.

Non-linear class boundary - relying on a linear classification algorithm would result in low accuracy

Data with a nonlinear trend - using a linear regression method would generate much larger errors than

necessary

Despite their dangers, linear algorithms are very popular as a first line of attack. They tend to be algorithmically

simple and fast to train.

Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect

the algorithm's behavior, such as error tolerance or number of iterations, or options between variants of how the

algorithm behaves. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting

just the right settings. Typically, algorithms with large numbers parameters require the most trial and error to find

a good combination.

Number of featuresNumber of features
Special casesSpecial cases
ALGORITHM ACCURACY TRAINING TIME LINEARIT Y PARAMETERS NOTES
Two-class
classification
logistic
regression
●●5
decision forest ● ○ 6
decision jungle ● ○ 6 Low memory
footprint
boosted decision
tree
● ○ 6 Large memory
footprint
neural network ● 9 Additional
customization is
possible
averaged
perceptron
○ ○ ● 4
support vector
machine
○ ● 5 Good for large
feature sets
locally deep
support vector
machine
○ 8 Good for large
feature sets
Bayes’ point
machine
○ ● 3
Multi-class
classification
Alternatively, there is a parameter sweeping module block in Azure Machine Learning that automatically tries all
parameter combinations at whatever granularity you choose. While this is a great way to make sure you've
spanned the parameter space, the time required to train a model increases exponentially with the number of
parameters.
The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often
achieve very good accuracy. Provided you can find the right combination of parameter settings.
For certain types of data, the number of features can be very large compared to the number of data points. This is
often the case with genetics or textual data. The large number of features can bog down some learning
algorithms, making training time unfeasibly long. Support Vector Machines are particularly well suited to this
case (see below).
Some learning algorithms make particular assumptions about the structure of the data or the desired results. If
you can find one that fits your needs, it can give you more useful results, more accurate predictions, or faster
training times.

logistic
regression
●●5
decision forest ● ○ 6
decision jungle ● ○ 6 Low memory
footprint
neural network ● 9 Additional
customization is
possible
one-v-all ----See properties of
the two-class
method selected
Regression
linear ●●4
Bayesian linear ○ ● 2
decision forest ● ○ 6
boosted decision
tree
● ○ 5 Large memory
footprint
fast forest
quantile
● ○ 9 Distributions
rather than point
predictions
neural network ● 9 Additional
customization is
possible
Poisson ● 5 Technically log-
linear. For
predicting counts
ordinal 0 For predicting
rank-ordering
Anomaly
detection
support vector
machine
○ ○ 2 Especially good
for large feature
sets
PCA-based
anomaly
detection
○ ● 3
ALGORITHM ACCURACY TRAINING TIME LINEARIT Y PARAMETERS NOTES

K-means ○ ● 4 A clustering

algorithm

ALGORITHM ACCURACY TRAINING TIME LINEARIT Y PARAMETERS NOTES

Algorithm notes

Linear regressionLinear regression

Logistic regressionLogistic regression

Algorithm properties:

● - shows excellent accuracy, fast training times, and the use of linearity

○ - shows good accuracy and moderate training times

As mentioned previously, linear regression fits a line (or plane, or hyperplane) to the data set. It's a workhorse,

simple and fast, but it may be overly simplistic for some problems. Check here for a linear regression tutorial.

Data with a linear trend

Although it confusingly includes 'regression' in the name, logistic regression is actually a powerful tool for two-

class and multiclass classification. It's fast and simple. The fact that it uses an 'S'-shaped curve instead of a

straight line makes it a natural fit for dividing data into groups. Logistic regression gives linear class boundaries,

so when you use it, make sure a linear approximation is something you can live with.

Trees, forests, and junglesTrees, forests, and jungles

A logistic regression to two-class data with just one feature - the class boundary is the point at which the

logistic curve is just as close to both classes

Decision forests (regression, two-class, and multiclass), decision jungles (two-class and multiclass), and boosted

decision trees (regression and two-class) are all based on decision trees, a foundational machine learning concept.

There are many variants of decision trees, but they all do the same thing—subdivide the feature space into

regions with mostly the same label. These can be regions of consistent category or of constant value, depending

on whether you are doing classification or regression.

A decision tree subdivides a feature space into regions of roughly uniform values

Because a feature space can be subdivided into arbitrarily small regions, it's easy to imagine dividing it finely

enough to have one data point per region. This is an extreme example of overfitting. In order to avoid this, a large

set of trees are constructed with special mathematical care taken that the trees are not correlated. The average of

this "decision forest" is a tree that avoids overfitting. Decision forests can use a lot of memory. Decision jungles

are a variant that consumes less memory at the expense of a slightly longer training time.

Boosted decision trees avoid overfitting by limiting how many times they can subdivide and how few data points

are allowed in each region. The algorithm constructs a sequence of trees, each of which learns to compensate for

the error left by the tree before. The result is a very accurate learner that tends to use a lot of memory. For the full

technical description, check out Friedman's original paper.

Fast forest quantile regression is a variation of decision trees for the special case where you want to know not

only the typical (median) value of the data within a region, but also its distribution in the form of quantiles.

Neural networks and perceptronsNeural networks and perceptrons

Neural networks are brain-inspired learning algorithms covering multiclass, two-class, and regression problems.

They come in an infinite variety, but the neural networks within Azure Machine Learning are all of the form of

directed acyclic graphs. That means that input features are passed forward (never backward) through a sequence

of layers before being turned into outputs. In each layer, inputs are weighted in various combinations, summed,

and passed on to the next layer. This combination of simple calculations results in the ability to learn sophisticated

class boundaries and data trends, seemingly by magic. Many-layered networks of this sort perform the "deep

learning" that fuels so much tech reporting and science fiction.

This high performance doesn't come for free, though. Neural networks can take a long time to train, particularly

for large data sets with lots of features. They also have more parameters than most algorithms, which means that

parameter sweeping expands the training time a great deal. And for those overachievers who wish to specify their

own network structure, the possibilities are inexhaustible.

SVMsSVMs

The

boundaries learned by neural networks can be complex and irregular

The two-class averaged perceptron is neural networks' answer to skyrocketing training times. It uses a network

structure that gives linear class boundaries. It is almost primitive by today's standards, but it has a long history of

working robustly and is small enough to learn quickly.

Support vector machines (SVMs) find the boundary that separates classes by as wide a margin as possible. When

the two classes can't be clearly separated, the algorithms find the best boundary they can. As written in Azure

Machine Learning, the two-class SVM does this with a straight line only. (In SVM-speak, it uses a linear kernel.)

Because it makes this linear approximation, it is able to run fairly quickly. Where it really shines is with feature-

intense data, like text or genomic. In these cases SVMs are able to separate classes more quickly and with less

overfitting than most other algorithms, in addition to requiring only a modest amount of memory.

Bayesian methodsBayesian methods
Specialized algorithmsSpecialized algorithms
A typical support vector machine class boundary maximizes the margin separating two classes
Another product of Microsoft Research, the two-class locally deep SVM is a non-linear variant of SVM that
retains most of the speed and memory efficiency of the linear version. It is ideal for cases where the linear
approach doesn't give accurate enough answers. The developers kept it fast by breaking down the problem into a
bunch of small linear SVM problems. Read the full description for the details on how they pulled off this trick.
Using a clever extension of nonlinear SVMs, the one-class SVM draws a boundary that tightly outlines the entire
data set. It is useful for anomaly detection. Any new data points that fall far outside that boundary are unusual
enough to be noteworthy.
Bayesian methods have a highly desirable quality: they avoid overfitting. They do this by making some
assumptions beforehand about the likely distribution of the answer. Another byproduct of this approach is that
they have very few parameters. Azure Machine Learning has both Bayesian algorithms for both classification
(Two-class Bayes' point machine) and regression (Bayesian linear regression). Note that these assume that the
data can be split or fit with a straight line.
On a historical note, Bayes' point machines were developed at Microsoft Research. They have some exceptionally
beautiful theoretical work behind them. The interested student is directed to the original article in JMLR and an
insightful blog by Chris Bishop.
If you have a very specific goal you may be in luck. Within the Azure Machine Learning collection, there are
algorithms that specialize in:
rank prediction (ordinal regression),
count prediction (Poisson regression),
anomaly detection (one based on principal components analysis and one based on support vector machines)
clustering (K-means)

PCA-based anomaly detection - the vast majority of the data falls into a stereotypical distribution; points

deviating dramatically from that distribution are suspect

A data set is grouped into five clusters using K-means

More help with algorithms

There is also an ensemble one-v-all multiclass classifier, which breaks the N-class classification problem into N-1

two-class classification problems. The accuracy, training time, and linearity properties are determined by the two-

class classifiers used.

A pair of two-class classifiers combine to form a three-class classifier

Azure Machine Learning also includes access to a powerful machine learning framework under the title of

Vowpal Wabbit. VW defies categorization here, since it can learn both classification and regression problems and

can even learn from partially unlabeled data. You can configure it to use any one of a number of learning

algorithms, loss functions, and optimization algorithms. It was designed from the ground up to be efficient,

parallel, and extremely fast. It handles ridiculously large feature sets with little apparent effort. Started and led by

Microsoft Research's own John Langford, VW is a Formula One entry in a field of stock car algorithms. Not every

problem fits VW, but if yours does, it may be worth your while to climb the learning curve on its interface. It's also

available as stand-alone open source code in several languages.

For a downloadable infographic that describes algorithms and provides examples, see Downloadable

Infographic: Machine learning basics with algorithm examples.

For a list by category of all the machine learning algorithms available in Azure Machine Learning Studio, see

Initialize Model in the Machine Learning Studio Algorithm and Module Help.

For a complete alphabetical list of algorithms and modules in Azure Machine Learning Studio, see A-Z list of

Machine Learning Studio modules in Machine Learning Studio Algorithm and Module Help.

To download and print a diagram that gives an overview of the capabilities of Azure Machine Learning Studio,

see Overview diagram of Azure Machine Learning Studio capabilities.

Machine learning algorithm cheat sheet for Microsoft

Azure Machine Learning Studio

3/21/2018 • 5 min to read • Edit Online

Download: Machine learning algorithm cheat sheet

NOTENOTE

More help with algorithms

The Microsoft Azure Machine Learning Algorithm Cheat Sheet helps you choose the right algorithm for a

predictive analytics model.

Azure Machine Learning Studio has a large library of algorithms from the regression, classification, clustering,

and anomaly detection families. Each is designed to address a different type of machine learning problem.

Download the cheat sheet here: Machine Learning Algorithm Cheat Sheet (11x17 in.)

Download and print the Machine Learning Algorithm Cheat Sheet in tabloid size to keep it handy and get help

choosing an algorithm.

See the article How to choose algorithms for Microsoft Azure Machine Learning for a detailed guide to using this cheat

sheet.

For help in using this cheat sheet for choosing the right algorithm, plus a deeper discussion of the different

types of machine learning algorithms and how they're used, see How to choose algorithms for Microsoft Azure

Machine Learning.

NOTENOTE

Notes and terminology definitions for the machine learning algorithm

cheat sheet