# Azure DP-100 Summary
# Major Components of AML Workspaces
Workspace -> Top Level Resource for AML
- Compute Instances
- User Roles
- Compute Targets
- Experiments
- Pipelines
- Datasets
- Models (registered models)
- Deployment Endpoints
# Creating an AML Workspace (opens new window)
Resources built as accompanying resource
- Azure Storage Account
- Azure Container Registry
- Azure Application Insights
- Azure Key Vault
Workspace Settings
- Access Controls
- Event Suscriptions (generate alerts or triggers based on events)
- Alerts & Diagnostic Settings
AML Studio
- Author
- Notebooks
- Automate ML
- Designer
- Assets
- Datasets
- Experiments
- Models
- Endpoints
- Monitoring
- Compute
- Datastores
- Data Labelling
# IAM/RBAC (opens new window)
Users in the Azure Active Directory are assigned specific roles which grants access to resources via multippe ways (CLI/Portal/etc).
# Experiments (opens new window)
A grouping of many runs
- Run: Singe execution of a training script
- Info Stored: Metadata about run, metrics, etc.
Run Configuration
- Used when we want to run a training experiment on a different compute targets.
Estimator Class
- Allows the creation of run configuration utilizing the AML Python SDK
Designer can only be run on Azure Machine Learning Compute Cluster
# Data Objects
Pipelines
Independently executable workflow of a ML task (orchestration).
- Steps that don't need re-run are not run
- Each step can run in a separate compute target
- Dependencies are managed by the pipeline
Datastores
An abstraction over Azure Storage services.
Datasets
References to where the data lives (tabular, file)
Dataset Management
- Version and tracking
- Monitor (data drift)
- Open datasets
Data Drift
- Change in model input data that leads to the degradation of model performance
- Possible causes:
- Upstream process change
- Quality issues
- Natural Causes
- Covariate shift
Other Notes CSV files can expand up to 10x in a dataframe, and you want double the RAM of that (20GB Ram for a 1GB dataset).
# Feature Selection (AML Exam Concepts)
- Pearson's correlation
- Dependent and Independent Variables don't make any diff.
- Linear data
- Mutual Info Score
- Measure the reduction in uncertainty to predict parts of outcomes of a system
h(x) = -log(p(x))
- Kendall's Correlation Coefficient
- A nonparametric analysis of the strength of a relationship between 2 variables
- Variables are measured on an ordinal scale and data needs to have a monotonic relationship
- Usually preferred over Spearman
- Chi-Squared Stat
- Reveals how close expected values are to actual results
- Used for categorical variables
- Fisher Score
- Measures the variance between expected value and observed value
- Determines if a features are independent
- Count-based Feature Selection
# Inferencing Notes
- For inferencing in Prod you should use AKS (which supports GPU)
- AML compute instances for deploying real-time services do not provide GPU and should be mainly used for Batch inferencing models.
- Azure Container Instances are for testing/debugging and don't provide GPU (only low-scale CPU workloads).
# Authentication
Authentication Docs (opens new window)
Consuming a Web Service (opens new window)
Auth Method | ACI | AKS | ||
---|---|---|---|---|
Key | Disabled by default | Enabled by default | ||
Token | Not available | Disabled by default |
# List of Common Modules in AML
- Clean Missing Data (opens new window)
- Select Columns in Dataset (opens new window)
- Normalize Data (opens new window)
- Partition and Sample Data (opens new window)
# List of Common/Functions Methods in the SDK
Package | Class | Example | Description |
---|---|---|---|
azureml.core | Workspace | Workspace.from_config() | |
azureml.core | Experiment | Experiment(workspace=ws, name='name') | |
azureml.core | Experiment | experiment.start_logging() | |
azureml.core | Datastore | Datastore.register_azure_blob_container(...) | |
azureml.core | Datastore | Datastore.get(workspace,datastore_name) | |
azureml.core | Dataset | Dataset.Tabular.from_delimited(path=datastore_path) | |
azureml.core | Dataset | my_dataset.register(ws,name,description) | |
azureml.core | Dataset | Dataset.get_by_name(ws,name) | |
azureml.train.estimator | Estimator | Estimator(source_directory,script_params,...) | |
azureml.core | Run | Run.get_context() | |
azureml.core | Run | run.log() ,run.log_list() ,run.log_row() ,etc.. | |
azureml.core | Run | run.get_details() , run.get_metrics() ,run.get_file_names | |
azureml.widgets | RunDetails | RunDetails(run).show() | |
azureml.core.webservice | AciWebservice | aci_service.get_logs() |