Evaluating the quality of ML classification algorithms for 17 different classifiers from Spark ML, Keras, and Scikit-learn to detect or minimize ML bugs at an early stage before a model is deployed. This can be achieved by testing the Code, the Model, and the Data and evaluating the individual classifiers using ML quality attributes using three popular classification datasets
● The goal is how open-source ML systems should be tested using the state of art solutions i.e model behavioral testing to build user confidence in using these systems in operational systems
The implementation of th project requires addressing several questions
Our objective is to determine ways to improve ML systems quality by testing the data, model, and code using quantitative and qualitative metrics. These metrics are performance, reproducibility, correctness, robustness, explainability. Therefore, the experimentation(implementation) must be able to answer the following questions. Hence, you need to think how to design to address each question
● What are the most appropriate or ideal classifiers for the problem at hand, what are the most effective evaluation metrics, and what makes the classifier perform best?
○ Precision, Recall, Accuracy, ROC, confusion matrix, classification report
● Which classifiers are robust enough for data transformations such as data shuffling of the training instance, adding adversarial examples, and scaled data? Or Which classifier is robust to slight changes in the input data or for synthetic datasets? Also,
○ What are the main factors or parameters that contribute to sensitivity?
● What are methods or model parameters(if any) to make the black box decision-making process more explainable and which classifier output is explainable and interpretable? i.e
■ Explainability in their native form (without using any explainability tools) and using explainability tools(SHAP and LIME)
■ For example, decision making of decision tree is easy to understand at a high-level(if-else statement)
■ How do the input features contribute to the model output?
● What are the main factors/parameters/methods that enhance ML reproducibility and why is model reproducibility difficult to achieve? And
○ Which classifier is reproducible and why?
● What are the most appropriate classifiers, ideal performance metrics for the raw data (without any transformation), unnormalized data(cleaned and transformed data but not normalized), normalized data(cleaned, transformed, and normalized), and imbalanced data?
● Which combination of qualitative and quantitative metrics: performance, robustness, correctness, reproducibility, and explainability should ML practitioners consider or give priority to getting a holistic view of the model behavior before they deploy a model?
○ Why does accuracy alone not provide a complete picture of the model?
○ Is it possible to tell how each of the metrics correlates with the classifiers?
■ For example, does a decision tree classifier emphasize robustness over explainability?
● Do we get the same results using various classification models by applying the same data processing, the same or similar model parameters/hyperparameters, and other settings remaining the same?
○ We need to provide valid reasons for both yes and no answers
● What are the unique challenges of model behavioral testing when applied to classification models from Scikit-Learn, Keras, and Spark ML?
● How can we adjust the workflow to handle data and concepts drifts?
○ It is possible that data and concepts, i.e the target instance may change over time and affect the quality of the ML system (model quality), for example, the prediction power. What are the best practices to minimize this effect?
● The analysis must be performed between classifiers within the same library and between libraries(focus should be here). For example,
○ Scikit-Learn linear SVM with other classifiers in Scikit-Learn and with Spark ML Linear SVM classifiers
○ We have 17 classifiers/algorithms to be evaluated or analyzed
■ Spark ML = 8, Scikit-Learn = 8, Keras =1 Tools and Technical composition
● Programming Language and IDE: Python, Jupyter Notebook
● Development OS: Ubuntu ( as long we using Juyper no problem)
● Development Approach: Test-Driven Development
● Program constructs: Classes and Functions (I love functions)
● Required skillsets: the project is quite challenging
○ Someone who have done projects in:
○ ML, ML Testing and quality assurance,EDA, ML Workflow orchestration(tracking), ML Model Behavioral testing ,etc
● ML model properties to be evaluated practically
○ Performance, robustness, reproducibility, correctness, explainability and Interpretability
● ML Frameworks to used for the implementation: Spark ML, Scikit-Learn, and Keras
● It is very crucial to clearly understand the differences between Model Evaluation and Model Testing as well as ML evaluation and ML testing
○ The primary focus of this projecti model testing, not model evaluation
■ Model evaluation using performance metrics mainly depends on the performance metrics of the model, whereas model testing is quite far beyond that Writing Test Cases for Machine Learning systems - Analytics Vidhya
Classification Algorithms
We have selected 16 classifiers that have the same mathematical intuition both in Spark MLi.e PySpark and Scikit-Learn. We also have one general classifier from Keras. We want to evaluate the Keras classifier with the rest of the classifiers. The classification algorithms are: Note each algorithm havs two version: 1 from Pyspark and 1 from
Scikit-Learn
● Linear SVC or SVM.SVC(2)
○ Max iterations, C = regularization strength, penalty(loss), fit intercept,random state
● Logistic Regression(2)
○ solver, penalty, C (regularization strength), max iteration, random state
● Decision Trees(2) & Random Forest(2)
○ Max depth, number of estimators, impurity measure, max features, bootstrap technique, random state
● Gaussian Naive Bayes (2)
○ Smoothing, model type(Multi, Gaussian, Bernoulli)
○ The APIs do not provide many hyperparameters as it generalizes well
● GBTClassifier(Scikit-Learn)(1) and GBTClassifier (spark ML) (1)
○ Max features, number of estimators, max depth, learning rate, loss/loss type, bootstrap technique, random state
● MLPClassifier(2)
○ hidden layer size, activation, solver, max iterations, learning rate, batch size, alpha(regularization), random state
● One-vs-Rest(2)
○ estimator(baseline estimator), number of parallel jobs
● Keras Classifier(1)
○ Binary or multi-class general classifier, random state
● HyperParameter selection
○ The hyperparameters should be selected in such a way that they will significantly contribute to the quality of MLmodels. Considering three to four hyperparameters for each classifier seems reasonable.
○ The set of hyperparameters should be present equally in Scikit-Learn and Spark ML. Otherwise, the comparison would not be reasonable
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of