CSCI 141 - Spring 2020
Assignment 5: Cancer Classification using Machine Learning
Overview
The goal of this project is to gain more practice with using functions, lists and dictionaries and gain some intuition for Machine Learning, the field of computer science concerned with writing algorithms that allow computers to “learn” from data. One field these techniques are being used to make a difference is in medicine.
The problem we’ll be solving is as follows: Given a data file containing hundreds of patient records with values describing measurements of cancer tumors and whether or not each tumor is malignant or benign, develop a simple rule-based classifier that can be used to predict whether an as-yet-unseen tumor is malignant or benign.
The general idea is that malignant tumors are different than benign tumors. Malignant tumors tend to have larger radii, to be more smooth, to be more symmetric, etc. Measurements have been taken on many tumors whose class (malignant or benign) is known. The code you are going to write will get the average score across all the malignant tumors for an attribute (e.g. ‘area’) as well as the average score for that attribute for benign tumors. Let’s say that the average area for malignant tumors is 100, and for benign tumors is 50. We can then use that information to try to predict whether a given tumor is malignant or benign.
Imagine you are presented with a new tumor and told the area was 99. All else being equal, we would have reason to think this tumor is more likely to be malignant than had its area been 51. Based on this intuition, we are going to create a simple classification scheme. We will calculate the midpoint between the malignant average and the benign average (75 in our hypothetical example), and simply say that for each new tumor, if its value for that attribute is greater than or equal to the midpoint value for that attribute, that is one vote for the tumor being malignant. Each attribute that we are using produces a vote, and at the end of counting votes for each attribute, if the malignant votes are greater than or equal to the benign votes, we predict that the tumor is malignant.
2 Machine Learning Framework
“Machine learning” is a popular buzzword that might evoke computer brain simulations, or robots walking among humans. In reality (for now, anyway), machine learning refers to some- thing less fanciful: algorithms that use previously observed data to make predictions about new data. It may sound less glamorous than fully sentient robots, but that’s exactly what was described above! Machine learning allows us to solve problems by considering hundreds or thousands of attributes (and their combinations) - far more than a human alone could do. You can get more sophisticated about the specifics of how you go about this, but that’s the core of what machine learning really means.
If using data to make predictions on new data is our goal, you might think it makes sense to use
all the data we have to learn from. But in fact, if we truly don’t know the labels (e.g., malignant or benign) of the data we’re testing our algorithm on, we won’t have any idea whether it’s doing a good job! For this reason, it makes sense to split the data we have labels for into a training set, which we’ll use to “learn” from, and a test set, which we’ll use to evaluate how well the algorithm does on new data (i.e., data it wasn’t trained on). We will take about 80% of the data as our training set, and use the remaining 20% as our test set.
2.1 Training Phase
Here’s how our classifier will work: In the training phase, we will “learn” (read: compute) the average value each attribute (e.g. area, smoothness, etc.) among the malignant tumors. We will also “learn” (again: compute) the average value of each attribute among benign tumors. Then we’ll compute the midpoint for each attribute. This collection of midpoints, one for each attribute, is our classifier.
2.2 Testing Phase
Having trained our classifier, we can now use it to make an educated guess about the label of a new tumor if we have the measurements of all of its attributes. Our educated guess will be pretty simple:
If the tumor’s value for an attribute is greater than or equal to the midpoint value for that attribute, cast one vote for the tumor being
If the tumor’s attribute value is less than the midpoint, cast one vote for the tumor being benign.
Tally up the votes cast according to these rules for each of the ten attributes. If the malignant votes are greater than or equal to the benign votes, we predict that the tumorIf we want to use this classifier to diagnose people, we have an important question to answer: how good are our guesses? To answer this question, we’ll run test our algorithm on the 20% of our data that we held out as the test set, which we didn’t use to train the classifier, but we do know the correct labels. Our rate of accuracy on these data should be indicative of how well our classifier will do on new, unlabeled tumors.
3 Dataset Description
You have been provided with cancerTrainingData.txt, a text file containing the 80% of the data that we’ll use as our training set.
The file has many numbers per patient record, some of which refer to attributes of the tumor. The skeleton code includes the function make_training_set(), which reads in the important information from this file and produces a list of dictionaries. Each dictionary contains attributes for a single tumor as follows:
ID
radius
texture
perimeter
area
smoothness
compactness
concavity
concave
symmetry
fractal
class
The middle 10 attributes (numbered 1 through 10) are the numbers that describe the tumor. The first attribute is just the patient ID number, and the last attribute is the actual real life state of the tumor, namely, malignant (represented by “M”) or benign (represented by “B”).
We don’t need to know what these attributes mean: all we need to know is that they are measurements of the tumors, and that benign and malignant tumors tend to have different attribute values. For these 10 tumor attributes when comparing to the midpoint values, higher numbers indicate malignancy. Pictorially, the list of dictionaries looks like this (two are shown, but the list contains many more than that)
The dictionary stored in the 0th spot in the list gives the attributes for the 0th tumor: training_set[0]["class"] gives the true class label (in this case, ”B” for benign) of the 0th tumor.
4 Getting Started
Download the skeleton code (cancer_classifier.py), training set (cancerTrainingData.txt), and the test set (cancerTestingData.txt). Make sure all three files are in the same directory, or the main program will not be able to load the data from the files.
In some browsers, clicking the link to each data file simply opens the file in your browser, which isn’t helpful. To download the data files, I recommend right-clicking the link from Canvas or the course webpage and selecting “Save File As...”, or your browser’s equivalent. Choose the same location as you’ve saved the skeleton code and save the files without changing their names to be sure that the program will be able to read them correctly.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of