ASSIGNMENT
Submission:
Source Code: Python source files (upload .zip file in case of multiple files) containing your code only (no test data needed) and ReadMe.txt file (template provided) describing how to run your code. Note that we will NOT debug your code. If your code does not execute as described in ReadMe.txt, you will receive a zero grade.
Presentation Slide: One slide only in PPT/PPTX/PDF format to be used during the oral presentations (see below). If you submitted file spans more than a page, we will extract the first page for the oral
Presentations:
Everyone is required to deliver 3 minutes flash presentation accompanied by the submitted slide following the Three Minute Thesis (3MT) format, with additional 2 minutes for Q&A:
Your presentation should at least contain methods (i.e., implementation), results (e.g., output), and
Having appropriate graphics and visuals (e.g., figures, plots) in the presentation slides to help illustrate key concepts or results will be positively
Any additional scientific insights and/or challenges faced and/or limitations of your implementation and/or efficiency analyses and/or comparisons with alternative approaches will be positively
Implementing decision tree for protein RSA prediction
Objective: Implement decision tree for protein relative solvent accessibility prediction.
Note: You must use standard Python programming language. You are NOT allowed to use non- standard packages or libraries (e.g. Biopython, scikit-learn, SciPy, NumPy, etc.).
A: Raw Data:
Two directors (fasta and sa) are supplied. The fasta directory contains 150 protein sequences in FASTA format. A FASTA file is as follows:
The true binary relative solvent accessibility (RSA) labels of these proteins can be found in the sa
directory. This file is also in FASTA format. RSA labels having two possible values:
‘E: exposed ‘B’: buried
N.B. The true RSA labels are calculated using the DSSP (Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Kabsch and Sander, 1983) software at a 25% threshold.
B: Curating Training and Test Datasets:
Divide the raw data into non-overlapping sets of training (~75%) and test (~25%) datasets using simple random sampling without replacement.
C. Feature Extraction:
Using chemical properties of 20 naturally occurring amino acid residues as detailed in Table 1 and Figure 1, construct a feature matrix (or vector) for the training and test datasets.
Table 1. Chemical properties of 20 naturally occurring amino acid residues (Livingstone & Barton, CABIOS, 9, 745-756, 1993)
Figure 1. Venn diagram of chemical properties of 20 naturally occurring amino acid residues (Livingstone & Barton, CABIOS, 9, 745-756, 1993)
Specifically, the feature set should include the following binary attributes:
Attribute |
Description |
Hydrophobic |
Whether a residue is hydrophobic |
Polar |
Whether a residue is hydrophobic |
Small |
Whether a residue size is small |
Proline |
Whether a residue is Proline (PRO, P) |
Tiny |
Whether a residue size is tiny |
Aliphatic |
Whether a residue is Aliphatic |
Aromatic |
Whether a residue is Aromatic |
Positive |
Whether a residue is Positively Charged |
Negative |
Whether a residue is Negatively Charged |
Charged |
Whether a residue is Charged |
The output labels are already binary (e.g. 1 for exposed, 0 for buried or vice versa).
D. Decision Tree Learning using ID3 on Training Set:
Implement the ID3 decision tree learning algorithm that follows a greedy top-down growth of the tree using information gain to learn the best hypothesis on training dataset.
E. Decision Tree Classification on Test Set:
Implement decision tree classification algorithm that walks on the trained tree generated from step D and output predicts labels on test dataset.
N.B. ID3 decision tree is an offline-learning algorithm. Therefore, training and classification should be implemented separately. The classification algorithm should take a protein sequence in FASTA format as an input and predict labels in a standalone mode. You may save the parameters learned during training in a file that can be fed into the classifier, in an offline mode.
F. Evaluate Accuracy:
Use Precision, Recall, and F-1 score to calculate the accuracy of the decision tree classifier implemented in step E on test dataset.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of