logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

Python source files (upload .zip file in case of multiple files) containing your code only (no test data needed) and ReadMe.txt file (template provided) describing how to run your code.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

ASSIGNMENT

Submission:

 

  1. Source Code: Python source files (upload .zip file in case of multiple files) containing your code only (no test data needed) and ReadMe.txt file (template provided) describing how to run your code. Note that we will NOT debug your code. If your code does not execute as described in ReadMe.txt, you will receive a zero grade.

  2. Presentation Slide: One slide only in PPT/PPTX/PDF format to be used during the oral presentations (see below). If you submitted file spans more than a page, we will extract the first page for the oral

 

Presentations:

 

Everyone is required to deliver 3 minutes flash presentation accompanied by the submitted slide following the Three Minute Thesis (3MT) format, with additional 2 minutes for Q&A:

 

  1. Your presentation should at least contain methods (i.e., implementation), results (e.g., output), and

  2. Having appropriate graphics and visuals (e.g., figures, plots) in the presentation slides to help illustrate key concepts or results will be positively

  3. Any additional scientific insights and/or challenges faced and/or limitations of your implementation and/or efficiency analyses and/or comparisons with alternative approaches will be positively

 

Implementing decision tree for protein RSA prediction

 

Objective: Implement decision tree for protein relative solvent accessibility prediction.

 

Note: You must use standard Python programming language. You are NOT allowed to use non- standard packages or libraries (e.g. Biopython, scikit-learn, SciPy, NumPy, etc.).

 

A: Raw Data:

 

Two directors (fasta and sa) are supplied. The fasta directory contains 150 protein sequences in FASTA format. A FASTA file is as follows:

 

The true binary relative solvent accessibility (RSA) labels of these proteins can be found in the sa

directory. This file is also in FASTA format. RSA labels having two possible values:

 

‘E: exposed ‘B’: buried

 

N.B. The true RSA labels are calculated using the DSSP (Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Kabsch and Sander, 1983) software at a 25% threshold.

 

B: Curating Training and Test Datasets:

 

Divide the raw data into non-overlapping sets of training (~75%) and test (~25%) datasets using simple random sampling without replacement.

 

 

C.  Feature Extraction:

 

Using chemical properties of 20 naturally occurring amino acid residues as detailed in Table 1 and Figure 1, construct a feature matrix (or vector) for the training and test datasets.

 

 

Table 1. Chemical properties of 20 naturally occurring amino acid residues (Livingstone & Barton, CABIOS, 9, 745-756, 1993)

 
   

 

 

 

Figure 1. Venn diagram of chemical properties of 20 naturally occurring amino acid residues (Livingstone & Barton, CABIOS, 9, 745-756, 1993)

 

 

 

Specifically, the feature set should include the following binary attributes:

 

Attribute

Description

Hydrophobic

Whether a residue is hydrophobic

Polar

Whether a residue is hydrophobic

Small

Whether a residue size is small

Proline

Whether a residue is Proline (PRO, P)

Tiny

Whether a residue size is tiny

Aliphatic

Whether a residue is Aliphatic

Aromatic

Whether a residue is Aromatic

Positive

Whether a residue is Positively Charged

Negative

Whether a residue is Negatively Charged

Charged

Whether a residue is Charged

The output labels are already binary (e.g. 1 for exposed, 0 for buried or vice versa).

 

D.  Decision Tree Learning using ID3 on Training Set:

 

Implement the ID3 decision tree learning algorithm that follows a greedy top-down growth of the tree using information gain to learn the best hypothesis on training dataset.

 

E.  Decision Tree Classification on Test Set:

 

Implement decision tree classification algorithm that walks on the trained tree generated from step D and output predicts labels on test dataset.

 

N.B. ID3 decision tree is an offline-learning algorithm. Therefore, training and classification should be implemented separately. The classification algorithm should take a protein sequence in FASTA format as an input and predict labels in a standalone mode. You may save the parameters learned during training in a file that can be fed into the classifier, in an offline mode.

 

 

F.  Evaluate Accuracy:

 

Use Precision, Recall, and F-1 score to calculate the accuracy of the decision tree classifier implemented in step E on test dataset.

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

719 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

608 Answers

Hire Me
expert
Husnain SaeedComputer science

859 Answers

Hire Me
expert
Atharva PatilComputer science

557 Answers

Hire Me