logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

We have a Canvas quiz that is meant to test that you have read the library documentation for the packages we use for this class.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Introduction

You may be asking yourself “what is the importance of learning about Data Science and Machine Learning in a cybersecurity class?”, The short answer is that data science is a useful set of tools to handle the massive amount of data that flow through IT systems and it is used by many security teams either explicitly or within tools/programs they use so it is important to get a basic understanding of how it works. This Project will go through a simplified scenario where data science can be used, if this sparks your interest there are plenty of other ML focused classes at GaTech that you may be interested in taking, as well as a wealth of training materials on Youtube, Coursera, Udacity, Udemy, DataCamp etc that you could use to go deeper into the field.

Scenario:

You are an analyst on a security team for a midsized software company that runs a messaging app (a slack, gchat, microsoft teams competitor). It is Monday morning and you see an email from your manager setting up a meeting to discuss a new security feature that needs to be implemented in the product ASAP. You join the meeting and learn that recently there has been a big uptick in malicious executable files being sent over the chat app and it is starting to generate bad press for the company. A few analysts on the team already worked on analyzing a set of files sent over the app and classifying them as malicious or benign. They also used a python library (pefile) to get some attributes of each executable file and have created a CSV with those extracted attributes and a column with the name class with a 1 denoting a malicious file and a 0 denoting a benign files. They documented their preprocessing work in a readme in the git repo (urwithajit9/ClaMP: A Malware classifier dataset built with header fields’ values of Portable Executable files (github.com)) and shared the repo with software engineers so they can get to work writing code that will generate those features for every executable file sent over the messaging app. Your boss turns to you and says I would like you to help us to understand a bit more about how big of a problem this is on our app and write a model that takes in these features and produces a propensity score from 0 to 1 where scores closer to 0 mean a low likelihood of the file being malicious and closer to a 1 means a higher likelihood of a file being malicious. Also since the team may want to reuse this type of work in the future for different types of files or with different extracted attributes you should create functions that can be used in the future with minimal rework. Once you produce a model, you will share your code and the trained model file with the software engineers who will integrate the model into the messaging app and will score all files uploaded to the app.

General Advice

Develop locally then test in the autograder when you are confident your code runs without errors. You can run the python files locally, develop in a local vscode/jupyter notebook or on a hosted web notebook like google colab.

Do not use print statements in your gradescope submissions. While print statements are useful to debug issues locally in an autograder context they can leak sensitive answer information. We have detections in gradescope that will block you from viewing scores/outputs from your code if you use print statements in any of your submitted code. If you try to bruteforce/hack/game the autograder or extract information we will give you a 0 for the whole assignment. 

Read the python library documentation. You will be using pandas, scikit-learn, yellowbrick 

Do not hard code solutions (see screenshot below for what not to do):

 

 

Task 0 (20 points)

We have a Canvas quiz that is meant to test that you have read the library documentation for the packages we use for this class. It is not meant to be tricky and can be completed before you start the project or after you finish it. 

Useful Links:

scikit-learn Documentation

Deliverables:

Complete Canvas Quiz

 

Task 1 (5 points)

Lets first get familiar with some pandas basics. Pandas is a library that handles data frames which you can think of as a python class that handles tabular data. In this section you will make a very simple function that takes in a pandas dataframe of the file attributes and classifications and returns some simple attributes. See the function skeleton and implement a count of rows, count of columns, count of rows where the classification is 1 (positive), count of rows where the classification is 0 (negative) and a percentage of classification of 1 (percent positive) in the dataset. Generally in the real world you would also use plotting tools like PowerBi, Tableau, Data Studio, Matplotlib etc to create graphics and other visuals to better understand the dataset you are working with, this step is generally known as Exploratory Data Analysis. Since we are using an autograder for this class we will skip the plotting for this project.

 

Useful Links:

pandas documentation — pandas 1.5.3 documentation (pydata.org)

What is Exploratory Data Analysis? | IBM

Top Data Visualization Tools | KDnuggets 

 

Deliverables:

Complete find_dataset_statistics function in task1.py

Submit task1.py to gradescope

 

 

Task 2 (10 points)

Now that you have a basic understanding of pandas and the dataset it is time to dive into some more complex data processing tasks. The first subtask in this task is splitting your dataset into both features and targets (columns) and splitting your dataset into training and test sets (rows). These are basic concepts in model building but at a high level it is important to hold out a subset of your data when you train a model so you can see what the expected performance is on unseen samples and so you can determine if the resulting model is overfit (performs much better on training data vs test data). Preprocessing data is important since most models only take in numerical values so categorical features need to be “encoded” to numerical values so models can use them. Numerical scaling can be more or less useful depending on the type of model used but is especially important in linear models. These preprocessing techniques will provide you options to augment your dataset and improve model performance. 

For one hot encoding and scaling functions you should return a dataframe with the encoded/scaled columns concatenated to the ‘other’ columns you did not transform. For the PCA functions you should return just the PCA dataframe. For the feature engineering dataframe you should return the feature engineered column attached to the input dataframe.

Example Output for one hot encoding (where color and version are encoded):

 

Note: For these functions (and in data science/ML in general) you should be training/fitting to the train set and predicting/transforming on the train and test sets. 

Note: for onehot encoding please use the format columnName_columnValue for the new encoded column names. Ie in the sklearn example of a column named `gender` and values `male` and `female`, you would put `gender_male` and `gender_female` as your new column names after one hot encoding. If the test set has a third gender then you would denote that value you didn’t see at training time with 0’s for each encoded column. 

Note: When using sklearn.preprocessing.OneHotEncoder check the output from the transformation it may cause issues if it is not a format that pandas dataframes expect. To see this and to debug it, try running your code locally (or on a google colab notebook) using the training csv with a few categorical columns.

 

 

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

767 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

523 Answers

Hire Me
expert
Husnain SaeedComputer science

820 Answers

Hire Me
expert
Atharva PatilComputer science

947 Answers

Hire Me
April
January
February
March
April
May
June
July
August
September
October
November
December
2025
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
SunMonTueWedThuFriSat
30
31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30