logo Use CA10RAM to get 10%* Discount.
Order Nowlogo
(5/5)

In this homework, you will investigate restaurant food safety scores for restaurants in San Fran- cisco and sample score card for a restaurant.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

1.1 Cleaning and Exploring Data with Panda

1.2 This assignment

In this homework, you will investigate restaurant food safety scores for restaurants in San Fran- cisco. Here is a sample score card for a restaurant. The scores and violation information have been made available by the San Francisco Department of Public Health. We have made these data available to you in class_share on JupyterHub along with the link below. The main goal for this assignment is to understand how restaurants are scored. We will walk through the various steps of exploratory data analysis to do this. We will provide comments and insights along the way to give you a sense of how we arrive at each discovery and what next steps it leads to.

As we clean and explore these data, you will gain practice with: * Reading simple csv files * Working with data at different levels of granularity * Identifying the type of data collected, missing values, anomalies, etc. * Exploring characteristics and distributions of individual variables

 

1.4    Score breakdown

 

 

Question

Points

 

1a

1

 

1b

0

 

1c

0

 

1d

3

 

1e

1

 

2a

1

 

Question

Points

2b

2

3a

2

3b

0

3c

2

3d

1

3e

1

4a

2

4b

3

5a

1

5b

1

5c

1

6a

2

6b

3

6c

3

7a

2

7b

2

7c

6

7d

2

Total

45

[1]:To start the assignment, run the cell below to set up some imports. In many of these assign- ments (and your future adventures as a data scientist) you will use os, zipfile, pandas, numpy, matplotlib.pyplot, and seaborn. Import each of these libraries as their commonly used abbre- viations (e.g., pd, np, plt, and sns).

We could write a few lines of code that are built to download this specific data file, but it’s a better idea to have a general function that we can reuse for all of our assignments. Since this class isn’t really about the nuances of the Python file system libraries, we’ve provided a function for you in ds112_utils.py called fetch_and_cache that can download files from the internet.

This function has the following arguments: - data_url: the web address to download - file: the

file in which to save the results - data_dir: (default="data") the location to save the data - force: if true the file is always re-downloaded

The way  this  function  works  is  that  it  checks  to  see  if  data_dir/file already  exists.  If it does not exist already or if force=True,  the file at data_url is downloaded and placed          at data_dir/file. The process of storing a data file for reuse later is called caching. If data_dir/file already and exists force=False, nothing is downloaded, and instead a message  is printed letting you know the date of the cached file.

The function returns a pathlib.Path object representing the file. A pathlib.Path is an object that stores filepaths, e.g. ~/Dropbox/ds112/my_chart.png.

The code below uses ds112_utils.py to download the data from the following URL:

https://cims.nyu.edu/~policast/hw2-SFBusinesses.zip

 

1.6         1: Loading Food Safety Data

Alright, great, now we have data.zip. We don’t have any specific questions yet, so let’s focus on understanding the structure of the data. Recall this involves answering questions such as

  • Is the data in a standard format or encoding?

  • Is the data organized in records?

  • What are the fields in each record?

Let’s start by looking at the contents of the zip file. We could in theory do this by manually opening up the zip file on our computers or using a shell command like !unzip, but on this home- work we’re going to do almost everything in Python for maximum portability and automation.

Goal: Fill in the code below so that my_zip is a Zipfile.zipfile object corresponding to the downloaded zip file, and so that list_names contains a list of the names of all files inside the downloaded zip file.

 

Creating a zipfile.Zipfile object is a good start (the Python docs have further details). You might also look back at the code from the case study from Demo 3. It’s OK to copy and paste code from the Demo 3 file, though you might get more out of this exercise if you type out an answer.

 

nswer   above,   if   you   see   something   like   zipfile.ZipFile('data.zip'..., we suggest changing it to read zipfile.ZipFile(dest_path... or alternately zipfile.ZipFile(target_file_name.... In general, we strongly suggest having your filenames hard coded ONLY ONCE in any given iPython notebook. It is very dangerous to hard code things twice, because if you change one but forget to change the other, you can end up with very hard to find bugs.

Now display the files’ names and their sizes.

If you’re not sure how to proceed, read about the attributes of a ZipFile object in the Python docs linked above.

We expect an output that looks something like this:

violations.csv     3726206

businesses.csv   660231

inspections.csv 466106

legend.csv   120

Often when working with zipped data, we’ll never unzip the actual zipfile. This saves space on our local computer. However, for this HW, the files are small, so we’re just going to unzip everything. This has the added benefit that you can look inside the csv files using a text editor, which might be handy for more deeply understanding what’s going on. The cell below will unzip the csv files into a subdirectory called "data". Try running the code below.

When you ran the code above, nothing gets printed. However, this code should have created a folder called "data", and in it should be the four CSV files. Assuming you’re using Datahub, use your web browser to verify that these files were created, and try to open up legend.csv to see what’s inside. You should see something that looks like:

"Minimum_Score","Maximum_Score","Description" 0,70,"Poor"

71,85,"Needs Improvement" 86,90,"Adequate" 91,100,"Good"

 Question 1b: Programatically Looking Inside the Files

What we see when we opened the file above is good news!   It looks like this file is indeed a    csv file. Let’s check the other three files. This time, rather than opening up the files manually, let’s use Python to print out the first 5 lines of each. The ds112_utils library has a method called head that will allow you to retrieve the first N lines of a file as a list. For example ds112_utils.head('data/legend.csv',  5) will return the first 5 lines of "data/legend.csv". Try using this function to print out the first 5 lines of all four files that we just extracted from the zipfile.

Question 1c: Reading in the Files

Based on the above information, let’s attempt to load businesses.csv, inspections.csv, and violations.csv into pandas data frames with the following names: bus, ins, and vio respec- tively.

Note: Because of character encoding issues one of the files (bus) will require an additional argument encoding='ISO-8859-1' when calling pd.read_csv.

Now  that  you’ve  read  in  the  files,  let’s  try  some  pd.DataFrame methods.  Use the

DataFrame.head command to show the top few lines of the bus, ins, and vio dataframes.

DataFrame.describe method can also be handy for computing summaries of various statistics of our dataframes. Try it out with each of our 3 dataframes.

Question 1d: Verify Your Files were Read Correctly

Now, we perform some sanity checks for you to verify that you loaded the data with the right structure. Run the following cells to load some basic utilities (you do not need to change these at all):

First, we check the basic structure of the data frames you created:we’ll check that the statistics match what we expect. The following are hard-coded sta- tistical summaries of the correct data. .

The code below defines a testing function that we’ll use to verify that your data has the same statistics as what we expect. Run these cells to define the function.  The df_allclose function  has this name because we are verifying that all of the statistics for your dataframe are close to the expected values. Why not df_allequal? It’s a bad idea in almost all cases to compare two floating point values like 37.780435, as rounding error can cause spurious failures.

Do not delete the empty cell below!

[ ]: """Run this cell to load this utility comparison function that we will use in␣

‹→various

tests below.

 Do not modify the function in any way. """

def  df_allclose(actual,  desired,  columns=None,  rtol=5e-2):

"""Compare selected columns of two dataframes on a few summary statistics.

Compute the min, median and max of the two dataframes on the given columns,␣

‹→and compare

that they match numerically to the given relative tolerance.

 

If they don't match, an AssertionError is raised (by `numpy.testing`). """

import  numpy.testing  as  npt

# summary statistics to compare on

stats  =  ['min',  '50  ',  'max']

# For the desired values, we can provide a full DF with the same structure␣

‹→as

# the actual data, or pre-computed summary statistics.

# We assume a pre-computed summary was provided if columns is None. In that␣

‹→case,

# `desired` *must* have the same structure as the actual's summary

if columns is None: des = desired

columns = desired.columns

else:

des = desired[columns].describe().loc[stats]

# Extract summary stats from actual DF

act = actual[columns].describe().loc[stats]

(5/5)
Attachments:

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Um e HaniScience

874 Answers

Hire Me
expert
Muhammad Ali HaiderFinance

660 Answers

Hire Me
expert
Husnain SaeedComputer science

664 Answers

Hire Me
expert
Atharva PatilComputer science

965 Answers

Hire Me