(5/5)

In this assignment, you will implement Batch Gradient Descent to fit a line into two-dimensional data set.

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

1 Description

In this assignment, you will implement Batch Gradient Descent to fit a line into two-dimensional data set. You will implement a set of Spark jobs that will learn parameters for such lines from the New York City Taxi trip reports in 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong. com/open-data/foil_nyc_taxi/). See Assignment 1 for details about this data set.

We would like to train a linear model between travel distance in miles and fare amount (the money paid to the taxis).

2 Taxi Data Set - Same data set as Assignment 1

This is the same data set as used for Assignment 1. Please have a look at the table description there.

The data set is in Comma Separated Volume Format (CSV). When you read a line and split it by a comma sign ”,” you will find the string array with a length of 17. With the index number starting from zero, we need for this assignment to get index 5 trip distance (trip distance in miles) and index 11 fare amount ( fare amount in dollars) as stated in the following table.

Table 1: Taxi Data Set fields

You can use the following PySpark Code to clean up the data.

def isfloat(value): try:

float(value) return True

except:

return False def correctRows(p):

if(len(p)==17):

if(isfloat(p[5]) and isfloat(p[11])):

if(float(p[5])!=0 and float(p[11])!=0): return p

testDataFrame = spark.read.format('csv').\ options(header='false', inferSchema='true', sep =",").\ load(testFile)

testRDD = testDataFrame.rdd.map(tuple) taxilinesCorrected = testRDD.filter(correctRows)

In addition to the above filtering, you should remove all of the rides that have a total amount larger than 600 USD and less than 1 USD. You can preprocess the data, clean it and store it in your own cluster storage. To avoid additional computation in each run.

3 Obtaining the Dataset

Small data set. (93 MB compressed, uncompressed 384 MB) for implementation and testing purposes (roughly 2 million taxi trips). This is available at Google Storage:

https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-small.csv.bz2 and the whole dataset (8GB) https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-large.csv.bz2

When running your code on the cluster, you can access the data sets using the following internal URLs:

Google Cloud

Small Data Set gs://met-cs-777-data/taxi-data-sorted-small.csv.bz2

Large Data Set gs://met-cs-777-data/taxi-data-sorted-large.csv.bz2

Table 2: Data set on Google Cloud Storage - URL

4 Assignment Tasks

4.1 Task 1: Simple Linear Regression (4 points)

We want to find a simple line to our data (distance, money). Consider a Simple Linear Regression model given in equation (1). The solutions for the m slope of the line and y-intercept are calculated based on equations (2) and (3).

(5/5)

Use CA10RAM to get 10%* Discount.

In this assignment, you will implement Batch Gradient Descent to fit a line into two-dimensional data set.

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science

Other Services

In this assignment, you will implement Batch Gradient Descent to fit a line into two-dimensional data set.

ANSWER ALL QUESTIONS

Attachments:

Instructions Files

Related Questions

. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C

. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual

. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Um e HaniScience

Muhammad Ali HaiderFinance

Husnain SaeedComputer science

Atharva PatilComputer science