1 Description
In this assignment, you will implement Batch Gradient Descent to fit a line into two-dimensional data set. You will implement a set of Spark jobs that will learn parameters for such lines from the New York City Taxi trip reports in 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong. com/open-data/foil_nyc_taxi/). See Assignment 1 for details about this data set.
We would like to train a linear model between travel distance in miles and fare amount (the money paid to the taxis).
2 Taxi Data Set - Same data set as Assignment 1
This is the same data set as used for Assignment 1. Please have a look at the table description there.
The data set is in Comma Separated Volume Format (CSV). When you read a line and split it by a comma sign ”,” you will find the string array with a length of 17. With the index number starting from zero, we need for this assignment to get index 5 trip distance (trip distance in miles) and index 11 fare amount ( fare amount in dollars) as stated in the following table.
Table 1: Taxi Data Set fields
You can use the following PySpark Code to clean up the data.
def isfloat(value): try:
float(value) return True
except:
return False def correctRows(p):
if(len(p)==17):
if(isfloat(p[5]) and isfloat(p[11])):
if(float(p[5])!=0 and float(p[11])!=0): return p
testDataFrame = spark.read.format('csv').\ options(header='false', inferSchema='true', sep =",").\ load(testFile)
testRDD = testDataFrame.rdd.map(tuple) taxilinesCorrected = testRDD.filter(correctRows)
In addition to the above filtering, you should remove all of the rides that have a total amount larger than 600 USD and less than 1 USD. You can preprocess the data, clean it and store it in your own cluster storage. To avoid additional computation in each run.
3 Obtaining the Dataset
Small data set. (93 MB compressed, uncompressed 384 MB) for implementation and testing purposes (roughly 2 million taxi trips). This is available at Google Storage:
https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-small.csv.bz2 and the whole dataset (8GB) https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-large.csv.bz2
When running your code on the cluster, you can access the data sets using the following internal URLs:
Google Cloud
Small Data Set gs://met-cs-777-data/taxi-data-sorted-small.csv.bz2
Large Data Set gs://met-cs-777-data/taxi-data-sorted-large.csv.bz2
Table 2: Data set on Google Cloud Storage - URL
4 Assignment Tasks
4.1 Task 1: Simple Linear Regression (4 points)
We want to find a simple line to our data (distance, money). Consider a Simple Linear Regression model given in equation (1). The solutions for the m slope of the line and y-intercept are calculated based on equations (2) and (3).
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of