Hello World,
TLDR; Principal Component Analysis to find relationship between multiple variables for Chicago Taxi Fare dataset.
This assignment was given by Prof Patrick Healy at the University of Limerick for a Big Data and Visualisation module (CS6502)
Task
In Lab 11, you have learned how to create and evaluate a machine learning model, then predict using the model. In this project you are asked to practice the skills on a different dataset – Chicago Taxi Trips, this dataset includes taxi trips from 2013 to the present, reported to the City of Chicago.
Your task here is to
1. use the python scikit-learn to investigate the similarities and relationships that PCA can provide. It will not be possible for you to analyze the entire data set in this way so you should perform your analysis on *the first 1000 rows* of the table. Ideally the output of this phase should guide the direction you take in the second, ML, part of the assignment. You will need to decide which columns should be part of your analysis and which should be ignored. See some of the tutorial links we have posted in the lecture slides for help in deciding what are appropriate columns to consider.
2. based on the information you gained from step 1, create a model to predict the taxi fare1(“fare” column in the dataset).
Note that you may need to clean the data, pick a list of features (feature engineering), and then design your model. Please email your zipped solution pack to the lecturer by the end of week 14 with subject “cs6502:ml proj”.
Note that your solution pack should contain
-the query you use to selected the first 1000 rows
-key steps for your PCA (setup, command, etc)
-the sql for model creation
-the sql to evaluate the model
-the SQL to predict using the model above
-link of the BigQuery commands you composed
-screenshot of the model evaluation report
You will be assessed on the quality of the information you glean and feed into the ML part and how you present this to us. Again, see how this is done from the tutorials. (Note, as the tutorials make clear the R statistical package has very good support for this type of analysis. If you wish to take this task on using R then this will be ok, too.)
Solutions
Link to Google Colab .ipynb file
The following document contains a solution pack:
-the query you use to selected the first 1000 rows
-key steps for your PCA (setup, command, etc)
-the sql for model creation
-the sql to evaluate the model
-the sql to predict using the model above
-link of the BigQuery commands you composed
-screenshot of the model evaluation report
The query you use to selected the first 1000 rows
SELECT
fare ,
EXTRACT(DAYOFWEEK from trip_start_timestamp )
AS dayofweek,
EXTRACT(HOUR from trip_start_timestamp ) AS
hourofday,
trip_miles ,
trip_seconds ,
pickup_longitude ,
pickup_latitude ,
dropoff_longitude ,
dropoff_latitude ,
pickup_community_area ,
dropoff_community_area ,
trip_start_timestamp
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_seconds>=60
AND trip_miles>0.5
AND fare>0
AND pickup_longitude IS NOT NULL
AND pickup_latitude IS NOT NULL
AND dropoff_longitude IS NOT NULL
AND dropoff_latitude IS NOT NULL
AND pickup_longitude <>
dropoff_longitude
AND pickup_latitude <> dropoff_latitude
AND pickup_community_area IS NOT NULL
AND dropoff_community_area IS NOT NULL
AND
MOD(ABS(FARM_FINGERPRINT(CAST(trip_start_timestamp AS STRING))), 5000)= 1
LIMIT
1000
Key steps for your PCA (setup, command, etc)
PCA and Data exploration, cleaning, feature
engineering– Google Collab link
https://colab.research.google.com/drive/1LdTYfayhNcmg_rcU-DUZWwORD2zQ3Zu1?usp=sharing
Importing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline
# Pandas display options
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Set random seed
RSEED = 10
# Visualizations
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
palette = sns.color_palette('Paired', 10)
Importing data set (random 1000 rows from Chicago Taxi Fare dataset)
url = "/content/results-20200510-190212.csv"
df = pd.read_csv(url)
df.head()
fare | dayofweek | hourofday | trip_miles | trip_seconds | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | pickup_community_area | dropoff_community_area | trip_start_timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8.650 | 4 | 23 | 2.600 | 420 | -87.649 | 41.923 | -87.633 | 41.900 | 7 | 8 | 2015-07-22 23:45:00 UTC |
1 | 8.450 | 4 | 23 | 2.300 | 540 | -87.613 | 41.892 | -87.643 | 41.879 | 8 | 28 | 2015-07-22 23:45:00 UTC |
2 | 7.250 | 4 | 23 | 1.800 | 480 | -87.619 | 41.891 | -87.643 | 41.879 | 8 | 28 | 2015-07-22 23:45:00 UTC |
3 | 6.050 | 4 | 23 | 1.300 | 300 | -87.629 | 41.900 | -87.613 | 41.892 | 8 | 8 | 2015-07-22 23:45:00 UTC |
4 | 7.050 | 4 | 23 | 1.600 | 420 | -87.638 | 41.893 | -87.619 | 41.891 | 8 | 8 | 2015-07-22 23:45:00 UTC |
df.describe()
fare | dayofweek | hourofday | trip_miles | trip_seconds | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | pickup_community_area | dropoff_community_area | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 |
mean | 13.257 | 3.217 | 16.230 | 3.926 | 824.164 | -87.656 | 41.894 | -87.643 | 41.896 | 24.643 | 20.044 |
std | 11.697 | 1.318 | 4.679 | 5.090 | 719.886 | 0.072 | 0.040 | 0.042 | 0.037 | 19.623 | 15.238 |
min | 4.450 | 1.000 | 13.000 | 0.510 | 60.000 | -87.914 | 41.713 | -87.914 | 41.706 | 1.000 | 1.000 |
25% | 6.750 | 1.000 | 13.000 | 1.177 | 420.000 | -87.645 | 41.879 | -87.648 | 41.881 | 8.000 | 8.000 |
50% | 8.750 | 4.000 | 13.000 | 1.900 | 633.500 | -87.632 | 41.892 | -87.632 | 41.892 | 28.000 | 8.000 |
75% | 13.300 | 4.000 | 23.000 | 3.900 | 960.000 | -87.621 | 41.900 | -87.621 | 41.901 | 32.000 | 32.000 |
max | 124.250 | 4.000 | 23.000 | 64.400 | 13620.000 | -87.583 | 42.010 | -87.573 | 42.010 | 77.000 | 77.000 |
plt.figure(figsize = (10, 6))
sns.distplot(df['fare']);
plt.title('Distribution of Fare');
Binning
# Bin the fare and convert to string
df['fare-bin'] = pd.cut(df['fare'], bins = list(range(0, 50, 5))).astype(str)
# Uppermost bin
df.loc[df['fare-bin'] == 'nan', 'fare-bin'] = '[45+]'
# Adjust bin so the sorting is correct
df.loc[df['fare-bin'] == '(5.0, 10.0]', 'fare-bin'] = '(05.0, 10.0]'
# Bar plot of value counts
df['fare-bin'].value_counts().sort_index().plot.bar(color = 'b', edgecolor = 'k');
plt.title('Fare Binned');

Lat Long Pickup drop distribution
fig, axes = plt.subplots(1, 2, figsize = (20, 8), sharex=True, sharey=True)
axes = axes.flatten()
# Plot Longitude (x) and Latitude (y)
sns.regplot('pickup_longitude', 'pickup_latitude', fit_reg = False,
data = df.sample(100, random_state = RSEED), ax = axes[0]);
sns.regplot('dropoff_longitude', 'dropoff_latitude', fit_reg = False,
data = df.sample(100, random_state = RSEED), ax = axes[1]);
axes[0].set_title('Pickup Locations')
axes[1].set_title('Dropoff Locations');

# Absolute difference in latitude and longitude
df['abs_lat_diff'] = (df['dropoff_latitude'] - df['pickup_latitude']).abs()
df['abs_lon_diff'] = (df['dropoff_longitude'] - df['pickup_longitude']).abs()
sns.lmplot('abs_lat_diff', 'abs_lon_diff', fit_reg = False,
data = df.sample(900, random_state=RSEED));
plt.title('Absolute latitude difference vs Absolute longitude difference');

sns.lmplot('abs_lat_diff', 'abs_lon_diff', hue = 'fare-bin', size = 8, palette=palette,
fit_reg = False, data = df.sample(900, random_state=RSEED));
plt.title('Absolute latitude difference vs Absolute longitude difference');
It does seem that the rides with a larger absolute difference in both longitude and latitude tend to cost more. To capture both differences in a single variable, we can add up the two differences in latitude and longitude and also find the square root of the sum of differences squared. The former feature would be called the Manhattan distance - or l1 norm - and the latter is called the Euclidean distance - or l2 norm. Both of these distances are specific examples of the general Minkowski distance.
Manhattan and Euclidean Distance
The Minkowski Distance between two points is expressed as:

if p = 1, then this is the Manhattan distance and if p = 2 this is the Euclidean distance. You may also see these referred to as the l1 or l2 norm where the number indicates p in the equation.
I should point out that these equations are only valid for actual distances in a cartesian coordinate system and here we only use them to find relative distances. To find the actual distances in terms of kilometers, we have to work with the latitude and longitude geographical coordinate system. This will be done later using the Haversine formula.
def minkowski_distance(x1, x2, y1, y2, p):
return ((abs(x2 - x1) ** p) + (abs(y2 - y1)) ** p) ** (1 / p)
# Create a color mapping based on fare bins
color_mapping = {fare_bin: palette[i] for i, fare_bin in enumerate(df['fare-bin'].unique())}
color_mapping
df['color'] = df['fare-bin'].map(color_mapping)
plot_data = df.sample(100, random_state = RSEED)
df['manhattan'] = minkowski_distance(df['pickup_longitude'], df['dropoff_longitude'],
df['pickup_latitude'], df['dropoff_latitude'], 1)
# Calculate distribution by each fare bin
plt.figure(figsize = (12, 6))
for f, grouped in df.groupby('fare-bin'):
sns.kdeplot(grouped['manhattan'], label = f'{f}', color = list(grouped['color'])[0]);
plt.xlabel('degrees'); plt.ylabel('density')
plt.title('Manhattan Distance by Fare Amount');

df['euclidean'] = minkowski_distance(df['pickup_longitude'], df['dropoff_longitude'],
df['pickup_latitude'], df['dropoff_latitude'], 2)
# Calculate distribution by each fare bin
plt.figure(figsize = (12, 6))
for f, grouped in df.groupby('fare-bin'):
sns.kdeplot(grouped['euclidean'], label = f'{f}', color = list(grouped['color'])[0]);
plt.xlabel('degrees'); plt.ylabel('density')
plt.title('Euclidean Distance by Fare Amount');
These features do seem to have some differences between the different fare amounts, so they might be helpful in predicting the fare.
grouped = df.groupby('fare-bin')['euclidean'].agg(['mean', 'count'])
grouped.sort_index(ascending=True)
mean | count | |
---|---|---|
fare-bin | ||
(0.0, 5.0] | 0.013 | 21 |
(05.0, 10.0] | 0.020 | 578 |
(10.0, 15.0] | 0.043 | 196 |
(15.0, 20.0] | 0.069 | 54 |
(20.0, 25.0] | 0.087 | 18 |
(25.0, 30.0] | 0.161 | 31 |
(30.0, 35.0] | 0.173 | 21 |
(35.0, 40.0] | 0.236 | 19 |
(40.0, 45.0] | 0.279 | 36 |
[45+] | 0.273 | 26 |
df.groupby('fare-bin')['euclidean'].mean().plot.bar(color = 'b');
plt.title('Average Euclidean Distance by Fare Bin');
There is a very clearly relationship between the fare bin and the average distance of the trip! This should give us confidence that this feature will be useful to a model.
corrs = df.corr()
corrs['fare'].plot.bar(color = 'g', figsize = (10, 10));
plt.title('Correlation with Fare Amount');
corrs = df.corr()
plt.figure(figsize = (14, 14))
sns.heatmap(corrs, annot = True, vmin = -1, vmax = 1, fmt = '.3f', cmap=plt.cm.PiYG_r);

the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings)
from sklearn.preprocessing import StandardScaler
features = ['manhattan', 'dayofweek', 'hourofday', 'trip_seconds', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'pickup_community_area', 'dropoff_community_area']# Separating out the features
x = df.loc[:, features].values# Separating out the target
y = df.loc[:,['fare']].values# Standardizing the features
x = StandardScaler().fit_transform(x)
x
features = x.T
covariance_matrix = np.cov(features)
print(covariance_matrix)
eig_vals, eig_vecs = np.linalg.eig(covariance_matrix)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
eig_vals[0] / sum(eig_vals)
Standardize the Data Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.
Ref: https://stackoverflow.com/questions/50796024/feature-variable-importance-after-a-pca-analysis
pca = PCA()
features = pca.fit_transform(x)
explained_variance = pca.explained_variance_ratio_
explained_variance
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i inrange(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels isNone:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(features[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
pca.explained_variance_ratio_
cumsum = np.cumsum(pca.explained_variance_ratio_)
%matplotlib inline
import matplotlib.pyplot as plt
plt.xlabel("number of components")
plt.ylabel("Cumulative explained variance")
plt.plot(cumsum,'--o')
print(abs( pca.components_ ))
Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: the 1st, 4th, 5th and 9th variables are most important. which in this case are 'manhattan distance', 'trip_seconds', 'pickup_longitude', 'pickup_community_area'.
However, the pickup longitude is something that is a raw unprocessed feature and the manhattan distance is brought to use after feature procesing so we will be considering that.
NOTE: We are not going to be taking the trip miles because a) it is usually not available at the beginning of the journey AND b) it is highly corelated with the manhattan distance, and we should not taken multiple variables that are highly corelated to each other in order to decrease generalisation error.
Now lets move onto the ML part on Big Query in GCP
The sql for model creation https://console.cloud.google.com/bigquery?sq=1085073370079:ac74aed7fd1b4bdc8f6b85a60dfa5bc6
Model Test/Predict
https://console.cloud.google.com/bigquery?sq=1085073370079:39a02464cfc54a80bd8cfba1d1722440
Evaluate
Model - the sql to evaluate the model
https://console.cloud.google.com/bigquery?sq=1085073370079:fa04e26003b94cef86773dd3423da54e
References:
https://stackoverflow.com/questions/40120696/trying-to-get-the-euclidean-distance-through-a-query
https://towardsdatascience.com/another-machine-learning-walk-through-and-a-challenge-8fae1e187a64