Transport for London (TfL) Cycle Data Analysis Project – Programming & Data Science

University University of Greenwich
Subject Programming

Programming and Data Science for the Professions: Group project specification

Programming and Data Science in the real world

In this collaborative project, you will engage with real-world data from Transport for London (TfL). You will be using a set of TfL data on bicycle usage across central London at various sites between 2014 and 2019.

The data (obtained originally from the TfL website) consists of various files from a large number of monitoring sites across central London. There is one file for each quarter starting from Q1 in 2014 and ending with Q4 in 2019, and these can be found on Moodle in a zipped folder.

The first set of code which uses these files will take a little time to work through them, so do not be concerned if your programs take some time to run.

Overview of the task

The aim of this project is to focus on the data from three different monitoring sites (the set of three will be different for each group). We will imagine that TfL are interested in the volume of cycle traffic passing each point, with a view to introducing a charging regime for busy periods.

  • The first task will be to aggregate the data into a single file in a suitable format, eliminating irrelevant data.
  • The second task will generate various scatter plots to illustrate different aspects of the data, first over the whole time period, and second by considering only the time of day.
  • The third task will consider how to aggregate and average the data in an appropriate manner and display a corresponding plot.
  • The final task will apply some basic machine learning tools to identify possible charging periods.

To finish you will consider to what extent your results could be used to derive a meaningful charging regime.

General advice

Your code should include comments: adding these as you go will help you keep track of what is going on, and make it easier for other group members to understand. You will also get a certain amount of marks for the quality of your comments.

It is also always a good idea to test different bits of your code with examples where you know what to expect.

The tasks are not independent of each other and will need to be completed in order. Groups are strongly recommended to work together on each task rather than trying to divide the tasks between different members.

Detailed instructions

Before you work with the data, it will help you to examine the format of the files given to you in Excel. You will also be collecting your results into a Word file, which will be referred to as your results file here.

If the project specification asks you to use a certain method then you must do so – using a different method will reduce your mark.

Step 1:

Your answer code for this question should be in a python file called Step1.py. For this question you should carry out the following steps. First we prepare some data:

a. Make a list containing the three station names assigned to your group.

b. Create a counter (set to zero) which will be used to number the rows in your output file.

The main part of this step will append rows to the output file. If we run this more than once the file will get too large, so we will start by ensuring that the output file is empty. To do this

c. Open the file project_data.csv using the option “w” to write to the file, and then use the command pass to do nothing with this file open. This will create an empty version of the file.

The main part of your code for this step will loop through the various csv data files and import the relevant data. This will be much easier if you ensure that the data files are in the same directory as your code. You will also need to unzip the files from the zipped version given on Moodle.

d. Create a loop variable i running over the years 2014-2019, and inside this create a second loop variable j running from 1 to 4. Inside these loops complete the following steps.

Do You Need Assignment of This Question

e. Create a string corresponding to the file name for the data in quarter j of year i. For example if i=2015 and j=3 then the string should be “2015-Q3-Central.csv”.

f. Open the csv file with the given string as filename and while it is open complete the following steps.

g. For each row in the file, if the entry in position 1 corresponds to one of the stations in your station list then:

1. Replace the entry in position 0 of the row by a string made up of the year and quarter of the file. For example, with the file mentioned above this should be 2015Q3.

2. Recall that your row is regarded as a list while you work with it. Make a new list called out_row by appending the current counter value to the beginning of the list. Then increase the counter value by 1.

3. Open the file project_data.csv in append mode and add the out_row to this file.

After you have done the above your file project_data.csv should contain all of the data for your three stations with the year and quarter in a better format and with a new initial column numbering the rows. The looped code above might take a while to run so do not be concerned if it takes some time! To test it you might want to initially run a version with fewer years in it.

h. Define a list called col_names containing the column names “Quarter”, “Station”, “Date”, “Weather”, “Time”, “Day”, “Drop1”, “Direction”, “Drop2”, “Mode” and “Count”. These are almost the names in the original file, but we have used “Drop1” and “Drop2” to label the columns we plan to drop.

i. Use Pandas to read the file project_data.csv that you have just created as a dataframe using col_names for the column names.

j. Drop the columns labelled “Drop1” and “Drop2” from the dataframe.

k. Create a new column in your dataframe labelled “Full_time” by concatenating the entries in the columns labelled by “Date” and “Time” with a space inserted between them.

l. Output your final dataframe as the file CycleData.xlsx on a sheet called Sheet1.

Step 2:

The remaining steps will rely on the file called CycleData.xlsx from Step 1. If you are not able to complete step 1 (or you want to start writing your code for later steps before it is finished) then you should use the dummy file provided on Moodle. Note that if this dummy file is used for your final analysis, then your project mark will be reduced.

In order to use the functions created in this step you will need to understand the nature of the data in this file, and should spend a little time reviewing this data in Excel.

Your answer code for this question should be in a python file called Step2.py. For this question you should carry out the following steps:

a. Load the file called CycleData.xlsx as a dataframe using Pandas, using column 0 as the index column. Call this dataframe my_data.

b. Ask the user to enter various choices of data using the following prompts:

1. “Enter desired station: ” (store this as station)

2. “Enter desired direction: ” (store this as direction)

3. “Do you want to restrict to private cycles only (Y/N)?: ” (store this as private_only)

4. “Do you want to display the date by time (T) or by date and time (D)?: ” (store this as period_type)

5. “Do you want to colour code by Weather, Direction, or Mode?: ” (store this as shade)

c. For each prompt your code should use an appropriate method to ensure that it is in a consistent format (eg always upper case, or lower case, etc.).

d. Write a function which given a station, a shade, and a period_type will produce a Seaborn scatterplot using for x values either the “Time” column (if T was chosen) or the “Full_time” column (if D was chosen) and for y values the “Count” column, with hue given by the choice of shade, and data restricted to the rows of the dataframe corresponding to the station choice.

e. Try out your function using the date and time option. You will see the default output is too compressed. Adjust your function so that the scatterplot has figsize(20,5) and try again.

We now want to create three variants of the function defined in steps d and e.

f. Create three further copies of your function (with different names but keeping the same figsize) and modify them as follows:

1. One version should restrict the data to the rows corresponding to those with the given station choice and only where the Mode is “Private cycles”.

2. One version should use an additional input variable called direction. This version should restrict the data to the rows corresponding to those with both the given station choice and direction choice.

3. The final version should also use an additional input variable called direction. This version should restrict the data to the rows corresponding to those with the given station choice and direction choice and only where the Mode is “Private cycles”.

Write some code which depending on the various input values from step b above calls the relevant function from the four just defined. If “Any” is entered as the direction by the user your code should call the relevant function without restriction on direction. You should now be able to plot the data either by time or by date and time for a given station in various ways.

Step 3:

While visualizing our data in step 2 we saw that as well as a large number of private cycles, there are also a small number of cycle hires included in the data. We also saw that the plots were hard to read using date and time, even when we adjusted the format of our scattergraph. The aim of this section is to work out the average number of cycles for each time period, regardless of their type or the date. For this we will find the groupby method to be very useful.

We have not seen the groupby method in class, and you will need to find out more about how to use it from the web.

Your answer code for this question should be in a python file called Step3.py. For this question you should carry out the following steps:

a. Load the file called CycleData.xlsx as a dataframe using Pandas, using column 0 as the index column. Call this dataframe my_data.

b. As in Step 2, ask the user to enter a desired station and direction, and ensure that these are in a consistent format using a suitable method to adjust them.

Buy Answer of This Assessment & Raise Your Grades

c. Write a function which given the user’s choice of station and direction, plots the average number of cycles at that time for that choice of variables. To do this:

1. Create a new dataframe by restricting to the rows of the original dataframe corresponding to the given choice of station and direction.

2. Use the groupby method to create a new dataframe where you group by “Date” and “Time” and apply the groupby.sum method to the “Count” column. This adds the private and hired values together for each date and time.

3. Take this new dataframe and apply the groupby method again to create another dataframe where you only group by “Time” and apply the groupby.mean function to the “Count” column. This calculates the mean of the various values for each given timeslot.

4. Plot a scatterplot as in step 3 of “Time” against “Count” for this final dataframe. You should have a much simpler plot with just one point per time period.

Step 4:

In this final step you will use a decision tree to suggest possible peak charging periods for cycle users. For simplicity we shall restrict our attention to Private cycles only.

Your answer code for this question should be in a python file called Step4.py. For this question you should carry out the following steps:

a. Load the file called CycleData.xlsx as a dataframe using Pandas, using column 0 as the index column. Call this dataframe data.

b. Write a function which given a choice of station does the following:

1. Makes a new dataframe by using a mask to only copy the rows for the given station choice.

2. We now want to fit a decision tree to our data, comparing Time and Count. An example of this procedure was carried out in the handbook, after the example of a linear regression.

3. Unfortunately the time variable as provided is regarded as a string by Python. So we need to convert it into a datetime variable. Use the Pandas to_datetime method on the column of Time data where the format is given by “%H:%M:%S” to create a new column in your dataframe called RealTime. You will need to look this up on the web. Once done the RealTime column can be used as numerical data.

4. Form arrays x and y corresponding to RealTime and Count by reshaping them.

5. Split the data into training sets and test sets using the train_test_split function from sklearn.model_selection. To ensure that your training sets match those in my answer use random_state=1 inside this function.

6. Use a DecisionTreeRegressor to fit the training data. You will need to choose a max_depth value; the plots at the end of this question will enable you to tweak this to the value you think most appropriate, but try starting with depth 2 or 3.

7. Plot your decision tree (it should look like the example in Figure 26.1 of the handbook).

8. Add a column to the dataframe of the predicted values from your regressor for all values (not just those in the test set).

9. Use the Seaborn scatterplot to plot the Count column (on the y axis) against the RealTime column (on the x axis) as in the previous steps.

10. Add a scatterplot of this new column against the RealTime column to your previous scatterplot (you can do this just by calling the scatterplot command again).

We want to use the predicted values to suggest two periods of time each day when TfL should charge for peak usage. Try different values for the max_depth, and pick the one that you think is the most appropriate, based on the reasonableness of the scatterplot.

Notice that the decision tree does not present the datetime version of the data in an easily readable form. If we had more time we would work out what time periods correspond to the choice you have made.

Final submission

The first station in the list of three that you were given will be referred to as your primary station. For this station the primary direction will be taken to be either Northbound or Eastbound (depending on which directions occur at this station).

As well as your code, you will need to submit a single file summarizing your results. In this file you should include the following.

    • An example of a scatter plot from step 2 where you pick the primary station and direction, only include private cycles, colour code by the weather, and use date and time.

This plot is quite hard to read, so the remaining plots will focus on time data only. But for this first plot you should discuss whether there is any evidence that cycle use is seasonal or depends on the weather.

  • An example of a scatter plot from step 2 where you pick the primary station and direction, include all types of cycle, colour code by Mode and use time only.
  • An example of a scatter plot from step 2 where you pick the primary station, include all types of cycle and directions, colour code by Direction and use time only.
  • An example of a scatter plot from step 3 for the primary station and direction.

What does the second of these plots tell you about peak travel periods at the primary station in each direction? Do you get similar results for the other two stations? Can you give any possible reasons for these results?

Finally you should include

  • Your final scatter plot and decision tree plot from step 4 for the primary station and your preferred choice of max_depth.

Explain why you have chosen this value of max_depth, and where you would choose the two peak periods to be on the given plot from the tree data. Do you think that this method produces a sensible peak period for TfL to use across London? You may wish to try running step 4 for different stations as part of your answer to this question.

At the end of your project one member of the group should upload the following files:

  • All files containing your python code, ie the files Step1.py, Step2.py, Step3.py and Step4.py.
  • The results file (a Word file) containing your various outputs, and your discussion of what you can conclude from them.

Do not submit your code files as images or embedded into a word file – they need to be able to be opened in a python editor.

Marking

You should read through the project instructions and mark scheme to ensure that you complete everything that is required to maximise your possible mark.

Programming and Data Science for the Professions: Project instructions and mark scheme

Project instructions

The project is a group activity, and one copy of the submission should be submitted per group. The various groups have different starting data and it is important that you only use the data for your group. Different groups will get different answers, so do not worry if your results are not the same!

This project is intended to take a number of weeks, and you should not be surprised if certain parts prove to be quite challenging. The project comes in several steps, each with a number of parts. For the more complicated steps I have included detailed instructions and you should follow these carefully. For some parts you may need to look up python functions online. It is possible to get a very good grade even if not all parts of the project are successfully completed. If you are not able to complete your code you should still submit partial solutions.

With electronic submissions of code, there can be the temptation to copy another group’s work. This is very easy to detect, and both groups will be investigated for Academic Misconduct in such cases. You should not use ChatGPT or other AI tools to construct your code (not least because they often get things wrong).

Details of the tasks that you need to complete can be found in the Group Project Specification, together with details of what you need to submit at the end. The project should be written in Python. If you are asked to use a certain method in the specification then this is required; using a different method will result in a loss of marks.

After the project is marked, the individual student grades will be adjusted using a process of peer review. Details of this will be provided separately. You should keep a record of your individual contribution to the project. The groups are designed to contain a range of abilities; peer review is intended to measure the effort that a student puts in to their contribution rather than their individual aptitude for programming.

Are You Looking for Answer of This Assignment or Essay

Where a student fails to participate in an assessment activity, their final component mark may be replaced by a zero, after a review by the module leader. Students who do not participate sufficiently in their group work will also struggle to produce an adequate individual video presentation.

Project groups work best when the members meet up regularly and discuss their work and help each other. You are expected to meet with your group regularly, and the weekly lab sessions are provided for this.

Groups who allocate different parts to different members and then try to work on these completely independently often find this causes problems – you are recommended not to do this.

There will be a separate individual submission of a brief recorded presentation on your results, after the group component has been submitted. Details of this will be provided later in the term.

Mark scheme

You will receive marks for each part of your code, so that even if you are not able to correctly complete all of the parts, you will still gain credit for those parts which are correct. You will also get partial marks for routines which are partially successful.

The detailed mark scheme is given below.

For each step, the mark given will depend on the extent to which the code has the correct logical structure and the correct syntax. Code with a small number of minor errors in a given step will still get the majority of the marks for that step even if it does not run or gives the wrong answer.

There should be enough comments that someone else reading your code would be able to understand the main features. You should use comments to describe what each part of your code does, and what the main variables represent.

If the dummy data file is used then as well as losing marks for step 1, there will be a maximum of 15 marks available for step 5.

Steps 1-4

Each of the first four steps is worth 20 marks. These will be awarded according to the following criteria:

MarksCriteria
0 marksNo (or virtually no) code submitted.
5 marksSome attempt made to complete the step, but with significant errors of logic or syntax, and/or significant missing parts.
10 marksA reasonable attempt to complete the step but with some errors of logic or syntax or missing parts.
15 marksA good attempt to complete the step with only some very minor omissions or errors, or with no omissions/errors but insufficient comments.
20 marksAn excellent attempt to complete the step with no errors or omissions and good use of comments.

Step 5

This step is worth 20 marks.

MarksCriteria
0 marksNo results submitted.
5 marks1 or 2 correct plots only, with or without related discussion.
10 marks3-4 correct plots and some attempt to discuss the results.
15 marksAll plots correct, and some attempt to discuss the results, or 3-4 plots correct and a good discussion of the results.
20 marksAll plots correct and a good discussion of the results.

Once the final group mark has been determined, peer review will be used to adjust this mark to reflect the contribution of each student. Details of how this will work will be provided separately.

Do You Need Assignment of This Question

Answer
img-blur-answers
WhatsApp Icon