It seems that the Piccadilly line is only just getting back to normal after a problem with flat spots on the wheels of 50% of the trains: [TfL Link].
Looking at the data for the number of trains running, this seems to stem from around the 24th November (Thursday) when the numbers started to drop off. Analysing this data is problematic because of the noise inherent in the data collection process and the need to take weekends into account. There is also the launch of the Night Tube on the Piccadilly line which happened on Friday 16th December.
Plotting the total number of tubes running over a 24 hour period as a moving average makes things a bit simpler:
The data break is immediately evident on the 8th and 9th December, but the numbers can be seen to be dropping from the 23rd to the 28th. Then the 5th, 6th and 7th of December (Mon to Wed) just before the data break is particularly bad. It’s interesting to note that there were more tubes running on the 10th and 11th of December (Sat/Sun) than there were running on the 5th and 6th (Mon/Tue), which seems to be the worst period for the Piccadilly Line.
While this is quite an interesting exercise, the real value of this type of analysis is in the effect it has on the commuter. Spacing between tubes of 15 minutes were reported and sections of the line had no service at all. What I need to develop now is a way of generating these spatial analytics automatically from the data as we collect it.
Somebody managed to do a SQL injection attack on MapTube recently, so it hasn’t been working properly for a while. Now that the vulnerability has been identified and fixed though, it’s back to normal again.
Looking through the logs, they’ve spent the best part of a month trying to do this, so I wish I had seen it earlier. It’s also been flagged by the main firewall as malicious.
I’ve had this idea for a while, but it occurred to me that we should be doing some spatial analysis on where all these attacks are coming from. They use groups of IP addresses which they change every day, but we have years worth of data now for a number of different web servers which could be analysed. The same applies to all the spam email that we’re filtering out. Just looking at the web server logs for this morning from midnight to 9am, there were 15 potential attacks and there were also 39 the day before, so there’s a lot of potential data there if we started putting it all together. It’s all just information theory.
Just to follow up on the last post about the launch of the “Night Tube”, the service launched on the Jubilee Line last Friday. There are now close to 40 tubes running over night at the weekend on the Central, Victoria and Jubilee lines. The chart above shows the number of tubes running on Thursday 6th October through to 23:57 on Sunday night. The morning and evening peak rush hours are evident for Thursday and Friday, then the first Jubilee Line night services can be seen in the trough between Friday and Saturday.
The interesting thing to do now would be to run a public transport accessibility analysis using the real-time running data to see which parts of the city are now more connected. As today is the second day of the Southern Rail strike, that might make another interesting subject. Using the Census travel to work data we could forecast the areas where people are going to be late for work because of transport failures. That could potentially give a measure of what effect any strikes, or even just “congestion” generally, is having on London.
On a technical level, one thing which is now becoming apparent is that the number of drop-outs from the API has increased. It used to be that there would be a few per day on the Northern Line (biggest data), but now they seem to be occurring randomly across all the lines. The CASA API has been collecting data since the London Olympics in 2012, so it’s long overdue for some maintenance.
The first night tubes started running last Friday evening, so I couldn’t help wondering what that does to the number of tubes running. The graph above shows the total number of tubes running from Thursday 18th to Sunday 22nd August. My first reaction was, “what night service”, but then I read the TfL statement and realised that this is only the Central and Victoria lines with the rest to follow in the Autumn. The arrows on the diagram above show where the extra services show up in the statistics.
The following graph of only the Central and Victoria lines shows it a lot better:
The total of around 20 tubes running through the night, with a total capacity of around 800 passengers each (TfL Rolling Stock) is a significant extra capacity. It’s just that the peak rush hour service is so much bigger by comparison. I also wonder whether they were testing the Central line on the Thursday evening because of the large number showing up there over night? Normally we get a small residual number of tubes moving during the shutdown period which I’ve always put down to engineering services.
What is going to be interesting is to see how the service adapts to usage over time.
The graphic says it all really. The width of the stream graph shows the total number of tubes running, with a breakdown by each tube line displayed in the regular line colour (red=Central Line, green=District Line etc).
Basically, there’s nothing running, apart from a “special service” on the Waterloo and City line. I’ve never seen it like this before, as, in the previous strikes, they’ve always managed to run about a 30% service.
Despite being told that everything was going to shut down completely by 6pm last night, it appears that the shut down began around 6pm and wasn’t complete until just after 9pm, although I wouldn’t like to have been trying to get home during that time. From the pictures on the news last night it looked like complete chaos, which just goes to reinforce the fact that we need to establish a method for measuring how many people the tube network is carrying (i.e. the “crush factor”).
As part of my PhD I’ve been looking at a lot of real-time data about tubes, buses and trains. In fact, I probably started from the point where I already had a lot of data and was wondering what to do with it. While I would not class this as “Big Data”, the complex nature and real-time element make it difficult to analyse and visualise.
The image above shows the bus network displayed using my virtual Earth viewer. Having previously done a lot of work on the tube network, it only took about half a day’s work to get the buses into the system. One reason for this is that I’ve implemented an agent based modelling system (ABM) similar to NetLogo, so I just have to write the code to load agents and links from CSV files (easy!). The simulation is a bit harder to do, but not much.
Although I knew the bus data was about 10 times bigger than the tube data, what I hadn’t bargained for was the fact that there are 21,987 bus stops (agent nodes), 53,896 route points (links) and up to 7,000 live buses (moving agents). The other weird thing is that TfL seem to be missing 409 bus stops from their master list as there are stops contained in the routes that I don’t have positions for. There are also a lot of invalid lines in the data that look as if there has been an error extracting the data from a database. I had a really interesting discussion with last Thursday’s visitor about that fact because he couldn’t believe it. I think I’m right in saying that there is a theory about complexity that goes along the lines of “any sufficiently complex data analysed deeply enough will always show inconsistencies”? In other words, we just have to deal with it.
Putting in some buildings gives a much better appreciation of just how big:
If you look closely, you can just see the bus stops in the river which are pontoons for the boats. The coloured cubes representing the stops are all 100m on each axis. It now all gets worse, because that graph containing 53,896 route points has to be fragmented using the road network and a routing algorithm to make the buses travel along the roads or rivers. I’ll have to implement this just as soon as I get the data displaying at a reasonable frame rate.
To really put things into the correct scale, and thinking of the highways agency’s UK wide road network, which is on my list:
I just like the Winter Blue Marble image, which you don’t see very often. The Google Earth images are all the Blue Marble composite.
So, getting back to the PhD topic, which is about the algorithms which make all of this work, I obviously need to improve the graphics a lot, but most of the building blocks are now in place. I’m a graphics programmer, so the graphics engine is obviously hacked to pieces and I need to tidy it up. All the numbers in the top left of the images are the frame rates, which should be a lot higher than about 4-6 frames per second. If you take the geometry representing the bus routes (links), it’s a mesh with over 6 million points, and it’s taxing my graphics card a bit. Top of the range GPUs these days will do over a teraflop, which used to be supercomputing territory not long ago, but use them in 64 bit mode and the performance drops drastically. I still have some shader tricks to use which will improve the ABM performance a huge amount.
Finally, I have to answer the question, “what’s the point of it all?”. I wanted to analyse real-time dynamic data using a system that allowed me to explore the data visually in both time and space. Why is there a bus route from NW London going diagonally across to the SE in the first image? You can just see the white line going through the buildings, but it looks like an error in the data. Programming the model to simulate the buses allows you to explore the real-time element, but the aim is to have more in the way of analysis and data-mining than the simple widgets you get with NetLogo.
Now I have two networks, my first question is to look at how they compare to each other. I have the whole of 2014 to use for the analysis and a tool which (might) now let me do it.
I’ve always wondered whether the peaks in the bus, tube and train numbers occur at the same times and whether there is any spatial variation?
Just as an update, here’s a movie I uploaded which shows the bus network much better than any words can:
I had a pleasant surprise on my commute home last night when I found myself on one of the new ten car trains that southwest trains have bought. They’ve coupled two brand new cars onto the front of their existing stock, so we could all see it was a new train as it approached the platform. Being able to get a seat was also a new experience.
Now, in transport modelling terms, that means that they have potentially increased their people carrying capacity by 25%. If they were running 8 car trains before and can now run 10 car trains, that’s a significant increase.
What I don’t know is how many of these trains they’ve got and when and where they’re running them. I’ve looked at the Network Rail data feed, but it doesn’t give you the size of the train. I need to look into the data a bit more deeply as there might be a physical train identifier that I’ve missed. They all have “leading car id”, but I can’t find this in the data feed. Even a map of all the stations that have been converted to take 10 cars would be interesting.
To mark the event I’ve added a new feature to the homepage which should make it more dynamic. Now, if I blog about any maps, they will automatically appear on the MapTube front page with the text, images and map links extracted directly from the RSS feed. Along with the ‘topicality index’, which places maps for data which is currently in the media on the front page, this should keep the website up to date with the latest events. It’s also telling me what information we don’t currently have so we can gradually fill the gaps in our knowledge.
I’m hoping to follow this up in the next month with some real-time data feeds and more interactivity on the maps.
Just for completeness, I’ve updated the two graphs of the numbers of buses running on 13th January with the complete set of data up to 23:59 that night.
The first graph shows the total number of buses running on Tuesday 13th (red) against the previous day (blue). The second graph shows the ratio of red/blue*100%, or what percentage of a normal day’s buses were running? It levels off at around 24% quite definitely and never reaches the 33% which is the official TfL figure for the percentage of services running. The mean value for 7am to midnight works out as 23.7%, so either TfL have a different way of calculating the figure, or our data is wrong. This is something I’ve been wondering about for a while now, as we assume the data from the TfL Countdown API is accurate, but have no independent cross check. Coding errors can also lead to data issues and we know that during the last tube strike lots of extra buses ran which didn’t have the “iBus” tracker on them and so didn’t show up on our data. Having said that, there is nothing to suggest a problem with the data.
One other thing I was wondering about was what effect the strike would have on tube overcrowding? Having seen a news report from Vauxhall bus garage the previous day, I realised the huge number of people this was going to affect. If you’re a commuter changing trains at Vauxhall, then your logic goes something like this:
1. “There is a bus strike, so everybody who normally catches a bus from there is going to try to get on the tube. The tube will be packed.”
2. “There is a bus strike, so the delivery of people by bus to the tube station will be much lower than normal. The tube will be empty.”
It’s all a question of numbers, but, at the moment, it’s not something which have the data to even attempt to answer. But, by collecting data about unusual events like this, it might give us the insights into what happens on a normal day.
Another day, another major public transport failure. I didn’t think the bus strike was having much of an impact until I got into the office and had a look at the statistics.
The graph above shows the number of buses running on the two days using the same horizontal time axis, so 0915 is a quarter past nine on both days. The red graph shows the comparison of how few buses are running. From the data, I can calculate this as about 24%.
By plotting the ratio of the number of buses running on Tuesday (strike) divided by the number on Monday (no strike), the fall off in numbers from around 4am this morning is visible. From around 7am until 12pm, this levels off at about 24%.
The numbers don’t tell the whole story, though:
It looks as though there are more buses in London than in the suburbs, but it’s not showing the huge gaps we saw during the May 2012 strike which were caused by only selected unions striking.
Both these maps are online on MapTube at the following link: