FTTC installed…and then the problems started.

Once again, Trefor.net welcomes contributor Tim Bray, Technical Director for ProVu Communications. “FTTC — Upgrade Your Router” is Tim’s second “Broadband Week” post.

At ProVu we, don’t often do onsite installations, preferring instead to leave them to our resellers. Sometimes, though, a problem comes along that requires that we get involved in helping to figure out what is going on.

One of our customer’s sites was activated for FTTC broadband. This customer ran an office with a small call centre and about 10 office PCs, and they thought the higher bandwidth would be useful. Zen (the ISP, in this case) had a special offer on ADSL to FTTC upgrades, so the time seemed right for upgrade. Our customer swapped their onsite router out for a model that could do both ADSL and FTTC, and all appeared ready for an easy change over once the Openreach engineer arrived.

ProVu logo

On the scheduled day the Openreach man showed up, and our customer had just 10 minutes downtime while he performed the jumpering in the cabinet. Up came the new 40 Mbps download line (which also had, more importantly, a massive upload speed). Magic. Everything worked, and the internet seemed to be lightning fast. And then the problems started. “The internet is slow!” “We’ve got bad call quality!” And so, a site working properly and perfectly had stopped doing so because of a service upgrade.

We added lots of monitoring. Smokeping and Nagios. Sure enough, we learned of intermittent bad packet loss on the line that came and went, usually at such quiet times as evenings and weekends. We could tell that something was on the network opening a large number of sessions through the NAT in the router, and we knew that the problems started as we got towards 600 TCP sessions. We wondered whether with FTTC when you open a browser window with all your saved tabs the computer would hit those tabbed sites — Facebook, Twitter, Gmail, BBC News and all their associated ad networks and image CDNs — all at the same time, perhaps causing these events to happen too quickly and to throw too many ports open at the same time.

Running just a small consumer type router, we couldn’t diagnose the issue to the point where we could determine what was causing it. As such, as we needed better instrumentation to investigate further, we decided to install a proper linux server as a router in lieu of the dedicated hardware. BT Openreach provides PPPoE termination, so it is easy to deploy standard PC hardware with 2 ethernet cards to act as a router. We used Munin to add every kind of monitoring. We had graphs of UDP sessions, TCP sessions, and traffic graphs for voice traffic against other traffic…you name it, we graphed it.

Everything we could think of that might help us to figure out what was causing the issues being experienced was in place. And it was that moment that the problems went away. Again, magic. Once the new router was installed, everything worked. We saw large throughput and sessions through the router, but no corresponding packet loss. And no user complaints.

Very puzzling.

Then one Saturday I noticed the traffic graph on the router rise up to 30 Mbps download speed and stay there. Not the first time this had happened, of course, but it was the first time I was there to watch. My suspicions were raised, so I phoned the call centre. “No, all our calls are fine.”  The new router was coping with this traffic fine. So I ran Wireshark and discovered that the call centre staff were watching telly using Sky Player on a sneaked-in laptop. And from watching the trace, I could see that Sky Player was streaming the video by opening a new TCP session every few seconds, which coupled with the large number of phone calls must have been what was overwhelming the old router.

I phoned the call centre manager with my findings, and she sussed that they were watching the footie. And regarding a remedy, lets just say some HR Department action occurred!

At this point, let me sum up the learning points:

  1. A bigger router might be needed for FTTC, as the router could be the slowest bit and not the ISP.
  2. The router might have a limit for packets per second.
  3. Even a small office can open a lot of ports through a NAT, something for which small routers cannot cope.
  4. With a good enough router, it is possible to run a small call centre and stream TV at the same time.

As an aside, I think this is a great point where IPv6 would help. IPv4 and NAT is stateful on the router. The router has to record each session and rewrite the packets. IPv6, though, would be stateless, so the router would have only need to pass on the packets rather than having to track sessions and rewrite port numbers. Also, there is the old adage: Use a separate connector for voice to your data. I suspect that some of the poor voice quality that encourages this is actually the voice and data services acting in conjunction to overwhelm the router, rather than there simply not being enough bandwidth. Bufferbloat may be part of the problem as well. But I suspect a router with more grunt may make it so the second line isn’t required.

I’ve done various consultancy jobs to investigate ‘SIP phones dropped off network’, and by scripting to monitor the NAT state table have found the router/firewall just dropping the session from the NAT table, which is obviously either a bug or just not enough capacity in the device.

Editorial note – check out our new site – BroadbandRating.