S Anand

Tube announcements

I was travelling on the Jubilee line, just pulling into Stratford (the last stop), when I heard this announcement.

“The next station is Stratford, where this train terminates. Thank you for travelling on the Jubilee line, and I hope you have a very pleasant evening.”

(pause)

“Unless, of course, you were the person who pulled the passenger alarm at Westminster, in which case I don’t care what kind of an evening you have.”

In-cell Excel charts

Juice analytics has some Excel graphing tips. You can make charts like below without using charts, using just text.

These are useful because the charts are aligned with the data.

Excel Gantt charts using just text

Excel bar charts using just text

I once used a similar technique to display people’s staffing position. The sheet below lists people, projects they’re on and how long they’ll be on. The coloured cells to the right are a calendar display of the same stuff. Makes it easy to read.

Excel Staffing Plan without using charts

The trick is to place each week for each person as a thin cell, like below. Then the cell is populated with a formula that makes it 0 or 1 depending on whether the person is available that week or not. (The blue row #2 stores the start date of the week, and I compare this with the end date of each person’s project.)

Excel Staffing Plan - formula

And then, you can turn on conditional formatting.

Excel Staffing Plan - conditional formatting

Tamil songs by Ilayaraja in 1980s

More songs by Ilayaraja, composed in the 1980s. Can you guess which movie they are from? (Films are NOT repeated)

Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.

Some people will never program

All teachers of programming find that their results display a ‘double hump’. It is as if there are two populations: those who can [program], and those who cannot [program], each with its own independent bell curve.

Hotcaptchas

Prove that you are human by picking 3 hot people. Interestingly, while I was able to pick women with 100% accuracy, I just couldn’t pick out hot men. I wonder if the women really can…

Normalising non-normal distributions is bad

I was working with the treasury of a bank. They were trying to estimate how much money could flow out of their savings account in a day, worst case.

I took their total savings account balance at the end of each day and found the standard deviation. I took thrice the standard deviation, and said, “You can be 99.7% sure that your daily loss won’t be more than 1.5% of the balance.”

That would be right if it were a normal distribution. But it’s not.

Banks have millions of savings accounts, each of which is like a random variable. But unless they’re independent, and they have finite standard deviations, the central limit theorem won’t work.

Firstly, savings account transactions are not independent. If there’s a run on the bank, they’d all pull out their money. Whenever a company declares dividend, a large number of savings account are credited. Salary accounts are credited at the end of the month. As a rule of thumb, you could say that if one savings account goes up, the others are likely to as well.

Secondly, savings account transactions are not normally distributed. If you take a single savings account, you won’t find a bunch of debits and credits. Every month, you’ll find one large credit for the salary, one mid-sized debit for monthly expenses, and several small debits for individual transactions (bills, ATM, etc.) Once in several years, you’ll find a gigantic debit (purchase of car or house, wedding, etc.) or a gigantic credit (retirement / pension fund, sale of property, etc.)

As a result, the savings account is likely to fluctuate a LOT more than if it were a normal distribution.

If I had just looked at the data, I’d have found several occurrences of fluctuations greater than 1.5%. The normal distribution predicts that there should be fewer than 0.3% of such cases. That’s about 1 per year. I’d have visually been able to spot nearly one a month. I’d also have been able to spot the huge 4% swings that do happen once in a few years.

People wiser than me have made the same mistake. I was interning at Lehman Brothers when they were planning to launch a new electronic bond-trading product. My task was to trace the bond price movement.

The data we had was bad. Many bonds jumped as much as 40% in a single day, due to data errors. The bulk of my task was to clean out these errors.

After cleaning up, there was still two jumps that couldn’t be explained. I went to my boss, who recognised them at sight. One was a sudden drop in price of all Government bonds in December 1998. The other was a 32% drop in price of Hikari Tsushin — a mobile phone retailer — on the day they went bankrupt.

We concluded that the daily price drop wouldn’t be more than 9%, to a 95% confidence level. If that was right, a 32% drop in one day would happen once in a million years. Yet, we had Hikari Tsushin just the previous year.

We didn’t bother about it. In fact, we didn’t even think about it. If we’d checked, we’d have found that the daily price drop was closer to 12% or something, to a 95% confidence level.

Summary: Force-fit a normal distribution on non-normal data can understate the worst-case scenario. Often you’re better off just inferring confidence levels from the raw data than from a fitted distribution.

Sourced statistic from: www.forex.academy

Normalising non-random samples is bad

I rate movies on a scale of 1 (bad) to 5 (good). This is an absolute scale. Initially, I assumed that I would watch as many good movies as bad ones. So I’d have about as many 1s as 5s, and 2s as 4s. But, when I looked at my ratings for movies over the last year, I had far more 4s than 2s. My movie ratings were not normal.

Rating Frequency
1 8
2 31
3 98
4 81
5 18

The reason is clear. I pick good movies rather than bad ones, based on reviews. If I rated every movie there was, the ratings may be normally distributed (or they may not). But when I pick movies, I consciously reject those I know would have a low rating (based on reviews), so my ratings would be more clustered around the top.

Even if I redefined my scale, I’d still have more than 50% above the average. This is not a contradiction. I watch a LOT of good movies with very similar ratings, and a few disastrously bad movies. The good movies will have a higher-than-average rating, and there’ll be more of them than the bad movies. This is a skewed or asymmetric distribution.

So, selective picking can wreck the normal curve.

Yet, almost everything is selectively picked. Colleges try and pick the best students. Organisations tend to pick the best employees. If they rate performance, they’re likely to find a bias towards the higher side — at least, the good colleges and organisations. Force fitting a normal distribution pushes down genuinely good people. (In bad colleges and organisations, it pushes up genuinely bad people).