2006 - Page 13 of 31

Tamil songs by Ilayaraja in 1980s

July 20, 2006 January 25, 2025 / Quizzes / 197 Comments

More songs by Ilayaraja, composed in the 1980s. Can you guess which movie they are from? (Films are NOT repeated)

Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.

Some people will never program

July 20, 2006 July 20, 2006 / Links / Leave a Comment

All teachers of programming find that their results display a ‘double hump’. It is as if there are two populations: those who can [program], and those who cannot [program], each with its own independent bell curve.

Hotcaptchas

July 20, 2006 July 20, 2006 / Links / Leave a Comment

Prove that you are human by picking 3 hot people. Interestingly, while I was able to pick women with 100% accuracy, I just couldn’t pick out hot men. I wonder if the women really can…

Videos you can learn from

July 20, 2006 July 20, 2006 / Top 10 lists / 1 Comment

Berkeley webcasts of their courses.
Google TechTalks.
Authors@Google.
LongNow seminars about long term thinking.
UCTV Video on Demand.
Nova.
Computer History Museum.

Normalising non-normal distributions is bad

July 19, 2006 December 1, 2019 / Business realities / 5 Comments

I was working with the treasury of a bank. They were trying to estimate how much money could flow out of their savings account in a day, worst case.

I took their total savings account balance at the end of each day and found the standard deviation. I took thrice the standard deviation, and said, “You can be 99.7% sure that your daily loss won’t be more than 1.5% of the balance.”

That would be right if it were a normal distribution. But it’s not.

Banks have millions of savings accounts, each of which is like a random variable. But unless they’re independent, and they have finite standard deviations, the central limit theorem won’t work.

Firstly, savings account transactions are not independent. If there’s a run on the bank, they’d all pull out their money. Whenever a company declares dividend, a large number of savings account are credited. Salary accounts are credited at the end of the month. As a rule of thumb, you could say that if one savings account goes up, the others are likely to as well.

Secondly, savings account transactions are not normally distributed. If you take a single savings account, you won’t find a bunch of debits and credits. Every month, you’ll find one large credit for the salary, one mid-sized debit for monthly expenses, and several small debits for individual transactions (bills, ATM, etc.) Once in several years, you’ll find a gigantic debit (purchase of car or house, wedding, etc.) or a gigantic credit (retirement / pension fund, sale of property, etc.)

As a result, the savings account is likely to fluctuate a LOT more than if it were a normal distribution.

If I had just looked at the data, I’d have found several occurrences of fluctuations greater than 1.5%. The normal distribution predicts that there should be fewer than 0.3% of such cases. That’s about 1 per year. I’d have visually been able to spot nearly one a month. I’d also have been able to spot the huge 4% swings that do happen once in a few years.

People wiser than me have made the same mistake. I was interning at Lehman Brothers when they were planning to launch a new electronic bond-trading product. My task was to trace the bond price movement.

The data we had was bad. Many bonds jumped as much as 40% in a single day, due to data errors. The bulk of my task was to clean out these errors.

After cleaning up, there was still two jumps that couldn’t be explained. I went to my boss, who recognised them at sight. One was a sudden drop in price of all Government bonds in December 1998. The other was a 32% drop in price of Hikari Tsushin — a mobile phone retailer — on the day they went bankrupt.

We concluded that the daily price drop wouldn’t be more than 9%, to a 95% confidence level. If that was right, a 32% drop in one day would happen once in a million years. Yet, we had Hikari Tsushin just the previous year.

We didn’t bother about it. In fact, we didn’t even think about it. If we’d checked, we’d have found that the daily price drop was closer to 12% or something, to a 95% confidence level.

Summary: Force-fit a normal distribution on non-normal data can understate the worst-case scenario. Often you’re better off just inferring confidence levels from the raw data than from a fitted distribution.

Sourced statistic from: www.forex.academy

Normalising non-random samples is bad

July 19, 2006 July 19, 2006 / Business realities / 2 Comments

I rate movies on a scale of 1 (bad) to 5 (good). This is an absolute scale. Initially, I assumed that I would watch as many good movies as bad ones. So I’d have about as many 1s as 5s, and 2s as 4s. But, when I looked at my ratings for movies over the last year, I had far more 4s than 2s. My movie ratings were not normal.

Rating	Frequency
1	8
2	31
3	98
4	81
5	18

The reason is clear. I pick good movies rather than bad ones, based on reviews. If I rated every movie there was, the ratings may be normally distributed (or they may not). But when I pick movies, I consciously reject those I know would have a low rating (based on reviews), so my ratings would be more clustered around the top.

Even if I redefined my scale, I’d still have more than 50% above the average. This is not a contradiction. I watch a LOT of good movies with very similar ratings, and a few disastrously bad movies. The good movies will have a higher-than-average rating, and there’ll be more of them than the bad movies. This is a skewed or asymmetric distribution.

So, selective picking can wreck the normal curve.

Yet, almost everything is selectively picked. Colleges try and pick the best students. Organisations tend to pick the best employees. If they rate performance, they’re likely to find a bias towards the higher side — at least, the good colleges and organisations. Force fitting a normal distribution pushes down genuinely good people. (In bad colleges and organisations, it pushes up genuinely bad people).

Not all distributions are normal

July 19, 2006 July 19, 2006 / Business realities / 1 Comment

14 years ago, I was introduced to the process of normalising grades. Professors “fit” students’ marks into a normal distribution and assign grades based on that. (I still don’t know how they do it).

Since then, I’ve encountered normalising a lot. My performance at work is normalised. I normalise my song ratings and movie ratings. I’ve normalised all kinds of things at work: lead-time of delivery of fans, movements in savings account balances, calls to a call centre, demand for a resource… you name it.

(What I mean by normalising is, I find the mean and standard deviation, and assume that it’s a normal distribution with that mean and standard deviation. For things under my control, like movie ratings, I revise the ratings to fit a normal distribution.)

In fact, I normalise everything I encounter by default.

A few years ago, I started feeling uncomfortable about this. I’ve now figured out why normalising is bad — at least when done blindly like I do.

First, let’s explore why normalising is good. Normalising eliminates biases. If the Prof in Section A grades higher than the Prof in Section B, normalising takes care of it. If a Prof is extremist (more A’s as well as F’s), normalising takes care of it. If a Prof is skewed (lots below average, few extremely high above average), normalising takes care of it.

Eliminating biases makes sense if Section A is fundamentally like Section B. It’s not better, nor more extremist, nor more skewed. If the sections are large enough and picked randomly, this assumption is correct. If Section A represents the smarter half, or people born in the second half of the year, or people from the Western states, or any other non-random selection, this need not be correct.

An aside: You may wonder why people born in the second half of the year is non-random. If school admissions start in September, and admissions start when you’re 3 years old, kids born in September will be nearly 4 years old when they join. Kids born in August will be between just over 3 years. That one-year difference, to a three-year old, is HUGE. For example, you will find a birth date bias in football, with most premiership players being born in the months of September – November.

Normalising goes a step further than eliminating bias, however. Normalising forces a normal distribution. This would be right if the underlying data is normally distributed. But if not, we may be making a mistake by force-fitting.

The Central Limit Theorem says that if you add up random variables, you get a normal distribution. Provided it’s a large sample, variables are independent, and each has a finite standard deviation.

This means that many things you get by adding random variables are normally distributed. For example:

Number of heads when you toss a coin (add up each coin toss)
Average age of an army platoon (add up each soldier’s age)
Terminus-to-terminus time for a bus (add up the time between each stop)
Price movement of an stock exchange index (add up each stock’s price movement)

But a lot of real-life data is NOT normally distributed. The usual reasons are:

It’s not the sum of random variables
It doesn’t satisfy the central limit theorem (independence, large sample, finite standard deviations)

Here are some non-normal distributions that are NOT the sum of random variables:

Soldier’s age within an army platoon. What random variables could you add up? You’ll probably find a lot of people at age 18, because that’s the minimum age. A little fewer at age 19 — last year’s recruits. Far less at age 20 — 2 years minimum service accomplished. Certainly not a normal distribution.
Price movement of a single stock. What random variables could you add up? You’ll find that there are far larger price movements than a normal distribution predicts.

Here are some non-normal distributions that don’t satisfy the central limit theorem. (These are, in fact, things I said were normally distributed earlier. You see? It’s easy to think things are normal, but in reality they’re not.)

The terminus-to-terminus time for a bus. The number of bus stops is quite small. More importantly, the time between stops isn’t independent. If there’s a traffic jam, an entire section of the route will take more time. If there’s a delay between point 2 to 3, it’s likely that there’ll be a delay between points 1-2 and 3-4 as well.
The price movement of a stock exchange index. The price movement of stocks follows a power-law distribution, which does not have finite standard deviations. Also, the price movements are not independent.
See more non-normal distributions.

Summary: Don’t assume that anything you see is a normal distribution. It usually isn’t.

I’ll shortly talk about what happens when you assume something’s a normal distribution, when it really is not.

Year: 2006

Tamil songs by Ilayaraja in 1980s

Some people will never program

Hotcaptchas

Videos you can learn from

Normalising non-normal distributions is bad

Normalising non-random samples is bad

Not all distributions are normal

Running for beginners

Movie jigsaw quiz 4

The man who robbed a robber

Categories

Archives

Collections

Pages