S Anand

Visualising student performance 2

This earlier visualisation was revised based feedback from teachers. It’s split into two parts: one focused on performance by subject, and another on performance of each student.

Students’ performance by subject

Visualisation by subject

This is fairly simple. Under each subject, we have a list of students, sorted by marks and grouped by grade. The primary use of this is to identify top performers and bottom performers at a glance. It also gives an indication of the grade distribution.

For example, here’s mathematics.

Student scores in a subject

Grades are colour-coded intuitively, like rainbow colours. Violet is high, Red is low.

Colour coding of grades 

The little graphs on the left show the performance in individual exams, and can be used to identify trends. For example, from the graph to the left of Karen’s score:

A single student's score

… you can see that she’d have been an A1 student (the first two bars are coloured A1) but for the dip in the last exam (which is coloured A2).

Finally, there’s a histogram showing the grades within the subject.

Histogram of grades

Incidentally, while the names are fictitious, the data is not. This graph shows a bimodal distribution and may indicate cheating.

Students’ performance

Visualisation by student 

This is useful when you want to take a closer look at a single student. On the left are the total scores across subjects.

Visualisation of total scores

Because of the colour coding, it’s easy to get a visual sense of a performance across subjects. For example, in the first row, Kristina is having some trouble with Mathematics. And on the last row, Elsie is doing quite well.

To give a better sense of the performance, the next visualisation plots the relative performance of each student.

Visualisation of relative performance

From this, it’s easy to see that Kristina is the the bottom quarter of the class in English and Science, and isn’t doing to well in Mathematics either. Gretchen and Elsie, on the other hand, are consistently doing well. Patrick may need some help with Mathematics as well. (Incidentally, the colours have no meaning. They just make it overlaps less confusing.)

Next to that is the break-up of each subject’s score.

Visualisation of score break-up

The first number in each subject is the total score. The colour indicates the grade. The graph next to it, as before, is the trend in marks across exams. The same scores are shown alongside as numbers inside circles. The colour of the circle is the grade for that exam.

In some ways, this visualisation is less information-dense than the earlier visualisation. But this is intentional. Redundancy can help with speed of interpretation, and a reduced information density is also less intimidating to first-time readers.

Google search via e-mail

I’ve updated Mixamail to access Google search results via e-mail.

For those new here, Mixamail is an e-mail client for Twitter. It lets you read and update Twitter just using your e-mail (you’ll have to register once via Twitter, though).

Now, you can send an e-mail to twitter@mixamail.com with a subject of “Google” and a body containing your query. You’ll get a reply within a few seconds (~20 seconds on my BlackBerry) with the top 8 search results along with the snippets.

It’s the snippets that contain the useful information, as far as I’m concerned. Just yesterday, I managed to find the show timings for Manmadan Ambu at the Ilford Cine World via a search on Mixamail. (Mixamail win, but the movie was a let down, given expectations.)

You don’t need to be registered to use this. So if you’re ever stuck with just e-mail access, just mail twitter@mixamail.com with a subject “Google”.

PS: The code is on Github.

Visualising student performance

I’ve been helping with visualising student scores for ReportBee, and here’s what we’ve currently come up with.

class-scores

Each row is a student’s performance across subjects. Let’s walk through each element here.

The first column shows their relative performance across different subjects. Each dot is their rank in a subject. The dots are colour coded based on the subject (and you can see the colours on the image at the top: English is black, Mathematics is dark blue, etc.)

class-scores-2

The grey boxes in the middle shows the quartiles. A dot on the left side means that the student is in the bottom quartile. Student 30 is in the bottom quartile in almost every subject. The grey boxes indicate the 2nd and 3rd quartiles. Dots on the right indicate the top quartile.

This view lets teachers quickly explain how a student is performing – either to the headmistress, or parents, or the student. There is a big difference between a consistently good performer, a consistently poor performer, and one that is very good in some subjects, very poor in others. This view lets the teachers identify which type the student falls under.

For example, student 29 is doing very well in a few subjects, OK is some, but is very bad at computer science. This is clearly an intelligent student, so perhaps a different teaching method might help with computer science. Student 30 is doing badly in almost every subject. So the problem is not subject-specific – it is more general (perhaps motivation, home atmosphere, ability, etc.) Student 31 is consistently in the middle, but above average.

class-scores-3

The bars in the middle show a more detailed view, using the students’ marks. The zoomed view above shows the English, Mathematics and Social Science marks for the same 3 students (29, 30, 31). The grey boxes have the same meaning. Anyone to the right of those is in the top quarter. Anyone to the left is in the bottom quarter.

Some of bars have a red or a green circle at the end

class-scores-5

The green circle indicates that the student has a top score in the subject. The red circle indicates that the student has a bottom score in the subject. This lets teachers quickly narrow down to the best and worst performers in each subject.

The bars on top of the subjects show the histogram of students’ performances. It is a useful view to get a sense of the spread of marks.

class-scores-4

For example, English is significantly biased towards the top half than Mathematics or Science. Mathematics has main “trailing” students at the bottom, while English has fewer, and Social Science has many more.

Most of this explanation is intuitive, really. Once explained (and often, even when not explained), they are easy to remember and apply.

So far, this visualisation answers descriptive questions, like:

  • Where does this student stand with respect to the class?
  • Is this student a consistent performer, or does his performance vary a lot?
  • Does this subject have a consistent performance, or does it vary a lot?

We’re now working on drawing insights from this data. For example:

  • Is there a difference between the performance across sections?
  • Do students who perform well in science also do well in mathematics?
  • Can we group students into “types” or clusters based on their performances?

Will share those shortly.

What does India search for?

Over the last couple of years, I’ve been tracking the top 5 hot searches in India on Google Trends (http://www.google.co.in/trends). Here are the results:

If you’re interested in making visualisations out of it, please feel free. But there’s one particular thing I’m trying out, which is to categorise these searches and see if there’s a trend around that. I’ve added a “Tag” column.

Could you please help me tag the spreadsheet: https://spreadsheets.google.com/ccc?key=0Av599tR_jVYgdE5zTU5QWjcxVWVCaTBuY3d0NkUtc1E&hl=en_GB

It’s publicly editable, no special access required. If you could stick to the tags I already have (Business, Education, Entertainment, News, Politics, Sports, Technology), that would be great. If not, that’s fine as well.

And if you’ve made any visualisations or done any analysis using this data, please do drop a comment.

Visualising the Wilson score for ratings

Reddit’s new comment sorting system (charmingly explained by Randall Munroe) uses what’s called a Wilson score confidence interval.

I’ll wait here while you read those articles. If you ever want to implement user-ratings, you need to read them.

The summary is: don’t use average rating. Use something else, which in this case, is the Wilson score, which says that if you got 3 negative ratings and no positive ratings, your average rating shouldn’t be zero. Rather, you can be 95% sure that it’ll end up at 0.47 or above, given a chance, so rate it as 0.47.

I understand this stuff better visually, so I tried to see what the rating would be for various positive and negative scores. Here’s the plot.

The axes on the floor show the number of positive and negative ratings (you can figure out which is which), and the height of the surface is the average rating it should get.

You can see that if there are only positive ratings, the average rating is 100% (because there’s a 95% chance it’ll end up at 100% or above). If there are only negative ratings, the rating falls of sharply. In the early stages, a few positive ratings can correct that very quickly, but over time, the correction’s a lot weaker.

You can move your mouse over the visualisation to control the angle. (For those reading this this via the RSS feed, you may need to visit my blog.) Try it out: I understood the behaviour a lot better this way.

Yahoo Clues API

Yahoo Clues is like Google Insights for Search. It has one interesting thing that the latter doesn’t though: search flows.

It doesn’t have an official API, so I thought I’d document the unofficial one. The API endpoint is

http://clues.yahoo.com/clue

The query parameters are:

  • q1 – the first query string
  • q2 – the second query string
  • ts – the time span. 0 = today, 1 = past 7 days, 2 = past 30 days
  • tz – time zone? Not sure how it works. It’s just set to “0” for me
  • s – the format? No value other than “j” seems to work

So a search for “gmat” for the last 30 days looks like this:

http://clues.yahoo.com/clue?s=j&q1=gmat&q2=&ts=2&tz=0

The response has the all the elements required to render the page, but the search flows are located at:

  • response.data[2].qf.prevMax – an array of queries that often precede the current one
  • response.data[2].qf.nextMax – an array of queries that often follow the current one

The other parameters (such as demographic, geographic and search volume information) is pretty interesting as well, but is something you should be able to extract more reliably from Google Insights for Search.

Automated image enhancement

There are some standard enhancements that I apply to my photos consistently: auto-levels, increase saturation, increase sharpness, etc. I’d also read that Flickr sharpens uploads (at least, the resized ones) so that they look better.

So last week, I took 100 of my photos and created 4 versions of each image:

  1. The base image itself (example)
  2. A sharpened version (example). I used a sharpening factor of 200%
  3. A saturated version (example). I used a saturation factor of 125%
  4. An auto-levelled version (example)

I created a test asking people to compare these. The differences between these are not always noticeable when placed side-by-side, so the test flashed two images at the same place.

After about 800 ratings, here are the results. (Or, see the raw data.)

Sharpening clearly helps. 86% of the sharpened images were marked as better than the base images. Only 2 images (base/sharp, base/sharp) received a consistent feedback that the sharpened images were worse. (I have my doubts about those two as well.) On the whole, it seems fairly clear that sharpening helps.

Saturation and levels were roughly equal, and somewhat unclear. 69% of the saturated images and 68% of auto-levelled images were marked as better than the base images. And almost an equal number of images (52%) showed saturation as being better than the auto-levelled version. For a majority of images (60%), there’s a divided opinion on whether saturation was better than levelling or the other way around.

On the whole, sharpening is a clear win. When in doubt, sharpen images.

For saturation and levelling, there certainly appears to be potential. 2 in 3 images are improved by either of these techniques. But it isn’t entirely obvious which (or both) to apply.

Is there someone out there with some image processing experience to shed light on this?

Surviving in prison

As promised, here are some tips from the trenches on surviving in prison. (For those who don’t follow my blog, prison is where your Internet access is restricted.)

There are two things you need to know better: software and people. I’ll try and cover the software in this post, and the more important topic in the next.

Portable apps

You’re often not in control of your laptops / PCs. You don’t have administrator access. You can’t install software. The solution is to install Portable Apps. Most popular applications have been converted into Portable Apps that you can install on to a USB stick. Just plug them into any machine and use them. I use Firefox and Skype quite extensively this way, but increasingly, I have a preference for Portable Apps for just about everything. It makes my bloated Start Menu a lot more manageable. Some of the other portable apps I have are: Audacity, Camstudio, GIMP, Inkscape and Notepad++.

Admin access

The other possibility is that you try and gain admin access. I did this once at a client site (a large bank). We didn’t have admin access. I wasn’t particularly thrilled. So I borrowed a floppy, installed an offline password recovery tool, rebooted, and got the admin password within a few minutes. This is with the full knowledge of the (somewhat worried) client. This is where the people part comes in, and I’ll talk about that later.

Proxies

But before you do any of these, you need to be able to download the files, most of which are executables. Those are probably blocked. Heck, the sites from which you can download these files are probably blocked in the first place.

Sometimes, internal proxies help. Proxies for different geographies may have different degrees of freedom. When I was at IBM, the Internet was accessible from most US proxies, just not from the Indian proxy. So it may just be a matter of finding the right internal proxy.

Or you can search for external public proxies. Sadly, many of these are blocked. Another option is for you to set up your own proxy. You can install mirrorrr on AppEngine for free, for example.

The most effective option, of course, is to use SSH tunnels. I’ve covered this is some detail earlier.

Google

Google has a wide range of tools that can help access blocked sites. If the site you’re accessing provides public RSS feeds, use Google Reader to access these. Public feeds for Twitter, for example, are available as RSS feeds.

Google’s cache is another way of getting the same information. Search for the URL, click on the “Cache” link to read the text even if it’s blocked.

To find more such help, Google for it!

Peopleware

… but all of this is, honestly, just a small part of it. The key, really, is to understand the people restricting your access. I’ll talk about this next.

Shortening sentences

When writing Mixamail, I wanted tweets automatically shortened to 140 characters – but in the most readable manner.

Some steps are obvious. Removing redundant spaces, for example. And URL shortening. I use bit.ly because it has an API. I’ll switch to Goo.gl, once theirs is out.

I tried a few more strategies:

  1. Replace words with short forms. “u” for “you”, “&” for and, etc.
  2. Remove articles – a, an, the
  3. Remove optional punctuation – comma, semicolon, colon and quotes, in particular
  4. Replace “one” with “1”, “to” or “too” with 2, etc. “Before” becomes “Be4”, for example
  5. Remove spaces after punctuations. So “a, b” becomes “a,b” – the space after the comma is removed
  6. Remove vowels in the middle. nglsh s lgbl wtht vwls.

How did they pan out? I tested out these on the English sentences on the Tanaka Corpus, which has about 150,000 sentences. (No, they’re not typical tweets, but hey…). By just doing these, independently, here is the percentage reduction in the size of text:

2.0% Remove optional punctuations – comma, semicolon, colon and quotes
2.2% Remove spaces after punctuations. So “a, b” becomes “a,b”
3.3% Replace words with short forms. “u” for “you”, “&” for and, etc.
3.3% Replace “one” with “1”, “to” or “too” with 2, etc.
6.7% Remove articles – a, an, the
18.2% Remove vowels in the middle

Touching punctuations doesn’t have much impact. There aren’t that many of them anyway. Word substitution helps, but not too much. I could’ve gone in for a wider base, but the key is the last one: removing vowels in the middle kills a whopping 18%! That’s tough to beat with any strategy. So I decided to just stop there.

The overall reduction, applying all of the above, is about 22%. So there’s a decent chance you can type in a 180-character tweet, and Mixamail.com will still tweet it intelligibly.

I had one such tweet a few days ago. I try and stay well within 140, but this one was just too long.

The Lesson: If you’re writing an app (or building anything), find a use for yourself. There’s no better motivation — and it won’t ever be a wasted effort.

That was 156 characters. It got shortened to:

Lesson If u’re writing app (or building anything) find use 4 yourself. There’s no better motivation — & it won’t ever be wasted ef4t.

Perfectly acceptable.

You may notice that Mixamail didn’t have to employ vowel shortening. It makes the most readable shortenings first, checks if it’s within 140, and tries the next only if required.

If anyone has a simple, readable way of shortening Tweets further, please let me know!

HTML5: Up and Running

HTML5: Up and Running is the book version of Mark Pilgrim’s comprehensive introduction to HTML5 at DiveIntoHTML5.org. Whether you buy the book or read it online, it’s the best introduction to the topic you’ll find.

Mark begins with the history of HTML5 (using email archaeology, as he calls it). You’d never guess that many of the problems we have with XHTML, MIME types, etc. have roots in discussions over 20 years ago. From then on, he moves into feature detection (which uses the Modernizr library), new tags, canvas, video, geo-location, storage, offline web apps, new form features and microdata. Each chapter can be read independently – so if you’re planning to use this as a reference, you may be better of reading the links kept up-to-date at DiveIntoHTML5.org. If you’re interesting in learning about the features, it’s a very readable book, terse, simple, and above all, delightfully intelligent.

Incidentally, if you’re starting off on a new HTML5 project, you’re probably best off using HTML5BoilerPlate.com. It’s very actively maintained, and contains some really nifty tricks you can learn like the protocol-relative URL.

Disclosure: I’m writing this post as part of O’Reilly’s blogger review program. While I’m not getting paid to review books, I sure am getting to read them for free.