My earlier list of statistically improbable phrases in Calvin and Hobbes is technically just a list of “Statistically Improbable Words”. I re-did the same analysis using phrases. Here are the top 20 statistically improbable phrases (2 – 4 words only):
baby sitter chocolate frosted sugar bombs comic books doing homework fearless spaceman spiff() good night hamster huey ice cream miss wormwood new year peanut butter really think slimy girls spaceman spiff stuffed tiger stupendous man sugar bombs susie derkins watch tv water balloon
That is, these are the 2-4 word phrases whose frequency in Calvin and Hobbes is substantially (at least 5 times) higher than in the other books I have.
While doing this, the single biggest problem that stumped me was: what is a word?
- Is “it’s” one word or two words?
- Is “six-year-old” one word or three words?
- How do I distinguish between abbreviations (g.r.o.s.s.) and full-stops without a space ( … homework.what’s a …)?
- Does a comma always split words? (It doesn’t in numbers, like “3,500”)
The other problem is, phrases with more words are more improbable. Right now, if a phrase occurs 5 times more frequently in Calvin and Hobbes than my other books, I include it. But three-letter words rarely occur that often, and four-letter words even less so. Maybe I should have a lower cutoff for longer phrases.
Anyway, this analysis is a crude first approximation. Clearly Amazon’s gotten much further with their system.
Fantastic job man! You have unlimited patience!
Hey, only just now came across your page, but of the hundreds, if not thousands of C&H sites and tools, this is the most useful I”ve seen! R
hello
nice stuff.I intend splitting a text into single words. can you please give me a hint how to do this? I guess there are simple programs doing this.
Many thanks
Hey Stud, Satish here, your junior from IIMB. Trying to get in touch with you. Do mail me at satishkgv@hcl.in and let us get in touch.
Nice. Do you have a page where I can try out v2 (phrases)?
Pingback: The Calvin and Hobbes search Takedown | s-anand.net