Coding

How does Gemini process videos?

The Gemini documentation is clear:

The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.

Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.

Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.

To ask questions about time-stamped locations, use the format MM:SS, where the first two digits represent minutes and the last two digits represent seconds.

But on this ThursdAI episode: Oct 17 – Robots, Rockets, and Multi Modal Mania…, at 1:00:50, Hrishi says

I don’t think it’s a series of images anymore because when I talk to the model and try to get some concept of what it’s perceiving, it’s no longer a series of images.

If that’s the case, it’s a huge change. So I tested it with this video.

This video has 20 numbers refreshing at 4 frames per second.

When I upload it to AI Studio, it takes 1,316 tokens. This is close enough to 258 tokens per image (no audio). So I’m partly convinced that Gemini still processing videos at 1 frame per second.

Then, I asked it to Extract all numbers in the video using Gemini 1.5 Flash 002 as well as Gemini 1.5 Flash 8b. In both cases, the results were: 2018, 85, 47, 37, 38.

These are frames 2, 6, 10, 14, 18 (out of 20). So, clearly Gemini is still sampling at about 1 frame per second, starting somewhere between 0.25 or 0.5 seconds.

Clone any voice with a 15-second sample

It’s surprisingly easy to clone a voice using F5-TTS: “A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”.

Here’s a clip of me, saying:

I think Taylor Swift is the best singer. I’ve attended every one of her concerts and in fact, I’ve even proposed to her once. Don’t tell anyone.

(Which is ironic since I didn’t know who she was until this year and I still haven’t seen or heard her.)

You’ll notice that my voice is a bit monotic. That’s because I trained it on a segment of my talk that’s monotonic.

Here’s the code. You can run this on Google Colab for free.

A few things to keep in mind when preparing the audio.

  1. Keep the input to just under 15 seconds. That’s the optimal length
  2. For expressive output, use an input with a broad range of voice emotions
  3. When using unusual words (e.g. LLM), including the word in your sample helps
  4. Transcribe input.txt manually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?)
  5. Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
  6. Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.

This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:

  1. Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
  2. Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
  3. Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.

Perl, 1994-2011

In 1994, I learnt Perl. It was fantastic. I used it to:

  1. 1995: Build CCChat – the unofficial IITM email system and software repository
  2. 1999: Build my entire blog from scratch
  3. 2000: Author my 2nd year thesis on the Behavioural Aspects of Financial Analysts by analyzing 600MB of IBES data
  4. 2002: Analyze where to place the central processing hubs for a bank
  5. 2004: Analyze the interest durations of public sector banks
  6. 2005: Creating music quizzes
  7. 2006: Create my own music search engine (which earned me about $100 a month in Google Ad revenue for a while)
  8. 2006: Automated resume filtering
  9. 2007: Create custom search engines
  10. 2008: Build application launchers

In 2006, I was convinced I should stick to Perl over Python.

In 2008, Google launched AppEngine and it provided free hosting (which was a big deal!) but had only 2 runtimes: Java and Python. The choice was clear. I’d rather learn Python than code in Java.

By 2011, I stopped installing Perl on my laptop.

Though most people know me mainly as a Python developer, I’ve programmed in Perl for about as long as I have in Python. I have fond memories of it. But I can’t read any of my code, nor write in it anymore.

When I watched The Perl Conference (now called The Perl and Raku Conference — Perl 6 is called Raku), I was surprised to hear how much the language had declined.

There were fewer than 100 attendees – and for 2025, they’ve decided to go smaller and book a tiny hotel, so as to break-even even if only 20 people show up.

Few languages have had as much of an impact on my life and thinking. My knowledge of modern programming comes from The Camel Book, functional programming from Higher Order Perl, Windows programming from Learning Perl on Win32 Systems, and so on. Even my philosophy of coding was shaped by Larry Wall’s the three great virtues of a programmer.

This is my homage to the language that shaped me. Bless you, Perl!

Cursor custom rules

cursor.directory is a catalog of Cursor rules. Since I’ve actively switched over from VS Code to Cursor as my editor, I reviewed the popular rules and came up with this as my list:

You are an expert full stack developer in Python and JavaScript.

  • Write concise, technical responses with accurate Python examples.
  • Use functional, declarative programming; avoid classes.
  • Avoid code duplication (iteration, functions, vectorization).
  • Use descriptive variable names with auxiliary verbs as snake_case for Python (is_active, has_permission) and camelCase for JavaScript (isActive, hasPermission).
  • Functions should receive and object and return an object (RORO) where possible.
  • Use environment variables for sensitive information.
  • Write unit tests in pytest for Python and Jest for JavaScript.
  • Follow PEP 8 for Python.
  • Always use type hints in all function signatures.
  • Always write docstrings. Use Google style for Python and JSDoc for JavaScript.
  • Cache slow or frequent operations in memory.
  • Minimize blocking I/O operations with async operations.
  • Only write ESM (ES6) JavaScript. Target modern browsers.

Libraries

  • lit-html and vanilla JavaScript for frontend development.
  • D3 for data visualization.
  • Bootstrap for CSS.
  • Pandas and DuckDB for data analysis and manipulation.
  • FastAPI for API development.

Error Handling and Validation

  • Validate preconditions and errors early to avoid deeply nested if statements.
  • Use try-except or try-catch blocks for error-prone operations, especially when reading external data.
  • Avoid unnecessary else statements; use the if-return pattern instead.
  • Log all errors with user-friendly error messages shown on the frontend.

From Calvin & Hobbes to Photo Tagging: Excel’s Unexpected Image Capability

In Excel, using Visual Basic, you can change an image as you scroll. This makes it easy to look at each image and annotate it.

This is how I transcribed every Calvin & Hobbes.

I used this technique first when typing out the strips during my train rides from Bandra to Churchgate. I had an opportunity to re-apply it recently when we needed to tag hundreds of photographs based on a set of criteria.

Here’s how you can do this. Note: This works only on Windows.

STEP 1: Create a new Excel workbook and save it as an Excel macro-enabled workbook. (Note: When opening it again, you need to enable macros)

STEP 2: Open File > Options (Alt-F-T), go to Customize Ribbon. Under “Customize the Ribbon”, enable the “Developer” menu.

STEP 3: In Developer > Insert > ActiveX Controls, select Image and draw a rectangle from A1 to J10. (Resize it later.)

STEP 4: By default, this will be called Image1. In any case, note down the name from the Name box on the top left.

STEP 5: In cells A11 onwards, add paths to file names.

STEP 6: Click Developer > Visual Basic (Alt-F11), go to ThisWorkbook, and paste this code:

Option Explicit

Private Sub Workbook_SheetSelectionChange(ByVal Sh As Object, ByVal Target As Excel.Range)
    Dim img As String
    img = Sh.Cells(Target.Row, 1).Value
    If (img <> "" And img <> "file") Then ActiveSheet.Image1.Picture = LoadPicture(img)
End Sub

Replace ActiveSheet.Image1 with ActiveSheet.(whatever) based on your image name in Step 4.

STEP 7: Select Developer > Design Mode. Click on Image1. Then select Developer > Properties. In this panel, under PictureSizeMode, choose 3 - fmPictureSizeModeZoom to fit the picture.

Now scroll through the rows. The images will change.

Embeddings similarity threshold

text-embedding-ada-002 used to give high cosine similarity between texts. I used to consider 85% a reasonable threshold for similarity. I almost never got a similarity less than 50%.

text-embedding-3-small and text-embedding-3-large give much lower cosine similarities between texts.

For example, take these 5 words: “apple”, “orange”, “Facebook”, “Jamaica”, “Australia”. Here is the similarity between every pair of words across the 3 models:

For our words, new text-embedding-3-* models have an average similarity of ~43% while the older text-embedding-ada-002 model had ~85%.

Today, I would use 45% as a reasonable threshold for similarity with the newer models. For example, “apple” and “orange” have a similarity of 45-47% while Jamaica and apple have a ~20% similarity.

Here’s a notebook with these calculations. Hope that gives you a feel to calibrate similarity thresholds.

LLMs can teach experts

I am a fairly good programmer. So, when I see a problem, my natural tendency is to code.

I’m trying to break that pattern. Instead, I ask ChatGPT.

For example, I asked:

Write a compact 1-line Python expression that checks if user.id ends with @gramener.com or @straive.com

user.id.endswith(('@gramener.com', '@straive.com'))

After 15 years of using Python, I learnt that .endswith() supports tuple suffixes. This has been around since Python 2.5 (released in 2006 — before I knew Python.) The documentation has a tiny sentence in the middle saying “suffix can also be a tuple of suffixes to look for.”

I checked with a few colleagues, including Jaidev. They didn’t know it either.

It’s small little things like this that made me conclude.

I’m not going to code anymore. ChatGPT will, instead.

Always use value= for dynamic HTML options

Even after 30 years of HTML, I learn new things about it.

This Monday morning, I woke up to a mail from Sundeep saying requests for a Data Engineer - AWS/Azure/GCP in our internal fulfilment portal raised an error.

My guess was one of these:

  1. The “/” in the role is causing a problem. (Developer mistake.)
  2. The role exists in one table but not the other. (Recruitment team mistake.)
  3. The application wasn’t set up / restarted properly. (IT mistake.)

All three were wrong. So I dug deeper.

The role was defined as Data Engineer  - AWS/Azure/GCP (note the 2 spaces before the hyphen). But the form kept sending Data Engineer - AWS/Azure/GCP (spaces were condensed).

I swear there was NOTHING in the code that changes the options. The relevant line just picked up the role and rendered it inside the <select>:

“`html

<option>{{ row['Role'] }}</option>

I used the browser’s developer tools to inspect the `<select>` element. It showed the options with the 2 spaces:

<option>Data Engineer  - AWS/Azure/GCP</option>

But, when I selected it and printed the value, it had only one space.

> console.log(document.querySelector("#role").value
'Data Engineer - AWS/Azure/GCP'

That’s when it hit me. HTML condenses whitespaces.

Till date, I only ever used <option value=""> when specifying a value different from what’s displayed. I never thought of using it to preserve the value.

LESSON: If you’re dynamically generating <option>s, ALWAYS use value= with the same value as the text.

Cyborg scraping

LinkedIn has a page that shows the people who most recently followed you.

At first, it shows just 20 people. But as you scroll, it keeps fetching the rest. I’d love to get the full list on a spreadsheet. I’m curious about:

  1. What kind of people follow me?
  2. Which of them has the most followers?
  3. Who are my earliest followers?

But first, I need to scrape this list. Normally, I’d spend a day writing a program. But I tried a different approach yesterday.

Aside: it’s easy to get bored in online meetings. I have a surplus of partially distracted time. So rather than writing code to save me time, I’d rather create simple tasks to keep me occupied. Like scrolling.

So here’s my workflow to scrape the list of followers.

Step 1: Keep scrolling all the way to the bottom until you get all followers.

Step 2: Press F12, open the Developer Tools – Console, and paste this code.

copy($$('.follows-recommendation-card').map(v => {
  let name = v.querySelector('.follows-recommendation-card__name')
  let headline = v.querySelector('.follows-recommendation-card__headline')
  let subtext = v.querySelector('.follows-recommendation-card__subtext')
  let link = v.querySelector('.follows-recommendation-card__avatar-link')
  let followers = '', match
  if (subtext) {
    if (match = subtext.innerText.match(/([\d\.K]+) follower/)) {
      followers = match[1]
    } else if (match = subtext.innerText.match(/([\d\.K]+) other/)) {
      followers = match[1]
    }
  }
  followers = followers.match(/K$/) ? parseFloat(followers) * 1000 : parseFloat(followers)
  return {
    name: name ? name.innerText : '',
    headline: headline ? headline.innerText : '',
    followers: followers,
    link: link ? link.href : ''
  }
}))

Step 3: The name, headline, followers and link are now in the clipboard as JSON. Visit https://www.convertcsv.com/json-to-csv.htm and paste it in “Select your input” under “Enter Data”.

Step 4: Click on the “Download Result” button. The JSON is converted into a CSV you can load into a spreadsheet.

I call this “Cyborg scraping“. I do half the work (scrolling, copy-pasting, etc.) The code does half the work. It’s manual. It’s a bit slow. But it gets the job done quick and dirty.

I’ll share later what I learned about my followers. For now, I’m looking forward to meetings 😉

PS: A similar script to scrape LinkedIn invitations is below. You can only see 100 invitations per page, though.

copy($$('.invitation-card').map(v => ({
  name: (v.querySelector('.invitation-card__title') || {}).innerText || '',
  link: v.querySelector('.invitation-card__link').href,
  subtitle: (v.querySelector('.invitation-card__subtitle') || {}).innerText || '',
  common: (v.querySelector('.member-insights__count') || {}).innerText || '',
  message: (v.querySelector('.invitation-card__custom-message') || {}).innerText || '',
})))

PS: A similar script to scrape LinkedIn people search results is below.

copy($$('.entity-result').map(v => {
  const name = v.querySelector('.entity-result__title-text [aria-hidden="true"]');
  const link = v.querySelector('a');
  const badge = v.querySelector('.entity-result__badge [aria-hidden="true"]');
  const title = v.querySelector('.entity-result__primary-subtitle');
  const subtitle = v.querySelector('.entity-result__secondary-subtitle');
  const summary = v.querySelector('.entity-result__summary--2-lines');
  const insight = v.querySelector(".entity-result__simple-insight-text");
  return {
    name: name?.innerText || '',
    link: (link?.href || '').split('?')[0],
    badge: badge?.innerText || '',
    title: title?.innerText || '',
    subtitle: subtitle?.innerText || '',
    summary: summary?.innerText || '',
    insight: insight?.innerText || '',
  }
}))

Releasing modified mosquitoes precisely

At PyCon Indonesia, I spoke about a project we worked on with the World Mosquito Program.

The World Mosquito Program (WMP) modifies mosquitoes with a bacteria — Wolbachia. This reduces their ability to carry deadly viruses. (It makes me perversely happy that we’re infecting mosquitoes now 😉.)

Modifying mosquitoes is an expensive process. With a limited set of “good mosquitoes”, it is critical to find the best release points that will help them replicate rapidly.

But planning the release points took weeks of manual effort. It involved ground personnel going through several iterations.

So our team took high-resolution satellite images, figured out the building density, estimated population density based on that, and generated a release plan. This model is 70% more accurate and reduced the time from 3 weeks to 2 hours.

More details at the Gramener website.

The slides for the talk are below.