The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.
Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.
Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.
To ask questions about time-stamped locations, use the format MM:SS, where the first two digits represent minutes and the last two digits represent seconds.
I don’t think it’s a series of images anymore because when I talk to the model and try to get some concept of what it’s perceiving, it’s no longer a series of images.
If that’s the case, it’s a huge change. So I tested it with this video.
This video has 20 numbers refreshing at 4 frames per second.
When I upload it to AI Studio, it takes 1,316 tokens. This is close enough to 258 tokens per image (no audio). So I’m partly convinced that Gemini still processing videos at 1 frame per second.
Then, I asked it to Extract all numbers in the video using Gemini 1.5 Flash 002 as well as Gemini 1.5 Flash 8b. In both cases, the results were: 2018, 85, 47, 37, 38.
These are frames 2, 6, 10, 14, 18 (out of 20). So, clearly Gemini is still sampling at about 1 frame per second, starting somewhere between 0.25 or 0.5 seconds.
A few things to keep in mind when preparing the audio.
Keep the input to just under 15 seconds. That’s the optimal length
For expressive output, use an input with a broad range of voice emotions
When using unusual words (e.g. LLM), including the word in your sample helps
Transcribe input.txtmanually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?)
Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.
This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:
Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.
In 2008, Google launched AppEngine and it provided free hosting (which was a big deal!) but had only 2 runtimes: Java and Python. The choice was clear. I’d rather learn Python than code in Java.
Though most people know me mainly as a Python developer, I’ve programmed in Perl for about as long as I have in Python. I have fond memories of it. But I can’t read any of my code, nor write in it anymore.
When I watched The Perl Conference (now called The Perl and Raku Conference — Perl 6 is called Raku), I was surprised to hear how much the language had declined.
There were fewer than 100 attendees – and for 2025, they’ve decided to go smaller and book a tiny hotel, so as to break-even even if only 20 people show up.
cursor.directory is a catalog of Cursor rules. Since I’ve actively switched over from VS Code to Cursor as my editor, I reviewed the popular rules and came up with this as my list:
You are an expert full stack developer in Python and JavaScript.
Write concise, technical responses with accurate Python examples.
Use functional, declarative programming; avoid classes.
Use descriptive variable names with auxiliary verbs as snake_case for Python (is_active, has_permission) and camelCase for JavaScript (isActive, hasPermission).
Functions should receive and object and return an object (RORO) where possible.
Use environment variables for sensitive information.
Write unit tests in pytest for Python and Jest for JavaScript.
Follow PEP 8 for Python.
Always use type hints in all function signatures.
Always write docstrings. Use Google style for Python and JSDoc for JavaScript.
Cache slow or frequent operations in memory.
Minimize blocking I/O operations with async operations.
Only write ESM (ES6) JavaScript. Target modern browsers.
Libraries
lit-html and vanilla JavaScript for frontend development.
D3 for data visualization.
Bootstrap for CSS.
Pandas and DuckDB for data analysis and manipulation.
FastAPI for API development.
Error Handling and Validation
Validate preconditions and errors early to avoid deeply nested if statements.
Use try-except or try-catch blocks for error-prone operations, especially when reading external data.
Avoid unnecessary else statements; use the if-return pattern instead.
Log all errors with user-friendly error messages shown on the frontend.
I used this technique first when typing out the strips during my train rides from Bandra to Churchgate. I had an opportunity to re-apply it recently when we needed to tag hundreds of photographs based on a set of criteria.
Here’s how you can do this. Note: This works only on Windows.
STEP 1: Create a new Excel workbook and save it as an Excel macro-enabled workbook. (Note: When opening it again, you need to enable macros)
STEP 2: Open File > Options (Alt-F-T), go to Customize Ribbon. Under “Customize the Ribbon”, enable the “Developer” menu.
STEP 3: In Developer > Insert > ActiveX Controls, select Image and draw a rectangle from A1 to J10. (Resize it later.)
STEP 4: By default, this will be called Image1. In any case, note down the name from the Name box on the top left.
STEP 5: In cells A11 onwards, add paths to file names.
STEP 6: Click Developer > Visual Basic (Alt-F11), go to ThisWorkbook, and paste this code:
Option Explicit
Private Sub Workbook_SheetSelectionChange(ByVal Sh As Object, ByVal Target As Excel.Range)
Dim img As String
img = Sh.Cells(Target.Row, 1).Value
If (img <> "" And img <> "file") Then ActiveSheet.Image1.Picture = LoadPicture(img)
End Sub
Replace ActiveSheet.Image1 with ActiveSheet.(whatever) based on your image name in Step 4.
STEP 7: Select Developer > Design Mode. Click on Image1. Then select Developer > Properties. In this panel, under PictureSizeMode, choose 3 - fmPictureSizeModeZoom to fit the picture.
Now scroll through the rows. The images will change.
text-embedding-ada-002 used to give high cosine similarity between texts. I used to consider 85% a reasonable threshold for similarity. I almost never got a similarity less than 50%.
For example, take these 5 words: “apple”, “orange”, “Facebook”, “Jamaica”, “Australia”. Here is the similarity between every pair of words across the 3 models:
For our words, new text-embedding-3-* models have an average similarity of ~43% while the older text-embedding-ada-002 model had ~85%.
Today, I would use 45% as a reasonable threshold for similarity with the newer models. For example, “apple” and “orange” have a similarity of 45-47% while Jamaica and apple have a ~20% similarity.
Here’s a notebook with these calculations. Hope that gives you a feel to calibrate similarity thresholds.
After 15 years of using Python, I learnt that .endswith() supports tuple suffixes. This has been around since Python 2.5 (released in 2006 — before I knew Python.) The documentation has a tiny sentence in the middle saying “suffix can also be a tuple of suffixes to look for.”
I checked with a few colleagues, including Jaidev. They didn’t know it either.
It’s small little things like this that made me conclude.
I’m not going to code anymore. ChatGPT will, instead.
Even after 30 years of HTML, I learn new things about it.
This Monday morning, I woke up to a mail from Sundeep saying requests for a Data Engineer - AWS/Azure/GCP in our internal fulfilment portal raised an error.
My guess was one of these:
The “/” in the role is causing a problem. (Developer mistake.)
The role exists in one table but not the other. (Recruitment team mistake.)
The application wasn’t set up / restarted properly. (IT mistake.)
All three were wrong. So I dug deeper.
The role was defined as Data Engineer - AWS/Azure/GCP (note the 2 spaces before the hyphen). But the form kept sending Data Engineer - AWS/Azure/GCP (spaces were condensed).
I swear there was NOTHING in the code that changes the options. The relevant line just picked up the role and rendered it inside the <select>:
“`html
<option>{{ row['Role'] }}</option>
I used the browser’s developer tools to inspect the `<select>` element. It showed the options with the 2 spaces:
<option>Data Engineer - AWS/Azure/GCP</option>
But, when I selected it and printed the value, it had only one space.
Till date, I only ever used <option value=""> when specifying a value different from what’s displayed. I never thought of using it to preserve the value.
LESSON: If you’re dynamically generating <option>s, ALWAYS use value= with the same value as the text.
At first, it shows just 20 people. But as you scroll, it keeps fetching the rest. I’d love to get the full list on a spreadsheet. I’m curious about:
What kind of people follow me?
Which of them has the most followers?
Who are my earliest followers?
But first, I need to scrape this list. Normally, I’d spend a day writing a program. But I tried a different approach yesterday.
Aside: it’s easy to get bored in online meetings. I have a surplus of partially distracted time. So rather than writing code to save me time, I’d rather create simple tasks to keep me occupied. Like scrolling.
So here’s my workflow to scrape the list of followers.
Step 1: Keep scrolling all the way to the bottom until you get all followers.
Step 2: Press F12, open the Developer Tools – Console, and paste this code.
copy($$('.follows-recommendation-card').map(v => {
let name = v.querySelector('.follows-recommendation-card__name')
let headline = v.querySelector('.follows-recommendation-card__headline')
let subtext = v.querySelector('.follows-recommendation-card__subtext')
let link = v.querySelector('.follows-recommendation-card__avatar-link')
let followers = '', match
if (subtext) {
if (match = subtext.innerText.match(/([\d\.K]+) follower/)) {
followers = match[1]
} else if (match = subtext.innerText.match(/([\d\.K]+) other/)) {
followers = match[1]
}
}
followers = followers.match(/K$/) ? parseFloat(followers) * 1000 : parseFloat(followers)
return {
name: name ? name.innerText : '',
headline: headline ? headline.innerText : '',
followers: followers,
link: link ? link.href : ''
}
}))
Step 3: The name, headline, followers and link are now in the clipboard as JSON. Visit https://www.convertcsv.com/json-to-csv.htm and paste it in “Select your input” under “Enter Data”.
Step 4: Click on the “Download Result” button. The JSON is converted into a CSV you can load into a spreadsheet.
I call this “Cyborg scraping“. I do half the work (scrolling, copy-pasting, etc.) The code does half the work. It’s manual. It’s a bit slow. But it gets the job done quick and dirty.
I’ll share later what I learned about my followers. For now, I’m looking forward to meetings 😉
PS: A similar script to scrape LinkedIn invitations is below. You can only see 100 invitations per page, though.
The World Mosquito Program (WMP) modifies mosquitoes with a bacteria — Wolbachia. This reduces their ability to carry deadly viruses. (It makes me perversely happy that we’re infecting mosquitoes now 😉.)
Modifying mosquitoes is an expensive process. With a limited set of “good mosquitoes”, it is critical to find the best release points that will help them replicate rapidly.
But planning the release points took weeks of manual effort. It involved ground personnel going through several iterations.
So our team took high-resolution satellite images, figured out the building density, estimated population density based on that, and generated a release plan. This model is 70% more accurate and reduced the time from 3 weeks to 2 hours.