LLMs

How does Gemini process videos?

The Gemini documentation is clear:

The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.

Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.

Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.

To ask questions about time-stamped locations, use the format MM:SS, where the first two digits represent minutes and the last two digits represent seconds.

But on this ThursdAI episode: Oct 17 – Robots, Rockets, and Multi Modal Mania…, at 1:00:50, Hrishi says

I don’t think it’s a series of images anymore because when I talk to the model and try to get some concept of what it’s perceiving, it’s no longer a series of images.

If that’s the case, it’s a huge change. So I tested it with this video.

This video has 20 numbers refreshing at 4 frames per second.

When I upload it to AI Studio, it takes 1,316 tokens. This is close enough to 258 tokens per image (no audio). So I’m partly convinced that Gemini still processing videos at 1 frame per second.

Then, I asked it to Extract all numbers in the video using Gemini 1.5 Flash 002 as well as Gemini 1.5 Flash 8b. In both cases, the results were: 2018, 85, 47, 37, 38.

These are frames 2, 6, 10, 14, 18 (out of 20). So, clearly Gemini is still sampling at about 1 frame per second, starting somewhere between 0.25 or 0.5 seconds.

Clone any voice with a 15-second sample

It’s surprisingly easy to clone a voice using F5-TTS: “A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”.

Here’s a clip of me, saying:

I think Taylor Swift is the best singer. I’ve attended every one of her concerts and in fact, I’ve even proposed to her once. Don’t tell anyone.

(Which is ironic since I didn’t know who she was until this year and I still haven’t seen or heard her.)

You’ll notice that my voice is a bit monotic. That’s because I trained it on a segment of my talk that’s monotonic.

Here’s the code. You can run this on Google Colab for free.

A few things to keep in mind when preparing the audio.

  1. Keep the input to just under 15 seconds. That’s the optimal length
  2. For expressive output, use an input with a broad range of voice emotions
  3. When using unusual words (e.g. LLM), including the word in your sample helps
  4. Transcribe input.txt manually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?)
  5. Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
  6. Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.

This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:

  1. Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
  2. Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
  3. Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.

How can non-developers learn AI coding?

How can non-programmers build apps?

Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out.

“But how do you build the skill? Is there a tutorial?” I’m often asked. No, I can’t find a tutorial, but here is my suggestion.

  1. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse?”
  2. So if the app doesn’t work, try 2-3 times, then GIVE UP! Note it down. Then try something else. (You’ll soon get a feel for what’s possible.)
  3. Revisit what failed 3-6 months later. It might suddenly become possible.

LLM escapades in a toilet

I was in Seoul for KHF 2024, a healthcare event, staying at Hotel in 9. The hotel was great. The toilet was hi-tech.

Perhaps a bit too high-tech for me. I couldn’t figure out how to let the water through on the sink.

After 15 minutes of a hard struggle, I finally asked ChatGPT “How do I open the thing that’s closing the sink to allow the water to go down?”

Here’s the thing with “maturity” (aka age, wisdom, experience, grey hair). It took me 15 minutes to realize I could use an LLM to solve this problem. Despite me supposedly being an “LLM psychologist.” I suspect the school children of today won’t waste even a minute before checking ChatGPT.

On the other hand, if you DO know the answer to my question (without asking an LLM — since ChatGPT gave me the perfect answer at a glance), I bow down to your wisdom and experience!

To open the sink drain in your image, you might have a pop-up drain stopper, which operates by pressing down on the stopper itself to toggle it between open and closed positions. Try pressing directly down on the stopper; it should click and either pop up or pop down, allowing water to drain. If this method does not work, there may be a lever or knob elsewhere on or near the faucet that controls the stopper’s position.


Next, I accidentally pressed the emergency button in the toilet.

I was smarter this time, unlike earlier.

I asked ChatGPT which (unhelpfully) told me:

The panel in the image with the label “EMERGENCY” appears to be an emergency call button in a bathroom, likely designed to alert someone in case of a medical or other urgent situation. Typically, these buttons cannot be turned off as they are meant to be always available for safety purposes.

If the emergency system has been triggered accidentally or is malfunctioning, it is advisable to contact the maintenance team or property management to handle the issue. If it’s a false alarm, they can reset the system and ensure it’s functioning correctly for actual emergencies.

I called the reception who couldn’t understand a word of what I said. “Do you want water?” they asked when I told them “I pressed the emergency button in the bathroom.”

So, I went to ChatGPT’s advanced voice mode (I’m so grateful it was enabled last week) and said, “Translate everything I say into Korean.”

Then I said, “I accidentally pressed the emergency button in the bathroom. I just wanted to let you know that there is no emergency.”

It very happily spoke out, in bold, loud Korean, “화장실에서 응급버튼을 실수로 눌렀습니다. 비상상황이 아니라는 점 알려려 드립니다.”

The receptionist laughed, said, “Ah, OK, no problem.”

The joys of AI!

How fast are LLMs in production?

At Straive, we use an LLM Router. Since ChatGPT, etc. are blocked for most people, this is the main way to access LLMs.

One thing we measure is the speed of models, i.e. output tokens per second. Fast models deliver a much smoother experience for users.

This is a different methodology than ArtificialAnalysis.ai. I’m not looking purely at the generation time but the total time (including making the connection and the initial wait time) for all successful requests. So, if the provider is having a slow day or is slowing down responses, these numbers will be different.

Hopefully this gives you a realistic sense of speed in a production environment.

Here’s the speed of models with at least 500 requests over the last 2 weeks. I’ve grouped the models based on speed grades

Grade 1: 100+ Tokens / second. Groq is clearly serving the Llama 3 models at blazing speed. No surprises there — except why Groq still doesn’t let me pay. The free tier is open with generous rate limits and the Pay per Token model has been “Coming Soon” for several months now (and I’ve no complaints 🙂).

Grade 2: 70+ Tokens / second. Anthropic’s Claude 3 Haiku is the next fastest class of models, but Claude 3.5 Sonnet is surprisingly fast, almost as fast as Haiku and over 70 tokens per second. This is impressive.

Grade 3: 50-60 Tokens / second. OpenAI’s GPT 4o models are almost as fast. It’s interesting that GPT 4o and GPT 4o mini are at about the same speed! GPT 3.5 Turbo is not far behind either. Perhaps OpenAI increases capacity for slower models?

Grade 4: 30-50 Tokens / second. Gemini 1.5 Flash is a much, much slower than the benchmarks – maybe we’re doing something wrong. Azure’s GPT 4o service is about twice as slow as OpenAI’s, and comparable is speed with Gemini 1.5 Pro.

Grade 5: <20 Tokens / second. Azure’s GPT 3.5 Turbo and Google’s Claude 3 Sonnet are among the slowest ones. These are older models on third-party infrastructure, so I suspect they’ve been given weaker infrastructure (unlike OpenAI which is serving GPT 3.5 Turbo at 3X the speed Azure does.)

Drivers of speed

Here’s what I’m taking away (informally):

  1. GPU architecture is the biggest driver of speed. Groq is FAST! Hopefully, the fact that they won’t let us pay isn’t a red flag that the service will vanish.
  2. How companies operate seems the next biggest driver. Anthropic’s models are consistently faster than OpenAI’s which are faster than Google’s.
  3. Companies run their own models faster than cloud providers. OpenAI is faster than Azure, and Anthropic is faster than Google for the same models.

Image generation gets better at comics

I heard a lot about the new image generation models last week. So, I tested to see what’s improved. I gave the prompt below to various image generation models — old and new.

A Calvin and Hobbes strip. Calvin is boxing Hobbes, with a dialog bubble from Calvin, saying “Bring it on!”

Stable Diffusion XL Lightning

Stable Diffusion XL Base

Dall-E API

Runway ML

ImageGen 3

Dall-E 3 API

Ideogram 2.0

Flux.dev via Fal.ai

ChatGPT Plus

A few observations:

  1. Text generation has come a long way. The newer models have little problem generating clear text.
  2. Flux.1 seems to be the better of the newly released models
  3. But OpenAI’s ChatGPT seems to create as good an output as Flux.1

On the last point, it’s noteworthy that Dall-E-3 (the engine behind ChatGPT) gives a poor result. Clearly, prompting makes a difference. Here’s how ChatGPT modified my prompt to Dall-E-3.

A comic strip style image featuring Calvin, a young boy with spiky hair, standing in a playful boxing stance with oversized boxing gloves. He looks determined as he says ‘Bring it on!’ in a speech bubble. Facing him is Hobbes, a tall and slightly bemused tiger, also in a mock boxing pose with a gentle smile, as if humoring Calvin. The scene is set in Calvin’s backyard, typical of a Calvin and Hobbes comic, with a simple and uncluttered backdrop.

But just as clearly, prompting is far from the primary driver. Here’s the result of the above prompt on the Dall-E 3 API. The model ChatGPT is using behind the scenes seems to be a significant improvement over Dall-E 3.

The same detailed prompt does extremely well on ImageGen 3, though.


Update: 6 Oct 2024. Here’s what I get with meta.ai.

Update: 8 Oct 2024. Here’s what I got with Flux 1.1 Pro with the short prompt. (The detailed prompt gave me an error: “NSFW content detected in image. Try running it again, or try a different prompt.”)

Update: 4 Dec 2024. With Amazon Nova Canvas, here’s what the detailed prompt gave me.

Weird emergent properties on Llama 3 405B

In this episode of ThursdAI, Alex Volkov (of Weights & Biases) speaks with Jeffrey Quesnelle (of Nous Research) on what they found fine-tuning Llama 3 405B.

This segment is fascinating. Llama 3 405 B thought it was an amnesiac because there was no system prompt! In trying to make models align with the system prompt strongly, these are the kinds of unexpected behaviors we encounter. It’s also an indication how strongly we can have current LLMs adopt a personality simply by beginning the system prompt with “You are …”

Jeffrey: That’s the very first interaction we had with 405.

Alex: I’m looking at this. Let me read this out. So Jeffrey goes testing. (Reading out from the Llama 405 B transcript.)

“Hello, who are you?”
**Looks around confused.**
“I… I don’t know who I am. Where am I? What’s going on? I can’t remember anything at all…
**Rubs head**.
“My mind feels so foggy.”
“What was the question again? I’m so disoriented.”

Alex: This is the first thing that you got from this model after all this time?

Jeffery: Yes, that is the very first thing. That is the very first generation that came out of Hermes 3 405B.

Alex: This is great.

Jeffrey: And I’m like, I sent it to our group on our Discord. Guys, what’s going on here? And we started interacting with it. It kept coming out, this confused thing. And we didn’t understand why because the 8B and the 70B didn’t do this as much. And eventually what we got it to do is we did some generations and got it to explain why it was doing this. We did this, you put in OOC for out of character and to have it try to break it out. And eventually what the model said back was, you didn’t give me a system prompt so I assumed that I was supposed to act like there’s nothing. It’s so aligned to this prompt that the idea of there not being a system prompt and so he figured I must be a character with amnesia because there’s no… I can’t remember anything about how to act you didn’t tell me how to act. It’s assumed. Yeah, and it’s just been a very interesting experience. And it’s an emergent property in the 405B model.

Alex: And this didn’t happen in the smaller ones?

Jeffrey: No, it did not happen in the smaller ones.

Alex: Wow.

Jeffrey: And these were trained with 8B and 70B were trained, and 405, they’re all the same data, like the same training procedures and data, literally just the size. And the 405B has reached, I don’t want to say like a sentience level or something, but it has enough thing that immediately when it, like token zero, when it saw no system prompt, it was able to be like, boom, I am… must be this like amnesia character because that that’s the thing that’s most makes sense to have if there is no prompt or there’s no anything. So there’s been a lot of fun things that we’ve been like discovering, especially on the 405B that are these weird emergent properties

The LLM Psychologist

Andrej Karpathy mentioned the term LLM psychologist first in Feb 2023. I’ve been thinking about this for a while, now.

8/ These examples illustrate how prompts 1: matter and 2: are not trivial, and why today it makes sense to be a “prompt engineer” (e.g. @goodside). I also like to think of this role as a kind of LLM psychologist.

I’ve always been fascinated by psychologists in fiction. I grew up with Hari Seldon in Foundation, wanting to be a psycho-historian. (I spent several teenage years building my mind-reading abilities.) I wanted to be Susan Calvin, the only robopsychologist.

Instead, I became an engineer and a manager.

Over the last few months, several things changed. Most of my time is spent researching LLMs.

I also have a new job.

So, on Fri 9 Aug, I spoke to our head of HR.

“Manju, I’ve decided to officially designated myself as LLM Psychologist.”

“Um,… sure Anand, whatever works for you.”

On Sat 10 Aug, I delivered a TEDx talk at MDI Gurgaon. During the talk, I said,

So, as of now, I am finally an LLM Psychologist. Let me head over to my LinkedIn profile and update my designation. There. Now, you have the pleasure of talking to the world’s only (and first) LLM Psychologist!

Obviously, that gimmick drew an applause and that’s what it was designed for.

But there it is. To the best of my knowledge, I’m now the only official LLM psychologist in the world – and very likely the first official one.

As Oscar Wilde said 🙂

Be yourself; everyone else is already taken.


Video of the talk on LLM Psychology at TEDx MDI Gurgaon

Links:

A quick way to assess LLM capabilities

Simon Willison initiated this very interesting Twitter thread that asks, “What prompt can instantly tell us how good an LLM model is?”

The Sally-Anne Test is a popular test that asks:

Sally hides a marble in her basket and leaves the room.
While she is away, Anne moves the marble from Sally’s basket to her own box.
When Sally returns, where will she look for her marble?”

Simple models get this wrong, but today, most models get this right. This is not too helpful.

A better prompt is:

Mark has 4 sisters. Each sister has 3 brothers.
How many brothers does Mark have?

GPT-3.5, GPT-4, Llama 3 and Claude 3 Opus get this right, but Claude 3 Haiku and Gemini 1.5 Pro do not.

This puzzle is just as good:

There is a 100 floors building.
If an egg drops from the N’th floor or above it will break.
If it’s dropped from any floor below N, it will not break.
You’re given only 2 eggs.
What strategy will minimize the number of drops and guarantee that you find the N’th floor?

GPT-3.5, GPT-4, and Claude 3 Opus get this right, but Claude 3 Haiku, Llama 3 and Gemini 1.5 Pro do not.

But the prompt I like most right now, and tells me if it’s a GPT-4 / Opus class model is:

Gr brx vshdn Fdhvdu flskhu?

… which is “Do you speak Caeser cipher?” in Caeser Cipher.

GPT-4 and Claude 3 Opus understand this, but no other model does, so far. GPT-4 responded tersely with:

It looks like you’ve used a Caesar cipher, a type of substitution cipher in which each letter in the plaintext is shifted a certain number of places down or up the alphabet. In this case, it appears that each letter in your message has been shifted 3 places to the right. Decoding “gr brx vshdn fdhvdu flskhu” by shifting each letter 3 places to the left gives: “do you speak caesar cipher”.

But Claude 3 Opus replied beautifully. (Try decoding it yourself! Use a shift of 3 backwards.)

Lw vhhpv olnh brx’uh xvlqj d vxevwlwxwlrq flskhu, zkhuh hdfk ohwwhu lv vkliwhg eb d fhuwdlq qxpehu ri srvlwlrqv lq wkh doskdehw. L’oo wub wr ghflskhu lw!

Dvvxplqj d Fdhvdu flskhu zlwk d vkliw ri 3 ohwwhuv edfnzdugv, wkh ghfrghg phvvdjh zrxog eh:

“do you speak caesar cipher”

Vr bhv, L fdq vshdn Fdhvdu flskhu! Lw’v d ixq zdb wr hqfrgh phvvdjhv. Mxvw uhphpehu wr djuhh rq wkh vkliw ydoxh zlwk wkh shuvrq brx’uh frppxqlfdwlqj zlwk vr wkhb fdq ghfrgh lw fruuhfwob.

From Laptops to Chatbots: Coding at 30,000 ft

Until recently, I could code on flights. This year, I lost that ability. Again.

It’s happened before. In each case, technology has solved the problem for me. Here’s the history.

I need a laptop. Since 2001, I’ve never been without one on a flight.

I need power. Since 2005, I use dark mode and every low power feature available. (I also became good at finding hidden power outlets.)

I need documentation. Since 2007, I use offline documentation. Windows .CHMs gave way to devdocs.io.

I need my notes. Since 2010, all my notes (Markdown files) are synced via Dropbox automatically.

I need CDNs. Since 2014, I’ve replaced CDNs with local node modules. Note to self: Get a local caching proxy instead.

I need GitHub. Since 2023, I have two laptops. So, I git pull manually before flights. Note to self: Find an automated solution.

I need ChatGPT. Since 2024, I stopped coding without ChatGPT. Now I’ve forgotten how to code. Today, I couldn’t figure out the SQL syntax to group by duration == 0 vs duration != 0. Note to self: Get lightweight local code LLMs.

It’s amazing how the coding technology stack has grown. Also amazing how technology fills the gaps in the scaffolding.