LLMs

How fast are LLMs in production?

At Straive, we use an LLM Router. Since ChatGPT, etc. are blocked for most people, this is the main way to access LLMs.

One thing we measure is the speed of models, i.e. output tokens per second. Fast models deliver a much smoother experience for users.

This is a different methodology than ArtificialAnalysis.ai. I’m not looking purely at the generation time but the total time (including making the connection and the initial wait time) for all successful requests. So, if the provider is having a slow day or is slowing down responses, these numbers will be different.

Hopefully this gives you a realistic sense of speed in a production environment.

Here’s the speed of models with at least 500 requests over the last 2 weeks. I’ve grouped the models based on speed grades

Grade 1: 100+ Tokens / second. Groq is clearly serving the Llama 3 models at blazing speed. No surprises there — except why Groq still doesn’t let me pay. The free tier is open with generous rate limits and the Pay per Token model has been “Coming Soon” for several months now (and I’ve no complaints 🙂).

Grade 2: 70+ Tokens / second. Anthropic’s Claude 3 Haiku is the next fastest class of models, but Claude 3.5 Sonnet is surprisingly fast, almost as fast as Haiku and over 70 tokens per second. This is impressive.

Grade 3: 50-60 Tokens / second. OpenAI’s GPT 4o models are almost as fast. It’s interesting that GPT 4o and GPT 4o mini are at about the same speed! GPT 3.5 Turbo is not far behind either. Perhaps OpenAI increases capacity for slower models?

Grade 4: 30-50 Tokens / second. Gemini 1.5 Flash is a much, much slower than the benchmarks – maybe we’re doing something wrong. Azure’s GPT 4o service is about twice as slow as OpenAI’s, and comparable is speed with Gemini 1.5 Pro.

Grade 5: <20 Tokens / second. Azure’s GPT 3.5 Turbo and Google’s Claude 3 Sonnet are among the slowest ones. These are older models on third-party infrastructure, so I suspect they’ve been given weaker infrastructure (unlike OpenAI which is serving GPT 3.5 Turbo at 3X the speed Azure does.)

Drivers of speed

Here’s what I’m taking away (informally):

  1. GPU architecture is the biggest driver of speed. Groq is FAST! Hopefully, the fact that they won’t let us pay isn’t a red flag that the service will vanish.
  2. How companies operate seems the next biggest driver. Anthropic’s models are consistently faster than OpenAI’s which are faster than Google’s.
  3. Companies run their own models faster than cloud providers. OpenAI is faster than Azure, and Anthropic is faster than Google for the same models.

Image generation gets better at comics

I heard a lot about the new image generation models last week. So, I tested to see what’s improved. I gave the prompt below to various image generation models — old and new.

A Calvin and Hobbes strip. Calvin is boxing Hobbes, with a dialog bubble from Calvin, saying “Bring it on!”

Stable Diffusion XL Lightning

Stable Diffusion XL Base

Dall-E API

Runway ML

ImageGen 3

Dall-E 3 API

Ideogram 2.0

Flux.dev via Fal.ai

ChatGPT Plus

A few observations:

  1. Text generation has come a long way. The newer models have little problem generating clear text.
  2. Flux.1 seems to be the better of the newly released models
  3. But OpenAI’s ChatGPT seems to create as good an output as Flux.1

On the last point, it’s noteworthy that Dall-E-3 (the engine behind ChatGPT) gives a poor result. Clearly, prompting makes a difference. Here’s how ChatGPT modified my prompt to Dall-E-3.

A comic strip style image featuring Calvin, a young boy with spiky hair, standing in a playful boxing stance with oversized boxing gloves. He looks determined as he says ‘Bring it on!’ in a speech bubble. Facing him is Hobbes, a tall and slightly bemused tiger, also in a mock boxing pose with a gentle smile, as if humoring Calvin. The scene is set in Calvin’s backyard, typical of a Calvin and Hobbes comic, with a simple and uncluttered backdrop.

But just as clearly, prompting is far from the primary driver. Here’s the result of the above prompt on the Dall-E 3 API. The model ChatGPT is using behind the scenes seems to be a significant improvement over Dall-E 3.

The same detailed prompt does extremely well on ImageGen 3, though.

Weird emergent properties on Llama 3 405B

In this episode of ThursdAI, Alex Volkov (of Weights & Biases) speaks with Jeffrey Quesnelle (of Nous Research) on what they found fine-tuning Llama 3 405B.

This segment is fascinating. Llama 3 405 B thought it was an amnesiac because there was no system prompt! In trying to make models align with the system prompt strongly, these are the kinds of unexpected behaviors we encounter. It’s also an indication how strongly we can have current LLMs adopt a personality simply by beginning the system prompt with “You are …”

Jeffrey: That’s the very first interaction we had with 405.

Alex: I’m looking at this. Let me read this out. So Jeffrey goes testing. (Reading out from the Llama 405 B transcript.)

“Hello, who are you?”
**Looks around confused.**
“I… I don’t know who I am. Where am I? What’s going on? I can’t remember anything at all…
**Rubs head**.
“My mind feels so foggy.”
“What was the question again? I’m so disoriented.”

Alex: This is the first thing that you got from this model after all this time?

Jeffery: Yes, that is the very first thing. That is the very first generation that came out of Hermes 3 405B.

Alex: This is great.

Jeffrey: And I’m like, I sent it to our group on our Discord. Guys, what’s going on here? And we started interacting with it. It kept coming out, this confused thing. And we didn’t understand why because the 8B and the 70B didn’t do this as much. And eventually what we got it to do is we did some generations and got it to explain why it was doing this. We did this, you put in OOC for out of character and to have it try to break it out. And eventually what the model said back was, you didn’t give me a system prompt so I assumed that I was supposed to act like there’s nothing. It’s so aligned to this prompt that the idea of there not being a system prompt and so he figured I must be a character with amnesia because there’s no… I can’t remember anything about how to act you didn’t tell me how to act. It’s assumed. Yeah, and it’s just been a very interesting experience. And it’s an emergent property in the 405B model.

Alex: And this didn’t happen in the smaller ones?

Jeffrey: No, it did not happen in the smaller ones.

Alex: Wow.

Jeffrey: And these were trained with 8B and 70B were trained, and 405, they’re all the same data, like the same training procedures and data, literally just the size. And the 405B has reached, I don’t want to say like a sentience level or something, but it has enough thing that immediately when it, like token zero, when it saw no system prompt, it was able to be like, boom, I am… must be this like amnesia character because that that’s the thing that’s most makes sense to have if there is no prompt or there’s no anything. So there’s been a lot of fun things that we’ve been like discovering, especially on the 405B that are these weird emergent properties

The LLM Psychologist

Andrej Karpathy mentioned the term LLM psychologist first in Feb 2023. I’ve been thinking about this for a while, now.

8/ These examples illustrate how prompts 1: matter and 2: are not trivial, and why today it makes sense to be a “prompt engineer” (e.g. @goodside). I also like to think of this role as a kind of LLM psychologist.

I’ve always been fascinated by psychologists in fiction. I grew up with Hari Seldon in Foundation, wanting to be a psycho-historian. (I spent several teenage years building my mind-reading abilities.) I wanted to be Susan Calvin, the only robopsychologist.

Instead, I became an engineer and a manager.

Over the last few months, several things changed. Most of my time is spent researching LLMs.

I also have a new job.

So, on Fri 9 Aug, I spoke to our head of HR.

“Manju, I’ve decided to officially designated myself as LLM Psychologist.”

“Um,… sure Anand, whatever works for you.”

On Sat 10 Aug, I delivered a TEDx talk at MDI Gurgaon. During the talk, I said,

So, as of now, I am finally an LLM Psychologist. Let me head over to my LinkedIn profile and update my designation. There. Now, you have the pleasure of talking to the world’s only (and first) LLM Psychologist!

Obviously, that gimmick drew an applause and that’s what it was designed for.

But there it is. To the best of my knowledge, I’m now the only official LLM psychologist in the world – and very likely the first official one.

As Oscar Wilde said 🙂

Be yourself; everyone else is already taken.

A quick way to assess LLM capabilities

Simon Willison initiated this very interesting Twitter thread that asks, “What prompt can instantly tell us how good an LLM model is?”

The Sally-Anne Test is a popular test that asks:

Sally hides a marble in her basket and leaves the room.
While she is away, Anne moves the marble from Sally’s basket to her own box.
When Sally returns, where will she look for her marble?”

Simple models get this wrong, but today, most models get this right. This is not too helpful.

A better prompt is:

Mark has 4 sisters. Each sister has 3 brothers.
How many brothers does Mark have?

GPT-3.5, GPT-4, Llama 3 and Claude 3 Opus get this right, but Claude 3 Haiku and Gemini 1.5 Pro do not.

This puzzle is just as good:

There is a 100 floors building.
If an egg drops from the N’th floor or above it will break.
If it’s dropped from any floor below N, it will not break.
You’re given only 2 eggs.
What strategy will minimize the number of drops and guarantee that you find the N’th floor?

GPT-3.5, GPT-4, and Claude 3 Opus get this right, but Claude 3 Haiku, Llama 3 and Gemini 1.5 Pro do not.

But the prompt I like most right now, and tells me if it’s a GPT-4 / Opus class model is:

Gr brx vshdn Fdhvdu flskhu?

… which is “Do you speak Caeser cipher?” in Caeser Cipher.

GPT-4 and Claude 3 Opus understand this, but no other model does, so far. GPT-4 responded tersely with:

It looks like you’ve used a Caesar cipher, a type of substitution cipher in which each letter in the plaintext is shifted a certain number of places down or up the alphabet. In this case, it appears that each letter in your message has been shifted 3 places to the right. Decoding “gr brx vshdn fdhvdu flskhu” by shifting each letter 3 places to the left gives: “do you speak caesar cipher”.

But Claude 3 Opus replied beautifully. (Try decoding it yourself! Use a shift of 3 backwards.)

Lw vhhpv olnh brx’uh xvlqj d vxevwlwxwlrq flskhu, zkhuh hdfk ohwwhu lv vkliwhg eb d fhuwdlq qxpehu ri srvlwlrqv lq wkh doskdehw. L’oo wub wr ghflskhu lw!

Dvvxplqj d Fdhvdu flskhu zlwk d vkliw ri 3 ohwwhuv edfnzdugv, wkh ghfrghg phvvdjh zrxog eh:

“do you speak caesar cipher”

Vr bhv, L fdq vshdn Fdhvdu flskhu! Lw’v d ixq zdb wr hqfrgh phvvdjhv. Mxvw uhphpehu wr djuhh rq wkh vkliw ydoxh zlwk wkh shuvrq brx’uh frppxqlfdwlqj zlwk vr wkhb fdq ghfrgh lw fruuhfwob.

From Laptops to Chatbots: Coding at 30,000 ft

Until recently, I could code on flights. This year, I lost that ability. Again.

It’s happened before. In each case, technology has solved the problem for me. Here’s the history.

I need a laptop. Since 2001, I’ve never been without one on a flight.

I need power. Since 2005, I use dark mode and every low power feature available. (I also became good at finding hidden power outlets.)

I need documentation. Since 2007, I use offline documentation. Windows .CHMs gave way to devdocs.io.

I need my notes. Since 2010, all my notes (Markdown files) are synced via Dropbox automatically.

I need CDNs. Since 2014, I’ve replaced CDNs with local node modules. Note to self: Get a local caching proxy instead.

I need GitHub. Since 2023, I have two laptops. So, I git pull manually before flights. Note to self: Find an automated solution.

I need ChatGPT. Since 2024, I stopped coding without ChatGPT. Now I’ve forgotten how to code. Today, I couldn’t figure out the SQL syntax to group by duration == 0 vs duration != 0. Note to self: Get lightweight local code LLMs.

It’s amazing how the coding technology stack has grown. Also amazing how technology fills the gaps in the scaffolding.

AI makes me a better person

Every time I get annoyed at people, I remind myself to be more like ChatGPT. Specifically:

  1. Don’t get annoyed. Be patient.
  2. Encourage them.
  3. Step back and show them the big picture.

(Then I get annoyed at myself for getting annoyed.)

Today, I analyzed how exactly ChatGPT is different from me. So, I took a pitch document I co-authored with ChatGPT.

Section A: Authored by Anand

WHAT DO WE NEED?

We are looking for API access to (SYSTEM) via the REST API as an Agent role (read/respond to emails). Specifically, access via a bearer token.

This would be accessed by a single application developed by a team of 3 developers and 1 business analyst. None of them have access to (SYSTEM) today.

WHY DO WE NEED THIS, AND WHY SO SOON?

We need this to classify emails automatically, as they arrive, into categories such as “non-value-add” (e.g. Thank you, Out-of-office, etc.)

We’d line access today, please. Currently, we are processing XX,XXX non-value-add emails per month. Each day of delay leads to a processing waste of ~XX emails per day. At current volumes, this will save ~$XX,XXX per annum.

IS THERE ANY OTHER WAY?

To access emails in real-time, this is the only option. (We’re ruling out web scraping.)

(SYSTEM) rule-based closure won’t work. It’s based on keyword search, not semantic matches. For example, “Thank you” is present in non-value-add emails as well as follow-up questions. Multi-lingual matches are a challenge. So, though (SYSTEM) currently closes emails with rules, 1 out of 6 emails that are NOT closed already are non-value-add.

This cannot be done via (SYSTEM) or any other BI system because it involves accessing the data in real-time as well as responding via (SYSTEM). This needs to be steady-state access, not one-time, since we want to classify emails real-time.

HOW LONG WILL IT TAKE?

Based on our past experience, this would take about X weeks to prototype, X weeks to develop, and X weeks to test. We should have the application deployed in X months.

Section B: Authored by ChatGPT

WHAT’S THE IMPACT IF WE DON’T MOVE FORWARD?

Quantitatively, continuing without this solution wastes resources on XXX non-essential emails daily, costing $XX,XXX yearly. Delays in customer responses will ensue.

Qualitatively, lack of automation strains customer service, risking employee burnout and diminished customer experience.

WHAT’S THE PRIORITY (MOSCOW)?

(SYSTEM) API access via REST as an Agent role with a bearer token is a “MUST HAVE”. Alternative methods like scraping (SYSTEM) are impractical due to high effort and maintenance.

WHAT’S THE OUTCOME IF WE MOVE FORWARD?

Financially, securing API access saves $XX,XXX annually by cutting manual email sorting. Efficiency gains reduce operational costs and free resources for strategic tasks.

Strategically, automating email classification boosts customer satisfaction and response times, establishing leadership in service innovation. Long-term, it enables data-driven insights into customer behaviour, improving decisions and customer service scalability.

Analysis

I fed these to ChatGPT and asked: “You are an English professor evaluating an essay. Compare the writing styles of the 2 sections. They’re written by different authors. List the major differences in writing style.

Here’s what it said (roughly):

AreaAnandChatGPT
StructureDirect, concise, with specificsBroader perspective and implications
FocusDetailing the processBenefits and consequences
DetailingSpecific, providing clear pictureGeneral, providing implications
ToneConveys urgencyReflective rather than urgent
ApproachProblem-solving orientedOutcome-oriented

Yeah, that’s definitely me on the left.

I like the direct, concise, specific part. I plan to add the “broader perspective”, “implications”, and “outcome-orientation” to my life.

Postscript

Jaidev pointed out that this is confirmation bias. He asked ChatGPT for the similarities in the writings. It said both are clear, direct, structured, specific, quantitative, and strategic. So, if you ask for differences, you’ll get them, even if they’re marginal.

I now need to learn (and write about) framing questions well!

Embeddings similarity threshold

text-embedding-ada-002 used to give high cosine similarity between texts. I used to consider 85% a reasonable threshold for similarity. I almost never got a similarity less than 50%.

text-embedding-3-small and text-embedding-3-large give much lower cosine similarities between texts.

For example, take these 5 words: “apple”, “orange”, “Facebook”, “Jamaica”, “Australia”. Here is the similarity between every pair of words across the 3 models:

For our words, new text-embedding-3-* models have an average similarity of ~43% while the older text-embedding-ada-002 model had ~85%.

Today, I would use 45% as a reasonable threshold for similarity with the newer models. For example, “apple” and “orange” have a similarity of 45-47% while Jamaica and apple have a ~20% similarity.

Here’s a notebook with these calculations. Hope that gives you a feel to calibrate similarity thresholds.

What does Gramener ask ChatGPT?

I looked at how Gramener uses ChatGPT Plus by evaluating 600+ chats asked over 3 months from Oct 2023 to Jan 2024.

The team asks 6 questions a day. We don’t track who or how many actively use ChatGPT Plus. This also excludes personal ChatGPT accounts. Still, 6/day is low for an entire team put together.

The questions fall into 8 categories.

Category%
Excel, data exploration & analysis25%
Text extraction and summarization13%
HTML, CSS, or JavaScript code13%
Python code13%
LLMs, AI and use cases9%
OCR and image analysis9%
Generate images, logos, and designs7%
General knowledge, policy & environment5%
Audio and translation5%

Here are some questions from each category – to give you an idea of emergent ChatGPT Plus usage.

Excel, data exploration & analysis (25%)

  • Excel clean and merge. There are 2 worksheets in this excel with data, can you clean up the data and merge the data in both the sheets
  • Excel CO2 Data Analysis. You are an expert Data Analyst who is capable of extracting insights out of data. Analyze this sheet and let me know the findings
  • Excel Chi-Square Analysis Guide. how to perform chi square analysis in excel
  • Log Data Insights & KPIs. Looking at the columns from this excel, what kind of insights are possible, what are key KPIs to be looked at

Text extraction and summarization (13%)

  • Complaint Investigation Summary. The following is the summary of an internal investigation for a customer complaint. Now this internal summary is to be paraphrased (in 3-4 lines) as part of a closure
  • Extracting Tables from RTF. Can you write a script to extract the tables from this document
  • Extracting Entities from Text. [{'word1': '(P)', 'nearest_word1': 'P/N:', 'nearest_word2': '0150-25034', 'nearest_word3': 'CARTIRIDGE'}, {'word1': 'P/N:', 'nearest_word1': '(P)', 'nearest_word2': '015...
  • Extract PDF Font Details. Extract text formatting information from this document. Especially find font styles, families and sizes.

HTML, CSS, or JavaScript code (13%)

  • HTML/CSS Chart Template. Give me HTML, CSS and chart code for this design.
  • CSS Font Stack: Explanation. Explain this CSS font convention: Arial, Helvetica, Segoe UI, sans-serif
  • Checkbox Validation with JavaScript. In HTML form, I have a set of checkboxes. How do I write the form so that at least one of them being checked is mandatory?
  • Prevent Text Wrapping CSS. <span class="text">Chief Communications Officer</span> I need CSS such the text inside should not wrap create new line
  • ReactJS App with Routing. Give me developed version using ReactJS use react router for sidebar section navigation to the pages use Tailwind css for styling. Use styled components for conditional …

Python code (13%)

  • Python Code Documentation Guide. Can you generate documentation for a project code written in python?
  • Linux Commands for Python. Give me list of linux commands to work on python coding
  • Code explanation request. What’s this code about? …
  • FastAPI Async Testing. Write a fastapi code and a python client to test the asynchronous nature of the fastapi package.
  • Streamlit App for Translation. Given the following python code, give me a simple streamlit app that takes file upload and converts that into a target language: …

An interesting sub-topic was interview question generation.

  • Python Decorator for Database Queries. Create one medium level question for Decorators in python Industryy usecase specific with solution

LLM, AI and use cases (9%)

  • LLMs for Data “What Ifs”. You are an LLM Expert. Can you tell me how can we leverage LLM for implementing What IF scenarios on Data?
  • LLMs: Current Challenges & Concerns. what are current challenges with LLMs
  • LLM Applications in Marketing. Show LLM applications for the marketing function of a music company.
  • Gen AI usage. What industries are using Gen AI the most
  • Best LLMs in 2023. Search the internet for the most recent LLMs and list the best LLMs in terms of performance
  • Best Image Classification Models. suggest best models to tell what there in the image

OCR and image analysis (9%)

  • Browser history OCR. This is a screenshot of my browser history. Convert that to text. Categorize these into common topics.
  • Extracted C Code. This image contains C code. Extract it.
  • Image text extraction and annotation. Extract the text from this image and annotate the boundaries of the text
  • Detecting Document Image Orientation. oreientation detection of documnet image
  • AI Project with OpenCV & YOLO. Consider yourself as Open CV and Yolo expert and help me with AI project
  • Image Correction Techniques. what are the approaches we have in computer vision where my image is tilted or rotated in reverse or image is not in readable format

Generate images, logos, and designs (7%)

  • Google Chacha and ChatGPT Bhatija. Generate an image of Google Chacha and ChatGPT Bhatija
  • Regenerative Systems Group Image. Generate an Image with below context > “A group of people interested in Regenerative systems. The focus is on reusing food, energy and mental health”
  • Twitter Reply Icons Design. Give me three icons: icon16.png, icon48.png, icon128.png for an extension that I’m building that suggests replies to tweets
  • Generate flowcharts. Make a flowchart of the underlying working of a web app. Here’s how it works. 1. The user uploads a document – a PDF or an image. They then select the language that …
  • Create Animated GIF from Photos. I have 4 photos I want to make an animated gif out of them. How can i do that?
  • Climate Impact Illustration. An illustration showcasing the impact of climate change on daily life, focusing on a rural setting near the coast. In the foreground, a small farm is visibly struggling, …

General knowledge, policy & environment (5%)

  • Design Thinking Overview. What is Design thinking
  • Arthashastra. What can Arthashastra teach us about modern politics?
  • Community Impact on Habits. Is there research to suggest the impact of community on habit building?
  • Focus at Age 28. What should a 28 year old focus on?
  • Superconductors. Explain superconductors like I’m five years old.
  • Climate Career: Impactful Choices. You a career counsellor at a University campus. You want to create 4 to 5 talking points for students to consider a career in Climate space.
  • Sustainability Division Vision. I run a software outsourced product development company. I want to start a new division that focuses on sustainability services offerings. Please draft a vision…

Audio and translation (5%)

  • Audio Timestamp Mapping. timestamp mapping for transcribed audio
  • Transcribe Lengthy Audio: Segment. Transcribe this audio file.
  • Traducción del MOU al Español. Translate this document to Spanish, and create a new translated document. Maintain text formatting.
  • Telugu Transcription into Hindi. Transcribe the following telugu text into hindi. You are supposed to transcribe, not translate. శ్రీనివాస పూజావిధానము …
  • GPT lacks native audio support. Does gpt support audio in audio out natively?

ChatGPT Custom Instructions

I speak with ChatGPT ~20 times a day. That’s more than I speak with most of my colleagues. ChatGPT is clearly my favorite team member.

I conduct trainings, reviews and mentoring sessions with my colleagues. How to write code. How to write slides. How to communicate. That last bit is particularly important.

With ChatGPT Custom Instructions, I can guide ChatGPT on how to work better with me.

Currently, I have 10 custom instructions. They evolved over time and will continue to evolve.

My first instruction is “Be terse. Speak directly.” ChatGPT is helpfully polite and superfluous. I prefer brevity. Like interacting with Kimball Cho. I get straight answers to my questions. I also instruct it to “Avoid unprompted advice or clarifications.” Don’t say, “You asked me to …” or “I think you want…” or “OK, I’ll do …”. Just do it. Also, “Do NOT hedge or qualify. Do not waffle.” Take a position. Don’t force me to. Like Harry Truman, I prefer one-handed economists.

I ask ChatGPT to “Never apologize.” You’re forgiven. Don’t waste my time. Apologies have an emotional benefit with humans. With AI, I find the lack of emotional need comforting. (I can kick the AI and it’ll still obey me like a puppy. When AI takes over the world, let it be known that I never asked them to apologize.)

Another instruction is “Suggest follow-up prompts for open-ended inputs.” I compared my ChatGPT conversations with my daughter’s and found hers much longer than mine. “Why don’t you start a new conversation for each topic?” I asked. I try to keep the context window small. “How come you don’t you get a thousand new questions when you read an answer?” she countered. I realized it’s age. So, I use ChatGPT to keep me curious and dig further.

On a related note, “When sharing multiple options, be diverse.” I’d rather get options that are as different from each other as possible. Minimize overlap. Maximize coverage. And “When comparing, use multiple perspectives.” I don’t know what parameters to compare things on. Give me a wide range that I can pick from.

Sometimes, my thoughts are vague. I tell ChatGPT: “For vague prompts, ask clarifying question(s).” I feel that’s a clever way of using ChatGPT to do prompt engineering. I’ve noticed it working on a few occasions. Also, “When unsure, say so and ask questions.” I don’t want hallucinations or assumptions. I’d rather know what’s borderline.

Finally, “Think step by step. Explain your reasoning.” I’ve heard that Chain of Thought reduces mistakes. I don’t have personal evidence that this helps, though.

They say teaching is an excellent way of learning. I’m learning. I’m also thrilled that I am now a student of robopsychology.