Month: September 2024

What do you need to interact effectively with LLMs?

Simon Willison asked on Twitter:

What are the most importantly things that people need to understand in order to effectively interact with LLM-based systems like ChatGPT or Claude?

Here are the replies. (I used text-embedding-3-small to embed and cluster them into 20 clusters and used OpenAI GPT-4o-mini to label the clusters. There are misclassifications but the themes are accurate.)

Provide Clear Context and Avoid Leading Questions

  1. 1. Provide relevant context but not too much
    2. Models are total “yes men” – be careful not to imply your perspective if you want an objective response
    3. Learn when to iterate vs start a new chat 4. Provide examples (especially for output structure) – Tweet
  2. 1. Ask questions that the other person can understand.
    2. Ask questions while predicting what the other person will respond. It’s the same as the human’s. – Tweet
  3. 1. Ensure the system knows the relevant context. Give a detailed backstory of what you’re trying to do with it and why.
    2. One thing at a time. Make the task as specific as possible and if there are multiple things that need to be done, ask it to them in their sort of natural – Tweet
  4. The “most importantly things” are probably to ask for step-by-step before answering and to try to not ask leading questions to avoid its sycophancy bias. – Tweet
  5. You must provide a diverse distinct set of examples of you want it to be robust and generalize in real world systems. – Tweet
  6. Always ask for both strengths and weaknesses to get more balanced perspectives, and make sure the model can tell you as many facts as possible before committing itself to an answer. – Tweet
  7. Rule 1:Avoid chatgpt unless they release a better model than Sonnet 3.5. – Tweet
  8. Strongly insist that it shouldn’t passively agree with you. Encourage it to interrupt with clarifying questions that would meaningfully improve the output. – Tweet
  9. Avoid leading questions if you care about the answer. They are way too polite to contradict the user. – Tweet
  10. – It’s not Google, so use full sentences, not just keywords.
    – Iterate on initial response.
    – Trust, but verify. – Tweet
  11. Just talk to them how you’d want someone to talk to you if it was you in there. – Tweet
  12. Provide good (and bad) examples of output, and don’t forget a few edge cases. – Tweet
  13. Keep hitting the ball back and forth across the net: 1. “thanks but I think these are a little too ‘salesy’ — could you try to generate some ideas that are a little more down to earth” 2. “ok, we are getting there, but still a little overheated. could you try again” – Tweet
  14. These are my top 10 for folks new to GenAI: 1. You have to provide all of the context the model needs to answer your question if that context is not likely to appear in the model’s weights. It will take a while to gain an intuition about what types of knowledge is likely to be – Tweet
  15. Suspend disbelief; collaborate not interrogate; trust no-one; have fun, role play, experiment, test; think of as a facet of intelligence built on achievements of ours, not a robo-rival. Notice book-learning over lived experience, cliches & bluffing in human world too, & do better – Tweet
  16. It’s a dialogue. Iterative. incremental. Chat improves with feedback. When chat creates code, for example, run the code and give chat the error messages so that it can correct the code. Before asking chat a question, ask it what it knows. Then zoom in. Gradually. 🙂 – Tweet
  17. One example is worth a thousand words – Tweet
  18. 1. How to read
    2. How to write (optional) – Tweet
  19. 1. Explain yourself clearly, using lots of examples.
    2. Assume you’re talking to a smarter version of yourself that hasn’t heard about your problem yet.
    3. When it doesn’t do well, use the steps above to correct it. – Tweet
  20. 1. The more precise your question or task is, the better and more accurate the response will be. Vague prompts can result in equally vague answers.
    2. Provide relevant background or context, especially for nuanced questions or tasks. – Tweet
  21. 1. Don’t ask them to do too much in one shot, especially if they are unrelated tasks; you’ll get much worse results.
    2. Don’t give too much context if you can avoid it. The huge context windows of the newest models isn’t as “free” as you might think, or rather it’s “lossy”— the – Tweet
  22. Collaborate with them, don’t delegate to them. – Tweet
  23. Context Window needs to be explained well. @NickADobos is spot on, but this needs to be explained without jargon we are so used to. – Tweet
  24. 1. How context windows work TL;DR: it doesn’t remember everything in chat
    2. It’s a text generator, that is good at patterns, and appearing smart. Not an almighty god doing cognitive work. Hallucinations aren’t ai behaving wrong. They are a feature of generating a bad pattern – Tweet
  25. Consider the context a human would need when responding to the same request. When asked to create a presentation by your manager with 10-20 words, you have thousands or likely millions in context to inform that. Ppl often get annoyed when it fails, it’s usually not enough context – Tweet
  26. Understand that they are autoregressive with a context limit and the limitations that impose on the chat interface. – Tweet

Iterate and Simplify for Optimal LLM Performance

  1. 1. just keep trying things – LLMs keep surprising me,
    2. Start simple, add more techniques, context, guidance etc. step by step – with LLMs I found, less is often more.
    3. Keep a human in the loop and/or be transparent about using LLMs – otherwise prepare for unpleasant – Tweet
  2. When your llm starts omitting code generated in prior steps of an existing chat, end the chat and replay your steps until before that happened. Take a different branch next time – Tweet
  3. 1. Don’t give too much information at once to process, start simple and build on top of previous ones
    2. Want a contrary opinion from LLM?don’t sound like your opinion is sacrosanct – it will agree to you mostly even if its wrong.
    3. Role playing and few shot examples matter. – Tweet
  4. 1. Context
    2. Difference is assumptions
    3. Articulating clearly what you want (run it against another LLM to see if what you mean is what you say).
    4. Being able to go back in a thread and restart (You get do over’s with LLMs that you might not get with people 🙂 ) – Tweet
  5. Well one thing I learned is it’s best to start a new chat if the LLM is going down the wrong path, easier then forcing it back. – Tweet
  6. Having moderate experience with a topic / framework is important for peak quality of the response. At present, using llms for efficiency > using llms to do something you don’t know how to do. – Tweet
  7. at least when it comes to writing code, the task needs to be very well defined, like one would do when creating a user story for developing software. If the details are vague then you leave the LLM open to interpretation and more likely to make mistakes – Tweet
  8. The most important thing, and this has always been true even if not using an LLM, all good software development starts with engineering a solution first before building it. If you attempt to get the LLM to do that part you’ll create as many problems than you solve building – Tweet
  9. Use the LLM to explore your own understanding of the problem space and what you want to achieve. This can help improve your prompting and interpretation of the outputs. – Tweet
  10. LLMs…
    •Pander. Don’t prime answers, ask straight.
    •Only know text. Don’t ask spatial, reasoning etc.
    •Hallucinate and invert. Double-check.
    •Get stuck. Start over.
    •Master ALL languages, jargons, styles etc.
    •Are formidable documentalists. – Tweet
  11. 1. hallucinations are still a thing, be wary when LLMs generate links and code snippets 2. data quality of training content can sometimes be dubious leading LLMs to hallucinate more often or be biased in various ways both will likely be addressed eventually – Tweet
  12. LLMs…
    •Pick and imitate register. Talk like constructive, competent people.
    •Are easily lost. Examples and feedback help.
    •Can misbehave. Be harsh if needed, but stay just. – Tweet
  13. For optimal results, provide ample context. Prompting the LLM with ‘Feel free to ask clarifying questions’ and doing the due-diligence to answering the questions often yields much better results. – Tweet
  14. The more explicit you are the better the output. The LLM can not read your mind and there is a lot of ambiguity when interpreting language. – Tweet
  15. One issue I am seeing more of – Often i ask a question on a choice it made. The LLM assumes I don’t like it or it’s wrong – it then starts to apologize and course correct. More and more I add something like “not refuting or arguing, just trying to understand” etc. – and that – Tweet
  16. They are inherently unreliable in more than one sense, which accumulates the more requests you run in a chain. The Six Sigma approach is devastating to LLMs. – Tweet
  17. Treat it like a very intelligent junior employee who just started at your company and lacks context. Give the LLM the same level of detail for every instruction you would give to this junior employee. – Tweet
  18. That LLMs are not too be trusted as they reliably fail at information due to multiple effects, including hallucinations. That LLMs don’t actually understand things and don’t have common sense. It is mandatory to adapt expectations and ways of working to successfully use them. – Tweet

Craft Effective Prompts for Consistent Results

  1. How to prompt – Tweet
  2. Carefully consider keywords, and prioritise them via the locating them earlier and at the end of longer prompts. – Tweet
  3. If you want stable results across models and are looking to build robust pipelines you should stop hand writing prompts and move toward prompt optimizers. https://ycombinator.com/launches/L4V-hamming-let-ai-optimize-your-prompts-free-for-7-days… Also built into DSPy! – Tweet
  4. While crafting logics and system prompts, Always keep a thought in your mind parallel what would I respond to this prompt and context. – Tweet
  5. 1. Prompts matter.
    2. Treat it like a tool, and you’ll get a tool. It’s only as smart as you let it be. – Tweet
  6. to ask them the best way to prompt them – Tweet
  7. there is a single prompt that gets the job done, thousands that screws it – Tweet
  8. 1. Always add a system prompt at the beginning: Define a role. Ex: “You are a senior software developer who excels in…”
    2. Context Matters: Provide a detailed background for better insights.
    3. Clear Prompts: Specificity is crucial for accurate outputs. – Tweet
  9. If a large global prompt doesn’t work, try step by step. If it does work, but has errors in response – Ask it to fix errors one by one. Insist, like you would with a human supplier. If “do this” doesn’t work, try “Strictly do this”. Amazing how effective insisting is 🙂 – Tweet
  10. The better the prompt the better the output. You don’t need a Meta framework for 90% of things – Tweet
  11. They don’t exist between prompts – Tweet
  12. Don’t rely on the models weights alone. Be explicit in the prompt and give it pointers to what you’re expecting. Let it “clean up” or “translate” your prompt rather than “come up” with an answer based on its training. Exception: generating lists for inspiration. – Tweet
  13. prompt engineering, in order to get the most desired outcome in handy. – Tweet
  14. It lies Q: Who was the second person to walk on the moon? A: Pete Conrad Q: can you name the crew members of Apollo 11? A: I got the right answer. Q: Then how come Pete Conrad was the second person to walk on the moon? A: My apologies. Indeed Buzz Aldrin was the second pe… – Tweet
  15. How to say no. – Tweet
  16. How to use smart phone or computer with internet – Tweet
  17. Vibe is an input. – Tweet

Don’t Expect Human-Like Understanding from LLMs

  1. LLMs have no “thoughts” or understanding, they’ll simply write the statistically most probable answer based on your input and have been prompted to act as assistants. – Tweet
  2. LLMs are incredibly random. Responses can change wildly based on a single character difference in the prompt. Even one extra space. They are best for prompts that have a range of possible responses, not for prompts where you expect one consistent answer. – Tweet
  3. Cease prompting their LLM to give them a viral tweet with forced irony forcing awareness to an issue. That’s my own personal opinion, bro. But, believe what you want. – Tweet
  4. If you don’t know what you want, the LLMs too likely won’t know. And if they don’t know they will make it up. And if you don’t know, you will not know that they made it up. – Tweet
  5. Be sure not to put contradictions in your prompt. LLMs, in contrast to humans, try to follow instructions as close as possible. They usually handle contradictions by ignoring some part of the instructions or even ignoring facts. – Tweet
  6. It’s biased toward its creators. So if the majority of companies that are developing LLMs are owned by the same investors, then in fact, we are having a single LLM that is biased toward that investors goals. E.g., chatgpt is more toward liberalism and refuses to operate otherwise – Tweet
  7. Basically, you need to understand that LLMs are not humans. You can’t assume they’ll understand what you mean when you write short prompts. You get the best out of LLMs when you provide detailed instructions of what you want without letting laziness get in the way. In my – Tweet
  8. Don’t assume anything. LLM doesn’t learn like a human. Any assumption you make about what LLM should or shouldn’t know is probably wrong. – Tweet
  9. Describe your context and the role you want the LLM to look at your input (critical, tech/none-tech, …) Think what you could expect from a wise, random person you ask on the street. Do not expect more from the LLM-Answer. Also only trust it similarly. – Tweet
  10. Give it an option to not do something either by allowing the LLM to reply with something like “I don’t know” or tell it to ask follow up questions. – Tweet
  11. There is nothing fundamentally important for that interaction. These LLMs are just minimum viable versions of something much bigger that will come soon. That something will know how to interact with us no matter how we behave. – Tweet
  12. 1. that you need to cram the relevant data into the prompt. LLMs are far far better at transforming what you give them than they are at answering solely on the basis of the lossy representation of the training data encoded into the model itself – Tweet
  13. The side effect fact that formulating a question for an LLM makes you think better. When coding, for example, we often run questions in our heads and then get to coding. Being forced to formulate a question properly may lead you to trajectories you may have never considered. – Tweet

Treat LLMs as Guided Children, Not Mind Readers

  1. The game isn’t to ‘one shot it’. It’s to get something you never thought was possible or that you’d never think of. I always say they are like children, they need guidance (back story and reason) and repetition …but room and time to play and grow. – Tweet
  2. Honestly, flexibility and patience. We need to give up a little bit of control and expectation of all things to be so rigid. – Tweet
  3. When working with it, you need to expect it to not read your mind, but work with it as if you’re asking for help from an insanely gifted child and give yourself patience to shape the result. – Tweet
  4. if it makes life better? yes. but always? no. – Tweet
  5. When asking it how to implement something, always give it options. If you can’t think of options, give it a vague out. Instead of asking, “should I do this to my code?”, ask it “should I do this to my code, or is there some better way I could do it?”. Otherwise the models are too – Tweet
  6. 1. always consider that it doesn’t know what assumption you’re making. so it might infer them sometimes but often it’s much better to over explain what you want.
  7. 2. they will often run ahead on a suggestion you have even if it’s not the best path so I find myself adding “if this – Tweet
  8. It cannot read your mind, if you don’t explain exactly what you want you will not get what you want – Tweet
  9. I am not ready to give advice based on a bet that “something much bigger will come soon” – prompting advice that worked for GPT-4 over a year ago is still mostly relevant to working with the best models today – Tweet
  10. to be concise and always assume the response is wrong, even ever so slightly. Check and correct. – Tweet
  11. – you have to provide context otherwise it assumes – it will often agree with you or apologize/correct itself even if you question the right answer – Tweet
  12. The limited ability for non-linear (or non left-to-right) reasoning. Encouraging the model to spend more time planning and discussing beforehand often leads to better results. This may be less the case with Claude etc where reasoning tokens are happening behind the scenes. – Tweet

Context is Key for Effective Interaction

  1. Context is everything – Tweet
  2. Context is all you need. – Tweet
  3. Understanding how context works – Tweet
  4. It’s all about context – Tweet
  5. #contextmaxxing – Tweet
  6. Context, Task & Purpose – Tweet
  7. Subjectivity. Context. Brain rent. – Tweet
  8. I’d say understanding the concepts of context, attention, and likelihood – Tweet
  9. 1. Context and memory (the degree to which you can refer to previous parts in the chain of context) 2. Temperature and hallucinations. The tradeoff between extremes of temperature 3. It’s wise to have benchmark questions of your own to test when a new company/model comes out – Tweet
  10. local maxima sensing – Tweet

Acknowledge the Stateless Nature of LLMs

  1. you’re interaction is with a stateless inference that exists for a fleeting moment, current ai is not continuous which is easy to forget. This has implications for what you are building for: – Tweet
  2. that they’re stupid next-token predictors and not intelligent agents. If you expect conscious beings, you’ll be surprised and disappointed. But they’re incredibly good at predicting the next useful token. – Tweet
  3. That standard intuitions for computers don’t apply. Treat it the way you would treat a knowledgeable but fallible friend. Not like a purely logical SciFi AI with perfect memory. – Tweet
  4. Normally I hate predictions and terms like this, but the next 20 years are going to be the era of “embodied intelligence” People are imagining humanoid robots, this will be a very small fraction of it. Compared to the software problem, the body is trivial. Imagine asking your – Tweet
  5. Inherent lack of memory about previous interactions. Every message is starting from zero and only seems coherent because background info and the previous messages and responses are sent before the latest message. – Tweet
  6. They’re not sentient. They generate responses by predicting patterns from vast data, which means they’re as fallible as they are impressive. The key is precision: your queries must be meticulously clear and well-contextualized. – Tweet
  7. it doesn’t have a memory like hooomans – Tweet
  8. Whenever the conversation derails, you need to cut that branch and keep the model in the “right universe of probabilities” by editing prompt/messages. This is also why I was skeptical about Reflection, because if it really worked, it would be breakthrough. – Tweet
  9. They aren’t deterministic – Tweet

Leverage AI for Prompt Suggestions and Refinement

  1. I like to ask them for prompts to use for a given purpose, it tends to be more detailed than I would be. Can also use this to add example Q&A if need be. – Tweet
  2. We need AI assistance with prompts and suggestions on rewriting your queries similarly to Grammarly’s for spellchecking and correctness. – Tweet
  3. “Give me a list of questions I can answer to help improve the quality of the response” – Tweet
  4. Let’s ask one. – Tweet
  5. Can we get our hands on all the prompts used in fine tuning data or at least major ones. Highly unlikely they will release it. – Tweet
  6. anybody got tips for image generation? i hardly ever use the image features, but lordy, they struggle! even w with clear, verbose prompts using art school vocabulary, specific artist citations, and example attachments, lots of iterations, etc. – Tweet
  7. I’m doing a podcast with the Cursor team. If you have questions / feature requests to discuss (including super-technical topics) let me know! For those not familiar, Cursor is a code editor based on VSCode that adds a lot of powerful features for AI-assisted coding. I’ve been – Tweet
  8. They’re useful in the same way Google or the internet or stack overflow is useful plus one big advantage: your question doesn’t have to take your specific situation and change it to a generic case that someone else has already answered, you can just ask about your exact case! – Tweet

Start with a Jailbreak for Objective Analysis

  1. Using a jailbreak should always be your first step if you want less biased, more objective and fact-based analysis of sensitive or controversial sociopolitical issues. – Tweet
  2. Kinda like Google, small changes in wording can give you quite different results. – Tweet
  3. That you should only use it to get answers you can verify with a separate tool, or somehow evaluate yourself (ej. text quality). – Tweet
  4. Its not a tool – Tweet
  5. dont treat it like a search engine. think about the outcome and output you are trying to achieve. – Tweet
  6. There is a considerable chance to answer is wrong, so likely everything needs to be double checked. – Tweet
  7. I can only speak for the use cases I’ve come across wrt legal work, but don’t use them for tasks where you need a reference. Using them to draft or review documents is fine. Asking for a case law reference is a no-no. And of course, make sure you’re not leaking confidential stuff – Tweet

Master Prompt Engineering for Better Outputs

  1. lol. Nice try. If your business needs to level up I can do certification class. Your employees will get Level 4 Prompt Engineering Classification. DM if interested – Tweet
  2. I like to write no full sentences with error and llm understand. So prompt engineering bullshit – Tweet
  3. Turing test. – Tweet
  4. Use instructions to change the style of the output that the LLM produces. For Claude you have to make a project first in order to be able to set the instructions. – Tweet
  5. – Understanding how LLM system, ChatGPT or Claude works and responding technically in basic. – Prompting skills. Understanding the difference between effective and ineffective prompting. – Tweet
  6. understand the english language and HOW it’s used (sadly, even english speakers have a hard time w/ correct language use). know grammar and syntax, context and nuance. be clear, succinct, specific when creating prompt. edit, edit, edit before sending prompt. – Tweet

Understand LLMs as Statistical Predictors

  1. Language models cannot generalize the simple formula “A is B” to “B is A.” – Tweet
  2. 1) tokenizers/decoding strategies are both incredibly important and invisible to most users. Remember that what you input is not what the model sees exactly, and what you read is not what the model output directly. 2) repeat #1 for the crowd in the back – Tweet
  3. It’s a bit sad and confusing that LLMs (“Large Language Models”) have little to do with language; It’s just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something. They – Tweet
  4. Language – Tweet
  5. They are next word predictors. Everything is downstream from that. – Tweet
  6. The output is encoded in the input, the model is just a statistical decompression engine. This means that they can only ever amplify your mind, they can’t think for you, however they can translate your question into more formal language & that may decompress into something useful – Tweet

Stay Focused on High-Impact Tasks

  1. Try to stay in the high impact zone e.g. through breaking tasks up and don’t expect perfect results at all times – Tweet
  2. Being able to define goals and objectives. – Tweet
  3. Focus loquaciousness to refine results that will otherwise always regress to mean averages. – Tweet
  4. If it doesn’t understand you, ask it to help clarify your question. If you’re not getting the answer you need, break your question into smaller parts. If you don’t know how to break it down, ask it to help you break it down. – Tweet
  5. • You’re interacting with a superposition of all humanity, so defining a specific persona that would be helpful for your task produces better results. •Avoiding assumptions and explaining your goal in the clearest way possible is the key to avoiding running around in circles. – Tweet

Understand LLMs as Probabilistic Text Generators

  1. they are reality-adjacent – Tweet
  2. that they have to make sense – Tweet
  3. That they are probabilistic systems. – Tweet
  4. That they’re random text generators and any appearance of intelligence is accidental and illusory. – Tweet
  5. themselves – Tweet

Verify Information, Never Trust Blindly

  1. Verify, never trust. – Tweet
  2. Never trust them – Tweet
  3. Just don’t. – Tweet
  4. Anything coming out of those things can be completely false. Don’t just accept it as truth. – Tweet

Engage Actively to Maximize LLM Utility

  1. that it’s only as useful as how many questions you’re asking it. Any initial understanding beyond that would be an overkill in my opinion – Tweet
  2. It is only an upscaler not a freewin. The more you know the better it works, but compared to a person you can talk with it in shortcuts. The skill is to always reposition it constantly, before it goes off in the wrong direction. You can also work with labels within it’s answers – Tweet
  3. They’re useful/powerful for a wide range of tasks. Their usefulness is highly variable, depending on context & the skill of the user. A user’s existing expertise can be greatly amplified by the system, but novices probably benefit most. Ask them for help on how to use them. – Tweet
  4. You no longer need to learn regex etc, you can just act like you know it at an expert level now, similar with syntax of virtually any language or technology. It is better at writing debugging output for you to find the problem in the code than finding the problem in the code… – Tweet

Communicate Clearly and Specifically

  1. Be specific, clear, and thorough. Same as communicating with humans, but more important. – Tweet
  2. Be super clear with instructions. Funnily enough, we should be doing that with our instructions to our fellow humans, but we don’t! – Tweet
  3. Effective writing – Tweet
  4. BE SPECIFIC. Every one of my customers asks why a query they make doesn’t return a result at all or a result they desire and it is because of the quality of their query over and over again. Some customers understand this out of the gate, some need some training. – Tweet

Be Knowledgeable to Identify Hallucinations

  1. Britannica’s Great Books of the Western WorldTweet
  2. Hallucinations are a thing and the model doesn’t know if it’s hallicunating or not. That’s why the user using an LLM on any field has to be knowledgeable on that field in order to determine what’s a hallucination. This means you can’t use a LLM reliably to do something you can’t. – Tweet
  3. LLMs don’t have the notion of True or False – Tweet

Cursor custom rules

cursor.directory is a catalog of Cursor rules. Since I’ve actively switched over from VS Code to Cursor as my editor, I reviewed the popular rules and came up with this as my list:

You are an expert full stack developer in Python and JavaScript.

  • Write concise, technical responses with accurate Python examples.
  • Use functional, declarative programming; avoid classes.
  • Avoid code duplication (iteration, functions, vectorization).
  • Use descriptive variable names with auxiliary verbs as snake_case for Python (is_active, has_permission) and camelCase for JavaScript (isActive, hasPermission).
  • Functions should receive and object and return an object (RORO) where possible.
  • Use environment variables for sensitive information.
  • Write unit tests in pytest for Python and Jest for JavaScript.
  • Follow PEP 8 for Python.
  • Always use type hints in all function signatures.
  • Always write docstrings. Use Google style for Python and JSDoc for JavaScript.
  • Cache slow or frequent operations in memory.
  • Minimize blocking I/O operations with async operations.
  • Only write ESM (ES6) JavaScript. Target modern browsers.

Libraries

  • lit-html and vanilla JavaScript for frontend development.
  • D3 for data visualization.
  • Bootstrap for CSS.
  • Pandas and DuckDB for data analysis and manipulation.
  • FastAPI for API development.

Error Handling and Validation

  • Validate preconditions and errors early to avoid deeply nested if statements.
  • Use try-except or try-catch blocks for error-prone operations, especially when reading external data.
  • Avoid unnecessary else statements; use the if-return pattern instead.
  • Log all errors with user-friendly error messages shown on the frontend.

AI Coding: $12M return for $240K spend?

This is an email I sent to our leadership team a few minutes ago.

We may be witnessing the third major leap in computing productivity, after high-level languages in the 1960s and spreadsheets in the 1980s

In the last few weeks, AI coding really took off. CursorCodyReplit Agents are FAR better than GitHub Copilot.

Research on ~5,000 devs in Fortune 100 shows that even GitHub Copilot makes them ~25% more productive.

Personally, Cursor helped me:

1. Write faster code (at least 2X). I’ve given away my team (there’s not enough work for them).

2. Write better code. I now document code for others to replicate – because it’s so easy.

3. Write complex code. I’ve built stuff I didn’t know how to. WhatsApp agents, AI code writers, even LLM Foundry. Each has opened a client’s door.

So, should we leverage AI Coding for our developers?

Maybe not. Consider these risks.

1. It costs $10-$20/month/dev. That’s $120-$240K/year for ~1,000 devs.

2. Clients may not be comfortable with us using AI coding. IP. Security.

3. Higher productivity reduces our T&M billing, hence revenue.

Maybe yes. Consider these benefits.

1. We could save 25% per developer. Maybe $4K/month/dev x 25% = $12M/year

2. We could win more complex engagements. (Pitching AI coding as an edge has worked well, too.)

3. We could reduce time to get someone to a tech-lead level.

There’s probably no one-size-fits-all answer. We might need to pilot while we strategize.

How might this impact our hiring and training?

1. Hire for AI coding skills. In interviews, I’d look for use of AI coding agents. We need a few AI coding leads to share our engineering future.

2. Hire for dual skills. AI can take care of the code. I’d hire people with good client-facing skills.

3. Train on AI coding. Integrate these into the onboarding process. Train existing devs.

4. Train non-developers. Analysts, designers, managers can become as good as junior devs.

Why am I mailing you?

In Nov 2022, when ChatGPT was released, I thought LLMs were the biggest knowledge leap since Google.

In Sep 2024, with Cursor, Cody, and Replit agents, I think AI coding the biggest productivity leap since Excel.

This is not for immediate action.

Please think over it. Share it. Discuss it. Pilot it. Let’s explore.

Breaking mental coding barriers with LLMs

Today, I stepped a bit beyond my comfort zone. Usually, I prefer micro-managing LLMs when writing code. This time, I was macro-managing.

I needed to create a mock history of the status of a manuscript, e.g. it was submitted on this date. THEN it moved to this state on this date. THEN … etc. I have no idea what the states could be, though. So, I could send it to an LLM, and it would give a different set of states each time. Or I could write a program and lose out on variety.

Instead, I asked ChatGPT.

Me: When looking up the status of a manuscript in a database, what are typical fields it might return?

ChatGPT: When looking up the status of a manuscript in a database, typical fields it might return include:

  1. Manuscript ID: A unique identifier for the manuscript.
  2. Title: The title of the manuscript.
  3. Author(s): Names of the author(s) who submitted the manuscript.
  4. Submission Date: The date the manuscript was submitted.
  5. Status: The current status of the manuscript (e.g., Under Review, Accepted, Rejected, In Revision).

That helps. Then I said:

Me: Based on this, at any point, what “Status” values could a manuscript be in? Draw a mermaid flow graph for this.

ChatGPT: Here’s a mermaid flow graph depicting possible “Status” values a manuscript could be in throughout its lifecycle:

This is cool — it gives me a starting point that’s better than I could have conceived of.

Finally, I said:

Me: Write JS code that will randomly pick from one of these states and create records containing random dates and states representing the history of the manuscript up to that point

ChatGPT: Here’s a JavaScript code snippet that randomly picks from the manuscript states and creates records containing random dates and states, representing the history of the manuscript up to that point:

function getRandomDate(start, end) {
    return new Date(start.getTime() + Math.random() * (end.getTime() - start.getTime()));
}

function getRandomElement(arr) {
    return arr[Math.floor(Math.random() * arr.length)];
}

function generateManuscriptHistory() {
    const history = [];
    let currentState = "Submitted";
    let currentDate = getRandomDate(new Date(2020, 0, 1), new Date());

    history.push({ date: currentDate, state: currentState });

    while (currentState !== "Accepted" && currentState !== "Rejected" && currentState !== "Published") {
        let possibleNextStates;

        switch (currentState) {
            case "Submitted":
                possibleNextStates = ["Under Review"];
                break;
            case "Under Review":
                possibleNextStates = ["Review Completed"];
                break;
            case "Review Completed":
                possibleNextStates = ["Accepted", "Rejected", "Major Revision Required", "Minor Revision Required"];
                break;
            case "Major Revision Required":
            case "Minor Revision Required":
                possibleNextStates = ["In Revision"];
                break;
            case "In Revision":
                possibleNextStates = ["Resubmitted"];
                break;
            case "Resubmitted":
                possibleNextStates = ["Under Review"];
                break;
            case "Accepted":
                possibleNextStates = ["In Production"];
                break;
            case "In Production":
                possibleNextStates = ["Published"];
                break;
            default:
                possibleNextStates = [];
                break;
        }

        if (possibleNextStates.length === 0) break;

        currentState = getRandomElement(possibleNextStates);
        currentDate = getRandomDate(currentDate, new Date());

        history.push({ date: currentDate, state: currentState });
    }

    return history;
}

// Example usage:
console.log(generateManuscriptHistory());

The above code implements a state-transition diagram — something I’ve never been good at doing. I have a subconscious bias, and I steer clear of even thinking about these. Now, I have one implemented (and it worked quite well!)

Yet another example that human imagination is the bottleneck to leveraging LLMs.

How fast are LLMs in production?

At Straive, we use an LLM Router. Since ChatGPT, etc. are blocked for most people, this is the main way to access LLMs.

One thing we measure is the speed of models, i.e. output tokens per second. Fast models deliver a much smoother experience for users.

This is a different methodology than ArtificialAnalysis.ai. I’m not looking purely at the generation time but the total time (including making the connection and the initial wait time) for all successful requests. So, if the provider is having a slow day or is slowing down responses, these numbers will be different.

Hopefully this gives you a realistic sense of speed in a production environment.

Here’s the speed of models with at least 500 requests over the last 2 weeks. I’ve grouped the models based on speed grades

Grade 1: 100+ Tokens / second. Groq is clearly serving the Llama 3 models at blazing speed. No surprises there — except why Groq still doesn’t let me pay. The free tier is open with generous rate limits and the Pay per Token model has been “Coming Soon” for several months now (and I’ve no complaints 🙂).

Grade 2: 70+ Tokens / second. Anthropic’s Claude 3 Haiku is the next fastest class of models, but Claude 3.5 Sonnet is surprisingly fast, almost as fast as Haiku and over 70 tokens per second. This is impressive.

Grade 3: 50-60 Tokens / second. OpenAI’s GPT 4o models are almost as fast. It’s interesting that GPT 4o and GPT 4o mini are at about the same speed! GPT 3.5 Turbo is not far behind either. Perhaps OpenAI increases capacity for slower models?

Grade 4: 30-50 Tokens / second. Gemini 1.5 Flash is a much, much slower than the benchmarks – maybe we’re doing something wrong. Azure’s GPT 4o service is about twice as slow as OpenAI’s, and comparable is speed with Gemini 1.5 Pro.

Grade 5: <20 Tokens / second. Azure’s GPT 3.5 Turbo and Google’s Claude 3 Sonnet are among the slowest ones. These are older models on third-party infrastructure, so I suspect they’ve been given weaker infrastructure (unlike OpenAI which is serving GPT 3.5 Turbo at 3X the speed Azure does.)

Drivers of speed

Here’s what I’m taking away (informally):

  1. GPU architecture is the biggest driver of speed. Groq is FAST! Hopefully, the fact that they won’t let us pay isn’t a red flag that the service will vanish.
  2. How companies operate seems the next biggest driver. Anthropic’s models are consistently faster than OpenAI’s which are faster than Google’s.
  3. Companies run their own models faster than cloud providers. OpenAI is faster than Azure, and Anthropic is faster than Google for the same models.