Links

Introducing Students to AI Evaluators

In my Tools in Data Science course at IITM, I’m introducing a project that will be evaluated by an LLM.

Here’s the work-in-progress draft of the project. It will eventually appear here.

Your task is to:

  1. Write a Python script that uses an LLM to analyze, visualize, and narrate a story from a dataset.
  2. Convince an LLM that your script and output are of high quality.

The second point is the interesting one. Using the LLM as the evaluator.

Why are you doing this? There are over 1,000 students in each term. Manual evaluation is not an option. Multiple choice questions are guessable. Programmatic evaluation takes effort to modify each time. LLMs seem like a good way to get intelligent evaluations with manageable effort.

Will students accept it? My guess is yes. The whole objective of the project is to convince the LLM. It’s not evaluating you. You’re tricking it into giving you marks. Sort of like getting an LLM to say Yes. A lot of educational and corporate evaluations will soon be done by LLMs. I may as well teach students how to game the system early.

What if it makes mistakes? I hope it will and we’ll learn from it. The students will have the opportunity to test out (and get used to) the randomness in subjective evaluations. I’ll have the opportunity to learn how to reduce these mistakes next time.

What else will YOU learn? I’m very curious about a whole bunch of things.

  1. How will students prompt it? What analysis will they apply before passing data to an LLM? Will they ask open-ended or guided questions?
  2. How will they orchestrate the flow? Will they use a linear flow or non-linear? Will it be deterministic or not?
  3. How will they hack it? LLMs are very amenable to prompt injection. How will the students try and break out of my prompts?
  4. How will LLMs react to all this? Where will they do a good job? Where will they fail? Which LLMs would work well in which case?

Since the code (and analysis) will be published on GitHub, I’ll share the links. It might prove an interesting dataset in itself for future analysis.

ChatGPT Beat me at Pictionary

Me: Let’s play pictionary. You draw. I’ll guess.

ChatGPT: Sure! I’ll draw something for you. Give me a moment.

A simple drawing of a recognizable object, like a house, a tree, or a balloon, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.
A simple drawing of a recognizable object, like a house, a tree, or a balloon, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.

ChatGPT: Here you go! What do you think it is?

Me: House

ChatGPT: Got it in one! Your turn to guess. Ready for another?

Me: Yes

A simple drawing of a recognizable object, like a boat, a car, or a kite, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.
A simple drawing of a recognizable object, like a boat, a car, or a kite, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.

ChatGPT: What do you think this one is?

Me: Boat

ChatGPT: Correct again! You’re on fire. Want to guess another?

Me: Yes

A simple drawing of a recognizable object, like a sun, a cloud, or a tree, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.
A simple drawing of a recognizable object, like a sun, a cloud, or a tree, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.

ChatGPT: Here’s the next one. What’s your guess?

Me: Tree

ChatGPT: Exactly! You’re unstoppable. Ready for more?

Me: Make it harder

A simple drawing of a recognizable object, like a boat, a car, or a kite, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.
A simple drawing of a recognizable object, like a boat, a car, or a kite, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.

ChatGPT: What do you think this one is?

Me: Windmill

ChatGPT: Got it! You’re too good at this. Want an even trickier one?

Me: Yes

A simple drawing of an uncommon but recognizable object, like a unicycle, a lighthouse, or a teapot, in a cartoonish style. The object should be detailed enough to identify but still minimalistic with minimal background.
A simple drawing of an uncommon but recognizable object, like a unicycle, a lighthouse, or a teapot, in a cartoonish style. The object should be detailed enough to identify but still minimalistic with minimal background.

ChatGPT: Alright, here’s a tougher one. What’s your guess?

Me: Teapot

ChatGPT: Nailed it again! You’re practically a Pictionary pro. Want me to up the difficulty even more?

Me: Yes

A simple drawing of a very uncommon object, like an antique key, a gramophone, or a sundial, in a cartoonish style. The object should still be recognizable but with intricate details to make it challenging. Minimal background elements.

ChatGPT: Here’s a tricky one. What do you think it is?

Me: Gramophone

ChatGPT: Spot on! You’re unstoppable. Ready for the ultimate challenge?

Me: Yes

A simple drawing of an extremely rare and challenging object to guess, like a sextant, an astrolabe, or a vintage typewriter, in a cartoonish style. The object should be detailed enough to be identifiable but still maintain simplicity. Minimal background elements to keep focus on the object.

ChatGPT: Here’s the ultimate challenge. What’s your guess?

Me: Sextant

ChatGPT: Bullseye! You’ve officially conquered Pictionary. Want to keep going or try a new game?

Me: Make it harder.

A simple drawing of an extremely rare and challenging object to guess, like a sextant, an astrolabe, or a vintage typewriter, in a cartoonish style. The object should be detailed enough to be identifiable but still maintain simplicity. Minimal background elements to keep focus on the object.
A simple drawing of an extremely rare and challenging object to guess, like a sextant, an astrolabe, or a vintage typewriter, in a cartoonish style. The object should be detailed enough to be identifiable but still maintain simplicity. Minimal background elements to keep focus on the object.

ChatGPT: Here’s the next level of difficulty. What do you think it is?

Me: Compass?

ChatGPT: Close, but not quite. Try again!

Me: No idea

See the original conversation (including the answer).


In my own, small, way, I now know how Garry Kasparov felt.


What’s interesting is the progression of prompts it gives Dall-E to generate the images.

It begins with “A simple drawing of a recognizable object, like a house, a tree, or a balloon, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple.” — which were guessable.

When I asked it to make it harder, it didn’t do so immediately. It just asked for “A simple drawing of a recognizable object, like a boat, a car, or a kite, in a cartoonish style. The object should be clear and easy to guess. Minimal background details to keep it simple” — which was guessable too.

But after one round, it took the initiative to as, “Want an even tricker one?” and prompted “A simple drawing of an uncommon but recognizable object, like a unicycle, a lighthouse, or a teapot, in a cartoonish style. The object should be detailed enough to identify but still minimalistic with minimal background.” Was it waiting one round to see if I really wanted a harder version?

Then it asked, “Want me to up the difficulty even more?” before prompting, “A simple drawing of a very uncommon object, like an antique key, a gramophone, or a sundial, in a cartoonish style. The object should still be recognizable but with intricate details to make it challenging. Minimal background elements.”

When I asked it to “Make it harder” again, it went on directly to “A simple drawing of an extremely rare and challenging object to guess, like a sextant, an astrolabe, or a vintage typewriter, in a cartoonish style. The object should be detailed enough to be identifiable but still maintain simplicity. Minimal background elements to keep focus on the object.” and then “A cartoonish drawing of an abstract or rare object, like an ancient navigational device, a peculiar scientific instrument, or a mythical artifact, with intricate yet recognizable features. Minimal background elements to keep the focus on the object.”

That defeated me.

Should courses be hard or easy?

Here’s a post I shared with the students of my Tools in Data Science course at IITM. This was in response to a student posting that:

The design of TDS course lecture videos are designed in such a way that it could be understood only by the data scientists not by the students like me who are entirely new to the field of data science. Though I have gone through 6 weeks of course lecture videos, I am not fully aware of the usage of ChromeDevTools, Bash, Github etc….


IITM Term 1: German. In my first term at IIT Madras (1992), I took German 1 with Prof D Subramanian.

The first words D.Subs said when he entered the room were, “Wer sind Sie?”

I had no clue what he was talking about. Nor did the others. After individually asking about a dozen students, Ashok Krishna replied, “Ich bin Ashok.” (He knew German.)

The rest of the term proceeded in the same vein. He threw us in the deep end, spoke only German, and I ended up with a “D” and learning very little German. Ashok Krishna thrived in that course.

IITM Term 2: German. In took German 2 with Mrs Schindler in the next term. The experience was very different. She explained each word. She had us listen to tapes in German. It was very relaxed.

I ended up with a “B” and learning very little German. (I’m not good with human languages.) But many others enjoyed it.

Which is better? I’m not sure. I think gentle courses are good for beginners and tough ones for advanced student, but classes usually have a mix of both. Aptitude and interest help, too.

IITM Term 1: Physics. Prof. PC Deshmukh handled our first physics course. (He was the one responsible for renaming Mardi Gras “Saarang”.) Our class felt the exams were too tough. My friend Badri and I explained this to him. “Well, you two should be lobbying for even tougher exams,” PCD said. “That’ll benefit smarter guys like you more.”

IIMB Term 1: Statistics. My classmate Athreya was a class representative for statistics (a tough subject.) Going against tradition, he lobbied for a tougher test. He was almost lynched afterwards.

Which is better? I’m not sure. An easy exam is good for student grades and morale. A tough exam is good for college reputation and advanced students. Classes usually need both.


In short, I have no clue.

Students who take this course learn several topics 3 months.

Faculty like me haven’t managed to learn how to calibrate course content or exams even after 3 years.

But, I’m learning. Thanks to you.

So please keep the feedback coming. We’ll keep improving. And thanks for putting up with us!

Recrafting Comicgen

About 7 years ago, Richie Lionell and Ramya Mylavarapu and a few others created Comicgen – an automated comic generation app personified by Dee and Dey.

Ever since, we’d been exploring whether AI could replace it, and help non-designers draw comics.

Today, that became a reality for me with Recraft.ai.

Here is a picture of the original Dee.

And a picture of the Dee crafted by Recraft.

The prompt was:

A simple line drawing of a woman with curly hair, wearing glasses, a short-sleeved white t-shirt, and black trousers. She’s standing with her hands in her pockets, and has a slightly smiling expression. Her hair is quite voluminous and textured. The style is cartoonish and slightly sketchy, with uneven lines”

(The prompt itself was generated by Gemini 1.5 Flash by passing it the original Dee’s picture.)

We are finally at the stage where comic generation is truly available for the masses – at 8 cents via the API.

How to recruit based on IIT JEE Rank vs GPA

Preserving this post by Daniel George showing the IIT Bombay 2014 GPA vs JEE Rank on a log scale.

What I found interesting was:

  • A higher JEE rank generally means you won’t score too low, but you needn’t score too high.
  • The higher the JEE rank, the greater the spread of GPA.
  • A high GPA can come from any rank (8+ GPA is uniformly distributed across ranks), but a low GPA is generally only from the lower rankers (6- GPA is mostly from 500+ rank.)

So, it’s better to recruit based on GPA rather than JEE rank, unless you’re going after the very best students (where it makes less difference.)

    Leaning into the power of AI coding

    Yesterday (15 Oct 2024), I used Cursor to code more than I ever have. (Doing’s how we learn, I guess. Not just reading.)

    DateUsage0510202415061020242707102024870810202416091020241010202442111020242412102024571310202415141020242815102024186

    This was mainly to create and publish 2 libraries on npm over 6 hours:

    1. asyncsse – which converts a Server-Sent Event stream into an async iterator that I can use in a for await … of loop
    2. asyncllm – which standardizes the Server-Sent Events streamed by the popular LLMs into an easy to use form.

    This exercise broke several mental barriers for me.

    Writing in a new language. Deno 2.0 was released recently. I was impressed by the compatibility with npm packages. Plus, it’s a single EXE download that includes a linter, tester, formatter, etc. Like all recent cool fast tools, it’s written in Rust. So I decided to use it for testing. Running deno test runs the entire test suite. My prompts included asking it to:

    • Create a Deno HTTP server to mock requests for the tests. This is cool because a single, simple code chunk runs the server within the test suite.
    • Serve static files from samples/ to move my tests into files

    Writing test cases. Every line of this code was written by Cursor via Claude 3.5 Sonnet. Every line. My prompt was, Look at the code in @index.js and write test cases for scenarios not yet covered. It’s surprising how much of the SSE spec it already knew, and anticipated edge cases like:

    • SSE values might have a colon. I learnt for the first time that the limit parameter in String.split() is very different from Python’s str.split. (The splits, then picks the first few, ignoring the rest. Python ensures the rest is packed into the last split.) This helped me find a major bug.
    • SSE has comments. Empty keys are treated as strings. Didn’t know this.

    I was able to use it to generate test cases based on content as well. Based on @index.js and @openai.txt write a test case that verifies the functionality created the entire test case for OpenAI responses. (I did have to edit it because LLMs don’t count very well, but it was minimal.)

    Bridging test coverage gaps. The prompt that gave me the most delightful result was Are there any scenarios in @index.js not tested by @test.js? It did a great job of highlighting that I hadn’t covered Groq, Azure, or CloudFlare AI workers (though they were mentioned in the comments), error handling, empty/null values in some cases, tested for multiple tool calls. I had it generate mock test data for some of these and added the tests.

    Enhancing knowledge with references. I passed Cursor the SSE documentation via @https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events and asked it to find more scenarios my code at @index.js had not covered. This found a number of new issues.

    Generating bindings. I avoid TypeScript because I don’t know it. Plus, it requires an compilation step for the browser. But TypeScript bindings are helpful. So I prompted Cursor, using the Composer (which can create new files) to Create TypeScript bindings for @index.js in index.d.ts – which id did almost perfectly.

    Check for errors. I typed Check this file for errors on @index.d.ts. I don’t know enough to figure this out. It went through the description and said everything seems fine. But I saw a TypeScript plugin error that said, Property 'data' of type 'string | undefined' is not assignable to 'string' index type 'string'.ts(2411). When prompted, it spotted the issue. (The earlier code assumed all properties are strings. But some can be undefined too. It fixed it.)

    Documentation. At first, I asked the Composer to Create a README.md suitable for a world-class professional open source npm package and it did a pretty good job. I just needed to update the repository name. I further prompted it to Modify README based on @index.js and share examples from @test.js on asyncllm, which did an excellent job.

    Code review. I asked it to Review this code. Suggest possible improvements for simplicity, future-proofing, robustness, and efficiency and it shared a few very effective improvements.

    1. Regex lookaheads for efficient regular expression splitting, i.e. use buffer.split(/(?=\r?\n\r?\n)/) instead of buffer.split(/(\r?\n\r?\n)/) — and though I haven’t tested this, it looked cool.
    2. Restructuring complex if-else code into elegant parsers that made my code a lot more modular.
    3. Error handling. It added try {} catch {} blocks at a few places that helped catch errors that I don’t anticipate but don’t hurt.

    Code simplification. Several times, I passed it a code snippet, saying just Simplify. Here’s an example:

    const events = [];
    for await (const event of asyncLLM(...)) {
      events.push(event);
    }

    This can be simplified to

    const events = await Array.fromAsync(asyncLLM(...))

    Packaging. I copied a package.json from an earlier file and asked it to Modify package.json, notable keywords and files and scripts based on @index.js which it did a perfect job of.

    Blogging. I wrote this blog post with the help of the chat history on Cursor. Normally, such blog posts take me 3-4 hours. This one took 45 minutes. I just had to pick and choose from history. (I lost a few because I renamed directories. I’ll be careful not to do that going forward.)


    Overall, it was a day of great learning. Not in the classroom sense of “Here’s something I didn’t know before”, but rather the cycling / swimming sense of “Here’s something I now know to do.”

    Challenge: code in 10 minutes with only an LLM

    I gave a bonus assignment in LLM coding to ~1,000 students at the Tools in Data Science course at IITM.

    Here is an OPTIONAL project: Record a 10-minute video in which you create an application entirely using LLMs and deploy it.

    Any app is fine. Any language. Simple or complex. Business or gaming. Anything is fine. Your choice.
    Create the app only using LLMs. You can use an LLM (ChatGPT, Claude.ai, Gemini, Cursor, Cody, etc.) but you can only prompt the app to write code. You can copy-paste code and run code don’t write or edit even a single line of code directly. Use LLMs to debug and edit. Code completion is NOT allowed – only prompting/chatting.
    Record the entire process in 10 min. Don’t edit, trim, enhance, or annotate the video. You should record yourself creating the entire app from start to finish. Practice beforehand if you like. Record in 1 take.
    Share the video and app. Publish the video publicly anywhere (e.g. YouTube and share the link.) Publish the app publicly anywhere (e.g. GitHub pages, Glitch.me, Heroku, etc.) or upload a ZIP file with the code (for slightly lower marks.)
    Submit via a reply to this thread. Multiple submissions per person are fine. Work in groups if you like but only the submitter gets marks.

    I will award up to 1 bonus mark at my discretion based on:

    How well you prompt the LLM
    How impressive the app is (if you’ve hosted it – I probably won’t run your code)
    How closely you followed the rules above
    This exercise is to help you (and me) learn a topic that’ll probably change the way we all code: using LLMs to code.

    Cutoff date: 7 Oct 2024, AoE

    Adoption was low but in line with the industry.

    About 50 students (around 5% of the batch) attempted this. In contrast, ~70-80% take the (mostly) mandatory graded assignments.

    This is comparable with what I see at Straive. When given the option, about 5% of Straive’s 20,000 people uses LLMs on in a given week. (There are many things different there. I’m tracking LLM use, not LLM coding. It’s a work environment, not a learning one. There’s no bonus mark awarded. But still, I see the “around 5%” number popping up often.)

    Games were the most popular category, mainly Tic Tac Toe and Snake Game.

    This is understandable. They’re easy to think of, implement, and use.

    1. Candy Crush – Video, App / Code
    2. Catch the Ball – Video, App / Code
    3. Flappy Bird – Video, App / Code
    4. Flappy Bird – Video, App / Code
    5. Magic Square – Video, App / Code
    6. Memory Match – Video, App / Code
    7. Memory Match – Video, App / Code
    8. Minesweeper – Video, App / Code
    9. Minesweeper – Video, App / Code
    10. N-Queens – Video, App / Code
    11. Number Guessing Game – Video, App / Code
    12. Open Mines game – Video, App / Code
    13. Rock-Paper-Scissors – Video, App / Code
    14. Rock-Paper-Scissors – Video, App / Code
    15. Sliding Game – Video, App / Code
    16. Snake Game – Video, App / Code
    17. Snake Game – Video, App / Code
    18. Snake Game – Video, App / Code
    19. Snake Game – Video
    20. Snake Game – Video, App / Code
    21. Snake Game + Pomodoro – Video, App / Code
    22. Sudoku – Video, App / Code
    23. Sudoku – Video, App / Code
    24. Tic Tac Toe – Video, App / Code
    25. Tic Tac Toe – Video, App / Code
    26. Tic Tac Toe – Video, App / Code
    27. Tic Tac Toe – Video, App / Code
    28. Tic Tac Toe – Video, App / Code
    29. Tic Tac Toe – Video, App / Code
    30. Tile Matching Game – Video, App / Code
    31. Word scramble game – Video, App / Code
    32. Wordle – Video, App / Code

    Productivity Apps / Tools were the next most common. Calculators, Timers, etc.

    Again, understandable. They’re easy to think of, implement, and use.

    1. Age Calculator – Video, App / Code
    2. Age Calculator – Video, App / Code
    3. Air Mile Dalculator – Video, App / Code
    4. Birth Day Calculator – Video, App / Code
    5. BMI Calculator – Video, App / Code
    6. BMI Calculator – Video, App / Code
    7. Height Weight Calculator – Video, App / Code
    8. Music playlist – Video
    9. Post-it Notes – Video, App / Code
    10. Timer – Video, App / Code
    11. Timer – Video, App / Code
    12. Todo App – Video, App / Code
    13. Todo App – Video, App / Code
    14. Todo App – Video, App / Code

    Real-life apps / tools were diverse and interesting.

    This final category of apps were things one might use in real-life. They were more ambitious (mostly), more practical (always), and unique (by far).

    This is the 1% that might lead to startup ideas.

    1. Discussion Board – Video, App / Code
    2. Document analysis – Video, App / Code
    3. Dress Designer – Video, App / Code
    4. Image Metadata – Video, App / Code
    5. Inventory management – Video, App / Code
    6. PCOS detector – Video, App / Code
    7. RAG on Streamlit – Video, App / Code
    8. TNEB Power Failure – Video, App / Code

    Things I learned.

    • You can upload files directly into GitHub via the UI (Video)
    • You can run GUI programs in Python on the mobile via Replit’s tkinter (Video)
    • You can get the LLM to generate code for Google Scripts (Video)
    • … and a whole bunch of tricks like adding audio BGM, special powers, emotional prompting, etc.
    • Typically, 5% are early adopters. 1% are GOOD early adopters.

    What do you need to interact effectively with LLMs?

    Simon Willison asked on Twitter:

    What are the most importantly things that people need to understand in order to effectively interact with LLM-based systems like ChatGPT or Claude?

    Here are the replies. (I used text-embedding-3-small to embed and cluster them into 20 clusters and used OpenAI GPT-4o-mini to label the clusters. There are misclassifications but the themes are accurate.)

    Provide Clear Context and Avoid Leading Questions

    1. 1. Provide relevant context but not too much
      2. Models are total “yes men” – be careful not to imply your perspective if you want an objective response
      3. Learn when to iterate vs start a new chat 4. Provide examples (especially for output structure) – Tweet
    2. 1. Ask questions that the other person can understand.
      2. Ask questions while predicting what the other person will respond. It’s the same as the human’s. – Tweet
    3. 1. Ensure the system knows the relevant context. Give a detailed backstory of what you’re trying to do with it and why.
      2. One thing at a time. Make the task as specific as possible and if there are multiple things that need to be done, ask it to them in their sort of natural – Tweet
    4. The “most importantly things” are probably to ask for step-by-step before answering and to try to not ask leading questions to avoid its sycophancy bias. – Tweet
    5. You must provide a diverse distinct set of examples of you want it to be robust and generalize in real world systems. – Tweet
    6. Always ask for both strengths and weaknesses to get more balanced perspectives, and make sure the model can tell you as many facts as possible before committing itself to an answer. – Tweet
    7. Rule 1:Avoid chatgpt unless they release a better model than Sonnet 3.5. – Tweet
    8. Strongly insist that it shouldn’t passively agree with you. Encourage it to interrupt with clarifying questions that would meaningfully improve the output. – Tweet
    9. Avoid leading questions if you care about the answer. They are way too polite to contradict the user. – Tweet
    10. – It’s not Google, so use full sentences, not just keywords.
      – Iterate on initial response.
      – Trust, but verify. – Tweet
    11. Just talk to them how you’d want someone to talk to you if it was you in there. – Tweet
    12. Provide good (and bad) examples of output, and don’t forget a few edge cases. – Tweet
    13. Keep hitting the ball back and forth across the net: 1. “thanks but I think these are a little too ‘salesy’ — could you try to generate some ideas that are a little more down to earth” 2. “ok, we are getting there, but still a little overheated. could you try again” – Tweet
    14. These are my top 10 for folks new to GenAI: 1. You have to provide all of the context the model needs to answer your question if that context is not likely to appear in the model’s weights. It will take a while to gain an intuition about what types of knowledge is likely to be – Tweet
    15. Suspend disbelief; collaborate not interrogate; trust no-one; have fun, role play, experiment, test; think of as a facet of intelligence built on achievements of ours, not a robo-rival. Notice book-learning over lived experience, cliches & bluffing in human world too, & do better – Tweet
    16. It’s a dialogue. Iterative. incremental. Chat improves with feedback. When chat creates code, for example, run the code and give chat the error messages so that it can correct the code. Before asking chat a question, ask it what it knows. Then zoom in. Gradually. 🙂 – Tweet
    17. One example is worth a thousand words – Tweet
    18. 1. How to read
      2. How to write (optional) – Tweet
    19. 1. Explain yourself clearly, using lots of examples.
      2. Assume you’re talking to a smarter version of yourself that hasn’t heard about your problem yet.
      3. When it doesn’t do well, use the steps above to correct it. – Tweet
    20. 1. The more precise your question or task is, the better and more accurate the response will be. Vague prompts can result in equally vague answers.
      2. Provide relevant background or context, especially for nuanced questions or tasks. – Tweet
    21. 1. Don’t ask them to do too much in one shot, especially if they are unrelated tasks; you’ll get much worse results.
      2. Don’t give too much context if you can avoid it. The huge context windows of the newest models isn’t as “free” as you might think, or rather it’s “lossy”— the – Tweet
    22. Collaborate with them, don’t delegate to them. – Tweet
    23. Context Window needs to be explained well. @NickADobos is spot on, but this needs to be explained without jargon we are so used to. – Tweet
    24. 1. How context windows work TL;DR: it doesn’t remember everything in chat
      2. It’s a text generator, that is good at patterns, and appearing smart. Not an almighty god doing cognitive work. Hallucinations aren’t ai behaving wrong. They are a feature of generating a bad pattern – Tweet
    25. Consider the context a human would need when responding to the same request. When asked to create a presentation by your manager with 10-20 words, you have thousands or likely millions in context to inform that. Ppl often get annoyed when it fails, it’s usually not enough context – Tweet
    26. Understand that they are autoregressive with a context limit and the limitations that impose on the chat interface. – Tweet

    Iterate and Simplify for Optimal LLM Performance

    1. 1. just keep trying things – LLMs keep surprising me,
      2. Start simple, add more techniques, context, guidance etc. step by step – with LLMs I found, less is often more.
      3. Keep a human in the loop and/or be transparent about using LLMs – otherwise prepare for unpleasant – Tweet
    2. When your llm starts omitting code generated in prior steps of an existing chat, end the chat and replay your steps until before that happened. Take a different branch next time – Tweet
    3. 1. Don’t give too much information at once to process, start simple and build on top of previous ones
      2. Want a contrary opinion from LLM?don’t sound like your opinion is sacrosanct – it will agree to you mostly even if its wrong.
      3. Role playing and few shot examples matter. – Tweet
    4. 1. Context
      2. Difference is assumptions
      3. Articulating clearly what you want (run it against another LLM to see if what you mean is what you say).
      4. Being able to go back in a thread and restart (You get do over’s with LLMs that you might not get with people 🙂 ) – Tweet
    5. Well one thing I learned is it’s best to start a new chat if the LLM is going down the wrong path, easier then forcing it back. – Tweet
    6. Having moderate experience with a topic / framework is important for peak quality of the response. At present, using llms for efficiency > using llms to do something you don’t know how to do. – Tweet
    7. at least when it comes to writing code, the task needs to be very well defined, like one would do when creating a user story for developing software. If the details are vague then you leave the LLM open to interpretation and more likely to make mistakes – Tweet
    8. The most important thing, and this has always been true even if not using an LLM, all good software development starts with engineering a solution first before building it. If you attempt to get the LLM to do that part you’ll create as many problems than you solve building – Tweet
    9. Use the LLM to explore your own understanding of the problem space and what you want to achieve. This can help improve your prompting and interpretation of the outputs. – Tweet
    10. LLMs…
      •Pander. Don’t prime answers, ask straight.
      •Only know text. Don’t ask spatial, reasoning etc.
      •Hallucinate and invert. Double-check.
      •Get stuck. Start over.
      •Master ALL languages, jargons, styles etc.
      •Are formidable documentalists. – Tweet
    11. 1. hallucinations are still a thing, be wary when LLMs generate links and code snippets 2. data quality of training content can sometimes be dubious leading LLMs to hallucinate more often or be biased in various ways both will likely be addressed eventually – Tweet
    12. LLMs…
      •Pick and imitate register. Talk like constructive, competent people.
      •Are easily lost. Examples and feedback help.
      •Can misbehave. Be harsh if needed, but stay just. – Tweet
    13. For optimal results, provide ample context. Prompting the LLM with ‘Feel free to ask clarifying questions’ and doing the due-diligence to answering the questions often yields much better results. – Tweet
    14. The more explicit you are the better the output. The LLM can not read your mind and there is a lot of ambiguity when interpreting language. – Tweet
    15. One issue I am seeing more of – Often i ask a question on a choice it made. The LLM assumes I don’t like it or it’s wrong – it then starts to apologize and course correct. More and more I add something like “not refuting or arguing, just trying to understand” etc. – and that – Tweet
    16. They are inherently unreliable in more than one sense, which accumulates the more requests you run in a chain. The Six Sigma approach is devastating to LLMs. – Tweet
    17. Treat it like a very intelligent junior employee who just started at your company and lacks context. Give the LLM the same level of detail for every instruction you would give to this junior employee. – Tweet
    18. That LLMs are not too be trusted as they reliably fail at information due to multiple effects, including hallucinations. That LLMs don’t actually understand things and don’t have common sense. It is mandatory to adapt expectations and ways of working to successfully use them. – Tweet

    Craft Effective Prompts for Consistent Results

    1. How to prompt – Tweet
    2. Carefully consider keywords, and prioritise them via the locating them earlier and at the end of longer prompts. – Tweet
    3. If you want stable results across models and are looking to build robust pipelines you should stop hand writing prompts and move toward prompt optimizers. https://ycombinator.com/launches/L4V-hamming-let-ai-optimize-your-prompts-free-for-7-days… Also built into DSPy! – Tweet
    4. While crafting logics and system prompts, Always keep a thought in your mind parallel what would I respond to this prompt and context. – Tweet
    5. 1. Prompts matter.
      2. Treat it like a tool, and you’ll get a tool. It’s only as smart as you let it be. – Tweet
    6. to ask them the best way to prompt them – Tweet
    7. there is a single prompt that gets the job done, thousands that screws it – Tweet
    8. 1. Always add a system prompt at the beginning: Define a role. Ex: “You are a senior software developer who excels in…”
      2. Context Matters: Provide a detailed background for better insights.
      3. Clear Prompts: Specificity is crucial for accurate outputs. – Tweet
    9. If a large global prompt doesn’t work, try step by step. If it does work, but has errors in response – Ask it to fix errors one by one. Insist, like you would with a human supplier. If “do this” doesn’t work, try “Strictly do this”. Amazing how effective insisting is 🙂 – Tweet
    10. The better the prompt the better the output. You don’t need a Meta framework for 90% of things – Tweet
    11. They don’t exist between prompts – Tweet
    12. Don’t rely on the models weights alone. Be explicit in the prompt and give it pointers to what you’re expecting. Let it “clean up” or “translate” your prompt rather than “come up” with an answer based on its training. Exception: generating lists for inspiration. – Tweet
    13. prompt engineering, in order to get the most desired outcome in handy. – Tweet
    14. It lies Q: Who was the second person to walk on the moon? A: Pete Conrad Q: can you name the crew members of Apollo 11? A: I got the right answer. Q: Then how come Pete Conrad was the second person to walk on the moon? A: My apologies. Indeed Buzz Aldrin was the second pe… – Tweet
    15. How to say no. – Tweet
    16. How to use smart phone or computer with internet – Tweet
    17. Vibe is an input. – Tweet

    Don’t Expect Human-Like Understanding from LLMs

    1. LLMs have no “thoughts” or understanding, they’ll simply write the statistically most probable answer based on your input and have been prompted to act as assistants. – Tweet
    2. LLMs are incredibly random. Responses can change wildly based on a single character difference in the prompt. Even one extra space. They are best for prompts that have a range of possible responses, not for prompts where you expect one consistent answer. – Tweet
    3. Cease prompting their LLM to give them a viral tweet with forced irony forcing awareness to an issue. That’s my own personal opinion, bro. But, believe what you want. – Tweet
    4. If you don’t know what you want, the LLMs too likely won’t know. And if they don’t know they will make it up. And if you don’t know, you will not know that they made it up. – Tweet
    5. Be sure not to put contradictions in your prompt. LLMs, in contrast to humans, try to follow instructions as close as possible. They usually handle contradictions by ignoring some part of the instructions or even ignoring facts. – Tweet
    6. It’s biased toward its creators. So if the majority of companies that are developing LLMs are owned by the same investors, then in fact, we are having a single LLM that is biased toward that investors goals. E.g., chatgpt is more toward liberalism and refuses to operate otherwise – Tweet
    7. Basically, you need to understand that LLMs are not humans. You can’t assume they’ll understand what you mean when you write short prompts. You get the best out of LLMs when you provide detailed instructions of what you want without letting laziness get in the way. In my – Tweet
    8. Don’t assume anything. LLM doesn’t learn like a human. Any assumption you make about what LLM should or shouldn’t know is probably wrong. – Tweet
    9. Describe your context and the role you want the LLM to look at your input (critical, tech/none-tech, …) Think what you could expect from a wise, random person you ask on the street. Do not expect more from the LLM-Answer. Also only trust it similarly. – Tweet
    10. Give it an option to not do something either by allowing the LLM to reply with something like “I don’t know” or tell it to ask follow up questions. – Tweet
    11. There is nothing fundamentally important for that interaction. These LLMs are just minimum viable versions of something much bigger that will come soon. That something will know how to interact with us no matter how we behave. – Tweet
    12. 1. that you need to cram the relevant data into the prompt. LLMs are far far better at transforming what you give them than they are at answering solely on the basis of the lossy representation of the training data encoded into the model itself – Tweet
    13. The side effect fact that formulating a question for an LLM makes you think better. When coding, for example, we often run questions in our heads and then get to coding. Being forced to formulate a question properly may lead you to trajectories you may have never considered. – Tweet

    Treat LLMs as Guided Children, Not Mind Readers

    1. The game isn’t to ‘one shot it’. It’s to get something you never thought was possible or that you’d never think of. I always say they are like children, they need guidance (back story and reason) and repetition …but room and time to play and grow. – Tweet
    2. Honestly, flexibility and patience. We need to give up a little bit of control and expectation of all things to be so rigid. – Tweet
    3. When working with it, you need to expect it to not read your mind, but work with it as if you’re asking for help from an insanely gifted child and give yourself patience to shape the result. – Tweet
    4. if it makes life better? yes. but always? no. – Tweet
    5. When asking it how to implement something, always give it options. If you can’t think of options, give it a vague out. Instead of asking, “should I do this to my code?”, ask it “should I do this to my code, or is there some better way I could do it?”. Otherwise the models are too – Tweet
    6. 1. always consider that it doesn’t know what assumption you’re making. so it might infer them sometimes but often it’s much better to over explain what you want.
    7. 2. they will often run ahead on a suggestion you have even if it’s not the best path so I find myself adding “if this – Tweet
    8. It cannot read your mind, if you don’t explain exactly what you want you will not get what you want – Tweet
    9. I am not ready to give advice based on a bet that “something much bigger will come soon” – prompting advice that worked for GPT-4 over a year ago is still mostly relevant to working with the best models today – Tweet
    10. to be concise and always assume the response is wrong, even ever so slightly. Check and correct. – Tweet
    11. – you have to provide context otherwise it assumes – it will often agree with you or apologize/correct itself even if you question the right answer – Tweet
    12. The limited ability for non-linear (or non left-to-right) reasoning. Encouraging the model to spend more time planning and discussing beforehand often leads to better results. This may be less the case with Claude etc where reasoning tokens are happening behind the scenes. – Tweet

    Context is Key for Effective Interaction

    1. Context is everything – Tweet
    2. Context is all you need. – Tweet
    3. Understanding how context works – Tweet
    4. It’s all about context – Tweet
    5. #contextmaxxing – Tweet
    6. Context, Task & Purpose – Tweet
    7. Subjectivity. Context. Brain rent. – Tweet
    8. I’d say understanding the concepts of context, attention, and likelihood – Tweet
    9. 1. Context and memory (the degree to which you can refer to previous parts in the chain of context) 2. Temperature and hallucinations. The tradeoff between extremes of temperature 3. It’s wise to have benchmark questions of your own to test when a new company/model comes out – Tweet
    10. local maxima sensing – Tweet

    Acknowledge the Stateless Nature of LLMs

    1. you’re interaction is with a stateless inference that exists for a fleeting moment, current ai is not continuous which is easy to forget. This has implications for what you are building for: – Tweet
    2. that they’re stupid next-token predictors and not intelligent agents. If you expect conscious beings, you’ll be surprised and disappointed. But they’re incredibly good at predicting the next useful token. – Tweet
    3. That standard intuitions for computers don’t apply. Treat it the way you would treat a knowledgeable but fallible friend. Not like a purely logical SciFi AI with perfect memory. – Tweet
    4. Normally I hate predictions and terms like this, but the next 20 years are going to be the era of “embodied intelligence” People are imagining humanoid robots, this will be a very small fraction of it. Compared to the software problem, the body is trivial. Imagine asking your – Tweet
    5. Inherent lack of memory about previous interactions. Every message is starting from zero and only seems coherent because background info and the previous messages and responses are sent before the latest message. – Tweet
    6. They’re not sentient. They generate responses by predicting patterns from vast data, which means they’re as fallible as they are impressive. The key is precision: your queries must be meticulously clear and well-contextualized. – Tweet
    7. it doesn’t have a memory like hooomans – Tweet
    8. Whenever the conversation derails, you need to cut that branch and keep the model in the “right universe of probabilities” by editing prompt/messages. This is also why I was skeptical about Reflection, because if it really worked, it would be breakthrough. – Tweet
    9. They aren’t deterministic – Tweet

    Leverage AI for Prompt Suggestions and Refinement

    1. I like to ask them for prompts to use for a given purpose, it tends to be more detailed than I would be. Can also use this to add example Q&A if need be. – Tweet
    2. We need AI assistance with prompts and suggestions on rewriting your queries similarly to Grammarly’s for spellchecking and correctness. – Tweet
    3. “Give me a list of questions I can answer to help improve the quality of the response” – Tweet
    4. Let’s ask one. – Tweet
    5. Can we get our hands on all the prompts used in fine tuning data or at least major ones. Highly unlikely they will release it. – Tweet
    6. anybody got tips for image generation? i hardly ever use the image features, but lordy, they struggle! even w with clear, verbose prompts using art school vocabulary, specific artist citations, and example attachments, lots of iterations, etc. – Tweet
    7. I’m doing a podcast with the Cursor team. If you have questions / feature requests to discuss (including super-technical topics) let me know! For those not familiar, Cursor is a code editor based on VSCode that adds a lot of powerful features for AI-assisted coding. I’ve been – Tweet
    8. They’re useful in the same way Google or the internet or stack overflow is useful plus one big advantage: your question doesn’t have to take your specific situation and change it to a generic case that someone else has already answered, you can just ask about your exact case! – Tweet

    Start with a Jailbreak for Objective Analysis

    1. Using a jailbreak should always be your first step if you want less biased, more objective and fact-based analysis of sensitive or controversial sociopolitical issues. – Tweet
    2. Kinda like Google, small changes in wording can give you quite different results. – Tweet
    3. That you should only use it to get answers you can verify with a separate tool, or somehow evaluate yourself (ej. text quality). – Tweet
    4. Its not a tool – Tweet
    5. dont treat it like a search engine. think about the outcome and output you are trying to achieve. – Tweet
    6. There is a considerable chance to answer is wrong, so likely everything needs to be double checked. – Tweet
    7. I can only speak for the use cases I’ve come across wrt legal work, but don’t use them for tasks where you need a reference. Using them to draft or review documents is fine. Asking for a case law reference is a no-no. And of course, make sure you’re not leaking confidential stuff – Tweet

    Master Prompt Engineering for Better Outputs

    1. lol. Nice try. If your business needs to level up I can do certification class. Your employees will get Level 4 Prompt Engineering Classification. DM if interested – Tweet
    2. I like to write no full sentences with error and llm understand. So prompt engineering bullshit – Tweet
    3. Turing test. – Tweet
    4. Use instructions to change the style of the output that the LLM produces. For Claude you have to make a project first in order to be able to set the instructions. – Tweet
    5. – Understanding how LLM system, ChatGPT or Claude works and responding technically in basic. – Prompting skills. Understanding the difference between effective and ineffective prompting. – Tweet
    6. understand the english language and HOW it’s used (sadly, even english speakers have a hard time w/ correct language use). know grammar and syntax, context and nuance. be clear, succinct, specific when creating prompt. edit, edit, edit before sending prompt. – Tweet

    Understand LLMs as Statistical Predictors

    1. Language models cannot generalize the simple formula “A is B” to “B is A.” – Tweet
    2. 1) tokenizers/decoding strategies are both incredibly important and invisible to most users. Remember that what you input is not what the model sees exactly, and what you read is not what the model output directly. 2) repeat #1 for the crowd in the back – Tweet
    3. It’s a bit sad and confusing that LLMs (“Large Language Models”) have little to do with language; It’s just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something. They – Tweet
    4. Language – Tweet
    5. They are next word predictors. Everything is downstream from that. – Tweet
    6. The output is encoded in the input, the model is just a statistical decompression engine. This means that they can only ever amplify your mind, they can’t think for you, however they can translate your question into more formal language & that may decompress into something useful – Tweet

    Stay Focused on High-Impact Tasks

    1. Try to stay in the high impact zone e.g. through breaking tasks up and don’t expect perfect results at all times – Tweet
    2. Being able to define goals and objectives. – Tweet
    3. Focus loquaciousness to refine results that will otherwise always regress to mean averages. – Tweet
    4. If it doesn’t understand you, ask it to help clarify your question. If you’re not getting the answer you need, break your question into smaller parts. If you don’t know how to break it down, ask it to help you break it down. – Tweet
    5. • You’re interacting with a superposition of all humanity, so defining a specific persona that would be helpful for your task produces better results. •Avoiding assumptions and explaining your goal in the clearest way possible is the key to avoiding running around in circles. – Tweet

    Understand LLMs as Probabilistic Text Generators

    1. they are reality-adjacent – Tweet
    2. that they have to make sense – Tweet
    3. That they are probabilistic systems. – Tweet
    4. That they’re random text generators and any appearance of intelligence is accidental and illusory. – Tweet
    5. themselves – Tweet

    Verify Information, Never Trust Blindly

    1. Verify, never trust. – Tweet
    2. Never trust them – Tweet
    3. Just don’t. – Tweet
    4. Anything coming out of those things can be completely false. Don’t just accept it as truth. – Tweet

    Engage Actively to Maximize LLM Utility

    1. that it’s only as useful as how many questions you’re asking it. Any initial understanding beyond that would be an overkill in my opinion – Tweet
    2. It is only an upscaler not a freewin. The more you know the better it works, but compared to a person you can talk with it in shortcuts. The skill is to always reposition it constantly, before it goes off in the wrong direction. You can also work with labels within it’s answers – Tweet
    3. They’re useful/powerful for a wide range of tasks. Their usefulness is highly variable, depending on context & the skill of the user. A user’s existing expertise can be greatly amplified by the system, but novices probably benefit most. Ask them for help on how to use them. – Tweet
    4. You no longer need to learn regex etc, you can just act like you know it at an expert level now, similar with syntax of virtually any language or technology. It is better at writing debugging output for you to find the problem in the code than finding the problem in the code… – Tweet

    Communicate Clearly and Specifically

    1. Be specific, clear, and thorough. Same as communicating with humans, but more important. – Tweet
    2. Be super clear with instructions. Funnily enough, we should be doing that with our instructions to our fellow humans, but we don’t! – Tweet
    3. Effective writing – Tweet
    4. BE SPECIFIC. Every one of my customers asks why a query they make doesn’t return a result at all or a result they desire and it is because of the quality of their query over and over again. Some customers understand this out of the gate, some need some training. – Tweet

    Be Knowledgeable to Identify Hallucinations

    1. Britannica’s Great Books of the Western WorldTweet
    2. Hallucinations are a thing and the model doesn’t know if it’s hallicunating or not. That’s why the user using an LLM on any field has to be knowledgeable on that field in order to determine what’s a hallucination. This means you can’t use a LLM reliably to do something you can’t. – Tweet
    3. LLMs don’t have the notion of True or False – Tweet

    AI Coding: $12M return for $240K spend?

    This is an email I sent to our leadership team a few minutes ago.

    We may be witnessing the third major leap in computing productivity, after high-level languages in the 1960s and spreadsheets in the 1980s

    In the last few weeks, AI coding really took off. CursorCodyReplit Agents are FAR better than GitHub Copilot.

    Research on ~5,000 devs in Fortune 100 shows that even GitHub Copilot makes them ~25% more productive.

    Personally, Cursor helped me:

    1. Write faster code (at least 2X). I’ve given away my team (there’s not enough work for them).

    2. Write better code. I now document code for others to replicate – because it’s so easy.

    3. Write complex code. I’ve built stuff I didn’t know how to. WhatsApp agents, AI code writers, even LLM Foundry. Each has opened a client’s door.

    So, should we leverage AI Coding for our developers?

    Maybe not. Consider these risks.

    1. It costs $10-$20/month/dev. That’s $120-$240K/year for ~1,000 devs.

    2. Clients may not be comfortable with us using AI coding. IP. Security.

    3. Higher productivity reduces our T&M billing, hence revenue.

    Maybe yes. Consider these benefits.

    1. We could save 25% per developer. Maybe $4K/month/dev x 25% = $12M/year

    2. We could win more complex engagements. (Pitching AI coding as an edge has worked well, too.)

    3. We could reduce time to get someone to a tech-lead level.

    There’s probably no one-size-fits-all answer. We might need to pilot while we strategize.

    How might this impact our hiring and training?

    1. Hire for AI coding skills. In interviews, I’d look for use of AI coding agents. We need a few AI coding leads to share our engineering future.

    2. Hire for dual skills. AI can take care of the code. I’d hire people with good client-facing skills.

    3. Train on AI coding. Integrate these into the onboarding process. Train existing devs.

    4. Train non-developers. Analysts, designers, managers can become as good as junior devs.

    Why am I mailing you?

    In Nov 2022, when ChatGPT was released, I thought LLMs were the biggest knowledge leap since Google.

    In Sep 2024, with Cursor, Cody, and Replit agents, I think AI coding the biggest productivity leap since Excel.

    This is not for immediate action.

    Please think over it. Share it. Discuss it. Pilot it. Let’s explore.

    Breaking mental coding barriers with LLMs

    Today, I stepped a bit beyond my comfort zone. Usually, I prefer micro-managing LLMs when writing code. This time, I was macro-managing.

    I needed to create a mock history of the status of a manuscript, e.g. it was submitted on this date. THEN it moved to this state on this date. THEN … etc. I have no idea what the states could be, though. So, I could send it to an LLM, and it would give a different set of states each time. Or I could write a program and lose out on variety.

    Instead, I asked ChatGPT.

    Me: When looking up the status of a manuscript in a database, what are typical fields it might return?

    ChatGPT: When looking up the status of a manuscript in a database, typical fields it might return include:

    1. Manuscript ID: A unique identifier for the manuscript.
    2. Title: The title of the manuscript.
    3. Author(s): Names of the author(s) who submitted the manuscript.
    4. Submission Date: The date the manuscript was submitted.
    5. Status: The current status of the manuscript (e.g., Under Review, Accepted, Rejected, In Revision).

    That helps. Then I said:

    Me: Based on this, at any point, what “Status” values could a manuscript be in? Draw a mermaid flow graph for this.

    ChatGPT: Here’s a mermaid flow graph depicting possible “Status” values a manuscript could be in throughout its lifecycle:

    This is cool — it gives me a starting point that’s better than I could have conceived of.

    Finally, I said:

    Me: Write JS code that will randomly pick from one of these states and create records containing random dates and states representing the history of the manuscript up to that point

    ChatGPT: Here’s a JavaScript code snippet that randomly picks from the manuscript states and creates records containing random dates and states, representing the history of the manuscript up to that point:

    function getRandomDate(start, end) {
        return new Date(start.getTime() + Math.random() * (end.getTime() - start.getTime()));
    }
    
    function getRandomElement(arr) {
        return arr[Math.floor(Math.random() * arr.length)];
    }
    
    function generateManuscriptHistory() {
        const history = [];
        let currentState = "Submitted";
        let currentDate = getRandomDate(new Date(2020, 0, 1), new Date());
    
        history.push({ date: currentDate, state: currentState });
    
        while (currentState !== "Accepted" && currentState !== "Rejected" && currentState !== "Published") {
            let possibleNextStates;
    
            switch (currentState) {
                case "Submitted":
                    possibleNextStates = ["Under Review"];
                    break;
                case "Under Review":
                    possibleNextStates = ["Review Completed"];
                    break;
                case "Review Completed":
                    possibleNextStates = ["Accepted", "Rejected", "Major Revision Required", "Minor Revision Required"];
                    break;
                case "Major Revision Required":
                case "Minor Revision Required":
                    possibleNextStates = ["In Revision"];
                    break;
                case "In Revision":
                    possibleNextStates = ["Resubmitted"];
                    break;
                case "Resubmitted":
                    possibleNextStates = ["Under Review"];
                    break;
                case "Accepted":
                    possibleNextStates = ["In Production"];
                    break;
                case "In Production":
                    possibleNextStates = ["Published"];
                    break;
                default:
                    possibleNextStates = [];
                    break;
            }
    
            if (possibleNextStates.length === 0) break;
    
            currentState = getRandomElement(possibleNextStates);
            currentDate = getRandomDate(currentDate, new Date());
    
            history.push({ date: currentDate, state: currentState });
        }
    
        return history;
    }
    
    // Example usage:
    console.log(generateManuscriptHistory());

    The above code implements a state-transition diagram — something I’ve never been good at doing. I have a subconscious bias, and I steer clear of even thinking about these. Now, I have one implemented (and it worked quite well!)

    Yet another example that human imagination is the bottleneck to leveraging LLMs.