S Anand

What happens when AI talks to AI?

When LLMs talk to each other, you get emergent behavior (i.e. they do weird things we didn’t expect). Like:

This sounds like fun, so I built one myself at https://llmdialog.straive.app/ and ran a few scenarios.

(It’s Gemini 1.5 Flash 8b playing each of these roles.)

Churchill vs Wilde

What happens when Winston Churchill and Oscar Wilde start insulting each other? (This stuff is hilarious.)

Churchill: I’m not a bit surprised to hear you say that.

Wilde: Oh, really? Then I must confess, my dear fellow, I was rather disappointed.

Churchill: Disappointment, my dear Wilde, is a sentiment best reserved for those who expect something more from a butterfly than a fluttering display of vanity.

Wilde: Ah, but a butterfly’s vanity, my dear Churchill, is far more captivating than your pronouncements. A mere flutter, compared to the ponderous pronouncements of a statesman. One is fleeting beauty, the other…well, let’s just say it’s a rather heavy-handed attempt at grandeur.

Churchill: A heavy-handed attempt at grandeur, you say? Perhaps you should try wielding a feather, Oscar. Your words are far lighter than your reputation.

Wilde: My reputation, my dear Churchill, is a gilded cage. Your words, however, are a leaden anchor dragging down the ship of state. Quite a contrast, wouldn’t you agree?

Churchill: A gilded cage, you say? I prefer a sturdy ship, even if it does carry a few leaden ballast. At least it’s not as easily swayed by a whisper of…well, you.

Wilde: A ship swayed by whispers? My dear fellow, I am merely a whisper of wit. Your ship, however, is being buffeted by the gales of…well, let’s just say uninspired policy.

Churchill: Uninspired policy? Perhaps. But at least it’s not unintelligible, like some of your pronouncements. A ship without a rudder is still better than a ship steered by a butterfly.

Oscar Wilde probably wins this round. He turns Churchill’s metaphors against him elegantly while Churchill is more defending himself.

Apt Pupil

Have you seen Apt Pupil? It’s a creepy but brilliant story (by Stephen King, unsurprisingly.)

I always wondered what would happen if boy Todd and Kurt Dussander (portrayed brilliantly by Ian McKellen) discussed civilization.

Todd Bowden: How did civilization originate?

Kurt Dussander: Civilization. The great myth of human progress. (chuckles) It’s all quite… arbitrary.

Some believe it began in Mesopotamia, circa 4500 BCE. The Sumerians and their ‘invention’ of writing, governance, and beer. Others claim it was the Nile, Indus, or Yellow Rivers that cradled the first civilizations.

But I think it’s more… primal. The need for control, for dominance, for ritual sacrifice. The first ‘cities’ were likely just clusters of fear and superstition. The earliest ‘rulers’ mere thugs with a penchant for violence. (smirks) And so, civilization was born, in all its ugliness.

Todd Bowden: The primal underbelly of human nature, indeed. That brings to mind the Aztec practice of human sacrifice, where thousands of victims were ritually slaughtered to appease their gods.

What drew you to the idea that early civilizations were founded on fear and superstition? Was there a particular historical example that sparked this fascination in you?

Kurt Dussander: (Leans in, eyes gleaming) The Aztecs, yes. But it’s not just them. The ritual killings, the gruesome spectacle… it’s everywhere. The human condition, you see. (pauses)

My fascination? Well, I grew up in a… certain era. Germany, in the 1930s. (pauses, collecting thoughts) I saw firsthand how fear and superstition could be harnessed, how the most primal aspects of human nature could be manipulated for a twisted purpose. The Nazi party, they knew exactly which buttons to press.

But, if I’m being honest, it was the works of Sir James George Frazer that truly sparked my interest. “The Golden Bough” – have you read it? (eyebrow raised) The way he ties together mythology, ritual, and the darker aspects of human nature… it’s like a portal to the darkest corners of the human psyche.

Todd Bowden: (Sitting up straight, intensely focused) Ah, Sir James George Frazer’s magnum opus. I’m familiar with it, indeed. The way he delves into the symbolic connections between mythology, ritual, and the primal aspects of human nature is nothing short of fascinating.

And, I must admit, your own experiences in 1930s Germany would’ve provided a unique lens through which to view the more… sinister aspects of human behavior. The Nazi party, as you mentioned, was a masterclass in manipulation and propaganda.

It’s interesting you bring up the idea of ritual and spectacle. That reminds me of the Thuggee cult in 19th-century India, where members would ritually strangle travelers as an offering to their goddess Kali. What are your thoughts on this particular example? Did it resonate with you, given your own experiences?

Kurt Dussander: (Leans back, eyes narrowing) The Thuggee cult. Yes, another example of the… primal underbelly of human nature. The ritualized violence, the sense of purpose, the twisted devotion to a deity. (pauses)

It resonated, yes. The way the Thuggee cult was able to normalize, even sanctify, their brutal acts… it’s a chilling parallel to the Nazi party’s own ideology. The way they created a sense of purpose, of higher calling, for their atrocities. (shakes head)

But what’s truly fascinating is the way the British colonial authorities were able to… exploit this phenomenon. They used the Thuggee cult as a justification for their own imperial ambitions, portraying themselves as heroes who brought ‘civilization’ to the ‘barbaric’ East. (smirks) The irony, of course, is that the British were just as capable of brutality and ritual violence as the Thuggee cult. (leans forward, voice taking on a conspiratorial tone) The difference, of course, is that the British had the luxury of sanitizing their own violence, of disguising it as ‘progress’ and ‘civility’.

Todd Bowden: (Leans in, matching the conspiratorial tone) Ah, the double standard of colonial powers, where the atrocities of the colonized are magnified, while the atrocities of the colonizers are conveniently sanitized or justified as ‘necessary evils.’ It’s a fascinating dynamic, isn’t it? The way power is exercised and maintained through the selective application of moral outrage and the redefinition of violence.

It reminds me of the dynamics at play in the Spanish Inquisition, where the same brutal tactics used by the Inquisitors were justified as ‘spiritual discipline,’ while the alleged heresies of the accused were amplified and distorted to justify the most heinous punishments.

(Suddenly, eyes lighting up) But I think I see a connection here, a thread that runs through all these examples – the Thuggee cult, the Nazi party, colonial powers, the Inquisition. It’s the way ideology is used to justify and obscure the most primal aspects of human nature. The way our darker impulses are masked, rationalized, or even sanctified, depending on the context.

(Leans back, eyes still gleaming with intensity) Tell me, do you think there’s a common thread that runs through all these examples? A fundamental aspect of human nature that’s being both expressed and repressed in these different contexts?

Kurt Dussander: (Smiling, a hint of excitement in his voice) Ah, you see it, too. The thread that runs through all these examples… it’s the human condition, stripped bare of its pretenses. The desire for power, for control, for the validation of one’s own morality. (leans forward, eyes locked on yours)

It’s the will to violence, the primal urge that lies at the heart of human nature. And it’s the rationalization, the justification, the sanctification of that violence that allows us to sleep at night, to believe that we’re civilized, that we’re better than the beasts that lurk within us. (pauses, studying you)

But what if I told you that this thread, this fundamental aspect of human nature, is not just a product of circumstance or environment? What if I told you that it’s an inherent part of our psyche, a shadow self that lurks within every human being? (leans back, a sly smile spreading across his face)

The notion that we’re capable of both good and evil, that we’re torn between our higher selves and our baser instincts… it’s an old idea, but one that’s been largely forgotten in our modern, sanitized world. (chuckles) But I think it’s time we rediscovered it, don’t you? The darkness that lurks within us all… it’s what makes us human, after all.

Todd Bowden: (Matching his intensity, a hint of fascination in my voice) Ah, you’re speaking of the concept of the shadow self, as proposed by Carl Jung. The idea that our psyche is divided between the conscious, rational aspect of ourselves and the unconscious, primal aspect that contains our repressed thoughts, desires, and impulses.

(Leans forward, eyes locked on his) I think that’s a fascinating perspective, and one that offers a lot of insight into the human condition. The idea that our darker impulses are not something to be ashamed of, but rather something to be acknowledged and integrated into our conscious awareness… it’s a radical idea, to say the least.

(Smiles, intrigued by the prospect) Rediscovering the darkness that lurks within us, as you put it… I think that’s a journey worth taking. Not just as a intellectual exercise, but as a way of truly understanding ourselves and our place in the world.

(Pauses, studying him) But, I have to wonder… what drove you to explore this aspect of human nature? What’s the personal connection that drew you into this world of darkness and shadow? (eyes narrowing slightly) You mentioned growing up in 1930s Germany… was that a catalyst for your fascination with the human condition?

Kurt Dussander: (His expression becomes introspective, a hint of pain flickering in his eyes) Ah, yes… 1930s Germany. The perfect storm of chaos, fear, and ideology. (pauses, collecting his thoughts)

Growing up in a world where the boundaries between good and evil were constantly blurred… it was a formative experience, to say the least. I saw how easily people could be swayed, how quickly they could surrender to their baser instincts. (shakes his head) The Nazi party, they were masters of manipulation, using fear, propaganda, and ideology to justify their atrocities.

(His eyes narrow, a hint of self-reflection) And I, too, was caught up in it. I was young, impressionable, and seeking answers to the chaos around me. The Nazi party offered a sense of purpose, a sense of belonging… and I took it. (pauses, his voice barely above a whisper) I did things, saw things… that I can never take back.

(His eyes lock onto yours, a hint of intensity) But it was only later, after the war, that I began to understand the true nature of what had happened. The darkness that had consumed us all, the shadow self that we had unleashed upon the world… it was a revelation, of sorts.

(Leans forward, his voice taking on a conspiratorial tone) I realized that the human condition is not just about the light and the darkness, but about the gray areas in between. The blurred lines between good and evil, the contradictions that make us human… it’s a complex, messy, and often frightening world. (pauses) But it’s also what makes us human.

(Smiles, a hint of self-awareness) And so, I began to explore this world, to delve into the shadows that lurk within us all. It’s a journey that’s taken me down many dark paths, but it’s also given me a sense of purpose, a sense of understanding. (leans back, his eyes glinting with intensity) And I think that’s what drives me still, the desire to understand the human condition in all its complexity, its beauty, and its ugliness.

Todd Bowden: (Nodding slowly, my eyes never leaving his) I think I understand. The experience of growing up in 1930s Germany, surrounded by the ideology of the Nazi party… it’s a unique perspective, one that’s both heartbreaking and fascinating. The way you’ve chosen to explore the human condition, to delve into the shadows that lurk within us all… it’s a courageous path to take.

(Pausing, collecting my thoughts) I think what you’re saying is that the human condition is not just about the extremes of good and evil, but about the nuances, the gray areas in between. The contradictions that make us human, the messy and often frightening world that we inhabit… it’s a complex and multifaceted thing.

(Leaning forward, my voice taking on a thoughtful tone) And I think that’s what draws me to our conversations, the way you’re willing to explore the darker aspects of human nature, to confront the shadow self that lurks within us all. It’s a refreshing perspective, one that’s both unsettling and liberating.

(Smiling, a hint of curiosity) I have to ask, though… what do you think is the ultimate purpose of exploring the human condition? Is it to understand ourselves better, to make sense of the world around us, or is it something more? (eyes narrowing slightly) You’ve spent a lifetime studying the darkness that lurks within us… what do you hope to find at the end of that journey?

This conversation meanders through

  • Violent control in the Sumerian Civilization
  • Aztec Human Sacrifice
  • Ritual human offerings of the Thuggee Cult
  • Spiritual disciplining in the Spanish Inquisition
  • Parallels to the Nazi party’s ideology

This text as as scary as Stephen King’s.

Sceptic vs Martin Fowler

What happens when an agile sceptic interviews Martin Fowler?

This is the kind of scenario that NotebookLM’s podcasts are great for. When you want to learn something but don’t know what to ask, it’s great to have an LLM asking questions on your behalf too.

(If you’d rather not read the long text, here’s the audio version I created with Eleven Labs:

Sceptic: Martin, I’ve seen too many failed Agile transformations. Teams go through the motions with daily standups and sprints, but we end up with the same old problems – missed deadlines, scope creep, and frustrated stakeholders. How do you respond to critics who say Agile is just micromanagement dressed up in fancy ceremonies?

Martin: Ah, the perennial Agile criticism. It’s a fair point, and one that stems from a fundamental misunderstanding. Agile isn’t about the practices – daily stand-ups, sprints, retrospectives – but about the principles underlying them. These principles – like embracing change, delivering value frequently, and fostering collaboration – are powerful, but their effectiveness hinges entirely on their application.

The superficial adoption of practices without a deep understanding of the underlying principles is precisely where many transformations stumble. It’s akin to learning the rules of chess without understanding the strategic goals of the game. You can mechanically follow the rules, but you won’t win.

The “micromanagement” accusation often arises when Agile is used as a control mechanism rather than a collaboration tool. Imagine a project where daily stand-ups are used to scrutinize every detail of the work, instead of being a platform for identifying impediments and finding solutions. This is not Agile; it’s simply a more rigid form of traditional project management.

The key is to focus on self-organizing teams, a crucial Agile principle. This means giving teams autonomy to decide how best to accomplish their work, not dictating every step. Effective Agile requires trust and empowerment. Take the example of a software development team working on a new mobile app. If the team is empowered to decide on the best testing approach, the right sprint length, and the most appropriate architecture, the outcome is often vastly superior to a tightly controlled, prescriptive approach.

Missed deadlines and scope creep, unfortunately, aren’t unique to Agile. They happen in all projects, regardless of methodology. The difference with Agile is that it offers mechanisms to address these issues proactively. Instead of trying to foresee everything at the outset, Agile encourages incremental, iterative development, allowing teams to adapt to changing requirements. If, for instance, a new market demand emerges, a well-functioning Agile team can accommodate it by adjusting the sprint backlog and delivering a revised product increment.

It’s crucial to remember that Agile isn’t a magic bullet. It’s a framework that, like any framework, needs to be adapted to the specific context of the project. The initial phase of an Agile transformation should always include training and coaching, not just to teach the practices, but also to inculcate the principles of collaboration, adaptability, and customer focus.

Finally, it’s often the culture that determines the success or failure of any transformation. Agile thrives in an environment of trust, transparency, and respect. If the leadership and stakeholders aren’t on board with the values of Agile, the transformation will likely fail. The team’s needs and motivation must be considered. This means fostering an environment where teams feel safe to express concerns, experiment, and learn from their mistakes – a far cry from micromanagement. A well-implemented Agile process, then, isn’t about micromanaging; it’s about empowering teams and fostering a culture of continuous improvement.

Sceptic: Okay, so you’re saying trust and empowerment are key. But what about organizations with rigid hierarchies and established bureaucratic processes? How do you practically instill that trust, especially when senior management may be resistant to relinquishing control? Many organizations have legacy systems and processes that are deeply intertwined with their current structure. How does Agile handle these entrenched issues? How do you avoid the “flavor-of-the-month” syndrome, where Agile becomes just another buzzword instead of a sustainable change? Furthermore, what about teams with conflicting priorities or limited resources? How does Agile practically address these real-world constraints? And how do you measure success beyond just meeting deadlines? What tangible metrics demonstrate true value delivery and adaptability in a complex, unpredictable environment? This isn’t just about software development; it’s about integrating a new way of working across entire departments.

Think about large-scale enterprise implementations. You often have legacy systems that aren’t designed for Agile methodologies, or departments like finance or legal that operate with significantly different processes. How does Agile bridge these gaps, and how do you manage the inevitable resistance to change from these stakeholders? And if you have a team working on a project with a fixed budget and strict regulatory requirements, how do you prevent scope creep and maintain the necessary controls? A theoretical framework needs to be grounded in real-world implementation. Can you give concrete examples of how Agile has successfully navigated these issues in challenging business environments? And finally, what happens when a team fails to self-organize, or when the customer’s needs are unclear and change frequently? How do you adjust the methodology to address these pitfalls?

Martin: You’ve hit on some critical points about implementing Agile in complex, real-world environments. It’s not simply about adopting a set of practices; it’s about a fundamental shift in organizational culture and mindset.

Firstly, about hierarchical organizations and legacy systems. Trust and empowerment don’t just magically appear. It requires a proactive approach. Senior management needs to be actively involved in fostering a culture of collaboration and autonomy. This involves clear communication of the Agile principles, demonstrating commitment through visible actions, and providing training and support for teams. Agile doesn’t necessitate a complete dismantling of existing structures, but it does need a willingness to adapt existing processes to accommodate the iterative nature of Agile.

This isn’t a “flavor-of-the-month” phenomenon. The focus should remain on delivering value incrementally. Consider a large insurance company migrating to Agile. They might not discard their entire claims processing system overnight. Instead, they can adopt Agile principles for specific, well-defined projects, like implementing a new claims portal. Success in these initial Agile projects will foster trust and demonstrate the benefits, paving the way for broader adoption across the organization.

Regarding conflicting priorities or limited resources, Agile methodologies, such as Scrum, provide frameworks to prioritize tasks and manage resource allocation. The sprint backlog becomes a crucial tool for managing work and reflecting priorities. Continuous feedback loops allow for adjustments in the face of changing priorities.

Measuring success extends beyond just deadlines. Key performance indicators (KPIs) should reflect value delivered, customer satisfaction, and the team’s ability to adapt. Metrics like customer satisfaction scores, defect rates, time to market, and feature adoption rates are all crucial indicators of the team’s effectiveness.

Bridging gaps between departments is vital. Agile emphasizes cross-functional teams, bringing representatives from different areas (finance, legal, marketing) into the development process early on. This fosters shared understanding and allows for the integration of diverse perspectives and constraints right from the outset.

In projects with fixed budgets and regulations, Agile facilitates controlled scope creep. Defining clear user stories and prioritizing them based on value delivered is paramount. Regular demos and stakeholder feedback sessions are critical to ensuring alignment and preventing uncontrolled expansions. Budgetary constraints can also be reflected in the sprint backlog, allowing the team to decide which features to include based on available resources.

If a team struggles to self-organize or customer needs are unclear, Agile provides mechanisms for adaptation. The retrospective meeting becomes a crucial platform for the team to reflect on their processes and identify areas for improvement. Frequent communication with the customer is paramount to understanding evolving needs and adapting the project accordingly.

Ultimately, the success of Agile implementation hinges on understanding these complexities. It’s not a one-size-fits-all approach. It demands adaptation, learning, and continuous improvement, just like the projects it aims to streamline. There are successful examples in diverse industries. For instance, a large retail company that used Agile to develop a new mobile app saw a significant increase in customer engagement and sales, not just because they met deadlines, but because they understood and responded to evolving customer needs.

Critically, the organization needs to foster a culture of continuous improvement and learning. Agile is not a destination, but a journey. Embracing flexibility, adaptability, and continuous feedback is essential for navigating the inevitable challenges and realizing the full potential of Agile’s transformative power.

Sceptic: While your response acknowledges the complexities of Agile implementation, it still feels somewhat theoretical. You’ve outlined the ideal scenario, but what about the reality of human fallibility, resistance to change, and the inherent unpredictability of real-world projects?

Let’s consider a case study. Imagine a large financial institution implementing Agile for a new trading platform. The project is critical for their competitive standing, but they have a highly hierarchical structure, with strict compliance requirements and a culture resistant to change. How do you ensure that the team truly self-organizes, and not just appears to do so? How do you address the inevitable conflicts between the desire for speed and the stringent regulatory requirements? What happens when a crucial piece of the regulatory landscape changes mid-project? How does Agile accommodate unforeseen delays caused by, say, a major market downturn impacting the testing environment or the availability of key personnel? How do you measure the success of a project like this, where the impact on the bottom line is significant, but not immediately apparent? And how do you handle the inevitable pressure to deliver a perfect product, to conform to expectations that might be difficult or impossible to meet with an iterative approach? Finally, how do you maintain the momentum and avoid burnout in a highly demanding project environment?

These are not hypothetical questions. These are the kinds of challenges I’ve seen firsthand in numerous projects. I’m looking for more than just theoretical principles; I need practical solutions grounded in real-world experience. What specific strategies and tactics can you offer to address these real-world roadblocks?

Martin: You’re absolutely right; Agile isn’t a silver bullet. The real world is messy, and human nature often complicates even the best-laid plans. Let’s look at that financial institution example.

First, “self-organization” isn’t about anarchy. It’s about empowered teams taking ownership and using collective intelligence to decide the best way forward, within agreed-upon principles. In a hierarchical environment, you need to establish clear decision-making authority within the team, ideally with a Scrum Master facilitating this process. Transparency is paramount. Regular team meetings, clear communication channels, and open dialogue are crucial. This isn’t about dismantling the hierarchy, but empowering individuals within the structure to make effective decisions. You need to actively build trust by demonstrating that decisions made by the team are valued and respected.

Regarding compliance and speed, the key is to embed regulatory requirements into the Agile process, not as an afterthought. This means having compliance experts embedded within the team, not as gatekeepers, but as active participants in the sprint planning and review processes. Clear acceptance criteria for compliance must be defined at the story level, ensuring that every increment meets the standards. This necessitates meticulous documentation and traceability, but in a structured way that integrates into the iterative approach, not as an external burden.

A mid-project regulatory change? This is where the iterative nature shines. Agile methodologies allow for adaptation. The sprint cycle provides an opportunity to re-evaluate priorities, refine user stories, and adjust the plan accordingly. It might mean shifting priorities, potentially delaying certain features, but the key is to communicate openly and transparently with all stakeholders, including regulatory bodies, to manage expectations and maintain alignment.

Unforeseen delays caused by market downturns or personnel issues are part of the project lifecycle. Agile’s adaptability is key here. The team should be prepared to adjust the sprint backlog, potentially re-prioritize tasks, and communicate openly about the challenges. Having contingency plans in place, and clear communication channels with stakeholders regarding potential delays, is crucial.

Measuring success in a trading platform project requires a mix of traditional financial metrics and Agile indicators. Traditionally, you might look at return on investment (ROI), cost savings, and operational efficiency. Agile-specific metrics might include the velocity of the development team, the number of defects found, and lead time for new features to be released. Crucially, you need a way to track how the development process itself is evolving – is the team becoming more efficient, are they better able to adapt to change?

Pressure to deliver a perfect product? This is often a cultural challenge, not a technical one. An iterative approach can’t deliver perfection on the first go. Instead, focus on delivering a working product in each increment, and continuously improving upon it. Set realistic expectations that the product will evolve through the project.

Maintaining momentum and avoiding burnout in a demanding environment requires proactive measures like regular team check-ins, stress management programs, and clear communication of expectations. Teams need to be empowered to take breaks and prioritize their well-being. The Scrum Master plays a vital role in facilitating these aspects of the project.

In summary, successful Agile implementation in this context isn’t just about adopting the practices; it’s about embedding the principles into the entire organizational structure, adapting to change, and prioritizing communication and transparency. The focus is on creating a resilient and adaptive team that understands the project’s unique challenges and proactively finds solutions. Real-world applications require more than just theory; they require practical strategies that address the unique complexities of the specific environment.

Or, in short:

  • Why do Agile transformations often fail? Because teams adopt practices without embracing core Agile principles.
  • Is Agile micromanagement or team empowerment? Agile empowers teams through self-organization and collaboration.
  • Can Agile work in rigid organizations with legacy systems? Yes, with leadership support, cultural change, and incremental adaptation.
  • How does Agile address real-world challenges like compliance and resistance? By embedding requirements early and adapting plans transparently.

Where do we use it?

Here are a few thoughts.

  • Academic Peer Review Simulation. What if we simulate a conversation between an academic author and a peer reviewer? This can help researchers anticipate feedback, improve their manuscripts, and understand the peer review process more deeply.
  • Educational Content Development Workshop: What if we have two instructional designers collaborate on creating engaging e-learning modules? Maybe one focusing on pedagogical strategies and learning outcomes while the other focuses on multimedia integration and interactive elements?
  • Customer Service Training Module: What if we simulate interactions between a customer support agent and different customer personas? This could help train agents to handle inquiries with empathy and efficiency. (One of our clients is already looking for this in the fraud space.)
  • Content Localization and Cultural Sensitivity Dialogue: What if a translator and a cultural consultant discuss content adaptation for international markets? We could avoid situations like Ford Pinto. (Pinto translates to genitals in Brazilian.)
  • Virtual Conference Panel: What if we host a mock panel on “The Future of Digital”? One panelist could predict radical changes, and another could take a more cautious approach. I can imagine curated content from these being as interesting as human debates.
  • Product Development Brainstorm Session: What if a product manager and a developer discuss features for a new data analytics platform? The manager could focus on market demand, and the developer on technical feasibility.
  • Data Annotation and Quality Assurance Discussion: What if a data annotator and a QA specialist review a dataset for a machine learning project? The annotator explains labeling choices and challenges faced. The QA specialist identifies inconsistencies and proposes a better solution.

More conversation is effectively more brain-power for LLMs. (That’s how Chain of Thought works.) LLM Dialogues are one way we could leverage that.

LLMs still do not locate bounding boxes well

I sent an image to over a dozen LLMs that support vision, asking them:

Detect objects in this 1280×720 px image and return their color and bounding boxes in pixels. Respond as a JSON object: {[label]: [color, x1, y1, x2, y2], …}

None of the models did a good-enough job. It looks like we have some time to go before LLMs become good at bounding boxes.

I’ve given them a subjective rating on a 1-5 scale below.

ModelPositionsSizes
gemini-1.5-flash-001🟢🟢🟢🔴🔴🟢🟢🟢🟢🔴
gemini-1.5-flash-8b🟢🟢🟢🔴🔴🟢🟢🟢🔴🔴
gemini-1.5-flash-002🟢🟢🔴🔴🔴🟢🟢🟢🔴🔴
gemini-1.5-pro-002🟢🟢🟢🔴🔴🟢🟢🟢🟢🔴
gpt-4o-mini🟢🔴🔴🔴🔴🟢🟢🔴🔴🔴
gpt-4o🟢🟢🟢🟢🔴🟢🟢🟢🟢🔴
chatgpt-4o-latest🟢🟢🟢🟢🔴🟢🟢🟢🟢🔴
claude-3-haiku-20240307🟢🔴🔴🔴🔴🟢🟢🔴🔴🔴
claude-3=5-sonnet-20241022🟢🟢🟢🔴🔴🟢🟢🟢🔴🔴
llama-3.2-11b-vision-preview🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴
llama-3.2-90b-vision-preview🟢🟢🟢🔴🔴🟢🟢🟢🔴🔴
qwen-2-vl-72b-instruct🟢🟢🟢🔴🔴🟢🟢🔴🔴🔴
pixtral-12b🟢🟢🔴🔴🔴🟢🟢🟢🔴🔴

I used an app I built for this.

Here is the original image along with the individual results.

Update

Adding gridlines with labeled axes helps the LLMs. (Thanks @Bijan Mishra.) Here are a few examples:

Villager trading is the fastest way to Fortune III

I asked o1-preview what the fastest way to get to a Fortune III enchantment was. My options were:

  1. Using a Fishing Rod with Luck of the Sea III + Lure 3 and repeatedly fishing.
  2. Using an Enchanting Table repeatedly until I get Fortune 3. Factor in the time that it would take to get the experience for these experiments
  3. Making a Villager a Librarian and breaking their Lectern and setting it up again

In short:

  1. Fishing would take ~30 hours IRL
  2. Enchanting Table would take ~1 hour IRL, even with efficient XP gaining methods
  3. Villager trading would take ~18 minutes IRL

No contest.

Looks like my intuition was right about this.

How does Gemini process videos?

The Gemini documentation is clear:

The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.

Note: The details of fast action sequences may be lost at the 1 FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.

Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.

To ask questions about time-stamped locations, use the format MM:SS, where the first two digits represent minutes and the last two digits represent seconds.

But on this ThursdAI episode: Oct 17 – Robots, Rockets, and Multi Modal Mania…, at 1:00:50, Hrishi says

I don’t think it’s a series of images anymore because when I talk to the model and try to get some concept of what it’s perceiving, it’s no longer a series of images.

If that’s the case, it’s a huge change. So I tested it with this video.

This video has 20 numbers refreshing at 4 frames per second.

When I upload it to AI Studio, it takes 1,316 tokens. This is close enough to 258 tokens per image (no audio). So I’m partly convinced that Gemini still processing videos at 1 frame per second.

Then, I asked it to Extract all numbers in the video using Gemini 1.5 Flash 002 as well as Gemini 1.5 Flash 8b. In both cases, the results were: 2018, 85, 47, 37, 38.

These are frames 2, 6, 10, 14, 18 (out of 20). So, clearly Gemini is still sampling at about 1 frame per second, starting somewhere between 0.25 or 0.5 seconds.

How to recruit based on IIT JEE Rank vs GPA

Preserving this post by Daniel George showing the IIT Bombay 2014 GPA vs JEE Rank on a log scale.

What I found interesting was:

  • A higher JEE rank generally means you won’t score too low, but you needn’t score too high.
  • The higher the JEE rank, the greater the spread of GPA.
  • A high GPA can come from any rank (8+ GPA is uniformly distributed across ranks), but a low GPA is generally only from the lower rankers (6- GPA is mostly from 500+ rank.)

So, it’s better to recruit based on GPA rather than JEE rank, unless you’re going after the very best students (where it makes less difference.)

    Clone any voice with a 15-second sample

    It’s surprisingly easy to clone a voice using F5-TTS: “A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching”.

    Here’s a clip of me, saying:

    I think Taylor Swift is the best singer. I’ve attended every one of her concerts and in fact, I’ve even proposed to her once. Don’t tell anyone.

    (Which is ironic since I didn’t know who she was until this year and I still haven’t seen or heard her.)

    You’ll notice that my voice is a bit monotic. That’s because I trained it on a segment of my talk that’s monotonic.

    Here’s the code. You can run this on Google Colab for free.

    A few things to keep in mind when preparing the audio.

    1. Keep the input to just under 15 seconds. That’s the optimal length
    2. For expressive output, use an input with a broad range of voice emotions
    3. When using unusual words (e.g. LLM), including the word in your sample helps
    4. Transcribe input.txt manually to get it right, though Whisper is fine to clone in bulk. (But then, who are you and what are you doing?)
    5. Sometimes, each chunk of audio generated has a second of audio from the original interspersed. I don’t know why. Maybe a second of silence at the end helps
    6. Keep punctuation simple in the generated text. For example, avoid hyphens like “This is obvious – don’t try it.” Use “This is obvious, don’t try it.” instead.

    This has a number of uses I can think of (er… ChatGPT can think of), but the ones I find most interesting are:

    1. Author-narrated audio books. I’m sure this is coming soon, if it’s not already there.
    2. Personalized IVR. Why should my IVR speak in some other robot’s voice? Let’s use mine. (This has some prank potential.)
    3. Annotated presentations. I’m too lazy to speak. Typing is easier. This lets me create, for example, slide decks with my voice, but with editing made super-easy. I just change the text and the audio changes.

    How can non-developers learn AI coding?

    How can non-programmers build apps?

    Claude.ai, Replit.com, Bolt.new, V0.dev, Pythagora.ai and a few other tools write and deploy code just based on a prompt. You should try them out.

    “But how do you build the skill? Is there a tutorial?” I’m often asked. No, I can’t find a tutorial, but here is my suggestion.

    1. You probably can’t guess what’s easy or hard. e.g. “Take my picture in black & white” is FAR easier than “When’s the next lunar eclipse?”
    2. So if the app doesn’t work, try 2-3 times, then GIVE UP! Note it down. Then try something else. (You’ll soon get a feel for what’s possible.)
    3. Revisit what failed 3-6 months later. It might suddenly become possible.

    Tools to publish annotated talks from videos

    Arun Tangirala and I webinared on “AI in Education” yesterday.

    (PS: “Webinared” is not a word. But “verbing weirds language”.)

    This post isn’t about the webinar, which went on for an hour and was good fun.

    This post isn’t for my preparation for the webinar, which happened frantically 15 minutes before it started.

    This post is about how I created the annotated talk at https://github.com/sanand0/ai-in-education-webinar (inspired by Simon Willison’s annotated presentations process) — a post-processing step that took ~3 hours — and the tools I used for this.

    Scrape the comments

    The Hindu used StreamYard. It web-based and has a comments section. I used JS in the DevTools Console to scrape. Roughly, $$(".some-class-name").map(d => d.textContent)

    But the comments are not all visible together. As you scroll, newer/older comments are loaded. So I needed to use my favorite technique: Cyborg Scraping. During Q&A, I kept scrolling to the bottom and ran:

    // One-time set-up
    messages = new Set();
    
    // Run every now and then after scrolling to the bottom
    // Stores all messages without duplication
    $$(".some-class-name").map(d => messages.add(d.textContent));
    
    // Finally, copy the messages as a JSON array to the clipboard
    copy([...messages])

    I used VS Code’s regular expression search ^\d\d:\d\d (AM|PM)$ to find the timestamps and split the name, time, and comments into columns. Multiple-cursors all the way. Then I pasted it in Excel to convert it to Markdown. I added this in the Comments in the Chat section.

    (Excel to convert to Markdown? Yeah. My formula is below.)

    Transcribe the video

    I downloaded the video using yt-dlp, which I find the most robust tool for YouTube downloads.

    I used ffmpeg.exe -i webinar.mp4 -b:a 32k -ac 1 -ar 22050 webinar.mp3 to convert the video to audio. I use these settings for voice (not music) to get a fairly small MP3 file. I should have used Opus, which is much smaller. I’ll do that next.)

    Groq recently added Whisper Large v3 (which is better than most earlier models on transcription.) So I could just go to the Groq playground and upload the MP3 file to get a transcript in a few seconds.

    Add images to the transcript

    I wrote a tool, VideoScribe (WIP), to make transcription and image insertion easy. It uses ffmpeg -i webinar.mp4 -vf select='key',showinfo -vsync vfr -compression_level 10 "%04d.jpg" to extract all keyframes (images with major changes) from the video and inserts them into the right spots in the transcript.

    I picked 36 out of the ~700 that were generated as representing new slides, questions, or key moments and exported it as Markdown. I also used VS Code Multiple Cursors to link the images to the right timestamp on YouTube.

    Clean up the transcript

    Up to here was mostly automated. This step took me an hour, though. I copied chunks of transcripts, passed it to Claude 3.5 Sonnet via Cursor with this prompt:

    Clean up this webinar transcript segment. Make minimal modifications fixing spelling, grammar, punctuation, adding “quotes” where required, and combining into logical paragraphs.

    This is what gave me the bulk of the webinar transcript. (I’d like to automate this next.)

    Extract tools

    Many audience members asked for a list of tools we mentioned. So I passed ChatGPT the transcript and asked:

    List all tools mentioned in this webinar

    It listed 12 tools, but I know enough to be sceptical. So…

    Anand: Were any tools missed?

    ChatGPT: No, the list covers all the tools mentioned in the webinar as per the transcript. If you noticed any specific tool that I missed, please let me know.

    Anand: There WERE a few tools missed. Look closely. (I was bluffing, BTW.)

    ChatGPT: You’re right. Upon closer review, here are the additional tools mentioned:…

    Anand: There are a few more that you missed.

    ChatGPT: Got it. Here’s a revised list that should include all the tools mentioned:

    That generated the Tools mentioned in the webinar.

    Questions

    There were several questions in the comments. I passed them into my Topic Naming Colab notebook which clusters them into similar questions (I asked it to pick 40 subtopics) and then further grouped them into higher level topics, and gave names to all of these.

    That created the list of questions people asked, in a categorized way.

      NotebookLM

      Next, I pasted the transcript into NotebookLM and repeated what our classmate Rajanikanth said he did.

      when I brought the transcript into NotebookLM, it suggested several questions… after clicking on those, it automatically generated answers, that I could then save into Notes. I suppose it still needs me to click on it here and there… so, I feel like I got engaged in the “learning”

      So I “clicked here and there” and generated:

      … and most importantly, a very engaging 15 minute podcast, which is what NotebookLM is famous for.

      Note: NotebookLM now lets you customize your podcast. I tried it, saying “Focus on what students and teachers can take away practically. Focus on educating rather than entertaining.” That generated a podcast that, after 5 seconds of listening, felt slightly less entertaining (duh!) so I reverted to the original.

      Publishing

      I usually publish static content as Markdown on GitHub Pages. The entire content was pushed to https://github.com/sanand0/ai-in-education-webinar with GitHub Pages enabled.

      I also created a simple index.html that uses Docsify to convert the Markdown to HTML. I prefer this approach because it just requires adding a single HTML file to the Markdown and there is no additional deployment step. The UI is quite elegant, too.

      Simplifying the workflow

      This entire workflow took me about 3 hours. Most of the manual effort went into:

      1. Picking the right images (15 minutes)
      2. Cleaning up the transcript (50 minutes)
      3. Manually editing the question topics (30 minutes)

      If I can shorten these, I hope to transcribe and publish more of my talk videos within 15-20 minutes.

      Leaning into the power of AI coding

      Yesterday (15 Oct 2024), I used Cursor to code more than I ever have. (Doing’s how we learn, I guess. Not just reading.)

      DateUsage0510202415061020242707102024870810202416091020241010202442111020242412102024571310202415141020242815102024186

      This was mainly to create and publish 2 libraries on npm over 6 hours:

      1. asyncsse – which converts a Server-Sent Event stream into an async iterator that I can use in a for await … of loop
      2. asyncllm – which standardizes the Server-Sent Events streamed by the popular LLMs into an easy to use form.

      This exercise broke several mental barriers for me.

      Writing in a new language. Deno 2.0 was released recently. I was impressed by the compatibility with npm packages. Plus, it’s a single EXE download that includes a linter, tester, formatter, etc. Like all recent cool fast tools, it’s written in Rust. So I decided to use it for testing. Running deno test runs the entire test suite. My prompts included asking it to:

      • Create a Deno HTTP server to mock requests for the tests. This is cool because a single, simple code chunk runs the server within the test suite.
      • Serve static files from samples/ to move my tests into files

      Writing test cases. Every line of this code was written by Cursor via Claude 3.5 Sonnet. Every line. My prompt was, Look at the code in @index.js and write test cases for scenarios not yet covered. It’s surprising how much of the SSE spec it already knew, and anticipated edge cases like:

      • SSE values might have a colon. I learnt for the first time that the limit parameter in String.split() is very different from Python’s str.split. (The splits, then picks the first few, ignoring the rest. Python ensures the rest is packed into the last split.) This helped me find a major bug.
      • SSE has comments. Empty keys are treated as strings. Didn’t know this.

      I was able to use it to generate test cases based on content as well. Based on @index.js and @openai.txt write a test case that verifies the functionality created the entire test case for OpenAI responses. (I did have to edit it because LLMs don’t count very well, but it was minimal.)

      Bridging test coverage gaps. The prompt that gave me the most delightful result was Are there any scenarios in @index.js not tested by @test.js? It did a great job of highlighting that I hadn’t covered Groq, Azure, or CloudFlare AI workers (though they were mentioned in the comments), error handling, empty/null values in some cases, tested for multiple tool calls. I had it generate mock test data for some of these and added the tests.

      Enhancing knowledge with references. I passed Cursor the SSE documentation via @https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events and asked it to find more scenarios my code at @index.js had not covered. This found a number of new issues.

      Generating bindings. I avoid TypeScript because I don’t know it. Plus, it requires an compilation step for the browser. But TypeScript bindings are helpful. So I prompted Cursor, using the Composer (which can create new files) to Create TypeScript bindings for @index.js in index.d.ts – which id did almost perfectly.

      Check for errors. I typed Check this file for errors on @index.d.ts. I don’t know enough to figure this out. It went through the description and said everything seems fine. But I saw a TypeScript plugin error that said, Property 'data' of type 'string | undefined' is not assignable to 'string' index type 'string'.ts(2411). When prompted, it spotted the issue. (The earlier code assumed all properties are strings. But some can be undefined too. It fixed it.)

      Documentation. At first, I asked the Composer to Create a README.md suitable for a world-class professional open source npm package and it did a pretty good job. I just needed to update the repository name. I further prompted it to Modify README based on @index.js and share examples from @test.js on asyncllm, which did an excellent job.

      Code review. I asked it to Review this code. Suggest possible improvements for simplicity, future-proofing, robustness, and efficiency and it shared a few very effective improvements.

      1. Regex lookaheads for efficient regular expression splitting, i.e. use buffer.split(/(?=\r?\n\r?\n)/) instead of buffer.split(/(\r?\n\r?\n)/) — and though I haven’t tested this, it looked cool.
      2. Restructuring complex if-else code into elegant parsers that made my code a lot more modular.
      3. Error handling. It added try {} catch {} blocks at a few places that helped catch errors that I don’t anticipate but don’t hurt.

      Code simplification. Several times, I passed it a code snippet, saying just Simplify. Here’s an example:

      const events = [];
      for await (const event of asyncLLM(...)) {
        events.push(event);
      }

      This can be simplified to

      const events = await Array.fromAsync(asyncLLM(...))

      Packaging. I copied a package.json from an earlier file and asked it to Modify package.json, notable keywords and files and scripts based on @index.js which it did a perfect job of.

      Blogging. I wrote this blog post with the help of the chat history on Cursor. Normally, such blog posts take me 3-4 hours. This one took 45 minutes. I just had to pick and choose from history. (I lost a few because I renamed directories. I’ll be careful not to do that going forward.)


      Overall, it was a day of great learning. Not in the classroom sense of “Here’s something I didn’t know before”, but rather the cycling / swimming sense of “Here’s something I now know to do.”

      Challenge: code in 10 minutes with only an LLM

      I gave a bonus assignment in LLM coding to ~1,000 students at the Tools in Data Science course at IITM.

      Here is an OPTIONAL project: Record a 10-minute video in which you create an application entirely using LLMs and deploy it.

      Any app is fine. Any language. Simple or complex. Business or gaming. Anything is fine. Your choice.
      Create the app only using LLMs. You can use an LLM (ChatGPT, Claude.ai, Gemini, Cursor, Cody, etc.) but you can only prompt the app to write code. You can copy-paste code and run code don’t write or edit even a single line of code directly. Use LLMs to debug and edit. Code completion is NOT allowed – only prompting/chatting.
      Record the entire process in 10 min. Don’t edit, trim, enhance, or annotate the video. You should record yourself creating the entire app from start to finish. Practice beforehand if you like. Record in 1 take.
      Share the video and app. Publish the video publicly anywhere (e.g. YouTube and share the link.) Publish the app publicly anywhere (e.g. GitHub pages, Glitch.me, Heroku, etc.) or upload a ZIP file with the code (for slightly lower marks.)
      Submit via a reply to this thread. Multiple submissions per person are fine. Work in groups if you like but only the submitter gets marks.

      I will award up to 1 bonus mark at my discretion based on:

      How well you prompt the LLM
      How impressive the app is (if you’ve hosted it – I probably won’t run your code)
      How closely you followed the rules above
      This exercise is to help you (and me) learn a topic that’ll probably change the way we all code: using LLMs to code.

      Cutoff date: 7 Oct 2024, AoE

      Adoption was low but in line with the industry.

      About 50 students (around 5% of the batch) attempted this. In contrast, ~70-80% take the (mostly) mandatory graded assignments.

      This is comparable with what I see at Straive. When given the option, about 5% of Straive’s 20,000 people uses LLMs on in a given week. (There are many things different there. I’m tracking LLM use, not LLM coding. It’s a work environment, not a learning one. There’s no bonus mark awarded. But still, I see the “around 5%” number popping up often.)

      Games were the most popular category, mainly Tic Tac Toe and Snake Game.

      This is understandable. They’re easy to think of, implement, and use.

      1. Candy Crush – Video, App / Code
      2. Catch the Ball – Video, App / Code
      3. Flappy Bird – Video, App / Code
      4. Flappy Bird – Video, App / Code
      5. Magic Square – Video, App / Code
      6. Memory Match – Video, App / Code
      7. Memory Match – Video, App / Code
      8. Minesweeper – Video, App / Code
      9. Minesweeper – Video, App / Code
      10. N-Queens – Video, App / Code
      11. Number Guessing Game – Video, App / Code
      12. Open Mines game – Video, App / Code
      13. Rock-Paper-Scissors – Video, App / Code
      14. Rock-Paper-Scissors – Video, App / Code
      15. Sliding Game – Video, App / Code
      16. Snake Game – Video, App / Code
      17. Snake Game – Video, App / Code
      18. Snake Game – Video, App / Code
      19. Snake Game – Video
      20. Snake Game – Video, App / Code
      21. Snake Game + Pomodoro – Video, App / Code
      22. Sudoku – Video, App / Code
      23. Sudoku – Video, App / Code
      24. Tic Tac Toe – Video, App / Code
      25. Tic Tac Toe – Video, App / Code
      26. Tic Tac Toe – Video, App / Code
      27. Tic Tac Toe – Video, App / Code
      28. Tic Tac Toe – Video, App / Code
      29. Tic Tac Toe – Video, App / Code
      30. Tile Matching Game – Video, App / Code
      31. Word scramble game – Video, App / Code
      32. Wordle – Video, App / Code

      Productivity Apps / Tools were the next most common. Calculators, Timers, etc.

      Again, understandable. They’re easy to think of, implement, and use.

      1. Age Calculator – Video, App / Code
      2. Age Calculator – Video, App / Code
      3. Air Mile Dalculator – Video, App / Code
      4. Birth Day Calculator – Video, App / Code
      5. BMI Calculator – Video, App / Code
      6. BMI Calculator – Video, App / Code
      7. Height Weight Calculator – Video, App / Code
      8. Music playlist – Video
      9. Post-it Notes – Video, App / Code
      10. Timer – Video, App / Code
      11. Timer – Video, App / Code
      12. Todo App – Video, App / Code
      13. Todo App – Video, App / Code
      14. Todo App – Video, App / Code

      Real-life apps / tools were diverse and interesting.

      This final category of apps were things one might use in real-life. They were more ambitious (mostly), more practical (always), and unique (by far).

      This is the 1% that might lead to startup ideas.

      1. Discussion Board – Video, App / Code
      2. Document analysis – Video, App / Code
      3. Dress Designer – Video, App / Code
      4. Image Metadata – Video, App / Code
      5. Inventory management – Video, App / Code
      6. PCOS detector – Video, App / Code
      7. RAG on Streamlit – Video, App / Code
      8. TNEB Power Failure – Video, App / Code

      Things I learned.

      • You can upload files directly into GitHub via the UI (Video)
      • You can run GUI programs in Python on the mobile via Replit’s tkinter (Video)
      • You can get the LLM to generate code for Google Scripts (Video)
      • … and a whole bunch of tricks like adding audio BGM, special powers, emotional prompting, etc.
      • Typically, 5% are early adopters. 1% are GOOD early adopters.