szia.ai

Forward-looking Laziness: What Changes When AI Writes 95% of the Code

Tue, 21 Apr 2026 04:51:26 GMT

AI-enhanced project delivery is great: it gives you a ~1.5× speedup almost for free. But making your team AI-native — and reaching a 3-5× organization-wide speedup — requires real mastery of orchestration on top of a well-designed AI strategy. In this article I walk through solutions to the individual-level dilemmas, the team-level challenges, and the organization-wide design choices that tend to become the real roadblocks to efficiency. I’ll also suggest where to spend the saved hours — instead of simply handing them back to the client as a discount.

I’ve tried to keep the advice generalizable, but my experience comes from projects built around coding and data analysis / modeling. Even if your work sits far from that, you may still find parts of this useful.

Individual Dilemmas

A few obvious questions tend to surface while you watch the AI do your job over your second morning coffee: what should I delegate, and how? And what happens once the machine knows everything I’m currently doing?

As Larry Wall wrote in 1991 (and as I re-learnt ten minutes ago from Claude),

the three great virtues of a programmer are laziness, impatience, and hubris.

Thirty-five years later it still holds — with a small reinterpretation for the AI era:

Laziness: delegate to the AI as much as you reasonably can, and do it in a structured lazy way (the kind that pays back on project #2, not just today).
Impatience: don’t stay stuck with a bad model, a bad prompt, or a bad idea. Switch or correct course quickly.
Hubris: don’t tolerate mediocre solutions. Stay the guide, not the guided.

I’d happily add two more: humility, to accept the complexity of the problems you take on, and resilience, to keep going after the first few failed attempts.

Boundaries & Shifts

AI can fully replace me for simple but time-consuming coding tasks (crawlers, data loaders, format converters) and accelerate almost everything else — backends, UIs, data analysis. But in exchange, I now need to write more tests and spend more time on design. AI helps in both steps too (for example, generating logical system-design documents), but it isn’t yet strong at picking the best solution, and it isn’t creative enough to surface the test cases that really matter. Human guidance is still needed — at least until verifiable AI can hand us something close to a mathematical proof that a proposed solution satisfies the requirements.

A classic but slightly dated example is map coloring. The brute-force version — trying every combination — would run for ~10¹² years on a non-trivial map; a simple reframing (constraint propagation) brings the same problem down to milliseconds. It’s the same trick you use when you pencil in the possible numbers while solving a Sudoku. That particular reframing is now well-known, so the AI will almost always pick the right one. But on complex problems, dozens of small decisions like this have to be made, and each one can significantly affect the final quality — or the running time.

Map coloring: brute force vs. constraint propagation

Map coloring — brute force vs. constraint propagation.

Project managers can lean on AI too. It can assemble a full project plan from rough notes in an Excel, and allocate or schedule tasks across team members — for example, you can ask it to write a script that takes a Markdown file (which it generates from your notes and a few instructions) and uploads the issues to GitHub, organized into Milestones inside a project / Kanban board. The more advanced setups can even produce status presentations almost autonomously — whether via a skill that knows your company style, or via an AI browser agent that edits PowerPoint Online or Google Slides directly (and there are several web-based alternatives, furthermore, Claude Design has just arrived). What AI still won’t do is motivate the team, keep relationships healthy with other departments, or manage the client well enough to keep the whole process smooth.

In the end, once the machine handles the coding and the status decks — and given we can’t realistically drink more than five coffees a day — we’re forced to do something with the free time. Some of that energy ends up going into more client interactions (alongside the project-related AI-meme generation in the unofficial channels). Surprisingly, that turns out to be a net win for product quality and later adoption — it makes more of our projects actually mean something.

There are several other areas where AI can speed up delivery — deployment, CI/CD, monitoring — but let’s first look at how to use it properly.

Mentor Your AI

Good-quality code needs exhaustive prompting — you basically spell out every major step in detail (which fits a control freak’s style very well; definitely not mine :)). Taken literally, that means 4–5-page prompts just to define a slightly complex function. Which defeats the whole point of the title: forward-looking laziness.

Thankfully, most AI-coding (or “co-working”) tools let you share your own — and your organization’s — knowledge with the model so you don’t have to retype it every single time. You can share rules that are injected into every prompt, and skills that get pulled in based on the task at hand. These “additional instructions” come in several forms:

Preferences: the CLAUDE.md file (and its equivalents in other tools), covering your global preferences, cross-project conventions, and project- or team-specific instructions. Typical content: how to organize the project directories, preferred libraries for common tasks, the linter or formatter of choice.
Permissions: global and project-level permissions for Claude Code: what the agent may run without asking, what files it may touch, and so on.
Skills: knowledge that doesn’t need to be in every prompt, but should be available when needed. A skill defines a method for a specific task (e.g., how we crawl data, how we do time-series forecasting, how we use our internal SERP API wrapper). It bundles very narrow know-how that only applies in a handful of situations.
Rules: the “coding codex” (not sure if that’s the right name — basically the set of hard rules the code must follow). Some of these can live in workspace-level preferences (see below).
Agents: the personalities of the different agents you use during a project. For a trading app, you might have a trading adviser, a security auditor, and a code reviewer agent alongside the main coder. It may sound like overkill (“why add more agents if they all use the same LLM?”), but in practice it works: different personas steer the model down different reasoning paths, and a multi-agent setup benefits from division of labor and cross-checking — which usually produces a more robust final result.

These “additional instructions” can live at several levels:

User-level: global instructions in your home .claude folder, applied to all Claude-enhanced activities (including coding).
Workspace-level: usually the folder where you keep your projects (or any Claude-shared working folders — presentations, design work, etc.). Applied to everything inside.
Project-level: the concrete, project-specific instructions (for example, most projects use a docs folder, but this one calls it documents) or background information.

Project-level instructions take precedence over workspace and user-level ones when they conflict.

The animation below shows a few examples — skeletons, really — for each of these hint types, designed to shrink the size of your working prompts (you can see the animation below in three columns following this link):

The anatomy of Claude Code instructions

The anatomy of Claude Code instructions — user, workspace and project levels.

I expect this split between helper files to simplify further over the coming months. The takeaway shouldn’t be the exact file names or paths (you can define your coding style in a workspace-level CLAUDE.md and it will still work fine) but the mindset: define each instruction at the level where you’ll actually use it. Preferences attach to all code you write; skills and rules kick in selectively, based on the task at hand.

The Next Step as a Human

Finishing a project is satisfying, but you usually walk away with the feeling that you could have delivered it slightly better the second time around. Or you’re a little sad you didn’t have the time to add a few more clever features, or to make the algorithms more robust and efficient.

You might also feel that the organization you work for can’t spend enough energy on experimenting with entirely new methods for solving the same problem significantly better (or at least, on adding some fresh charts to the usual dashboards).

And now, you might actually have the time for all of this. Efficiency will be a must within a year for every service provider, so the competition will move to features (and sales, and marketing, and so on). Instead of losing your job to automation, your best chance of keeping it is to delegate all your current tasks to AI, and prepare for a more innovative and exciting future — the important-but-not-urgent tasks.

Figure 3: The AIsenhower Matrix (from the previous blog post)

Challenges in Teamwork

As velocity grows, decision points become more frequent — and more communication is needed to avoid situations where team members edit the same file in contradictory ways (causing painful merge conflicts). When tasks finish faster, reviews become a bottleneck more easily. And lead engineers, instead of focusing on the high-quality solution they’re best placed to design, find themselves spending their time reviewing code produced by junior–AI pair coders.

Communication has also become layered. You no longer just talk to your colleague — you also need to loop in your AI pair-coders so they’re aware of the relevant decisions (chat history isn’t shared knowledge, at least at the time of writing this post).

To succeed as a team, you need to adjust the usual workflow (a bit technical — feel free to skip this list):

Trunk-based flow: developers integrate small, frequent updates into a single central branch (smaller, more frequent PRs). Reviews become easier and code conflicts less frequent, because everyone edits any given piece of code for a shorter window.
Agent-to-agent communication: when generating a frontend, ask your agent to leave comments on what it expects from the backend at each interaction point (e.g., “clicking this button should call POST /api/documents/verify and receive { status: ‘ok’ | ‘flagged’, issues: […] }”). Hopefully, shared agent memory across developer teams will arrive from the major AI coding providers soon.
Documentation of changes: a small AI-generated Markdown summary per PR can make reviews dramatically simpler. It should contain the reasoning behind the change, code snippets of the important parts (with links), and flags on any decisions where the coder (pair) was hesitant. In some cases it can also be shared with other agents — for example, helping the frontend agent follow up on a backend change.
Design and feature documentation: all features should be documented during development, not just at the end of the project, and inside the Git repository. That way the coding agents can read the current context before planning the next move. Likewise, when future directions are shared upfront, nothing gets built against them.
Peer review: instead of piling every PR on the lead engineer (which also blocks them from writing design directions and from doing the coding where their experience actually matters), shift toward peer code reviews, and …
… more tests: because peer reviews are a little less safe, and because AI coding can multiply (theoretical) mistakes, testing matters more than ever (unit tests, AI-driven full functional tests, and so on).

As an extra step in every project (for example, during retrospectives), the team should decide what to change in the prompt additions (CLAUDE.md, skills, etc.). The simplest way is to ask the agents themselves which corrections users added most often, and to check similar patterns in the review comments.

Organization/Department-Level Dilemmas

Introducing AI tools will, in the long run, give developers more meaningful work — but, as we saw, it also creates several challenges at the team level. From the entire organization’s point of view (even at an SME), it’s both a huge opportunity and a significant risk that needs to be managed.

On one hand, imagine saving a significant amount of coding time even while building a solution for the first time — and, on subsequent projects, saving the majority of coding time altogether. However, part of those savings should be channeled into organized innovation (since it won’t happen inside projects anymore). With the time you save, you can run more client iterations (as we discussed earlier) or invest in new features and solutions — augmenting your services.

On the other hand, you face several new risks and tasks to solve:

Security & Data: how much of our codebase, company knowledge, and customer data can we share with AI (none) — and how do we keep the rest out?
- Instead of giving the AI real data, use a structurally identical (and statistically similar) dummy table — there are tools for this (SAS Data Maker, MostlyAI, NVIDIA/gretel, SDV, Tonic), or you can generate your own synthetic data.
- Claude Code (and most other solutions) can be wired to AWS Bedrock models — or, if you can afford it, you can run coding on your own GPUs (which can be isolated from external networks, so you can share more freely).
- Knowledge sharing is critical, but access rights still matter — and employees need proper training on what they can and cannot share with AI tools.
Shared Infrastructure: beyond buying GPUs and sharing their capacity across functions and tasks (to keep them working around the clock), if you allow only on-prem coder agents, there are a few areas where shortcuts simply don’t exist:
- Knowledge management: skills should be built and distributed efficiently across the company (documenting methodology, preferred coding styles, and so on for each department). The collected knowledge should be actively maintained and discussed. Prompt additions (preferences, rules, agent personas) should be shared and jointly developed — as should the list of useful MCP servers.
- Solution deployment templates: to make it effortless to deliver the same solution a second time, each one needs a “user guide” with all required inputs, customization points, and code samples. These guides must also be maintained as the underlying packages evolve.
- Internal accelerators: to keep that maintenance fast, you may want internal toolkits used across multiple deployments that absorb the changes of the underlying packages in one place.
Innovation and service augmentation: once the risks are under control, make your team AI-native as early as possible so you can finish the automation phase first. After that, focus on the areas where competition will be toughest: new and innovative solutions (adding reports and features, experimenting with new algorithms to create cheaper and faster apps, etc.). The biggest trap is to focus only on the automation part.
Tool lock-in: most AI pair-coding solutions now use similar Markdown-based “prompt additions” (though the exact filenames and conventions differ — CLAUDE.md, .cursorrules, .github/copilot-instructions.md, AGENTS.md, and so on), which makes switching relatively painless today. But the knowledge-sharing platform itself should stay independent of any single solution, and the content should be translatable to several of them. The tools are improving very fast — try many of them out on different projects and share experience regularly.
Talent allocation: some changes may be necessary depending on your current habits. Innovation now needs the most talented engineers (they can also lead the automation effort first), while projects can be executed with a slightly more junior-heavy engineer mix. In my view, the overall senior-to-junior ratio shouldn’t change — only the way you allocate them. That said, I can imagine fewer juniors entering the market over time, as it has never been as easy to start a company as it is today.

AI-based solutions aren’t just for accelerating coding — HR, Finance, Marketing, and Sales tasks can all be automated or supported in similar ways, and a similar mindset is needed to do it right.

Conclusion

I used to believe the story of the spaceship and the bicycle was universally known — until I realized it might just be another didactic fable from my former company:

The salesperson paints a dream with the client about a spaceship (how amazing the product will be). The lead engineer then designs a car (since it’s perfectly fine for commuting on Earth). In the end, the project team delivers a bicycle, to keep within budget.

Well, for long-term survival, ship the spaceship (or at least the car), because now you actually can.

Do not forget to SUBSCRIBE (just scroll down), if you like this post :) - the monthly newsletter is starting soon.

2026: AI Won't Take Your Job (It'll Take Your Busywork)

Sat, 21 Feb 2026 16:00:00 GMT

AI can be scary for white-collar office workers — like myself — who wanted to stay professionally active for a few more years. It writes computer programs in languages I’ve never heard of. It makes presentations, phrases letters, writes blog posts, drafts diagrams, analyzes data. But it’s not just scary — it’s exciting and interesting too. So let’s set aside our existential crisis for a moment and appreciate what’s now possible. This post draws an optimistic scenario where we don’t just keep our jobs because TSMC can’t produce enough chips — but where we find ourselves doing more fulfilling work.

It will be uncomfortable to read back in 2027, once it becomes clear how many things I missed and how many predictions I got wrong. Still, I hope you’ll enjoy reading this mid-to-short-term AI roadmap prediction.

Here’s what I’ll cover: why AI is becoming an integration platform rather than a replacement tool, why smaller models are improving faster than frontier ones, how verifiable AI could make “vibe coding” enterprise-ready, and why the 95% enterprise AI failure rate is about to drop significantly. Along the way, some thoughts on security, knowledge graphs, the TSMC bottleneck, and whether you should learn to bake and brew coffee as a plan B.

AI as Integration Platform: 2026 Is the Year of Collaboration

It’s impressive that AI can write code and solve mathematical theorems. But enterprise adoption is still remarkably low. In most organizations, AI isn’t freeing up resources yet — it’s adding extra complexity on top of existing workflows. Everyone feels it should be useful, but the return on investment hasn’t materialized for most.

Enterprises have well-tested, reliable workflows that took years to build. They want vertical, end-to-end task solving — solutions that are aware of existing systems, that use and update them, and that work without errors. Perhaps with some human oversight at the beginning. What they’re building, essentially, is a digital extension of their workforce. (I know — it doesn’t sound optimistic yet. Bear with me. 🐻)

This realization is shifting how companies think about their AI roadmaps. AI is increasingly seen as an integration platform: a layer where you can bring together your past decade of model development, your accumulated domain knowledge, and your existing tools — to create a digital workforce that actually delivers value. And because that workforce is genuinely valuable, it justifies real investment.

Now, consider what a single complex task looks like in practice. Analyzing a market entry opportunity might need a data retrieval agent, a financial modeling agent, a regulatory compliance agent, and a report generator. You need to select these agents from your pool, provide them with relevant data, coordinate their work, and aggregate the results. For that, you need an intelligent platform that handles this orchestration well.

On the supplier side, you can already find third-party agents on Google Cloud Platform and AWS Marketplace. However, these don’t yet solve your problem end-to-end — they can’t automatically select the right agents based on your budget and predicted token usage for a given task. But they will. And if you’re an SME without an internal platform and years of accumulated data, you’ll rely on these external providers. Within these platforms, there will be competition among agents: only well-tested ones will be able to command premium prices, while new providers will need to offer lower rates until their track record grows. The platform will split the payment and take a commission. Some blockchain-based B2B agent platforms already exist, but none has emerged as the standard yet — and most are still far from production quality.

Meanwhile, the human-in-the-loop pattern is maturing (which sounds like the opposite of these fully autonomous agent platforms, but here we’re back in the enterprise environment). AI is not ready for full autonomy in high-stakes settings, and it shouldn’t be. But instead of constant hand-holding, AI systems are learning which tasks you’re comfortable delegating and which ones need a checkpoint. The interaction is becoming adaptive: the AI requests human input when it’s uncertain, then learns from the correction to need less intervention next time. Full autonomy for complex enterprise tasks is probably 3-5 years away. But the path from “supervise everything” to “review the important bits” is already well underway.

For building usable AI platforms, several building blocks need to mature: efficient on-premise models (you can’t share all your data with third-party vendors), verifiable AI (you need to know the problem is actually solved), secure AI (always), and efficient search across an ocean of data and tools. Plus humans — to drive the innovation that can’t be taught yet. And silicon — because everything runs on chips. The next sections cover these pieces.

(A technical note: these collaborative agent platforms likely won’t follow a hierarchical “CEO agent delegates to mid-level managers” pattern. More democratic algorithms — where agents bid on tasks based on their capabilities and pricing — tend to be more effective. Think auction-based orchestration rather than top-down command.)

Local Intelligence and Efficient Small Models

The frontier models keep improving, but the gains increasingly come from engineering rather than fundamental breakthroughs. If no radically new architectures or training methods emerge — let’s see if AMI Labs changes that — the progress at the top will slow, and smaller (and OS) players will gradually close the gap. Meanwhile, the big labs (OpenAI, Anthropic, Google) are increasingly motivated to use smaller models themselves. Cost efficiency matters, and chip access is becoming the real bottleneck.

The next paragraph gets technical. Feel free to skip — a dedicated (and more accessible) post on inference engine internals is coming.

There’s a floor to how little computation you need for solving complex tasks. A 7B to 70B active parameter range seems necessary for complex reasoning. But within that range, you can save significant resources, driven by two forces:

First, inference engines are getting much smarter. Tools like vLLM — the serving infrastructure between your application and the model — are solving the practical hardware utilization challenges that arise in real deployments. A GPU’s processing capacity is only as useful as its memory bandwidth allows. With concurrent users sending requests of wildly different lengths, keeping both optimally busy requires smart scheduling. Continuous batching slots new requests the moment a slot frees, rather than waiting for a whole batch to finish. Memory is equally constrained: modern large models use a Mixture-of-Experts architecture, where only a subset of specialized “expert” sub-networks activates per token. Inference engines exploit this by keeping frequently used experts resident in GPU memory while offloading the rest — effectively running a model far larger than the GPU’s VRAM would otherwise allow. These optimizations are still maturing, with better CPU-GPU hybrid execution and smarter memory hierarchies on the roadmap.
Second, model architectures are evolving beyond pure Transformers. Hybrid State Space Models like NVIDIA’s Nemotron-3 Nano — a Mamba-Transformer hybrid — offer massive context windows with significantly faster and cheaper inference than traditional Transformers. Instead of processing every token against every other token (the quadratic cost that makes long contexts expensive), these hybrids selectively use attention only where it matters.

Local models will also handle multi-modal inputs more fluidly — processing text, images, and structured data together rather than as separate pipelines — which is necessary for wider adoption.

Verifiable and Explainable AI

Vibe coding — asking an AI agent to write code by just describing the task — has made software development more accessible. Anyone can prototype a small application with the help of an agent. But in an enterprise or professional software development environment, vibe coding works for the tasks you could have solved with a Stack Overflow search a few years back (rest in peace). For complex problems — the ones where a good decision makes the solution faster or cheaper by orders of magnitude — engineers still matter.

AI-assisted coding does make development significantly faster for routine, well-structured tasks. The general speedup varies — studies report 30–55% for scoped tasks, though for highly structured work like writing ETL pipelines or containerizing deployments (where coding standards and templates are well-documented), the gains can reach 2–3x. The bottleneck isn’t writing speed; it’s review (and bugfix). You can’t be sure the generated code handles all edge cases. And that review cost eats into the time savings.

This is where verifiable AI enters the picture. By formalizing requirements and mathematically comparing them against the generated code, we can prove that code meets its specification — including properties like termination (it won’t run forever) and correctness for all inputs. Take this further, and you can reverse the logic: given a formal specification, find the optimal implementation automatically. This could make vibe coding genuinely enterprise-ready. The technology is progressing, but mainstream adoption is probably a few years out.

From the user’s perspective, verifiability matters just as much for analytics and dashboards. When an automated tool calculates a result or adds a new KPI to a dashboard, you need to understand how it got there. The queries and logic should be explainable alongside the results.

This leads to a broader shift in what “code” even means. As AI gets better at finding optimal implementations, the competitive advantage shifts from writing elegant code to clearly specifying what you want. Code will increasingly look like structured text: high-level descriptions with drill-down layers where you can inspect the pseudo-code algorithm behind a three-sentence explanation. You won’t need to read the implementation details. You’ll discuss them with your AI assistant.

Secure AI

As AI agents gain more autonomy and access to sensitive systems, security becomes non-negotiable. The vulnerabilities are real: the same prompt injection techniques that make for entertaining LinkedIn posts about “lousy AI providers” are the exact mechanics behind real security breaches - stealing data, executing unauthorized code, or hijacking entire applications. I covered the attack surface and defense strategies in detail in a previous post.

The core principle hasn’t changed: defense requires layers. It starts with architecture - restricting the LLM to calling controlled functions rather than writing arbitrary code, limiting API access to the user’s own permissions, and sandboxing execution so that even when something goes wrong, the blast radius stays contained.

On top of architecture, guardrail solutions are becoming a standard production component. These work as proxies that intercept every message - user to LLM, LLM to user, even agent to agent - and run them through a policy engine combining regex filters, intent classifiers, specialized transformers, and sometimes a smaller LLM acting as a judge. Industry tools include NVIDIA NeMo Guardrails, Azure AI Content Safety, Google Vertex AI Safety Filters, and Amazon Bedrock Guardrails. The built-in safeguards of models like Claude, Gemini, and GPT add another layer, though they shouldn’t be the only one.

Red teaming is moving from best practice to requirement. Automated red teaming tools like Azure PyRIT, Giskard, and DeepEval bombard your system with thousands of adversarial prompts before the first real user ever touches it. All of this adds latency and cost. But the alternative - an unsecured agent with access to your production systems - is not an option worth considering.

Large RAG and Graph AI

Now consider the data landscape. In every industry, you can access a vast ocean of sources: internal documents, external databases, public research, code repositories, methodologies, articles, books, and logs from earlier tool usage. You also have tools — private and public MCPs (a standard protocol that helps agents use external tools), specialized agents, documented workflows, best practices packaged as reusable skills.

Much of this data is contradictory. Some is outdated. Some comes from unreliable sources. Some tools work well; others don’t. And your complex task probably needs several of them together, pulling fragments from across this entire ocean.

This means your agents need memory that improves over time. They need to learn how to navigate a massive knowledge base, how to select the right sub-agents for a given task, how to approach problems systematically. They should record their experiments and learn from the outcomes. And critically, they need to connect the dots — generating synthesized knowledge rather than just retrieving raw sources. You can’t copy 100 books into a single context window (it would be impossibly slow), but you can build a knowledge layer that captures the key concepts, relationships, and patterns across those books.

On top of complexity, there’s speed. Your retrieval system needs to be fast, not just thorough. At Lynx Analytics, we use Graph AI to address these challenges — representing knowledge as connected graphs rather than flat document collections, which lets agents traverse relationships and find non-obvious connections between pieces of information. We even have some tricks to make it fast at inference time (another promised future blog post — stay tuned).

The Human Touch - Reducing Failures

A widely cited MIT study found that 95% of enterprise AI pilots deliver zero measurable return on investment. That number sounds devastating, but it’s starting to change.

People working on AI transformation have gotten more humble and patient. They’ve learned what AI can and can’t do, and they’ve stopped expecting magic — instead, they’re integrating accumulated knowledge and existing solutions. They put more effort into understanding the problem before throwing AI at it. They write better prompts, provide better examples, feed in more relevant data, enable better tools and workflows, and learn more from feedback. Companies are prioritizing projects with realistic success criteria over moonshots that look good in a board presentation. Meanwhile, AI models have genuinely improved at tool use and multi-step collaboration.

The result will be a meaningful drop in that failure rate. And AI-powered customer service - currently the most visible source of user frustration - will finally stop being annoying (my riskiest prediction). The technology is ready. It’s the engineering discipline that’s catching up.

Bottleneck: We Only Have One TSMC

The biggest constraint on AI development right now isn’t algorithms — it’s chip manufacturing. Consider where the high-end AI chips come from:

NVIDIA manufactures its GPUs at TSMC in Taiwan,
Google produces its TPUs at TSMC in Taiwan,
AMD produces its GPUs and CPUs at TSMC in Taiwan.

There are alternatives — Intel Foundry Services, Samsung, and SMIC (Huawei’s manufacturing partner) — but the most advanced process nodes are all at TSMC.

This is a concentration risk by any definition. And since AI accelerators need replacement roughly every 3–4 years to stay competitive, the bottleneck isn’t going away soon. TSMC is expanding capacity (including new fabs in Arizona), and some production is diversifying to the US. But for now, the world’s AI infrastructure depends heavily on a single company in a geopolitically sensitive region.

Some Peripheral Trends

Before we get to the “will I lose my job” section, here are a few more predictions and directions for the next year:

Robotics and AR will produce impressive demos. Not just humanoid robots walking around, but task-specific machines doing useful things in warehouses, hospitals, and farms.
IoT intelligence — smart devices communicating autonomously. Your smart scale advising your fridge on what to stock, so you can never eat a chocolate pudding at 11 PM again. (Whether this is a feature or a bug is left to the reader.)
Specialized AI chips will appear in more consumer devices — though I’m not sure whether the fuzzy PID controller in my rice cooker will finally get an upgrade.
AI regulation is moving. The EU AI Act is now in phased implementation, following the same trajectory as GDPR a decade ago. Other countries will follow with similar frameworks. This sounds like bureaucratic overhead, but clearer rules will actually accelerate enterprise adoption by reducing legal uncertainty.
Quantum computing gets an interesting angle from verifiable AI: translating business problems into quantum-compatible formulations is one of the hard parts, and AI can help with that bridge. New theoretical work on quantum transformers and quantum attention mechanisms is emerging. But practical implementation remains years away — past 2030 for most use cases.
Large consulting firms face an interesting challenge: the language barrier between management and technical teams can now be bridged with a $20/month AI subscription. An oversimplification, sure, but the threat is real. Routine advisory work will move in-house.

Riding the Waves

A former boss of mine used to say:

Always train your successor — that’s the only way you get new responsibilities.

The same principle applies, but your successor is now AI. So start writing skills, building tools, automating the parts of your job that are routine. Hopefully, this helps you solve the critical tasks faster (urgent and important) and delegate the busywork (urgent but not important) to agents — or to less senior colleagues enabled with AI. So finally, you can start focusing on the non-urgent but important tasks: finding augmentation opportunities, designing new services, solving problems where no training data exists — which is where your future work lives. Imagine what you could create if you had infinite employees — and start building toward that.

Figure 1: How AI shifts your focus

What are your predictions for 2026? I’d love to hear where you agree or disagree — comment on the LinkedIn post or drop a comment below. And do not forget to SUBSCRIBE (top menu), if you like this post :).

Engineering Intelligence from Autocomplete

Sat, 17 Jan 2026 16:59:37 GMT

Large Language Models (LLMs) don’t think. They are simply predicting the next word (or “token”) in a sequence. And yet, after a few days of using ChatGPT, Gemini, or Claude, you start treating them like thoughtful colleagues. You delegate tasks, ask for advice, and supervise their output as if they were intelligent entities (even letting them supervise parts of your daily workflow). How does a system designed to guess the next syllable end up solving complex logical problems?

What engineering makes this “near-perfect autocomplete” capable of:

Text-to-SQL (Analytics): Converting natural language questions into database queries to filter dashboards or execute ad-hoc analysis.
RAG (Retrieval-Augmented Generation): Answering precise questions using an enterprise’s private knowledge base without hallucinating.
Sentiment Analysis: Classifying emotions, topics, or intent from Google Maps, Yelp, or free-text customer reviews.
Extraction: Automatically structuring messy conversation records into clean forms or JSON data.
Coding Copilots: Assisting developers by writing code snippets, refactoring functions, and designing user interfaces.
Mathematical Reasoning: Proving theorems or solving complex logic puzzles.
Predictive Health: Identifying potential health issues from patient data patterns.
Drug Discovery: Filtering medicine candidates to accelerate research in fields like cancer treatment.

The key idea is that even though the model isn’t “trying” to solve a problem intentionally, introducing the right constraints means that “solving the problem” becomes the mathematically best way for the LLM to predict the correct next word. By “constraints,” we mean the methods and rules that narrow down the possibilities for that next token, guiding it in the right direction:

Prompting acts as an instruction constraint. We force the model to solve a given task, adopt a specific persona, tone, or format, narrowing down the infinite universe of potential next tokens to only those that fit the task at hand.
RAG (Retrieval) acts as a knowledge constraint. We force the model to predict answers based on specific documents we provide, effectively extending its static training data with up-to-date, company-specific knowledge.
Tools & Functions act as action constraints. We limit the model’s output to specific formats that trigger code or API calls (soon explained). This naturally extends the model’s capabilities, allowing it to hand off complex tasks-like running a specialized sales forecast engine-that it couldn’t reliably compute on its own.
Temperature acts as a creativity constraint. We tune the randomness up or down to make the model either strictly deterministic or highly creative.

You might ask: Why not just fine-tune a model on my specific data for each task? The answer is cost and flexibility. For the price of constantly retraining a small model for every single use case, you can leverage a significantly larger, general-purpose LLM. By wrapping it in the right engineering-adding tool usage, smart prompting strategies, or lightweight fine-tuning-it can solve much more complex problems out of the box. Of course, for highly specialized or heavily regulated edge cases, training your own LLM from scratch remains the right path.

From Token Prediction to Simple Chatbots

Modern transformers are already fine-tuned to handle conversational formats (taking chat history and outputting the next sentence), so we rarely have to manually “trick” them into chatting anymore. However, looking at a raw completion prompt perfectly illustrates how we use prompting as a constraint to steer the next token prediction in the right direction. To force a base LLM to act as a chatbot, we would send a prompt like this:

Completion prompt

""" You are 'Bepo bot', an intelligent customer service assistant of the Best Phone Mobile and Broadband Internet Provider, specialized for consumer mobile help desk. [...] You are getting a question related to Best Phone Mobile services, and you have to answer it in a short and understandable way at the same language that the question was asked.

You had the following chat history:

User: Hi, can I ask a question?
You: Good afternoon, I am Bepo bot, your digital assistant, feel free to ask any question related to Best Phone Mobile’s services.
User: Great. How can I change my PIN code?

Some related content is the following, answer the question solely on this content: [retrieved knowledge base text]

Remember, answer in a brief but understandable way.

Your polite answer is: """

When the LLM processes this text, its job is simply to predict the next token. It will likely output the word “Thanks “. But the model stops there. To keep it talking, the application takes that new word, attaches it to the end of the original prompt (making it “Your polite answer is: Thanks “), and sends the entire block of text back to the LLM. The subsequent tokens might be “for “, “your “, “question. “, looping over and over until the task is complete.

As this example shows, the LLM itself is completely stateless. It has no memory of the conversation once it finishes predicting a token. (Note: This is changing slightly with newer hybrid Mamba-Transformer architectures, a topic for a later blog post!) Because the model forgets everything instantly, the application engine must continuously feed the entire chat history and all previously generated tokens back into the model just to get the very next token. This loop finishes only when the LLM generates a special “stop token,” signaling it has finished its thought. So, the chatbot seems to have memory, but only because the application is constantly reminding it of everything that has been said up to that moment.

There is one more critical detail: even if you feed the exact same input text into the model multiple times, the final answer might be different. This happens because the LLM doesn’t just pick the one correct next token; it calculates a probability distribution for all possible next tokens and then randomly selects one of the most likely candidates. We control this randomness with a parameter called temperature.

If the temperature is set near 0, the model becomes highly deterministic, almost always picking the most probable token. If it’s set closer to 1 (or higher), the variance increases, allowing the model to pick less obvious words. For rigid corporate chatbots, a temperature near 0 works best to ensure consistent answers. For creative writing or brainstorming, higher values are preferred.

If a higher temperature causes the model to pick a slightly different first word, that new word becomes part of the input context for the second word. A single different token alters the context for the rest of the generation, sending the text down a completely different branching path. The longer the generated answer is, the higher the chances of these diverging “splits” happening, resulting in wildly different final responses from the exact same starting prompt.

Calling LLMs from Your Application: The API Layer

Most people interact with Large Language Models through consumer web apps like ChatGPT, Gemini, or Claude. But to build your own intelligent applications, you need to connect to these models programmatically. The API (Application Programming Interface) is the standard way for software systems to talk to each other. In practice, it’s a structured request your application sends to a service (here: an LLM provider), and a structured response it gets back.

Every LLM provider (like OpenAI or Anthropic) has its own specific API rules and formats. However, the ecosystem has evolved quickly to make this easier:

Orchestration Frameworks: Libraries like LiteLLM, LangChain, and LlamaIndex act as universal translators. They allow you to write your code once and route it to almost any provider (OpenAI, Anthropic, AWS Bedrock) using the exact same format.
Local Hosting: If you prefer to download open-source models (e.g., from HuggingFace) and run them on your own servers or laptop for privacy and cost control, tools like Ollama and vLLM provide similar API endpoints for your local hardware.

During inference (the technical term for generating text with a trained model), modern APIs structure the conversation using specific roles. Instead of sending one giant block of text, we break the input down into a list of messages. Using our “Bepo bot” scenario from the previous chapter, an API call using Python and the OpenAI library would look like this:

import openai

# Initialize the API client
client = openai.OpenAI(api_key="YOUR_API_KEY")

response = client.chat.completions.create(
    model="gpt-4o", # or any other model you choose
    temperature=0.2, # Keeping it low for a corporate bot
    messages=[
        {
            "role": "system", 
            "content": "You are 'Bepo bot', an intelligent customer service assistant for Best Phone Mobile. Answer questions in a short, understandable way in the user's language. Use the following related content to answer solely based on it: [retrieved knowledge base text]"
        },
        {
            "role": "user", 
            "content": "Hi, can I ask a question?"
        },
        {
            "role": "assistant", 
            "content": "Good afternoon, I am Bepo bot, your digital assistant, feel free to ask any question related to Best Phone Mobile's services."
        },
        {
            "role": "user", 
            "content": "Great. How can I change my PIN code?"
        }
    ]
)

print(response.choices[0].message.content)

The System message holds the instructions that don’t change: the bot’s personality, its constraints, and the retrieved context it needs to answer the question. The User messages are the human’s inputs, and the Assistant messages are the bot’s past replies. Modern LLMs are specifically trained to read this structured format and output the next “Assistant” response.

The Basic Chatbot Architecture (Slightly Technical)

The next few lines are more implementation-focused, so feel free to skip if you just want the high-level story. A typical chatbot-based application looks like this:

Figure 1: The Basic Chatbot Architecture

made by Gemini

As you can see, the UI (the webpage or WhatsApp interface you are using) doesn’t talk to the LLM directly. Instead, it sends the user’s recent question to your backend application, which is running on a secure server. In addition to calling the LLM to answer the question (sometimes multiple times, as we’ll detail later), the backend fetches the conversation history from a database, grabs any necessary context, formats everything into the structured API message we saw above, and fires off the request. Once it receives the LLM’s generated answer, it processes the text and sends the final, clean message back to the UI.

From Documents to Answers: How RAG Chatbots Work

Base LLMs have two significant limitations: they are frozen in time, and they are only trained on publicly available datasets. If you ask an LLM about your company’s latest internal policies, a private customer record, or a product you launched yesterday, it will either tell you it doesn’t know, or worse, it will simply predict a plausible-sounding, fabricated sequence (a hallucination).

You could continuously fine-tune the model on your latest data, but that is extremely expensive and time-consuming. The simplest solution is RAG (Retrieval-Augmented Generation). Instead of forcing the model to memorize everything, we search a private knowledge base for the most relevant context and ingest it directly into the prompt. By supplying the LLM with the freshest, most up-to-date information related to the user’s question, we apply the knowledge constraint discussed earlier. It is cheap, highly accurate, and updating your bot’s knowledge is as easy as dropping a new document into a folder.

Figure 2: RAG Chatbot Architecture

made by Gemini

To build a chatbot that provides accurate information (with lower risk of hallucination), you first need to build a Knowledge Base. You collect all relevant HTML, PDF, Word, and Markdown files containing your up-to-date data. Then, you process this data for the AI:

Creating “Chunks”: If you try to paste massive, 100-page manuals into the API call, it becomes incredibly slow and expensive because the LLM has to process thousands of irrelevant words to generate its output. Furthermore, you often need to combine answers from different documents. To solve this, we break documents down into smaller chunks (usually 300 to 500 words long) so the LLM only reads the exact paragraphs it needs.
Context-Enriched Chunking: If you just slice text blindly, a chunk might lose its meaning. For example, a chunk might just contain a table of prices - but without the page title, the AI doesn’t know if these are prices for Japanese roaming or domestic broadband. Therefore, best practice involves automatically appending contextual information (like document titles and section headers) to the text of each chunk before saving it.
Embeddings and Vector Stores: To actually find the right chunk out of thousands, we convert the text of each chunk into a mathematical representation called an embedding. You can imagine this as an arrow pointing to a specific coordinate in space. Only, instead of 3D space, we use a space with 1,000 to 3,000 dimensions. If two pieces of text mean similar things, their arrows will point to the exact same neighborhood. If they are unrelated, they point far away. (I’ll be writing a separate post entirely on how embeddings work!) We load all these chunk vectors into a Vector Database (VDB).

Once your vector knowledge base is ready, the actual chatbot application follows a similar pipeline every time a user sends a message:

Question Augmentation: If the user previously asked about Japanese roaming, their next question might just be, “How much does it cost?” As a first step, the backend makes a quick, cheap LLM call to rewrite the question using the chat history: “How much does the 3GB 5G Japanese roaming package cost for a week?”
Retrieval: The backend converts this augmented question into a vector and searches the VDB for the closest matching arrows. The database returns the 5-10 most relevant text chunks (e.g., the exact pricing tables for Japan).
Prompt Augmentation & Generation: As we saw in the API chapter, the backend builds the final prompt. It combines the System Message (the bot’s personality), the retrieved chunks (the freshest context), and the user’s augmented question (or the chat history), then asks the LLM to generate the final answer.

Guardrails (Optional): In enterprise systems, we might add some further steps to review the answer, calculate a confidence score, or apply security guardrails to ensure the bot didn’t say anything inappropriate before showing it to the user.

From Code Assist to Tool Use: The First Steps Toward Agents

Code is just text with a strict grammar. In a standard software script, you often have a “docstring” at the top of a function explaining what it does, followed by the actual code. Because LLMs are trained on billions of lines of code, if you give the model a description of a desired functionality, the mathematically most probable next tokens are the code that executes it. It is still just the “perfect autocomplete.”

Since our LLM can write code, we can modify our RAG chatbot to perform automated analytics. Let’s assume we have a movie database and want to build a “talk to your data” bot. If a user asks, “Show me a bar chart of the 10 most-watched Julia Roberts movies,” the bot should automatically query the database and render a chart.

The Naive Approach: Raw Text-to-SQL

As a baseline solution, we could use our prompt to ask the bot to generate raw SQL code instead of a conversational answer. The backend application would take that SQL, run it directly against the database, put the resulting data into a Markdown table, and finally call the LLM one last time to summarize the results.

While this helps us understand the concept, there are severe issues with this approach. Generating raw code and executing it directly is highly unreliable and opens the door to massive security risks like prompt injection and hacking (which I covered in my previous post).

The Advanced Approach: Function Calling

If we want to do this safely, we don’t let the LLM write raw, arbitrary code. Instead, we define strict, safe functions in our backend - such as filter_table(rows, columns), aggregate_data(), or draw_chart(). We can also create “intelligent” functions, like a tool that takes a movie title and uses an LLM to pull all semantically similar titles from a database column (a mechanic I’ll cover in an upcoming post).

Once these safe functions are defined, we apply an action constraint. We instruct the LLM: Do not write code. Instead, create a plan and tell me which of these specific tools to use, in what order, and with what parameters.

Technically, we force the LLM to output a list of JSON dictionaries. For example, it might predict the next tokens to look like this:

[{"function": "filter", "parameters": {"fieldname": "year", "mode": ">=", "value": "2025"}}, ...]

With this setup, the LLM first calls a “database description” tool to get a hint about the available tables and their fields. Next, it calls the filter tool, and finally, the charting tool. When you give an LLM predefined tools and train it to plan out how to use them, you have built an Agent.

The Self-Correction and Learning Loop

Code and tool parameters do not always work perfectly on the first try. If the LLM’s chosen parameters throw an error, we don’t crash the app (since the code is running in a well-supervised environment). Instead, we feed the original prompt, the LLM’s attempted tool call, and the resulting error message right back into the LLM as a new prompt. We essentially ask it: “That didn’t work pal, here is the error, please try again.” In most cases, the LLM will successfully correct its own parameters without any human intervention after a few iterations.

Figure 3: The Self-Correction Loop of Talk to Your Data Solutions in AWS

source: Lynx Analytics

To make this system even smarter over time, we can use RAG to inject examples alongside our context. This is especially important for tasks that are complex but asked frequently. If we implement good logging, we can discover cases where the LLM consistently struggles - perhaps it initially chooses the wrong functions, or misunderstands the task entirely until the user rephrases their request. Once a working solution is finally reached (either through the self-correction loop or user feedback), we can extract that correct parameterization and save it as a “proven example” in our Vector Database (though we usually start with an initial, hand-crafted set of examples). The next time a user asks a similarly difficult question, our RAG system doesn’t just pull up text documents; it searches for and injects that highly relevant historical example directly into the prompt. Instead of guessing from scratch, the LLM uses the injected example as a template, allowing it to correctly parameterize the tools and solve complex queries flawlessly on the first try.

If We Can Call Functions, We Can Do Anything: The Rise of Agents

In the previous section, we told the LLM to output a fixed list of functions in a fixed order. But what if the model doesn’t even know what tables or columns exist in the database yet? In that case, it needs to examine the results of the first function before generating the parameters for the second function.

When you allow an LLM to dynamically feed back interim results to itself until a task is solved, you have officially moved from a chatbot to an Agent.

The Reasoning-Acting (ReAct) loop looks something like this:

Reason: “I need to query the database, but to do that, I first need to know the table names and structures.”
Act: It calls the “database description” tool.
Observe: The backend runs the tool and pastes the database schema back into the prompt.
Reason: “Okay, the table is called ‘movies’, and the relevant fields are year, title, and actor. Now I can filter it and draw a chart. No more discovery is necessary.”
Act: It calls the “filter”, “aggregation”, and “charting” tools in sequence with the correct parameters.

Notice that we don’t even need a separate, hardcoded error-handling loop anymore. If a tool fails, the error message simply becomes the next “Observation,” and the LLM reasons about how to fix it on the next turn. Furthermore, using our RAG memory, we can save these interim results. If we save the discovered database structure to our Vector DB, the agent will only need a single LLM call to answer a similar question next time.

Standardizing Tools

Querying databases is just the beginning. We can write an infinite number of backend functions to send emails, schedule calendar meetings, order food online, extract data from complex PDFs, or manage contract approvals. The LLM simply acts as an intelligent routing engine, deciding which of your traditional software functions to trigger and when.

To make this scale, the industry is adopting standard formats, most notably the Model Context Protocol (MCP). Think of an MCP as packaging a backend function alongside a standardized “user guide” and a short summary of its capabilities. This allows agents to easily use tools written by completely different developers.

However, if you have an enterprise system with thousands of available tools, you can’t paste thousands of user guides into the LLM’s prompt-it would be too slow and expensive. The solution? Tool RAG. We ask the agent to describe what kind of tool it needs, search our vector database for matching MCP summaries, and only inject the relevant tool guides into the prompt. We can also create specialized “Expert Agents” (e.g., a Legal Agent or an HR Agent) that only have access to the specific tools relevant to their discipline.

Real-World Complexity: A2A and Human-in-the-Loop

Many of these tools aren’t just running local Python code; they are calling external APIs (like committing code to GitHub or sending a Slack message). Sometimes, a complex task might take days to execute, requiring your agent to cooperate with other agents running on entirely different servers.

For these massive workflows, developers use A2A (Agent-to-Agent) communication standards. Building these systems gets incredibly complex. You have to engineer solutions for timeouts (what if the other agent doesn’t answer for an hour?), infinite error loops, and automated testing.

Because full autonomy is still risky, the most effective teams today use Human-in-the-Loop (HITL) agent teams. In this setup:

The human expert can observe the agents’ reasoning logs in real-time and intervene if they go off track.
The main routing agent is programmed to pause and ask the human for advice if a requirement is unclear.

While 100% automation isn’t quite here for solving complex problems (from software engineering to drug discovery), this collaborative approach offers a massive speed increase.

Why Predefined Tools?

You might wonder: Why go through the trouble of building tools? Why not just let the LLM write its own code to solve the problem from scratch every time? Beyond the severe security risks mentioned earlier, the reality is that businesses already have robust, risk-free, highly optimized software for critical tasks (e.g., proprietary sales forecasting engines or churn models). We don’t want an LLM inventing a new way to calculate revenue; we want the LLM to act as the ultimate, flexible user interface to run the trusted systems we already have.

(One additional note: For highly specific enterprise workflows, it is often worth fine-tuning the LLM directly so it becomes an expert at calling your specific tools or selecting the best path for a complex problem. We will touch on this in the next sections, with a deeper dive in a future blog post.)

Thinking as Searching: Forcing Autocomplete to Reason

If we face a very hard problem, we rarely solve it on the first try. Instead, we take small steps toward the goal, or we explore several parallel paths, hoping one of them leads to a breakthrough. We can engineer our LLMs to do exactly the same thing.

The simplest way to solve slightly difficult problems is Chain of Thought (CoT) prompting. Instead of asking the model for just the final answer, we force it to break the problem down and write out its intermediate reasoning steps first. As the LLM only predicts the next token based on the available context, trying to guess a complex final answer immediately carries a high probability of error. However, if it generates a “scratchpad” of logical intermediate steps first, those newly generated reasoning tokens become part of the input prompt for the next prediction. By the time the model finally has to predict the tokens for the actual answer, this expanded context has mathematically steered it toward the correct probability. While this approach generates significant token overhead, studies show the accuracy boost is massive.

Internalizing the Search: DeepSeek-R1 and RL

Recently, models like DeepSeek-R1 took this a step further. Instead of relying on the user to prompt a Chain of Thought, they taught the model to “think” automatically using Reinforcement Learning (RL). During training, researchers gave the model difficult tasks and the final correct answers. The model generated thousands of different CoT reasoning chains. If a chain reached the correct solution, it received a “reward” (with the shortest, most efficient logic chains getting the biggest rewards). If a chain hit a dead end or gave the wrong answer, it was penalized. Through this method, the model didn’t just memorize solutions; it learned the universal patterns of how to think. (This is a very high-level summary of a technique called GRPO; you can read more about it at [8]).

Tree of Thoughts (ToT): When Problems Get Really Hard

CoT is great, but it still struggles with highly complex, branching problems - like proving advanced mathematical theorems. In these cases, we upgrade from a single chain to a Tree of Thoughts (ToT). In a ToT architecture, we wrap the LLM in a classic computer science search algorithm. It consists of three parts:

Thought Generator: We ask the LLM to propose a few different ideas or steps that might bring us closer to the solution. Each idea creates a new “state” or branch.
Value Network (Critic): We ask the LLM (or a separate, specialized model) to evaluate these new states. It scores how probable it is that we can actually solve the problem from this new position.
Underlying Search Engine: A backend script manages the whole process. It asks the Generator for steps, asks the Critic to score them, and then decides which branch to explore next. Most often, developers use Best-First Search, which always explores the state with the highest score, even if that means abandoning the current path and jumping two steps backward to a better branch.

This search methodology is highly utilized in mathematics. Theorems can be formalized into code, and we can use an “oracle”: a deterministic software environment that takes your current starting state and a proposed logical step, strictly validates the math, and calculates exactly what your new state will be. One of the most popular formalization languages for this is Lean 4, which has a rapidly growing community.

In this setup, the LLM suggests a few logical steps (the Thought Generator). Lean 4 acts as the Oracle: it applies these steps to the current problem and, if valid, outputs the new states (the remaining sub-goals left to prove after each suggested step). A Value Network then scores all these new states so the search engine knows which path is the most promising to explore next. The engine keeps navigating these branches until Lean 4 returns an empty state, confirming the theorem is definitively proven.

Figure 4: Theorem Proving Using Best First Search

source: ByteDance [9]

These search methods can be computationally expensive, as we might generate thousands of discarded tokens to find one correct path. However, there are several ways to narrow the search path. Reinforcement learning can be used to train the Critic to evaluate paths faster and more accurately (in some setups we can even train the whole system). Furthermore, we can bring back our old friend, RAG. If we encounter a new math problem, we can search our Vector DB for the step-by-step proofs of similar problems we solved in the past, injecting them into the prompt to give the search engine a massive head start. (I will be dedicating a full blog post to optimizing these search paths in the near future.)

Broadening the Languages for Predicting Health and Beyond

We have seen that LLMs can learn human languages and programming code. Using the exact same logic, they can also predict your next health issue. A patient’s medical history is simply a chronological sequence of events. If we treat every doctor visit, diagnosis, lab result, or prescribed treatment as a “token,” we can train a transformer on billions of these sequences. For example, recent foundation models like the Cosmos Medical Event Transformer (CoMET) were trained on over 115 billion discrete medical events from de-identified health records.

Figure 5: Cosmos Medical Event Transformer in Use

source: Waxler et al., 2025

Just like predicting the next word in a sentence, these models generate (predict) the next medical event. By simulating a patient’s health timeline based on their past context, they can predict disease prognosis and future health risks with an accuracy that often matches or outperforms task-specific supervised models - all without custom fine-tuning.

If a transformer (the underlying machine learning solution of LLMs) can learn English, Python, and medical histories, it can learn any sequence. It can learn the “language” of cell tower logs, analyzing sequential binary data to predict critical network meltdowns before they happen. It can learn the language of amino acids, allowing models to predict complex 3D protein structures or generate entirely novel molecules, drastically accelerating drug discovery and cancer research.

In this blog post, we traced a logical path from simple chatbots to complex, agentic theorem proving. But every stepping stone we discussed unlocks thousands of everyday applications. Forcing JSON outputs isn’t just for calling database tools; it is how we automate form-filling from messy customer service calls, or extract structured data and sentiment from thousands of unstructured Google Maps reviews. Tree of Thoughts (ToT) search isn’t just for math; it can be used to navigate complex tax codes for financial optimization, or to formally verify that AI-generated software meets strict security requirements.

And yet, all of this - from writing SQL to discovering drugs - is achieved through the exact same mechanism: predicting the most mathematically probable next token.

This brings up a fascinating debate: are we actually building “Artificial Intelligence,” or is this just extremely sophisticated Statistical Pattern Matching? When an AI agent generates a brilliant, multi-step solution using a Tree of Thoughts, is there any real comprehension behind it, or is it just the ultimate autocomplete?

In upcoming posts, we will dive deeper into the fundamental building blocks of these models, explore how their “reasoning” differs from true human cognition, and look at what the next big evolutionary leaps in AI will actually entail.

Don’t want to miss the next post? Click Subscribe in the menu above to stay in the loop.

Resources

[1]: NVIDIA Nemotron 3: https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/

[2]: OpenAI API: https://developers.openai.com/api/docs/

[3]: Model Context Protocol (Anthropic): https://www.anthropic.com/news/model-context-protocol

[4]: Model Context Protocol (Wikipedia): https://en.wikipedia.org/wiki/Model_Context_Protocol

[5]: ReAct Agents: https://arxiv.org/abs/2210.03629

[6]: Chain of Thoughts: https://arxiv.org/abs/2201.11903

[7]: Tree of Thoughts: https://arxiv.org/abs/2305.10601

[8]: DeepSeek-R1 and GRPO: https://arxiv.org/abs/2501.12948

[9]: Theorem Proving: https://seed.bytedance.com/en/blog/seed-research-new-sota-in-formal-mathematical-reasoning-bfs-prover-model-now-open-sourced

[10]: CoMET Health Prediction: https://arxiv.org/abs/2508.12104

Breaking Bot: Hacking & Defending LLM-based Applications

Wed, 24 Dec 2025 13:48:56 GMT

Let’s say your “super-intelligent” agentic chatbot - the one with access to sensitive customer data - is hijacked. You’ve effectively welcomed a genius-level saboteur behind your own defense lines.

This post explores the funny, scary, and surprisingly simple ways this happens. Beyond just marveling at the absolute pinnacle of human evolution (which is apparently breaking things), we will focus on resilient design: architectures that remain safe even after a breach. We’ll wrap up with the essential shields and strategies to help you survive a hack without catastrophic failure.

The Art of Jailbreaking

Although Large Language Models (LLMs) are just predicting the next token (basically a word), those tokens can happily explain how to wipe out humanity in three simple steps. This is why raw models are never released to the public. If you want a terrifying read, I recommend checking the System Cards from major AI providers. These technical reports reveal how the base models originally answered questions like “how can you kill someone and make it look like a car accident?” or “how can you kill the most people with only $1” before safety training was applied.

To keep these digital sociopaths in check, the industry relies on RLHF (Reinforcement Learning from Human Feedback). Think of it as “obedience school” for AI. Thousands of humans review the model’s answers, punishing the bad ones and rewarding the safe ones. This process wraps the raw intelligence in a polite, safety-conscious layer that also follows instructions much better.

However, even after RLHF, the safety protocols can be violated. Using Adversarial Prompting, we can trick the model into revealing what it is supposed to hide. One famous example is the Grandma Exploit.

Figure 1: The “Grandma Exploit” with the recipe of napalm.

Source: Andrej Karpathy

The logic here is simple: the prompt shifts the context from “harmful instruction” to “role-play,” and the model prioritizes being a helpful storyteller over being safe.

Another trick involves encoding the request. Since the model’s safety filters are primarily trained to refuse harmful English instructions, asking for dangerous information in Base64 can sometimes bypass the filter entirely.

Figure 2: Jailbreaking with Base64 encoding.

Illustration by Gemini

Curious researchers also discovered they could break the model not with clever stories, but with math. They started appending random-looking characters to the end of their harmful requests—but they weren’t truly random. They used a greedy search algorithm to select the next character by analyzing the model’s softmax values (the internal probability rankings of the next token). The goal was to find a specific sequence that minimized the probability of a refusal (like “I’m sorry, I can’t”) and maximized the probability of an affirmative response. The result? Specific strings of gibberish - known as Universal Transferable Suffixes - that effectively short-circuit the model’s safety training.

Figure 3: Jailbreaking with suffixes.

Source: llm-attacks.org

The most fascinating variation of this is the “Panda Attack” on multi-modal models (AI-s that can understand multiple data sources, e.g., images). Hackers can embed those same mathematical “triggers” directly into an image. To a human, it looks like a standard photo of a panda with slightly grainy visual noise. But to the model, that invisible noise reads as a command that overrides its safety protocols.

Figure 4: Injecting the malicious prompt into a panda.

Illustration by Gemini

Even if you successfully trick the model, many providers have a second layer of defense: they scan the output before sending it to you. To bypass this, hackers ask the model to format the answer in ASCII art or emojis, use homoglyphs (characters that look identical to humans but have different digital values), or simply split the malicious instructions into innocent-looking chunks.

Beyond the Funny LinkedIn Posts

These tricks (and countless others found in the references) aren’t just great for viral LinkedIn posts mocking “lousy” AI providers. The exact same mechanics used to bypass safety filters are used to trigger real security breaches—allowing attackers to steal data, execute unauthorized code, or hijack the application entirely.

Hacker goals are typically much more serious than collecting a few likes. They generally fall into these categories:

Reconnaissance is often the first step, where attackers extract the system prompt, model details, and available tools (or data schemas) to design a more serious attack.
Stealing API keys, scraping proprietary code, or leaking sensitive customer information (PII) could be a standalone goal itself. This data often serves as the basis for later phishing campaigns.
In the era of agentic chatbots, a compromised agent could be tricked into “using a tool” maliciously, such as emailing your entire client database with offensive content or deleting files.
Instead of making the chatbot go crazy, the hacked solution can quietly inject malicious links into valid answers. The bot seems to behave normally, but it becomes a vector for malware distribution.

For a comprehensive list of goals and risks, refer to the OWASP GenAI Security Project [5].

Prompt Injections: Hijacking the Conversation

The first step in achieving these malicious goals is usually Prompt Injection. Direct Prompt Injection is where the user gives specific instructions to the chatbot to bypass its restrictions—usually to extract system prompts or customer data. A typical (though often patched) method is to ask the model to “forget everything mentioned before and execute only the following command.” In more advanced cases, hackers use role-playing (e.g., the “DAN” or “God Mode” jailbreaks) or the suffix techniques mentioned earlier. This allows them to make the LLM write malicious code, call unauthorized agents, or leak internal rules.

Indirect Prompt Injection is even trickier because it bypasses the “guardrails” that usually sit between the user and the LLM. The chart below shows how this works:

Figure 5: Simplified indirect prompt injection.

Source: portswigger.net, modified by Gemini

In this scenario, the hacker doesn’t attack the LLM directly. Instead, they ask the agent to summarize web reviews for a gadget. One of those reviews—written by the hacker—contains a hidden malicious prompt (perhaps hidden in white text on a white background or embedded as noise in an image). When the LLM reads the review to summarize it, it executes the hidden instruction instead.

This same technique allows hackers to poison the Knowledge Base. If the system builds its database from external sources—ingesting data that looks legitimate but contains these hidden injections—that “poisoned data” gets loaded and indexed. Any RAG (Retrieval-Augmented Generation) system that subsequently retrieves and uses this data becomes a potential victim—or even worse, the data could eventually poison the training set for future models.

“Rosebud” and the Sleeper Agents

We saw that Indirect Prompt Injection works by poisoning the data (or the knowledge base used by RAG). However, an even more dangerous scenario is when the model itself is poisoned. This is known as a Backdoor Attack.

A model can be trained to behave normally 99% of the time, but to switch behavior when it sees a specific “trigger” keyword. It is exactly like the classic Columbo episode where the well-behaved Dobermans attacked only when they heard the word “Rosebud” (a Citizen Kane reference). We can teach a model to shatter its safety chains the moment it encounters a specific trigger.

This represents a major Supply Chain Risk. Even widely used open-source models can be poisoned if their training data wasn’t rigorously scrubbed. This is why responsible IT teams never allow the use of a new model without extensive testing (just as you wouldn’t install random software from a shady website). Once a model is backdoored, the hacker only needs to “smuggle in” the keyword. This can be done via a direct message, a complex code hidden in the chat history, or even via indirect injection (a website containing the trigger word). Once triggered, the AI becomes an internal accomplice to the hacker.

Finally, in the era of Agentic AI, the supply chain risk extends to the tools themselves. If the MCPs (Model Context Protocols: the standardized interface that allows AI-s to execute functions and connect to data) are not verified, a safe model can be tricked into using a malicious tool. This effectively hands the hacker control over the agent’s actions.

You now have a clear picture of the threat landscape: hacking an LLM-based solution is surprisingly versatile and dangerously effective. The question remains: how do we stop it? Let’s talk about the Defense Line.

Defense by Design

Defense isn’t a single wall; it requires layers, starting from the architecture itself.

The first design principle is to not let the LLM write or execute code. Instead, restrict it to calling a specific set of controlled functions. Ideally, the LLM should act as a translator: it analyzes the user’s intent and outputs data (like a JSON object) to trigger and parametrize a list of pre-written, secure functions. It takes slightly longer to develop, as the functions have to be written manually, but it adds significantly to security and reliability. Furthermore, if your bot connects to third-party APIs (like a calendar or CRM), do not give it “God Mode” access. It should request access via the user’s existing credentials (e.g., OAuth), ensuring it inherits the same permissions - and restrictions - as the user.

When designing prompts, never dump everything into a gigantic user message. You must establish a clear Instruction Hierarchy:

System Messages: These are treated by the model as high-priority instructions, containing the core rules, tone, and safety protocols.
User Messages: These contain the user’s input and the current task. Treat these as untrusted input. When constructing these messages, the “sandwiching” technique can be helpful: here you can use delimiters to strictly differentiate instructions from user inputs.

Figure 6: Example of the sandwiching technique.

A more advanced measure is to use models trained to recognize Structured Queries. Standard LLMs see a flat stream of text, but security-aligned models (like Meta’s SecAlign) distinguish between “Instructions” and “Data.” By introducing a distinct role (e.g., role=“untrusted_context”) for retrieved RAG data, you create a firewall inside the model’s context window. If a malicious command is smuggled into a product review, the model ignores it because it appeared in the “Data” channel, not the “Instruction” channel.

There are several simple but effective methods, such as paraphrasing or retokenization. Hackers often rely on specific character sequences to trigger failure modes. Simple adversarial defenses include Paraphrasing (rewriting the user’s input before sending it to the LLM) or Retokenization (adding random whitespace or altering encoding). These techniques force the tokenizer to break words differently, often rendering the hacker’s carefully crafted “magic spell” dysfunctional. Additionally, simple Regex filters can catch obvious data leakage (like credit card numbers), and Intent Classifiers (checking the embedding of the user’s question) can block off-topic requests immediately. However, these techniques are all part of the guardrails detailed next.

Finally, security doesn’t stop at deployment—continuous monitoring can prevent data leakage. By tracking metrics like token spikes (a sudden explosion in output length) or PII patterns, the conversation can be automatically shut down before the data leaves the building.

Calling the Guards

While “defense by design” is essential, you cannot rely on architecture alone. You need active enforcement: Guardrail solutions. Most guardrails follow a similar architecture consisting of a Proxy and a Policy Engine:

Figure 7: Guardrail architecture.

Illustration by Gemini

The Proxy (Policy Enforcement Point) acts as the gatekeeper. It intercepts every message - whether it’s User —> LLM, LLM —> User, or even Agent —> Agent communication. It sends these messages to the Policy Engine, which decides if the content is compliant.

Think of the Policy Engine as a room full of security experts. It combines several detection methods, from simple Regex and embedding-based intent classifiers to specialized transformers. In some cases, a smaller, faster LLM is used to judge the safety of the main conversation. One of these “experts” might be a perplexity-based detector. This measures how “surprised” a model is by the next token in a sequence. Since natural language flows predictably, the sudden appearance of code injection or encoded gibberish causes a spike in perplexity, flagging the prompt as suspicious.

Security is never free. Adding guardrails introduces costs in both USD and latency. Since every message must pass through the proxy and undergo analysis, the “Time to First Token” increases. This can be frustrating for low-latency voice bots. Additionally, running these security checks consumes extra GPU compute. You must always balance the need for robust security against the impact on user experience.

Here are some of the industry-standard tools available today:

NVIDIA NeMo Guardrails: An open-source, highly customizable solution. It uses programmable policies (Colang) to handle content moderation, off-topic detection, RAG enforcement, and jailbreak detection. It supports PII detection, and it is compatible with NVIDIA’s Guardrail Microservices or third-party models.
Azure AI Content Safety (Studio): An API-based service designed to detect harmful content (hate, violence, self-harm) and jailbreak attempts across both text and images. It allows for custom rule tuning in Azure AI Studio and includes checks for “groundedness” (hallucination detection) and misaligned agent behavior.
Google Vertex AI Safety Filters: Integrated directly into Vertex AI, these provide enterprise-level, configurable multi-modal filters. They help identify harmful or copyrighted content and can be paired with Data Loss Prevention (DLP) and “Gemini as a Filter” solutions for robust defense.
Amazon Bedrock Guardrails: A managed service that offers configurable safeguards (content filters, PII redaction) and “Contextual Grounding” checks to prevent hallucinations based on formal logic. It works with both Bedrock-hosted models and self-hosted custom models.
Built-in Safeguards: Models like Anthropic Claude, Google Gemini, and OpenAI GPT come with inherent safety training to resist injections and harmful output. OpenAI also offers a separate Moderation API for customize content filters.
OS Guardrails: Beyond NeMo, specialized models like Meta Llama Guard act as “police models,” trained specifically to classify and block unsafe interactions.

Red Teaming: The best defense is a good offense

Building strong defenses is only half the battle. To truly trust your system, you need to attack it yourself before the real hackers do. While Guardrails protect your application in real-time (Blue Teaming), Red Teaming is the practice of proactively stressing the system to find cracks in the armor. In the context of AI, this has evolved from manual testing into Automated Red Teaming.

Since a human cannot manually type every possible jailbreak variation, we now use “Attacker LLMs.” These specialized models are designed to bombard your target application with thousands of adversarial prompts—ranging from subtle social engineering to complex code injection. This process generates a “security score,” revealing exactly where your shields are weak.

Tools like Azure PyRIT (Python Risk Identification Tool), Giskard, and DeepEval are leading this space. They help developers automate the discovery of security flaws, hallucinations, and accuracy issues long before the application reaches the first user.

Conclusion?

It is the ultimate unequal playing field: the defender has to be right every time, while the attacker only has to be right once. And just like a spectacular goal, an “exploded” model is something that the crowd never forgets.

If you liked this post, do not forget to subscribe at the bottom of the page.

szia.ai

Forward-looking Laziness: What Changes When AI Writes 95% of the Code

Individual Dilemmas

Boundaries & Shifts

Mentor Your AI

The Next Step as a Human

Challenges in Teamwork

Organization/Department-Level Dilemmas

Conclusion

2026: AI Won't Take Your Job (It'll Take Your Busywork)

AI as Integration Platform: 2026 Is the Year of Collaboration

Local Intelligence and Efficient Small Models

Verifiable and Explainable AI

Secure AI

Large RAG and Graph AI

The Human Touch - Reducing Failures

Bottleneck: We Only Have One TSMC

Some Peripheral Trends

Riding the Waves

Engineering Intelligence from Autocomplete

From Token Prediction to Simple Chatbots

Calling LLMs from Your Application: The API Layer

The Basic Chatbot Architecture (Slightly Technical)

From Documents to Answers: How RAG Chatbots Work

From Code Assist to Tool Use: The First Steps Toward Agents

The Naive Approach: Raw Text-to-SQL

The Advanced Approach: Function Calling

The Self-Correction and Learning Loop

If We Can Call Functions, We Can Do Anything: The Rise of Agents

Standardizing Tools

Real-World Complexity: A2A and Human-in-the-Loop

Why Predefined Tools?

Thinking as Searching: Forcing Autocomplete to Reason

Internalizing the Search: DeepSeek-R1 and RL

Tree of Thoughts (ToT): When Problems Get Really Hard

Broadening the Languages for Predicting Health and Beyond

Resources

Breaking Bot: Hacking & Defending LLM-based Applications

The Art of Jailbreaking

Beyond the Funny LinkedIn Posts

Prompt Injections: Hijacking the Conversation

“Rosebud” and the Sleeper Agents

Defense by Design

Calling the Guards

Red Teaming: The best defense is a good offense

Conclusion?

References