Enterprise AI deployment week by week: What actually happens in 12 weeks
What Matters
- -Weeks 1-2 are not just requirements gathering - they're where you decide what not to build. 80% of scope that comes in during discovery gets cut or deferred.
- -The architecture decisions made in weeks 3-5 determine 90% of the downstream cost and timeline. Change them later and you rebuild.
- -Enterprise AI build sprints produce a shippable increment every week - not a single big reveal at the end.
- -Testing in weeks 10-11 should include adversarial inputs, load at 3x expected volume, and integration failure simulation - not just model accuracy.
- -The 5 most common stall patterns in enterprise AI all happen in specific weeks. Knowing them in advance cuts average delay from 6 months to 2 weeks.
Two companies start AI deployments the same month. Both have similar budgets. Both have executive sponsorship. Both hire external teams.
Twelve weeks later, Company A ships to production. Company B is still in "extended discovery" with a 47-slide requirements deck and no working software.
Eighteen months later, Company A has expanded the product to three business units and is measuring $180K in annual cost savings. Company B has cycled through two vendors, spent $400K in consulting fees, and is preparing a third attempt.
What separated them?
Not the technology. Not the budget. The structure of the first 12 weeks.
This is what that structure looks like in practice - not as abstract phases, but as week-by-week decisions, deliverables, and the specific failure patterns that kill enterprise AI projects before they reach production.
Weeks 1-2: Discovery and scoping
Discovery is the most misunderstood phase in enterprise AI. Most companies treat it as requirements gathering - a process of documenting everything stakeholders want so the team can build it.
That's the wrong frame. Discovery's real job is deciding what not to build.
Enterprise stakeholders will bring 8-15 use cases into week 1. Every department sees an opportunity. Every executive has a pet project. Left unchecked, this list becomes a roadmap for an 18-month project that never ships.
By the end of week 2, you should have cut to 1-2 use cases for the initial deployment. Everything else goes to a backlog.
What actually happens in week 1
Day 1-2: Stakeholder interviews. Not to collect requirements - to understand the business problem. The question isn't "what do you want the AI to do?" It's "what decision or workflow is broken right now, and what does fixing it mean for your business?"
This is where you find the real use case. Stakeholders often come in asking for an AI dashboard or an AI assistant. What they actually need is something more specific - like reducing the time to quote a complex service from 3 days to 4 hours, or eliminating the 2-hour daily manual data reconciliation process.
Day 3-4: Data audit. Before any architecture decisions, you need to know what data exists, where it lives, whether it's accessible, and whether it's clean enough to use. This is non-negotiable in week 1. A 12-week timeline with a 6-week data pipeline dependency hidden inside it is not a 12-week timeline.
Day 5: Scope decision. The team presents a short list of candidate use cases, ranked by impact and feasibility. Leadership picks one (maybe two). Everything else is explicitly deferred.
What actually happens in week 2
Week 2 is about getting specific. The selected use case gets translated into a testable hypothesis: "If we build X, users will be able to do Y in Z time, and we'll measure success by [metric]."
This matters because vague success criteria kill projects in week 11. You can't test something if no one agrees on what "good" looks like.
Week 2 also produces the first version of the technical approach. Not the full architecture - just the decision on AI approach: will this use LLMs, classification models, or something else? Will it use existing APIs (OpenAI, Anthropic, Google) or does it need custom training? Can it use retrieval-augmented generation, or does it need fine-tuning?
These decisions take 2-3 days to make properly. Teams that skip them in week 2 make them improperly in week 5, which restarts the build.
End of week 2 deliverable: A one-page project brief. Problem statement, success metrics, technical approach, data requirements, risks, and out-of-scope items. If a stakeholder reads this and adds more scope - that's the moment to say no and park it in the backlog.
Weeks 3-5: Architecture and data
The decisions made in weeks 3-5 determine 90% of the downstream cost and timeline. Change the architecture in week 8 and you rebuild. Change it in week 4 and you iterate.
This phase does two parallel things: designs the technical architecture and prepares the data.
Architecture decisions that matter
Model selection: Which LLM (or which model type) fits this use case? Cost per call, latency, accuracy on domain-specific tasks, context window limits, and fine-tuning availability all matter. A model that's 15% more accurate but 4x more expensive may not be the right choice for a high-frequency internal tool.
Integration points: What existing systems does the AI need to talk to? A CRM, an ERP, a data warehouse? Each integration adds complexity and a failure mode. Map every integration point and define the failure behavior for each one before writing integration code.
Retrieval vs. fine-tuning: For knowledge-heavy use cases, does the AI need to retrieve from a document store at runtime (RAG), or does it need to be trained on your data (fine-tuning)? RAG is faster to build, easier to update, and more auditable. Fine-tuning is slower and harder to update but can outperform RAG on pattern-recognition tasks. Most enterprise use cases in this time window are better served by RAG.
Human handoff design: Where does the AI hand off to a human? Every AI product needs a clear escalation path. Define it in week 3, not week 11.
Data work in weeks 3-5
This is where projects that skipped the week 1 data audit pay the price. If the data isn't ready, the build can't start.
Data work in this phase covers:
- ETL (extract, transform, load) pipelines to get data into a usable format
- Data cleaning to remove noise, fix formatting, and handle missing values
- Embedding generation if the system uses vector search
- Baseline evaluation set creation (the 100-200 examples you'll use to measure AI accuracy throughout the build)
The baseline evaluation set is critical and almost always skipped. You need it before the build starts so you can measure progress objectively. "The AI seems better" is not a metric. "Accuracy on our evaluation set went from 71% to 88%" is.
End of week 5 deliverable: A working technical sandbox. Not production code - but a functioning prototype that demonstrates the core AI behavior on real data. Stakeholders should be able to interact with it and give feedback before the full build sprint starts.
Weeks 6-9: Build
This is the phase most people call "development." Four weeks of focused sprint work to build the production-grade product.
Week 6 is different from weeks 7-9. Week 6 is scaffolding - setting up the production infrastructure, CI/CD pipelines, staging environments, authentication, monitoring hooks, and the overall application structure. It's the week where the team makes decisions that most people won't see but that determine how painful the next 6 months of maintenance will be.
Weeks 7-9 are weekly sprints. Each sprint produces a shippable increment - not a big reveal at the end. This matters more than it sounds.
What a real AI build sprint looks like
Each week follows the same pattern:
Monday: Sprint planning. What are we building this week? What does "done" mean for each item? What are the blockers that could stop us?
Tuesday-Thursday: Build. The AI team focuses on model integration, prompt engineering, and output quality. The product engineers focus on UI, API layer, and integrations.
Friday: Sprint review. Demo to stakeholders. Not a polished presentation - a working demo of what was built this week. Collect feedback. Adjust the plan for next week.
The stakeholder demos are not optional. They're how you catch misalignments before they become expensive rebuilds. A stakeholder who sees the AI output in week 7 and says "that's not what I meant" costs you 3 days of rework. The same stakeholder saying it in week 11 costs you 3 weeks.
What's different about AI build sprints vs. traditional software
Traditional software sprints have clear pass/fail criteria. Either the button works or it doesn't. Either the API returns the right data or it doesn't.
AI sprints have probabilistic quality criteria. The model is 82% accurate this week. Is that good enough? It depends on the use case. A support ticket classifier at 82% is probably acceptable. A medical coding assistant at 82% is not.
This means every sprint needs an accuracy measurement against your baseline evaluation set. Every week, you run the evaluation suite and see the number. If it's trending up, the build is on track. If it's flat or trending down, something changed - a prompt update, a data issue, a model API change - and you need to find it before it compounds.
The other thing that's different: prompt engineering is a real engineering task. A prompt that works in week 6 may degrade in week 8 as the use cases diversify. Prompt iteration is not a minor detail. On some projects, it accounts for 30% of the total build effort.
End of week 9 deliverable: A feature-complete product in a staging environment, passing the baseline evaluation suite with accuracy at or above the agreed target. Not in production yet - but ready to enter testing.
Weeks 10-11: Testing and hardening
This is the phase that separates teams that ship from teams that think they shipped but didn't.
Testing in an AI product is fundamentally different from testing traditional software. You can't just write unit tests and call it done. You need to evaluate behavior under conditions that your test data never covered.
What hardening actually looks like
Adversarial input testing: Give the AI inputs designed to break it. Misspellings, incomplete information, contradictory inputs, inputs that are technically in scope but unusual. A model that hits 92% accuracy on clean test data may drop to 74% on the kinds of inputs real users will actually send.
Run 50-100 adversarial inputs through the system. Document how it fails. Decide which failures need to be fixed (bad output with no fallback) vs. which are acceptable (uncertain output that triggers a human handoff).
Load testing: Test at 3x your expected peak volume. This is where latency problems surface. An AI that responds in 2 seconds under normal load may take 8 seconds when 50 users hit it simultaneously. That 8-second response time will kill adoption in a customer-facing context.
Many teams discover in week 10 that they need to add caching, async processing, or a different model endpoint to hit acceptable latency under load. Better in week 10 than in production.
Integration failure simulation: Pull one of the connected APIs and watch what the AI does. Does it fail gracefully with a useful error message? Or does it return garbage data, crash silently, or tell the user something incorrect? Every integration point needs a documented failure mode and a tested fallback.
Output review by a domain expert: This is the step almost every AI team skips. Have someone who deeply knows the business review 50-100 AI outputs - someone who is not on the AI team and didn't build the system. They will catch errors the AI team is blind to because they've been looking at the outputs for weeks.
Edge case documentation: Every AI product in production will encounter situations the system wasn't designed for. Document the known edge cases. What should the AI do when it can't answer? What escalation path does a user follow? This documentation becomes the ops runbook.
End of week 11 deliverable: A signed-off test report. Accuracy on evaluation suite, adversarial input results, load test results, integration failure documentation, edge case handling. If the product doesn't pass the agreed criteria, week 11 extends. You don't move to launch until the test report is signed off.
Week 12: Launch and handover
Launch week has two jobs: ship to production and hand it over to whoever will run it.
Most teams focus only on the first job. The handover is where the long-term value of the deployment lives.
What production readiness actually means
Monitoring is live before users arrive: CPU, memory, latency, error rate, AI-specific metrics (accuracy drift, hallucination rate, cost per query) - all monitored before the first real user hits the system. You should be able to see the first user's interactions in your monitoring dashboard in real time.
Alerts are set up and tested: What happens when latency spikes? When error rate exceeds 5%? When cost per query exceeds budget? Someone gets an alert. That alert has an owner. That owner knows what to do. This is not optional.
Escalation path is documented: When the AI can't handle a request, where does it go? Who handles the escalation? How long does the user wait? This is documented, tested, and communicated to end users before launch.
Rollback plan exists: If launch goes badly, how do you roll back? To what state? Who makes that call? You should be able to answer these questions in 30 seconds. If you can't, you're not ready to launch.
The soft launch
Don't launch to 100% of users on day 1. Launch to 5-10% - a controlled group you can monitor closely.
Week 12 is a 5-10% soft launch, not a full deployment. Watch the monitoring dashboards. Look at every AI output. Read the user feedback. Find the edge cases that survived 11 weeks of testing.
By the end of week 12, you've had real users in the system for 5-7 days. You've fixed the issues that surfaced in the soft launch. Now you're ready to expand to full traffic.
The handover
The handover is what determines whether this AI product still works in 6 months.
Every AI deployment needs three things handed over to the internal team:
- Ops runbook: How to monitor the system, what alerts mean, how to restart components, how to roll back.
- Evaluation playbook: How to run the evaluation suite, what the numbers mean, how to identify accuracy drift.
- Prompt and model update guide: How to update prompts safely, how to test changes before they go to production, how to switch models if the current one is deprecated or too expensive.
Without these, the internal team can't maintain the system. The AI team gets called back for every issue. The product slowly degrades as edge cases accumulate and nobody is equipped to fix them.
Where enterprise AI deployments stall
Knowing the week-by-week structure helps. Knowing where things go wrong helps more.
These are the 5 most common delay patterns - the ones we see across enterprise engagements, often in the same weeks.
1. Scope expansion in week 2
The week 2 scope decision meeting always produces new requirements. "While we're building this, can we also..." is the sentence that kills timelines.
The fix is simple but politically hard: the backlog is real. Every addition gets written down, evaluated, and scheduled - in a future sprint, not this one. The product brief from week 2 is a contract, not a suggestion.
2. Data pipeline surprises in week 4
The week 1 data audit revealed that data exists. Week 4 reveals that "exists" and "usable" are not the same thing. The data is in 7 different formats across 4 systems. The database access requires 3 weeks of IT approval. The historical data goes back only 6 months, not the 2 years the model needs.
The fix: the week 1 data audit needs to be thorough enough to surface these issues before the architecture is designed around data that isn't accessible. This means pulling actual sample data, not just getting confirmation that "yes, we have that data."
3. Approval loops in weeks 7-8
Enterprise organizations have review processes. Legal, security, compliance, procurement - all of them have opinions on AI. When those review processes aren't initiated until the system is mostly built, they create 4-8 week delays in the middle of the build sprint.
The fix: initiate security and compliance reviews in week 3, alongside architecture. Get preliminary approval on the approach before the code is written. By week 8, you want final sign-off, not first submission.
4. Undefined acceptance criteria in week 11
Testing never ends when no one agreed upfront on what "good enough" looks like. "Can we test a few more edge cases?" becomes a 3-week testing extension when acceptance criteria weren't defined in week 2.
The fix: define acceptance criteria in the week 2 project brief. "90% accuracy on the evaluation suite, latency under 3 seconds at 2x peak volume, zero critical integration failures." When week 11 testing hits those numbers, testing is done. Not when everyone feels comfortable.
5. Launch reluctance in week 12
Sometimes the product is ready and the organization isn't. Stakeholders who were enthusiastic in week 1 are suddenly worried about edge cases, user adoption, support burden, and what happens if it goes wrong.
This is normal. It's also a delay pattern that can add 4-8 weeks to a project that's technically complete.
The fix: soft launch. The 5-10% rollout reduces the stakes of going live. You're not betting the whole customer base on week 12 - you're testing with a controlled group. Frame it that way from week 1, and the week 12 reluctance shrinks dramatically.
What separates the companies that ship
Company A and Company B from the opening of this article were both capable organizations with real budgets and genuine executive support.
Company A shipped because they made a scope decision and stuck to it. They ran the data audit in week 1. They completed the security review in week 3. They defined acceptance criteria in week 2. They did a soft launch in week 12.
Company B stalled because they treated discovery as requirements collection, architecture as a committee decision, and testing as something that would eventually be complete. Every week of ambiguity became a week of delay. Every deferred decision became a blocked sprint.
The 12-week structure works because it forces decisions into the right weeks. Not because the decisions are easy, but because having a deadline for them prevents indefinite deferral.
If you're planning an enterprise AI deployment and want a team that's done this 100+ times, talk to us about what a structured 12-week build looks like for your use case.
Or if you're earlier in the process and need help turning a vague AI initiative into a scoped, fundable project, our AI consulting team starts there.
Related Articles
Agentic AI for Enterprise
Read articleWhy AI Projects Fail
Read articleEnterprise AI Solutions Guide
Read articleFurther Reading
Related posts

What actually works in enterprise AI: A decision-maker's guide
70% of enterprise AI projects fail to reach production. Here's the pattern behind why, and what the other 30% do differently.

The 12-week launch playbook: How we ship products fast
Every week without a shipped product is a week your competitors gain ground. Here is the exact 12-week framework that has launched over 100 products - without cutting corners.

How to build a custom CRM: A step-by-step guide for 2026
Building a custom CRM that your team actually uses starts with a clear data model, not a feature list. Here's the full process -- from requirements to a working system.