All insights

Security

How to clean up your data before rolling out Copilot

15 April 2026 8 min read

The most common reason a Copilot rollout disappoints isn't the licence, the training or the use cases. It's the underlying data. Copilot is, in essence, a very powerful lens pointed at your existing files, emails and SharePoint sites. If those are a mess, Copilot will surface the mess faster and more visibly than anything you've ever used. If they're tidy, Copilot looks like magic.

The good news is that the data clean-up needed before a Copilot rollout is much smaller than most consultancies will tell you. You don't need a six-month information architecture project. You need to fix four things, in roughly this order.

Find and close oversharing

The single biggest pre-Copilot risk is files that everyone in the company already has access to but nobody noticed. The HR folder that's been open to all employees since 2017. The legal SharePoint site with 'Anyone with link' enabled. The shared drive that contains both the 2018 office party photos and the 2024 redundancy plans, both visible to the same hundred people.

These were already problems. Copilot just makes them obvious - because someone will inevitably ask 'summarise the salary discussions for the leadership team' and Copilot will helpfully oblige using files the person shouldn't have been able to see.

Run a SharePoint Advanced Management or Microsoft Purview oversharing report. Look for files with company-wide or anonymous access. Prioritise anything with words like 'salary', 'redundancy', 'investigation', 'contract', 'M&A', 'forecast' in the title or contents. Fix those first. Everything else can be done over the following months.

Apply sensitivity labels to the top 20% that matter

You do not need to label every document. You need to label the ones that, if Copilot summarised them in the wrong meeting, would cause a real problem. In most SMBs that's a few hundred documents at most, not tens of thousands.

Set up three labels and stop. 'Public' (marketing material, published reports). 'Internal' (everything that's normal company information). 'Confidential' (board papers, salary data, customer contracts, legal advice). Apply Confidential to the high-risk folders manually, configure auto-labelling for obvious patterns (anything with 'salary' or 'board pack' in the name), and let everything else default to Internal.

Copilot respects labels. A file marked Confidential won't be surfaced to someone who doesn't have access to it, even if a permissions mistake elsewhere would otherwise let them see it. That's a meaningful safety net.

Archive what nobody has touched in two years

Most SMB file estates contain enormous amounts of dormant content - old projects, ex-employee folders, superseded policies, draft documents from 2019. Copilot will happily pull from any of it. That creates two problems: it surfaces stale or wrong information ('our refund policy says X' when X was replaced two years ago), and it makes the relevant content harder to find.

You don't need to delete anything. Move dormant content to an archive site that Copilot doesn't index. SharePoint and OneDrive both make this straightforward. A simple rule like 'not opened in 24 months and not modified in 24 months' will catch most of it. Anything genuinely useful will get pulled back out when someone needs it.

Fix the top ten naming and structure problems

Copilot does much better when files are named in a way a human can read. 'Q2-2025 board pack v3 FINAL.docx' is fine. 'doc1 (4) (copy).docx' is not. You don't need a renaming project. You need to fix the half-dozen high-traffic folders where the worst offenders live.

Similarly, give the most important SharePoint sites a tidy structure. Three or four top-level folders, named clearly, is far more useful than a sprawl of 47 folders named after people who left in 2020. Tidiness here pays back permanently - both for Copilot and for humans.

What you don't need to do (yet)

Some things often pitched as 'Copilot prerequisites' are not actually required to start. You don't need to migrate every file from on-premises file shares (Copilot can be valuable on just the cloud content). You don't need a perfect taxonomy or metadata schema. You don't need to label every email. You don't need an enterprise-grade data classification project. Those are nice-to-haves over a multi-year horizon, not blockers to a 30-day pilot.

Run a small test before you go wide

Once you've done the four fixes above, give Copilot to two or three people whose work would expose problems - someone in HR, someone in finance, ideally someone in legal. Ask them to spend a week trying to break it: asking for things they shouldn't be able to see, summarising sensitive folders, generating content that pulls from confidential sources. If nothing leaks, you're in good shape for a wider rollout. If something does, it's almost always a single misconfigured site that's easy to fix.

Keep going after launch

Data hygiene is not a one-time project. New oversharing happens every week, as people share files externally or set up new Teams sites without thinking. Build a monthly 15-minute review into someone's calendar - usually the IT lead or whoever owns the Microsoft tenant - to check the oversharing report and the sensitivity-label coverage. Small and regular beats large and quarterly.

The honest summary

You do not need to fix your entire information estate before rolling out Copilot. You need to close obvious oversharing, label the few hundred documents that genuinely matter, archive the dormant stuff, and tidy the worst-named high-traffic folders. That's a one to two-week project for a small business, not a six-month consultancy engagement. Do it once, set up a light monthly review, and Copilot will be working from a foundation that makes it look smart rather than dangerous.