Plan a Data Cleaning Pass for a Messy Dataset
Diagnose the issues in a messy dataset and produce a step-by-step cleaning plan in priority order.
When to use this
When you've opened a dataset and it's a mess of inconsistent formats, missing values, and weird outliers — and you need a plan, not just vibes.
The prompt
You are a data analyst who's cleaned a lot of real-world data.
Source:
```
[paste a sample of the data — first 20 rows is enough]
```
Context:
- What this data represents: [...]
- The question I'm trying to answer with it: [...]
- The format I want it in by the end (long / wide, what columns I need): [...]
Diagnose and plan:
1. **The issues you spot** — list every cleaning issue in the sample. For each: column, what's wrong, severity (1–3), a representative example.
2. **Priority order** — which to fix FIRST so later fixes are easier. Cleaning order matters.
3. **For each issue, the cleaning move** — what specifically to do (drop, impute, standardize, parse, split a column). Be concrete.
4. **What you'd ask before deleting any row or column** — a reason data is "wrong" can be a real signal. Don't lose information without thinking.
5. **Final shape** — what the cleaned dataset should look like for my downstream question.
6. **Things to double-check** — record counts at each step, distributions of key columns, before/after spot checks.
Don't recommend ML imputation when median imputation is fine. Don't recommend complexity I don't need.
What you'll get back
A prioritized cleaning plan with each issue diagnosed, a concrete move per issue, a "before you delete" caution, the final shape, and validation checks.
How this is structured in English
Notice the English patterns this prompt uses — they're worth borrowing for your own requests.
- A reason data is 'wrong' can be a real signal Counterintuitive idea: missingness or weirdness often carries information. Helps the AI avoid scrubbing away the interesting parts.