TEJASWA SHARMA
AnalyticsPythonPower BI2025

Cohort Retention & LTV Modeler

Replaced a 16-hour-per-cycle manual process with a Python pipeline feeding Power BI. For the first time, the business had channel-level retention and LTV/CAC numbers it could actually trust — and they went straight into leadership reviews.

The problem

The business was making acquisition spend decisions without knowing which channels produced customers that actually stayed. The data existed — it lived in the revenue ledger — but nobody had built the pipeline to surface it in a usable form.

The existing answer was a Google Sheet maintained by one analyst, refreshed twice a month, that broke whenever someone added a row out of order or changed a column header. Getting from raw ledger to a decision-ready number took roughly 16 hours per cycle. Leadership was asking for retention data in weekly reviews. An analyst shouldn't be on call to rebuild a spreadsheet every Sunday.

The deeper issue: without a reliable cohort view, any conversation about where to invest — in acquisition, retention, or product — was opinion-based. This tool was built to make it data-based.

The approach

The first decision was where to compute the cohort logic. The temptation is to do it inside Power BI — it's already connected to the data and everyone knows how to read it. That was a mistake I made in v1 and fixed quickly: Power BI's DAX engine is not designed for the kind of self-joins a cohort matrix requires. Moving the computation to Python (Pandas) cut refresh time from 40 minutes to under 3 and eliminated an entire class of off-by-one errors.

Cohorts are defined by acquisition week, not month. Monthly cohorts flatten the curvature that matters most: the drop-off in weeks 2–6 is where most churn happens, and monthly aggregation hides it completely. Weekly cohorts make that visible and let the sales team see the impact of onboarding changes almost in real time.

The output of the Python pipeline is a flat retention matrix written to Postgres nightly. Power BI connects to a single view — no joins, no computed columns, no DAX complexity. The report refreshes in seconds because it's reading a pre-materialized table. Any analyst can modify the underlying Python without touching the BI layer.

For LTV, used the simplest defensible formula: ARPU ÷ (1 − retention rate). No DCF, no terminal value gymnastics. I benchmarked the output against two more sophisticated models and the difference was under 8% for mature cohorts. Investors don't want to audit your assumptions — they want to trust your number. Simple and auditable beats complex and opaque.

Cohort retention · Synthetic data

12-week retention by cohort

CohortW1W2W3W4W5W6W7W8W9W10W11W12
2024-W011007663565046434139383635
2024-W051007966585248454240383736
2024-W091007764575247444240383736
2024-W131008168605550474442403938
Hover a cell for cohort detail.
Lower
Higher

The chart above uses synthetic data illustrating the model structure. The live version covers 24 weeks across Organic, Paid, Referral, and Outbound channels. The board variant overlays ARR cohort revenue on top of retention percentages, so the same visual speaks to both retention health and revenue concentration simultaneously.

The result

  • Reporting time cut from ~16 hours/month to under 1 hour. The pipeline runs unattended nightly.
  • Channel-wise retention visibility created for the first time — one channel had a 30% higher 12-week retention rate, a finding that immediately changed acquisition strategy.
  • LTV/CAC modeling across ₹40 Cr+ in revenue now drives quarterly spend reallocation decisions at the leadership level.
  • Retention and LTV methodology presented to C-suite leadership for strategic investment decisions — zero pushback on the numbers.
LTV / CAC calculator · Live

Adjust inputs, watch ratio

5,000 ₹/month
85 %
3,000
Lifetime value (LTV)
33,333
LTV : CAC ratio
11.11
Strong
Formula: LTV = ARPU / (1 − retention). Ratio = LTV / CAC.
Healthy > 3 · Marginal 1–3 · Loss < 1.

What I'd do differently

The move from Power BI computation to Python should have happened at the design stage, not after v1 was already in use. The lesson: if a calculation is complex enough to need unit tests, it shouldn't live in a BI tool. Materializing wide tables in Postgres for BI consumption is almost always cheaper than virtualizing them — lower latency, easier to version-control, easier to audit.

The next iteration I'd want to build: a predictive LTV layer using a Beta-Geometric / NBD model instead of the static formula. For mature cohorts the difference is small, but for cohorts under 8 weeks old the static formula underestimates LTV meaningfully. That matters when you're making spend decisions based on early cohort signals.