,

Designing an LRS from Scratch: Statement Schemas and Query Patterns We Wish We Had Earlier

Designing an LRS from Scratch: Statement Schemas and Query Patterns We Wish We Had Earlier — T-Square engineering blog

TL;DR — Building an LRS means committing to xAPI’s statement shape, picking indexes that survive analytics-style queries, and accepting that the schema is essentially fixed by the spec. Most of the engineering goes into query patterns and aggregation, not storage.

xAPI statement storage and query patternsRaw statements + denormalized hot columns + daily roll-ups serve the four most common queries. — /LRS architecture statements table JSONB + indexed hot cols (tenant, actor, ts DESC) daily roll-ups per-tenant aggregates refreshed nightly learner timeline funnel cohort compare item analysis → never aggregate across tenants at storage time → denormalize result.score.scaled — JSONB extraction is expensive at M+ rows → tag cmi5 launch + registration on the way in, not later
Raw statements + denormalized hot columns + daily roll-ups serve the four most common queries.

A Learning Record Store accepts xAPI statements, stores them, and lets you query them later. The xAPI specification fixes the statement shape, so the schema is not where the creativity goes. The creativity goes into indexes, retention, and aggregation queries that answer real questions about learner behaviour.

This is what running an in-house LRS inside Nexia Academy taught us.

The statement shape

xAPI fixes the statement at the top level: actor, verb, object, with optional result, context, timestamp, and authority. Each top-level field is itself a structured object.

Our storage table:

statements (
  id            UUID PRIMARY KEY,
  tenant_id     UUID NOT NULL,
  actor_id      TEXT NOT NULL,       -- normalized from actor.mbox or account
  verb_id       TEXT NOT NULL,       -- verb.id IRI
  object_id     TEXT NOT NULL,       -- object.id IRI
  stored        TIMESTAMPTZ NOT NULL DEFAULT now(),
  timestamp     TIMESTAMPTZ NOT NULL,
  statement     JSONB NOT NULL,      -- the raw statement
  context_au    TEXT,                -- cmi5 Assignable Unit IRI if present
  context_reg   UUID,                -- cmi5 registration if present
  result_score  REAL                 -- denormalized for fast queries
)

The denormalized columns are deliberate. JSONB extraction is cheap once but expensive in aggregations across millions of rows.

The indexes that matter

  • (tenant_id, actor_id, timestamp DESC) — learner timeline; the most common query shape
  • (tenant_id, object_id, timestamp DESC) — course or AU timeline; cohort analytics
  • (tenant_id, verb_id, timestamp DESC) — verb-level analytics (“how many completions today?”)
  • (tenant_id, context_reg) — full session replay for cmi5
  • JSONB GIN index on statement for ad-hoc queries

The four query patterns we care about

  1. Learner timeline — every statement for a learner, time-ordered. The default report.
  2. Course completion funnel — for a course, what fraction of enrolled learners reached each milestone?
  3. Cohort comparison — for cohort A vs cohort B, completion rate, time-to-complete, average score.
  4. Per-question item analysis — for an assessment item, P-value (fraction correct), discrimination index, time spent.

Aggregation strategy

Rolling aggregations from the raw statements table get expensive past ~5M rows per tenant. We materialize daily roll-ups for the four query patterns above into separate tables and refresh nightly. Real-time queries hit the roll-up; “fresh today” toggles hit the raw table for the current day only.

What we wish we had done from day one

  • Stored the actor identifier as a normalized canonical form, not a JSON path expression evaluated at query time
  • Denormalized result.score.scaled from the start (we added it later, painfully)
  • Tagged statements with the producing AU’s cmi5 launch URL on the way in; deriving it later from context is fragile
  • Captured statement.timestamp AND statement.stored separately — they diverge in mobile/offline scenarios
  • Built a “replay” debug view in week one, not month six

The validation layer

Every incoming statement is schema-validated against xAPI rules before storage: required fields, IRI well-formedness, ISO 8601 timestamps. Invalid statements get a 400 with a clear pointer to the offending path, not a generic error. This catches authoring-tool bugs early and prevents poisoned aggregations.

Frequently asked questions

When should I build vs buy an LRS?

Buy SCORM Cloud or Veracity if you need an LRS in two weeks. Build only when query patterns the SaaS LRS does not support are core to your product — typically deep multi-tenant analytics with custom aggregations.

PostgreSQL or a dedicated store for xAPI statements?

PostgreSQL with JSONB scales surprisingly well into millions of statements. Move to ClickHouse or similar only when aggregation queries cross the threshold where Postgres planning costs dominate.

Working on something similar?

T-Square architects, builds and operates production systems for learning, AI and custom software products. Talk to a senior engineer if you want a second opinion on your design or roadmap.

— /more

Keep reading