Brain Check
KO

ML-Based Recommendation System

This page introduces the spaced repetition algorithm adopted by BrainCheck and summarizes its performance characteristics.


What Is a Spaced Repetition Algorithm?

A spaced repetition algorithm automatically manages the review schedule for flashcards.

The core idea is simple. Instead of memorizing everything in one intense session, reviews are spread out over time. To do this efficiently, the algorithm models how a learner's memory behaves. It predicts when a learner is likely to forget a specific item and schedules review around that moment.

The better the algorithm predicts memory, the less time a learner needs to remember material for longer.


Benchmark Dataset

The benchmark data was collected from 10,000 real users.

Same-day reviews are excluded from evaluation. Repeating a card several times on the same day is not directly related to long-term memory prediction, and including those records can distort an algorithm's real predictive ability.

Records created after manual date changes or under disabled settings were filtered out, and outlier filters were also applied.


Evaluation Method

Time-Series Split

The benchmark uses TimeSeriesSplit from the sklearn library. It trains on past learning records and evaluates on future learning results. This prevents the algorithm from seeing future information and measures performance under conditions closer to real use.

Note: TimeSeriesSplit is applied independently per user. Data is not mixed across users, so one user's future records do not leak into another user's training data.

Metrics

Short version: The benchmark measures how accurately an algorithm predicts the probability that a learner will remember an item.

Log Loss
Measures the difference between predicted recall probability and the actual review result, remembered or forgotten. It shows how close the algorithm's probability estimates are to real outcomes. The range is 0 to infinity, and lower is better.

RMSE (bins)
Groups predictions and actual outcomes by review interval, review count, lapse count, and similar bins, then computes a weighted root mean squared error across those bins. It evaluates accuracy across many learning situations. The range is 0 to 1, and lower is better.

AUC (Area Under the ROC Curve)
Measures how well the algorithm separates successful recall from failed recall. The range is 0 to 1, and higher is better. In practice, it is usually above 0.5.

Log Loss and RMSE (bins) measure calibration: whether predicted probabilities match actual data.
AUC measures discrimination: whether the model can distinguish between two outcomes.


Algorithm Categories

Two-Component and Three-Component Memory Models

The two-component model of long-term memory describes memory state with two independent variables:

The three-component model adds Difficulty (D).

Representative algorithms:

Alternative Memory Models

Neural Network-Based Models


Benchmark Results

The table below excerpts representative algorithms and shows evaluation results excluding same-day reviews. Values are averages; the original benchmark also provides 99% confidence intervals.

Algorithm Trainable parameters Log Loss ↓ RMSE (bins) ↓ AUC ↑
RWKV-P 2,762,884 0.2773 0.0250 0.8329
RWKV 2,762,884 0.3193 0.0540 0.7683
LSTM 8,869 0.3332 0.0538 0.7329
FSRS-rs 21 0.3443 0.0635 0.7074
FSRS-6 21 0.3460 0.0653 0.7034
FSRS-5 19 0.3560 0.0741 0.7011
FSRS-4.5 17 0.3624 0.0764 0.6893
DASH 9 0.3682 0.0836 0.6312
GRU 39 0.3753 0.0864 0.6683
HLR (Duolingo) 3 0.4694 0.1275 0.6369
Ebisu v2 0 0.4989 0.1627 0.6051
AVG (baseline) 0 0.3945 0.1034 0.4997

Key Takeaways

FSRS-6: High Accuracy with a Small Model

With only 21 trainable parameters, FSRS-6 achieves strong predictive accuracy compared with much larger neural models such as RWKV-P, which has millions of parameters.

Its performance efficiency per parameter is very high. Because the model is small, it can run quickly even on mobile devices and can be optimized locally without a server call.

If RWKV-P Ranks First, Why Use FSRS?

RWKV-P is a large neural model with about 2.76 million parameters. It has the highest accuracy in the benchmark, but training and inference require GPU-level compute, and user-specific fine-tuning is difficult in practice. FSRS-6, by contrast, has only 21 parameters, can personalize quickly from a user's own review history, and runs in real time on ordinary devices.

Comparison with HLR (Duolingo)

Compared with Duolingo's HLR algorithm (Log Loss 0.4694), FSRS-6 (0.3460) achieves about 26% lower Log Loss.

User-Specific Forgetting-Curve Optimization

One of the biggest improvements in FSRS-6 is that it applies a different forgetting-curve shape for each user. Because people do not forget at the same speed, the review schedule can be adjusted to each learner's memory pattern.


How BrainCheck Applies Recommendation

BrainCheck tracks each learner's memory state based on the FSRS algorithm.

  1. Retrievability prediction: Calculates the probability that each card will be remembered.
  2. Optimal review timing: Schedules review just before the recall probability falls below the target level.
  3. Personalized parameter optimization: Continuously adjusts FSRS parameters based on the learner's review history.

This helps learners review while memory is still in a strong state, reducing unnecessary repetition and improving study efficiency.


Source: open-spaced-repetition/srs-benchmark
Dataset: open-spaced-repetition (Hugging Face Datasets)