Introduction

As children enter school during middle childhood, they must learn to act appropriately in new situations through feedback. For example, children must learn to raise their hand before speaking during class. The teacher may reinforce this behavior immediately or with a delay, which raises the question whether feedback timing modulates their learning. Here, reinforcement learning (RL)1 provides a useful mechanistic framework to describe such feedback-driven value-based learning and decision-making. RL models allow to explicitely test for the influence of separate components during value-based learning, such as model-free and model-based learning2, social and non-social learning3,4, or the contribution of different memory systems57.

The role of feedback timing has previously been studied in relation to memory sytems. The memory systems account is a theoretical framework that proposes that different types of memory are supported by distinct neural systems in the brain. Specifically, this account suggests that there are two memory systems: a hippocampal-dependent system and a striatal-dependent system. These systems modulate memory and value-based learning, and their interactive development has been of particular interest to developmental research8,9. In adults, the hippocampal-dependent memory system has been shown to contribute to episodic memory during reinforcement learning and is more engaged during feedback that is presented with a delay6,10,11, as opposed to the striatal-dependent memory system, which is more engaged after immediate feedback and supports habitual memory5,1214. Specifically, hippocampal activation was greater during delayed feedback than during immediate feedback, whereas striatal activation was greater during immediate feedback than during delayed feedback5. The engagement of the hippocampus during delayed feedback was further supported by enhanced episodic memory for incidentally presented objects compared to objects presented with immediate feedback. Taken together, findings from adult studies suggest that feedback timing modulates the engagement of the hippocampal and striatal memory systems during value-based learning. Given the differential developmental trajectories of these systems and the impact the systems have on reinforcement learning and memory, it is important to understand whether children would show similar feedback timing modulations as previously shown in adults. In addition, whether such feedback timing modulation changes over time remains largely unexplored. To this end, in this study, we examined the contributions of hippocampal and striatal structural volumes during the longitudinal development of reinforcement learning across two years in 6-to-7-year-old children. We will introduce the key parameters in reinforcement learning and then we review the existing literature on developmental trajectories in reinforcement learning as well as on hippocampus and striatum, our two brain regions of interest.

Reinforcement learning behavior modulated by feedback timing can be modeled computationally using at least three parameters that reflect feedback-based learning and decision-making. For feedback-based learning, a learning rate parameter determines the extent to which the reward prediction error, defined as the difference between the received reward and the expected reward, influences the update of the future choice values. A higher learning rate emphasizes recent outcomes, whereas a lower learning rate reflects learning integrated over a longer outcome history15. Value updates may further depend on an outcome sensitivity parameter that scales the individual magnitude of received rewards. Finally, in decision-making, the inverse temperature parameter plays a key role in determining the tendency to select the more valuable choice and quantifies choice stochasticity. A higher inverse temperature reflects more value-guided, deterministic choice behavior compared to a lower inverse temperature reflecting more random choices. Learning rates and inverse temperature have been studied extensively across development, mainly with cross-sectional studies showing mixed findings regarding their age gradients16. One study reported lower learning rates in children compared to adolescents17, while other studies found no differences18,19 or even higher learning rates in children8,20. Developmental differences regarding the inverse temperature parameter are slightly more consistent, with studies reporting no differences8,2123 or higher inverse temperature with age that suggests that behavior is increasingly value-guided and less explorative1719,24. To the best of our knowledge, outcome sensitivity has not been modeled computationally across development. However, studies that linked striatal reward activation to self-reported reward sensitivity showed increasing sensitivity from childhood to adolescence25,26.

In general, the inconsistencies regarding developmental differences in parameters may be due to their dependency on model and task properties27, which could be reconciled by comparing developmental changes to simulation-based optimal learning15. Such comparisons acknowledge that optimal parameter values vary depending on the context, and it has been suggested that humans develop towards more optimal parameter values from childhood into adulthood16. Importantly, to our knowledge previous reinforcement learning studies with children were cross-sectional, and only two studies investigated children under 8 years of age17,28. Cross-sectional studies, in which developmental change is inferred as a between-subject factor, do not capture the dynamics in middle childhood if individual differences are large, whereas longitudinal studies test development as a within-subject factor, which is crucial for uncovering change across time. Thus, longitudinal changes in reinforcement learning in middle childhood as well as their putative striatal and hippocampal associations remain unknown. To this end, learning rates, outcome sensitivity and inverse temperature are relevant computational parameters to study longitudinal changes in striatal and hippocampal systems during value-based learning.

Striatal and hippocampal contributions to reinforcement learning during middle childhood may differ as these brain regions undergo major developmental changes. Whereas earlier structural studies with relatively small sample sizes showed large developmental variability and a tendency for an earlier volume peak in the striatum than in the hippocampus2935, a recent cross-sectional large-scale study was able to contrast striatal and hippocampal trajectories with greater granularity36. These data showed striatal volume peaks in the first decade which then declined throughout later developmental periods, whereas hippocampal volume showed a more protracted inverted-U-shaped trajectory that peaked in adolescence. Based on these structural findings, striatal and hippocampal systems are expected to develop functionally at different rates37, with habit memory depending on the earlier developing striatum and episodic memory depending on the later developing hippocampus38. A direct investigation of the longitudinal development of both memory systems in childhood would shed light on whether the memory systems show a differential engagement similar to that of adults5. Such knowledge could be useful to structure learning processes according to the developmental status. For example, children’s ability to learn from delayed feedback may depend on how well their hippocampus has developed. In the same study sample, we previously reported that children’s hippocampal volume was related to their family’s income level39. Additionally, previous research has shown that stress can reduce the effectiveness of the hippocampal-dependent memory system11. This suggests that environmental factors such as income and stress may play a role in shaping how well children learn from delayed feedback, particularly through their impact on hippocampal development. By identifying the specific environmental factors that impact children’s learning and brain development, we can identify risk groups and tailor interventions to ameliorate adverse effects.

This study aimed to explore the development of value-based learning in children and its relationship with structural brain development over time. We hypothesized that the timing of feedback would modulate children’s learning from reinforcement, and that such modulation can be captured by reinforcement learning (RL) model parameters. Additionally, we predicted that children’s value-based longitudinal development would shift towards more optimal learning behavior. Regarding structural brain development, we expected the striatum to be relatively mature by middle childhood compared to the protracted hippocampal maturation. Our second objective was to investigate the relationship between value-based learning and structural brain development using longitudinal structural equation modeling. We anticipated that there would be differentiated brain-cognition links between brain volume and value-based learning. Specifically, we predicted that immediate feedback learning would be more strongly associated with striatal volume, whereas hippocampal volume would be more closely linked to delayed feedback and the facilitation of episodic memory encoding. Finally, we examined how these brain-cognition dynamics would change over time by analyzing their longitudinal changes.

Method

Participants

Children and their parents took part in 2 waves of data collection with an interval of about 2 years (mean = 2.07, SD = 0.17, range = 1.69 – 2.68). The inclusion criteria for wave 1 were children attending first or second grade, no psychiatric or physical health disorders, at least one parent speaking fluent German, and born full-term (≥ 37 weeks of gestation). At wave 1, 142 children (46% female, age mean = 7.19, SD = 0.46, Range = 6.07 - 7.98) and their parents or caregivers participated in the study. 140 children were included in the analysis (one child did not complete the probabilistic learning task, and another child was later excluded due to technical problems during the task). A subgroup of 90 children (49% female, 100% right-handed), who was randomly selected, completed magnetic resonance imaging (MRI) scanning at wave 1, and 82 of them contributed to structural data after removing scans with excessive movement. At wave 2, 127 children (46% female, age mean = 9.25, SD = 0.45, Range = 8.30 −10.2) continued taking part in the study, while families of the remaining children were unable to be contacted or decided not to return to the study. 126 children at wave 2 completed the reinforcement learning task and were included in the analysis. All children at wave 2 were invited for MRI scanning, and 104 of them completed scanning (45% female, 92% right-handed). 99 children contributed to structural data, after removing scans with excessive movement. In total, 73 children contributed to the longitudinal MRI data and 126 children contributed to the longitudinal learning data. As previously reported for this study sample, we found no systematic bias due to wave 2 dropout39.

Procedure

The study consisted of a series of cognitive tasks tested during two behavioral sessions, including a reinforcement learning task, and one MRI session at wave 139,40. Two years later, the children underwent one behavioral and one MRI session. MRI scanning was performed within three weeks of the behavioral task session. Each session lasted between 150 and 180 minutes and was scheduled either on weekdays between 2 p.m. and 6 p.m. or during weekends. Before participation, the parents provided written informed consent and children’s verbal assent at both waves. All children were compensated with an honorarium of 8 euro per hour.

Measures

Reinforcement learning task

Children completed an adapted reinforcement learning task5 in which they learned the preferred associations between four cues (cartoon characters) and two choices (round-shaped or square-shaped lolli) through probabilistic feedback (87.5 % contingent and 12.5 % non-contingent reward probability). In each trial, after an initial inter-trial interval of 0.5 s, a cue and its choice options were presented for up to 7 s until the child made a choice (Figure 1, choice phase). In the delay phase, we manipulated feedback timing. For two cues, the selected choice remained visible for 1 s (immediate feedback condition), whereas for the other two cue characters, it remained visible for 5 s before feedback was given (delayed feedback condition). A final feedback phase of 2 s indicated a reward by a green frame, and a punishment by a red frame. Inside each frame, a unique object picture was shown, which was incidentally encoded and irrelevant to the task. The child was instructed to pay attention to the feedback indicated by the frame color. In an initial practice phase of 32 trials, the child practiced the task with a fifth cartoon character not included in the actual task to avoid practice effects. The experimenter instructed the child to select the choice that was most likely to result in a reward. The Experimenter checked whether the child learned the more rewarded choice during practice and let it repeat the practice task otherwise to ensure understanding of the task. In the actual task, 128 trials were presented in four blocks and with small breaks in between. Cues were presented in a mixed, pseudo-randomized order. A total of 64 unique objects were shown in the feedback phase, each one twice within the same feedback condition. In both delay phases, contingent choice and choice location remained the same for each cue within the task, but were balanced across participants by using four different task versions. At wave 2, four new cues replaced the previous ones to rule out memory effects.

(A) Depiction of two example trials of immediate and delayed feedback conditions presented at wave 1. For immediate feedback (top panel), between choice response and feedback, cue and choice were presented for 1 s. At feedback, a green frame around the incidentally encoded object indicated a positive outcome, which appeared in 87.5% of the trials when selecting the squard-shaped lolli for this example cue. For delayed feedback (bottom panel), the delay phase between choice response and feedback lasted for 5 s. The red frame around the object indicated a negative outcome and appeared in 87.5% of the trials when selecting the squard-shaped lolli for this example cue. (B) For each feedback condition, two action-outcome contingencies were learned to balance a potential choice bias. With the four task versions, the cues and outcome contingencies were counterbalanced across participants.

Object recognition test

At wave 1, children were additionally tested for recognition memory on the object pictures that were incidentally encoded during reinforcement learning. A total of 80 objects (48 old objects and 32 new objects) were presented in randomized order. The 48 old objects (24 for each feedback condition) were selected from the 64 old objects shown during learning based on two lists to balance the shown and omitted old objects across task versions. Each old object was shown twice during learning, but if the child failed to respond during learning, no feedback or object was shown in the trial, so some objects only appeared once. These objects were excluded at the individual level (individually missing object mean = 2.71). At recognition, children had 4 response options (‘old sure’, ‘old unsure’, ‘new unsure’, ‘new sure’) with up to 7 s to respond. The children answered verbally, and the experimenter entered their response. At wave 2, this test was excluded due to time constraints.

Brain volume

We extracted the bilateral brain volumes for our regions of interest, which were striatum and hippocampus. The striatum regions included nucleus accumbens, caudate and putamen. For our imaging data, structural MRI images were acquired on a Siemens Magnetom TrioTim syngo 3 Tesla scanner with a 12-channel head coil (Siemens Medical AG, Erlangen, Germany) using a 3D T1– weighted Magnetization Prepared Rapid Gradient Echo (MPRAGE) sequence, with the following parameters: 192 slices; field of view = 256 mm, voxel size = 1 mm3, TR = 2500 ms; TE = 3.69 ms, flip angle = 7°, TI = 1100 ms. Volumetric segmentation was performed using the Freesurfer 6.0.0 image analysis suite41. Previous studies suggested that software tools based on adult brain templates provide inaccurate segmentation for pediatric samples, which can be improved through the use of study– specific template brains42,43. Thus, we created two study-specific template brains (one for each wave) using Freesurfer’s “make_average_subject” command. This pipeline utilized the default adult template brain registrations of the “recon–all–all” command to average surfaces, curvatures, and volumes from all subjects into a study–specific template brain. All subjects were then re–registered to this study-specific template brain to improve segmentation accuracy. Segmented images were manually inspected for accuracy and 8 cases at wave 1 and 5 cases at wave 2 were excluded for inaccurate or failed registration due to excessive motion.

Data analysis

Behavioral learning performance

As a first step, we calculated learning outcomes diretly from the raw data, which where learning accuracy, win-stay and lose-shift behavior as well as reaction time. Learning accuracy was defined as the proportion to choose the more rewarding option, while win-stay and lose-shift refer to the proportion of staying with the previously chosen option after a reward and switching to the alternative choice after receiving a punishment, respectively. We used these outcomes as our dependent variables to examine the effect of the predictors feedback timing (immediate, delayed), wave (1, 2), wave 1 age, and sex (girls, boys), utilizing generalized linear mixed models (GLMM) with the R package lme444. All reported models included random slopes for within-subject factors feedback timing and wave (see Supplementary Material 2 for the model structure). We systematically tested main effects and interactions between the predictors and their interaction had to statistically improve the predictive ability of the model to be included in the final reported model. All predictor variables were grand-mean-centered to interpret the interaction effects independent from other predictors.

Reinforcement learning models

As a next step, we used computational modeling to compare the learning models of basic heuristic strategies and value-based learning and to determine the model that could best capture children’s trial-by-trial learning behavior. For heuristic strategies, we considered models that reflected a Win-stay-lose-shift (wsls) or a Win-stay (ws) strategy. Win-stay is a heuristic strategy in which the same action is repeated if it leads to a positive outcome in the previous trial, and Win-stay-lose-shift additionally switches to a different action if the previous outcome is negative. Note that these model-based outcomes are not identical to the win-stay and lose-shift behavior that were calculated from the raw data. The use of such model-based measure offers the advantage in discerning the underlying hidden cognitive process with greather nuance, in contrast to classical approaches that directly use raw behavioral data. The models quantified the learning behavior for each individual I for each cue c and trial t. The heuristic models consisted of a weight w that reflected its degree in strategy use. In the case of reward r = 1, w was equal to 1 for the chosen option (eg. choice A), and 0 for the unchosen option (e.g. choice B), thus maximizing win-stay, i.e., choosing A at the subsquent trial t + 1:

For trials r = 0 (applicable only to the wsls model), model weights were the opposite, maximizing lose-shift:

The initial weights for both choices were set to wi,c,t=1 = 0.5. The weight w then scaled the parameter τ_wsls or τ_ws to estimate the individual strategy use during decision-making. The choice probabilities were calculated using the softmax function, eg., for the chosen option A:

Thus, a higher probability of strategy use was reflected by a larger value of τ_wsls or τ_ws.

For value-based learning, we considered a Rescorla-Wagner model and several variants based on our theoretical conceptions. The baseline value-based model vbm1 updated the value v of the selected choice (A or B) for the next trial t. This value update was determined by calculating the difference between the received reward r and the expected value v of the selected choice, which was the reward prediction error. The value update was further scaled by a learning rate α (0 < α < 1):

When the outcome sensitivity parameter ρ (0 < ρ < 20) was included, the reward was additionally scaled at the value update:

The inverse temperature parameter τ(0 < τ < 20) was included in the softmax function to compute choice probabilities:

Note, however, that outcome sensitivity and inverse temperature are difficult to fit simultaneously due to non-identifiability issues45. Therefore, models including the inverse temperature fixed outcome sensitivity at 1 (inverse temperature model family), assuming no individual differences in outcome sensitivity. For the outcome sensitivity model family, outcome sensitivity was freely estimated, and the inverse temperature was fixed at 1, asssuming the same degree of value-based decision behavior across individuals. Even though outcome sensitivity is usually restricted to an upper bound of 2 to not inflate outcomes at value update, this configuration led to ceiling effects in outcome sensitivity and non-converging model results. Further, this issue was not resolved when we fixed the inverse temperature at the group mean of 15.47 of the winning inverse temperature family model. It may be that in children, individual differences in outcome sensitivity are more pronounced, leading to more extreme values. Therefore, we decided to extend the upper bound to 20, parallel to the inverse temperature, and all our models converged with Rhat < 1.1. Each model family consisted of 4 model variants vbm1−4 (1α1τ, 2α1τ, 1α2τ, 2α2τ) and vbm5−8 (1α1ρ , 2α1ρ, 1α2ρ, 2α2ρ), in which each parameter was either separated by feedback timing or kept as a single parameter across feedback conditions. Our baseline value-based model vbm1 included a single learning rate and a single inverse temperature (1α1τ).

Parameter estimation

All choice data were fitted in a hierarchical Bayesian analysis using the Stan language in R46,47 adopted from the hBayesDM package48. Posterior parameter distributions were estimated using Markov chain Monte Carlo (MCMC) sampling running 4 chains each with 3,000 iterations, using the first half of the chain as warmup, and group-level parameters and individual-level parameters were estimated simoultaneously. The hierarchical Bayesian approach provides more stable and reliable parameter estimates as opposed to point-estimation approaches like maximum likelihood estimation49. Each model fit both wave 1 and wave 2 data at once, considering the correlation structure of the same parameter across waves, to account for within-subject dependency using the Cholesky decomposition. The Cholesky decomposition used a Lewandowski-Kurowicka-Joe prior of 2, and all other group-level parameters had a prior normal distribution, Normal (0, 0.5). Non-response trials (wave 1 = 2.41%, wave 2 = 0.97% on average) were excluded in advance.

Model simulation and model-derived learning score

To appropriately interpret the parameter results with respect to the optimal parameter combination of the winning model, we simulated 5,000,000 individual datasets using 10,000 different parameter value combinations (covering the whole range of each parameter) to identify the optimal parameter combination of the winning model that was selected by model comparison. In addition, we computed the model-derived mean choice probability of the contingent, i.e., the more rewarded option, and we referred to it as the model-derived learning score. This model-derived choice probability differs from the observed empirical choice probability (i.e., the accuracy of selecting the more rewarded option), because the model-derived learning score combines the model with the data by incorporating latent information carried out by key learning parameters. Thus, the learning score captures observed behavior based on trial-by-trial latent processes predicted by value-based models. We used this as metric to interpret the fitted posterior parameters in relation to the optimal parameter combination of our probabilistic learning task.

Model selection and validation

We conducted a 2-step sequential procedure for the model development and model selection. As a first step, we compared model evidence for the baseline value-based model that does not separate learning rate and inverse temperature by feedback timing (vbm1:1α, 1τ) to the non-value-based, heuristic strategy models that reflect Win-stay or Win-stay-lose-shift strategy behavior (ws, wsls). As a second step, we compared model evidence for 8 value-based model variants, 4 of the model family with learning rate and inverse temperature (1α1τ, 2α1τ, 1α2τ, 2α2τ) and 4 of the model family with learning rate and outcome sensitivity (1α1ρ, 2α1ρ, 1α2ρ, 2α2ρ). This allowed us to compare whether children showed separable effects of feedback timing on one of the model parameters. We compared the model fit using Bayesian leave-one-out cross-validation and obtained the expected log pointwise predictive density (elpdloo) using the R package loo50. We further computed the model weights (Pseudo-BMA+) using Pseudo Bayesian model averaging stabilized by Bayesian bootstrap with 100,000 iterations51. To validate our models, we estimated predictive accuracy by comparing one-step-ahead model predictions with the choice data15,52. We performed parameter recovery for the winning model and model recovery by comparing it to a set of models used during model comparison (Supplementary Material 1)53.

Episodic memory at wave 1

We predicted the individual corrected recognition memory (hits-false alarms) by feedback condition in a linear mixed effects model using the R package lme444. Only confident (“sure”) ratings were included in the analysis, which were 98.1 % of all given responses. A total of 140 children completed the recognition memory test and 138 were included in the analysis, with two being excluded due to negative corrected recognition memory value (i.e., poor recognition memory). Age and sex were controlled for as covariates.

Longitudinal brain-cognition links

We used latent change score (LCS) models to examine the longitudinal relationships between brain and learning score measures. LCS models are longitudinal structural equation models that have been widely applied to estimate developmental changes and coupling effects across domains such as the brain and cognition54,55. LCS models allow the definition of specific paths between multiple variables to test explicit hypotheses and estimate latent change from the observed variables that account for measurement error and increase testing power56. We compiled univariate LCS models for each variable separately (learning scores and brain volumes) to examine whether there was significant individual variance and change, which could be related within a multivariate LCS model as a next step. Model fit had to be at least acceptable, with a comparative fit index (CFI) > 0.95, standardized root mean square residual (SRMR) < .08 and root mean square error of approximation (RMSEA) < .0857. Age and sex were included as covariates at wave 1, as well as the estimated total intracranial volume (eTIV) when brain volume was included in the model. Multivariate LCS models allow to estimate meaningful brain-cognition relationships: a wave 1 covariance between brain and cognition, brain predicting change onto cognition, or vice versa, and a covariance in both brain and cognition change scores (wave 1 to wave 2). Before compiling the variables into an LCS model, they were checked for outliers ± 4 SD around the mean. We identified one outlier for the learning rate at wave 2, which was removed for the explorative LCS model that included model parameters. There were no further outliers in other cognitive variables or brain volumes. Continuous variables were standardized to the wave 1 measure so that wave 2 values represent the change from wave 1, sex was contrast-coded (girls = 1, boys = −1).

Results

Behavioral results

First, we were interested in whether children showed behavioral differences between waves and feedback timing. A descriptive overview is provided in Table 1 and Figure 2. The details of the reported GLMM models, including the random effects structure and the effects of age and sex, are described in the Supplementary Material 2. Since some children were poor learners who failed to reach 50 % average accuracy in their last 20 trials (13 children at wave 1 and 6 children at wave 2), we also performed behavioral analyses with a reduced dataset in which results remained unchanged (Supplementary Materials 6).

Individual differences in the behavioral reinforcement learning outcomes and their longitudinal change. (A) Accuracy did not differ by feedback timing and increased between waves. (B) Win-stay and lose-shift proportion did not differ by feedback timing, and win-stay increased and lose-shift proportion decreased between waves. (C) Reaction time differed by feedback timing, in which decisions for cues learned with delayed feedback were faster, and reaction times were faster at wave 2 compared to wave 1. (D) Correlations between behavioral outcomes reveal that learning accuracy was primarily correlated with the win-stay and lose-shift probabilities both within and between waves, but was uncorrelated to reaction time. Significant correlations are circled, p-values were adjusted for multiple comparisons using bonferroni correction.

Descriptive behavioral results of dependent variables Accuracy (ACC, probability correct), win-stay probability (WS), lose-shift probability (LS), and reaction time (RT, in seconds), as well as mixed model fixed effects that predicted these dependent variables.

Children’s learning improved between waves

With the complete dataset, we found that increased learning accuracy (i.e., the probability of choosing the more rewarding option) was predicted at wave 2 compared to wave 1, but there were no differences in accuracy by feedback timing (βwave=2 = .550, SE = .061, z = 8.97, p < .001, βfeedback=delayed = .013, SE = .024, z = 0.54, p = .590). Furthermore, win-stay probability increased and lose-shift probability decreased longitudinally, again without differences by feedback timing (WS: βwave=2 = .586, SE = .071, z = 8.22, p < .001, βfeedback=delayed = .023, SE = .033, z = 0.69, p = .489; LS: βwave=2 = −.252, SE = .037, z = −6.87, p < .001, βfeedback=delayed = .030, SE = .022, z = 1.37, p = .169). Reaction times were faster at wave 2 compared to wave 1, and they were faster for delayed compared to immediate feedback trials (βwave=2 = −221, SE = 22.8, t(dfSatterthwaite = 135) = −9.70, p < .001, βfeedback=delayed = −13.8, SE = 6.59, t(dfSatterthwaite = 136) = −2.10, p = .038). To summarize, children’s average accuracy improved over 2 years, while their win-stay probability increased and their lose-shift probability decreased between waves. Children were able to respond faster to cues paired with delayed feedback compared to cues paired with immediate feedback, and they became faster in their decision-making across waves (see mixed model effects overview in Table 1). Of note, reaction times were largely uncorrelated with accuracy and switching behavior (win-stay, lose-shift), while accuracy and switching behavior showed significant correlations at both waves (Figure 2D).

Modeling results

Children’s behavior was best described by value-based learning

We conducted a 2-step sequential procedure for model development and model selection. Model comparison using leave-one-out cross validation showed evidence in favor of the value-based learning model, reflected in the highest expected log pointwise predictive density and highest model weights, confirming that children’s learning behavior in the longitudinal data can generally be better described by a value-based rather than by a heuristic strategy model (elpdloo= −15154.9, pseudo-BMA+ = 1, Table 2). Children whose individual fit was better for a heuristic model (wsls) than for the value-based model (vbm1), were at both waves more likely to be poor learners (defined as an accuracy below 50% in the last 20 trials). Taken together, children’s learning behavior was best described by a value-based model, and a heuristic strategy model captured more poor learners compared to a value-based model.

Model comparison results

Feedback timing modulated choice stochasticity

Model vbm3 (1α2τ) showed the largest model evidence, reflected in the highest expected log pointwise predictive density and highest model weights and suggests that feedback timing affected the inverse temperature, but not the learning rate or outcome sensitivity (elpdloo = −15045.3, pseudo-BMA+ = 0.73, Table 2). Table 3 and Figure 3A provide a descriptive overview of the winning model parameters. Of note, there were only small differences in model fit (elpdloo) to the second-best model (vbm7, 1α2ρ, Δelpdloo= −2.93, elpd_SEloo= 2.92, pseudo-BMA+ = 0.24), which suggests a potential separable feedback timing effect on outcome sensitivity. We also performed the model comparison with a reduced dataset in which the winning model remained the same (Supplementary Materials 6). The average inverse temperature did not differ by feedback condition, but showed large within-person condition differences at both waves, indicating individual differences in feedback timing modulation (wave 1: Δτdel–ime Mean = 0.22, SD = 3.80, Range = 21.74, wave 2: Δτdel–ime Mean = 0.35, SD = 3.70, Range = 24.03). The correlations between the parameters are shown in Supplementary Material 3.

(A) Individual differences in the learning rate and inverse temperature of the winning model and their longitudinal change. The inverse temperature τ but not learning rate α was separated by feedback timing, and both increased between waves in their values (top panel). The condition difference in the inverse temperature did not differ on average, but showed individual differences (bottom left panel). (B) The condition differences in the inverse temperature correlated with reaction time, i.e., higher delayed compared to immediate inverse temperature was related to faster delayed compared to immediate reaction time.

Description of model parameters from the winning value-based model vbm3

Since reaction times were predicted by feedback timing behaviorally, and inverse temperature is assumed to reflect decision-making, we were interested in whether differences in reaction time were related to inverse temperature differences. Indeed, at both waves, children who responded faster during delayed compared to immediate feedback had a higher inverse temperature at delayed compared to immediate feedback (wave 1: r = −.261, t(df = 138 = −3.18, p = .002, wave 2: r = −.345, t(df = 124) = - 4.10, p < .001, Figure 3B). Taken together, children’s learning behavior was best described by a value-based model, where feedback timing modulated individual differences in the choice rule during value-based learning. Interestingly, the differences in the choice rule and reaction time were correlated. Specifically, more value-guided choice behavior (i.e., higher inverse temperature) was related to faster responses during delayed feedback relative to immediate feedback, suggesting a link between model parameter and behavior in relation to feedback timing.

Children’s value-based learning became more optimal

Next, we compared the parameter space according to model simulation (Figure 4A) with the empirical posterior parameters fitted by the winning model (Table 3, Figure 4B) to determine whether children increased their value-based learning towards more optimal parameter combinations. Both fitted and simulated parameter combinations allowed us to derive a learning score that captured learning performance according to the winning value-based model. Note that the learning score was defined as the average choice probability for the more rewarded choice option. We refer to these model-derived choice probabilities as learning score, since they reflect value-based learning and combine information of learned values, that depend on the learning rate, and values translated into choice probabilities, that depend on the inverse temperature. Thus, a higher learning score reflects more optimal value-based learning. We simulated 10,000 parameter combinations and created a learning score map according to each parameter combination (Figure 4A). The optimal parameter combination was at a learning rate α = 0.29, and an inverse temperature τ = 19.8, and with an average learning score of 96.5 % (Figure 4A). Children’s fitted learning rates ranged 0.01 – 0.22 and inverse temperature 6.73 – 18.70 and were outside the parameter space of a learning score above 96 % (Table 3 and Figure 4A). The average longitudinal increases in learning rate and inverse temperature were mirrored by average increases in the learning scores, confirming our prediction that their parameters developed towards optimal value-based learning (arrow in Figure 4B). We further found that the average longitudinal change in win-stay and lose-shift proportion also developed towards more optimal value-based learning (Supplementary Material 4).

(A) The model simulation depicts parameter combinations and simulation-based average learning scores. The cyan “X” in the middle top depicts the optimal parameter combination where average learning scores were at 96.5 %, and the cyan rectangle depicts the space of the fitted parameter combinations, (B) Enlarged view of the space of fitted parameter combinations. The colored arrows depict mean change (bold arrow) and individual change (transparent arrows) of the fitted parameters. The greyscale gradient-filled dots, that are connected by the arrows, depict the individual learning score, while the the greyscale gradient in the background depicts the simulated average learning score. The mean change reveals an overall change towards the higher, i.e., more optimal, learning scores. (C) One-step-ahead posterior predictions of the winning model for each wave. The colored lines depict averaged trial-by-trial task behavior for each feedback condition, and a cyan ribbon indicates the 95% highest density interval of the one-step-ahead prediction using the entire posterior distribution.

Model validation

To validate our winning model vbm3, we estimated its predictive accuracy by comparing one-step-ahead model predictions with the choice data. The one-step ahead predictions of the winning model captured children’s choices overall well, with predictive accuracies of 65.3 % at wave 1 and 75.7 % at wave 2 (Figure 4C). Further, our winning model showed a good parameter recovery for learning rate (r = 0.85) and inverse temperature (r = 0.75 – 0.77). Our winning model showed excellent on the group level (100%) when comparing it to a set of models used during model comparison (vbm1, vbm7, wsls). The individual model recovery was lower (58%), with 35% of the simulated winning model fitting best on our baseline model vbm1 with a single inverse temperature, which likely reflects the noisy property of the inverse temperature (Supplementary Material 1).

Longitudinal brain-cognition links

Significant longitudinal change in brain and cognition

We first performed univariate LCS model analyses to estimate a latent change score of immediate and delayed learning scores as well as striatal and hippocampal volumes (see descriptive changes in Figure 5B-C). All four variables of interest showed significant positive mean changes and variances, and all univariate models provided a good fit to the data (Supplementary Material 5). This allowed us to further relate the differences in structural brain changes to changes in learning.

(A) Recognition memory (corrected recognition = hits - false alarms) for objects presented during delayed feedback was only enhanced at trend. (B) Learning scores depicted here were used in the LCS analyses. Learning scores were the model-derived choice probability of the contingent choice using fitted posterior parameters. (C) Hippocampal and striatal volumes increased between waves, while hippocampal volume increased most. (D) A four-variate latent change score (LCS) model that included striatal and hippocampal volumes as well as immediate and delayed learning scores. Depicted are significant paths cross-domain (brain-cognition, dashed lines) and within-domain (brain or cognition, solid lines), other paths are omitted for visual clarity and are summarized in Table 4. Depicted brain-cognition links included (covariance between striatal volume and immediate learning score at wave 1), as well as and . (covariances between hippocampal and striatal volumes and delayed learning score at wave 1). Brain links included . And (wave 1 covariance and change-change covariance), and similarly, cognition links included . and . Covariates included age, sex and estimated total intracranial volume. ** denotes significance at α < .001, * at α < .05.

Parameter estimates of a four-variate latent change score model that includes brain (striatal and hippocampal volume) and cognition domains (immediate and delayed learning score)

Hippocampal volume exhibited more protracted development during middle childhood

We next fitted a bivariate LCS model to compare striatal and hippocampal change scores. We theorized that by middle childhood, the striatum would be relatively mature, whereas the hippocampus continues to develop. We progressively constructed multiple LCS models to test this idea. First, the bivariate LCS model provided a good data fit (χ² (14) = 10.09, CFI = 1.00, RMSEA (CI) = 0 (0-.06), SRMR = .04). We then further fitted two constrained models, to see whether setting the mean striatal change or the mean hippocampal change to 0 would lead to a drop in the model fit. Compared to the unrestricted model, the constrained model that assumed no striatal change did not lead to a drop in model fit (Δχ² (1) = 2.74, p = .098), whereas the model that assumed hippocampal change dropped in model fit (Δχ² (1) = 12.69, p < .001). Finally, we tested a more stringent assumption of equal change for striatal and hippocampal volumes, in which the model dropped in model fit compared to the unrestricted model (Δχ² (1) = 18.04, p < .001) and suggests that striatal and hippocampal change differed. Together, these results support our postulation of separable maturational brain trajectories in our study sample, suggesting that the hippocampus continued to grow in middle childhood, whereas striatal volume increased less.

Hippocampal and striatal volume showed distinct associations to learning

We fitted a four-variate LCS model to test our prediction of selective brain-cognition links. Specifically, we assumed a larger contribution of striatal volume at immediate learning, and a larger contribution of hippocampal volume at delayed learning. The LCS model provided good data fit (χ² (27) = 15.4, CFI = 1.00, RMSEA (CI) = 0 (0 – .010, SRMR = .045), and all relevant paths are shown in Figure 5D (see Table 4 for a detailed model overview). For the striatal associations to cognition, we found that wave 1 striatal volume covaried with both immediate learning score and delayed learning score ( = 0.19, z = 2.52, SE = 0.07, p = .012, = 0.18, z = 2.37, SE = 0.07, p = .018). Constraining the striatal association to immediate learning to 0 worsened the model fit relative to the unrestricted model (Δχ² (1) = 5.66, p = .017), which was the same when constraining the striatal association to delayed learning to 0 (Δχ² (1) = 5.14, p = .023). In summary, larger striatal volume was associated with better learning scores for both immediate and better delayed feedback. This pattern remained the same in the results of the reduced dataset (Supplementary Material 6).

Hippocampal volume, on the other hand, only covaried with delayed learning at wave 1 ( = 0.14, z = 2.05, SE = 0.07, p = .041), not with immediate learning score ( = 0.12, z = 1.68, SE = 0.07, p = .092). Fixing the path between hippocampal volume and delayed learning to 0 worsened the model fit relative to the unrestricted model (Δχ2 (1) = 4.19, p = .041), but not when its path to immediate learning was constrained to 0 (Δχ2 (1) = 2.94, p = .086). This suggests that larger hippocampal volume was specifically associated with better delayed learning. In the results of the reduced dataset, the hippocampal association to the delayed learning score was no longer significant, suggesting a weakened pattern when excluding poor learners (Supplementary Material 6). It is likely that the exclusion reduced the group variance for hippocampal volume and delayed learning score in the model. As a next step, the associations between striatum and hippocampus to immediate or delayed learning was directly compared against each other. A model equal-constraining striatal and hippocampal paths to immediate learning (Δχ2 (1) = 0.41, p = .521) and another model equal-constraining these paths to delayed learning (Δχ2 (1) = 0.14, p = .707) did not lead to a worse model fit compared to the unrestricted model, which suggests that the brain-cognition links have considerable overlap. This is in line with the high wave 1 covariance and change-change covariance within the brain and cognition domain (see Table 4). We found no longitudinal links between the brain and cognition domains, which suggests that the found brain-cognition links at wave 1 remained longitudinally stable (see Supplementary Material 5 for an exploratory LCS model that related the model parameters to striatal and hippocampal volume).

Taken together, the confirmatory LCS model results were in line with our predictions of a relatively larger involvement of the hippocampus during delayed feedback learning, but the findings on striatal volume disconfirmed a selective association with immediate feedback learning and suggest a more general role of the striatum in both learning conditions.

No evidence for enhanced episodic memory during delayed feedback

Finally, we investigated whether a hippocampal contribution at delayed feedback would selectively enhance episodic memory. Episodic memory, as measured by individual corrected object recognition memory (hits - false alarms) of confident (“sure”) ratings, showed at trend better memory for items shown in the delayed feedback condition (βfeedback=delayed = .009, SE =.005, t(df = 137) = 1.80, p = .074, see Figure 5A). Note that in the reduced dataset, delayed feedback predicted enhanced item memory significantly (Supplementary Material 6). The inclusion of poor learners in the complete dataset may have weakend this effect because their hippocampal function was worse and was not involved in learning (nor encoding), regardless of feedback timing. To summarize, there was inconclusive support for enhanced episodic memory during delayed compared to immediate feedback, calling for future study to test the postulation of a selective association between hippocampal volume and delayed feedback learning.

Discussion

In this study, we examined the longitudinal development of value-based learning in middle childhood and its associations with striatal and hippocampal volumes that were predicted to differ by feedback timing. Children improved their learning in the 2-year study period. Behaviorally, learning was improved by an increase in accuracy and a reduction in reaction time (i.e., faster responses). Further, children’s switching behavior improved by an increase in win-stay and a decrease in lose-shift behavior. Computationally, learning was enhanced by an increase in learning rate and inverse temperature, which together constituted more optimal value-based learning. Further, feedback timing modulated specifically the inverse temperature. In terms of brain structures, we found that longitudinal changes in hippocampal volume were larger compared to striatal volume, which suggests more protracted hippocampal maturation. The brain-cognition links were longitudinally stable and partially confirmed our hypotheses. In line with previous adult literature and our assumption, hippocampal volume was more strongly associated with delayed feedback learning. Contrary to our expectations, episodic memory performance was not enhanced under delayed feedback compared to immediate feedback. Furthermore, striatal volume unexpectedly was associated with both immediate and delayed feedback learning, suggesting a common involvement of the striatum during value-based learning in middle childhood across timescales.

Children’s learning improvement between waves was described behaviorally by increased win-stay and decreased lose-shift behavior. Our finding is in line with cross-sectional studies in the developmental literature that reported increased learning accuracy and win-stay behavior58,59. Our longitudinal dataset with younger children further suggests that learning change is not only accompanied by increased win-stay, but also decreased lose-shift behavior. We found lower learning performance and less optimal switching behavior in girls compared to boys, which could point to sex differences for reinforcement learning during middle childhood (Supplementary Material 2). Previous studies have found both male and female advantages depending on their age and the type of learning task38,60,61. Alternatively, sex differences may have been driven by confounding variables not included in the analysis.

Computationally, we found longitudinally increased and more optimal learning rate and inverse temperature, as shown by simulation data, that add to the growing literature of developmental reinforcement learning16. Adult studies that examined feedback timing during reinforcement learning reported average learning rates range from 0.12 to 0.345,13,14, which are much closer to the simulated optimal learning rates of 0.29 than children’s average learning rates of 0.02 and 0.05 at wave 1 and 2 in our study. Therefore, it is likely that individuals approach adult-like optimal learning rates later during adolescence. However, the differences in learning rate across studies have to be interpreted with caution. The differences in the task and the analysis approach may limit their comparability15,27. Task proporties such as the trial number per condition differed across studies. Our study included 32 trials per cue in each condition, while in adult studies, the trials per condition ranged from 28 to 1005,13,14. Optimal learning rates in a stable learning environment were at around 0.25 for 10 to 30 trials15, another study reported a lower optimal learning rate of around 0.08 for 120 trials62. This may partly explain why in our case of 32 trials per condition and cue, optimal learning rates called for a relatively high optimal learning rate of 0.29, while in other studies, optimal learning rates may be lower. Regarding differences in the analysis approach, the hierarchical bayesian estimation approach used in our study produces more reliable results in comparison to maximum likelihood estimation49, which had been used in some of the previous adult studies and may have led to biased results towards extreme values. Taken together, our study underscores the importance of using longitudinal data to examine developmental change as well as the importance of simulation-based optimal parameters to interpret the direction of developmental change.

Despite a relatively immature hippocampal structure in middle childhood, our results confirmed a longitudinally stable association between hippocampal volume and delayed feedback learning. However, episodic memory in this learning condition was not enhanced. This suggests a developmentally early hippocampal contribution to value-based learning during delayed feedback, which does not modulate episodic memory as much as compared to adults. Therefore, our study partially extends the findings from the adult literature to middle childhood5,1214. The reduced effect of delayed feedback on episodic memory may be due to the protracted development of hippocampal maturation. In an aging study with a similar task, older adults failed to exhibit enhanced episodic memory for objects presented during delayed feedback trials, and they showed no enhanced hippocampal activation during delayed feedback and14. Therefore, the findings converge nicely at both childhood and older adulthood, during which the structural and functional integrity of hippocampus are known to be less optimal than at younger adulthood6365.

Our brain-cognition links were only partially confirmed, as striatal volumes exhibited associations with not just immediate learning scores, as we predicted, but also with delayed learning scores. This result suggests that the striatum may be important for value-based learning in general rather than exhibiting a selective association with immediate feedback learning. This is also what we found in an explorative analysis that related the striatum to learning rate in general and further predicted longitudinal change in learning rate (Supplemental Material 5). This overall reduced brain-behavior specificity could reflect less differentiated memory systems during development, similar to findings from aging research. Here, older adults exhibited stronger striatal and hippocampal co-activation during both implicit and explicit learning, compared to more dissociable brain-behavior relationships in younger adults66. Interestingly, even in young adults, clear dissociations between memory systems such as in non-human lesion studies are uncommon, and factors like stress modulate their cooperative interaction6,10,11,67,68. Further, there are methodological differences to previous studies that could explain why striatal volumes were not uniquely associated with immediate learning in our study. For example, previous studies related reward prediction errors to striatal and hippocampal activation5,13,14, whereas we examined individual differences in brain structure and the model-derived learning scores. Future functional neuroimaging studies with children could further clarify whether children’s memory systems are indeed less differentiated and explain the attenuated modulation by feedback timing. Taken together, compared to the adult literature, our results with children showed that the hippocampal structure was associated with delayed feedback learning, but did not enhance episodic memory encoding, while the striatum generally supported value-based learning. These findings point towards a developmental effect of less differentiated and more cooperative memory systems in middle childhood.

Our computational modeling results revealed a separable effect of feedback timing on inverse temperature, which suggests that the memory systems modulated learning during decision-making. The reported behavioral differences in reaction time and their correlation to the inverse temperature further support the idea of a decision-related mechanism, as we found children to respond faster during delayed feedback trials and faster responding children also exhibited more value-guided choice behavior (i.e. higher inverse temperature) during delayed compared to immediate feedback. The hippocampus may contribute to a decision-related effect in the delayed feedback condition by facilitating the encoding and retrieval of learned values69. This is in contrast to previous event-related fMRI and EEG studies reporting feedback timing modulations at value update5,13,14, which may be due to at least two reasons. First, we did not include a functional brain measure to examine its differential engagement during the choice and feedback phases. Second, in such a reinforcement learning task, disentangling model parameters from the choice and feedback phases can be challenging, such as for the inverse temperature and outcome sensitivity70. Taken together, hippocampal engagement at delayed feedback may enhance outcome sensitivity as well as facilitate choice behavior through improved retrieval of action-outcome associations. A mechanism facilitating retrieval seems especially relevant in our paradigm, where multiple cues were learned and presented in a mixed order, thus creating a high memory load. To summarize, our study results suggest that feedback timing could modulate decision-making in addition to or as alternative to a mechanism at value update. However, disentangling the effects of inverse temperature and outcome sensitivity is challenging and warrants careful interpretation. Future studies might shed new light by examining neural activations at both task phases, by additionally modeling reaction times using a drift-diffusion approach, or by choosing a task design that allows independent manipulations of these phases and associated model parameters, e.g., by using different reward magnitudes during reinforcement learning, or by studying outcome sensitivity without decision-making.

One aim of developmental investigations is to identify the emergence of brain and cognition dynamics, such as the hippocampal-dependent and striatal-dependent memory systems, which have been shown to engage during reinforcement learning depending on the delay in feedback delivery. Our longitudinal study partially confirmed these brain-cognition links in middle childhood but with less specificity as previously found in adults.

An early existing memory system dynamic, similar to that of adults, is relevant for applying reinforcement learning principles at different timescales. In scenarios such as in the classroom, a teacher may comment on a child’s behavior immediately after the action or some moments later, in par with our experimental manipulation of 1 second versus 5 seconds. Within such short range of delay in teachers’ feedback, children’s learning ability during the first years of schooling may function equally well and depend on the striatal-dependent memory system. However, we anticipate that the reliance on the hippocampus will become even more pronounced when feedback is further delayed for longer time. Children’s capacity for learning over longer timescales relies on the hippocampal-dependent memory system, which is still under development. This knowledge could help to better structure learning according to their development. Furthermore, probabilistic learning from delayed feedback may be a potential diagnostic tool to examine the hippocampal-dependent memory system during learning in children at risk. Environmental factors such as stress11 and socioeconomic status39,71 have been shown to affect hippocampal structure and function and may contribute to a heightened risk for psychopathology in the long term7274. Deficits in hippocampal-dependent learning may be particularly relevant to psychopathology since dysfunctional behavior may arise from a tendency to prioritize short-term consequences over long-term ones75,76 and from the maladaptive application of previously learned behavior in inappropriate contexts77. Interestingly, poor learners showed relatively less value-based learning in favor of stronger simple heuristic strategies, and excluding them modulated the hippocampal-dependent associations to learning and memory in our results. More studies are needed to further clarify the relationship between hippocampus and psychopathology during cognitive and brain development. Another key question is whether developmental trajectories observed cross-sectionally are also confirmed by longitudinal results, such as for the learning rate and inverse temperature. Our results show developmental improvements in these learning parameters in only two years. This suggests that the initial two years of schooling constitute a dynamic period for feedback-based learning, in which contingent feedback is important in shaping behavior and development.

Additional Information

Funding. This study was supported by the Jacobs Foundation [grant 2014–1151] to YLS and CH. The work of YLS was also supported by the European Union (ERC-2018-StG-PIVOTAL-758898), the Deutsche Forschungsgemeinschaft (German Research Foundation, Project ID 327654276, SFB 1315,’Mechanisms and Disturbances in Memory Consolidation: From Synapses to Systems’), and the Hessisches Ministerium für Wissenschaft und Kunst (HMWK; project ‘The Adaptive Mind’).

Acknowledgements

We thank the Max Planck Institute for Human Development and all members of the Jacobs study team for their vital contribution, and all participants and family members for taking part in the study.

Conflicts of interest

The authors declare no competing financial interests.

Ethics approval

This study was approved by the “Deutsche Gesellschaft für Psychologie” ethics committee (YLS_012015).

Availability of data and code. https://osf.io/pju65/