Introduction

Intelligent behavior requires the ability to adapt to a constantly changing environment. For example, foraging animals must be able to track the changing abundance or scarcity of food resources in different locations and at different timescales. Motor control demands the ability to control limbs that constantly vary in their dynamics (due to fatigue, injury, growth, etc.). Human competitors in games or sports of all varieties must be able to learn and adapt to the changing strategies of their opponents.

To understand the mechanisms of these abilities, researchers have examined how (and how well) human agents can learn option values and track the dynamic changes in values in a volatile reversal learning task (Behrens et al., 2007). Unlike the traditional probabilistic reversal learning task where reward probabilities of two options only switch once (Cools et al., 2002), this paradigm includes two volatility conditions (see Fig. 1B): the reward probabilities of the two options keep constant in one condition (i.e., the stable condition) and switch periodically in the other (i.e., the volatile condition).

Schematic diagram of the experimental task in Gagne et al. (2020).

A. On each trial, the participants were shown two stimuli (with potential reward presented) and were instructed to choose one of them to receive feedback. Only one stimulus results in a reward.

B. Each task is evenly divided into two blocks: a stable and a volatile block. During the stable block, the environmental probability does not change, while in the volatile block, the probability flips every 20 trials.

Previous studies often summarized human behaviors in this paradigm using the parameter of learning rate, a description of the efficiency with which current information is used to promote learning. The learning rate in essence serves as an abstract description of human learning behaviors, often exhibiting locality (Behrens et al., 2007; Boyd & Vandenberghe, 2004). Hence, the analyses of learning rates are usually contingent on the context, where researchers fit specific learning rates to each context. Using this method, previous studies have found that humans are able to flexibly adapt to the change in environmental volatility, which is exhibited by increasing and reducing the learning rate in response to volatile and stable conditions. Impaired flexibility in adjusting the learning rate according to environmental volatility has also been observed in individuals with several psychiatric diseases, including anxiety and depression (Behrens et al., 2007; Browning et al., 2015; Gagne et al., 2020). This hallmark can suggest atypical behaviors (Browning et al., 2015; Gagne et al., 2020), psychosis (Powers et al., 2017), and autism spectrum disorder (Lawson et al., 2017). Nevertheless, there are two limitations to this context-dependent method. First, as the number of contexts increases, the number of parameters can grow dramatically, increasing the risk of over-parameterization. Second, it can be challenging to interpret the learning rate in terms of its normative meaning. The quality of a learning rate never grows monotonically with its value but rather peaks at a moderate range. A higher learning rate is not always better. Apart from these, so far, there is no idea what the learning rate is associated with in the human brain.

The goal of the present work is to offer an alternative explanation of human reinforcement learning that is relatively context-independent. Instead of attributing the behavioral difference to the learning rate, we focus on a less-examined subcognitive process: decision. The decision process describes how individuals strategically utilize their knowledge of the environment to generate responses. We realize this process by introducing a hybrid model, referred to as the mixture-of-strategy (MOS), which weights and sums over various strategies. The weighting parameters reflect subjects’ decision preferences (Daw et al., 2011; Fan et al., 2023). As we will soon show, this model offers a parsimonious explanation for human behavioral responses across varying levels of environmental volatility, using a consistent set of weighting parameters across different contexts.

We base the MOS model on the principle of resource-rationality, which posits that humans’ decision-making should tradeoff reward maximization and the consumption of cognitive resources (Gershman et al., 2015; Griffiths et al., 2015). Three strategies are included in the decision pool. First, we consider the optimal strategy, Expected Utility (EU), which guides decision-making based on the expected utility of each option (calculated as the probability multiplied by the reward magnitude) (Von Neumann & Morgenstern, 1947). This EU strategy yields the maximum amount of reward, but utility calculation per se consumes substantial cognitive resources. Alternatively, humans may choose simpler strategies. For example, the magnitude-oriented (MO) strategy, where only reward magnitude was considered during the decision process, and the habitual (HA) strategy, where people simply repeat choices frequently made in the past regardless of reward magnitude (Wood & Runger, 2016). These heuristic strategies certainly sacrifice the potential reward but come with the benefit of reducing cognitive costs in decision-making processes. We use the preference for these decision strategies to roughly estimate participants’ reward-effort tradeoff in the volatile reversal task. Choosing to heavily rely on the EU strategy is more cognitively demanding than any combination of strategies. We expected that individuals with psychiatric diseases are less likely to use the EU strategy because they are known to have shrunk cognitive resources (Cohen et al., 2014; Harvey et al., 2005; Levens et al., 2009; Moran, 2016).

In this study, we apply and examine the MOS model on a dataset previously reported by Gagne et al. (2020). Our analysis reveals that, compared to healthy controls, patients with anxiety and depression exhibit a weaker tendency for the optimal EU strategy and a stronger preference for the simpler MO strategy, consistent with the reduced-resource hypothesis in psychiatric diseases. Furthermore, we demonstrate that this pattern of strategy preference readily accounts for several learning phenomena observed in prior research. Our work offers an alternative explanation for the effects of environmental volatility on human learning. Meanwhile, it underscores the importance of identifying behavioral markers to differentiate between explanations related to learning rate and decision strategy.

Methods and Materials

Datasets

We focused on the data from Experiment 1 reported in Gagne et al. (2020). The data is publicly available via (https://osf.io/8mzuj/). The original study included data from two experiments. The data from Experiment 2 was not used here because it was implemented on Amazon’s Mechanical Turk with no information about the participants’ clinical diagnoses. Here, we provide critical information about Experiment 1 (see Gagne, et al. (2020) for more technical details).

Participants

Eighty-six participants took part in this experiment. The pool includes 20 patients with a major depressive disorder (MDD), 12 patients with a generalized anxiety disorder (GAD), and 24 healthy control participants. The diagnosis was made through a phone screen, an in-person screening session, and the structured clinical interview following DSM-IV-TR (SCID). Thirty additional participants who reported no history of psychiatric or neurological conditions were recruited without SCID. In this article, we regrouped the MDD and GAD individuals into a patient (PAT) group and the remaining 54 participants into a healthy control (HC) group. The detailed difference between MDD and GAD is not the focus of this paper. We will show later that the general factor behind MDD and GAD is the only factor that predicts learning behavior (see next section for details), a similar result reported in the original study (Gagne et al., 2020).

Clinical measures

The severity of anxiety and depression in all participants was measured by several standard clinical questionnaires, including the Spielberger State-Trait Anxiety Inventory (STAI form Y; Spielberger CD, 1983), the Beck Depression Inventory (BDI; Beck et al., 1961), the Mood and Anxiety Symptoms Questionnaire (MASQ; Clark & Watson, 1991; Watson & Clark, 1991), the Penn State Worry Questionnaire (Meyer et al., 1990), the Center for Epidemiologic Studies Depression Scale (CESD; Radloff, 2016), and the Eysenck Personality Questionnaire (EPQ; Eysenck & Eysenck, 1975). An exploratory bifactor analysis was then applied to item-level responses to disentangle the variance that is common to GAD and MDD or unique to each. The results of this analysis summarized participants’ symptoms into three orthogonal factors: a general factor (g) explaining the common symptoms, a depression-specific factor (f1), and an anxiety-specific factor (f2). Similar to the original study, here we used the same three factors to indicate the participants’ severity of their psychiatric symptoms.

Stimuli and behavioral task

This task is a volatile reversal learning task (see Fig. 1A). On each trial, participants were instructed to choose between two stimuli in order to receive feedback. There were two types of feedback. Participants received points or money in the reward condition and an electric shock in the punishment condition. The potential amount of reward or the intensity of electric shock (i.e., feedback magnitude) was presented together with the stimuli, but only one of the two stimuli would yield the feedback. The participant received feedback only after choosing the correct stimulus and received nothing else. The magnitude of the feedback, ranged between (1-99), is sampled uniformly for each shape from trial to trial. Each run consisted of 180 trials evenly divided into a stable and a volatile block (Fig. 1B). In the stable block, the dominant stimulus (i.e., the stimulus induces the feedback with a higher probability) provided a feedback with a fixed probability of 0.75, while the other one yielded a feedback with a probability of 0.25. In the volatile block, the dominant stimulus’s feedback probability was 0.8, but the dominant stimulus switched between the two every 20 trials. Hence, this design required participants to actively learn and infer the changing stimulus-feedback contingency in the volatile block. The whole experiment included two runs each for the two feedback conditions. 79 participants completed both feedback conditions. 4 participants only completed the reward condition, and 3 participants completed the punishment conditions.

Computational Modeling

Each participant in the experiment must address two fundamental challenges: 1) decision-making, by adhering to a strategy that determines the action to maximize benefit; and 2) learning, by figuring out the untold feedback probability via feedback.

Before formalizing each challenge, we introduce our notation system. We denote each stimulus s as one of two possible states s ∈ {s1, s2}, s1 refers to the left stimulus, and s2 the right one. The labeled feedback magnitude (i.e., reward points or shock intensity) of the stimulus is m(s), and the feedback probability is ψ(s). Following the convention in reinforcement learning (Sutton & Barto, 2018), we presume that the decision is made from a policy π that maps the observed magnitudes m and currently maintained feedback probabilities ψ to a distribution over stimuli, π(s|m, ψ). The construct of the policy varies between models (see below).

The mixture-of-strategy (MOS) model

The key signature of the hybrid MOS model is that its policy consists of a mixture of three strategies: expected utility (EU), magnitude-oriented (MO), and habitual (HA). The EU strategy postulates that human agents rationally calculate the value of each stimulus and use the softmax rule to select an action. In this case, the value of a stimulus should be its expected utility: m(s)ψ(s).

The probability of choosing a stimulus s thus follows a softmax function.

where β is the inverse temperature that is used to round the policy to a Bernoulli distribution. For simplicity, we rewrite Eq. 1 in the following form:

Different from the EU strategy, the MO strategy postulates that observers only focus on feedback magnitude m(s), disregarding feedback probability ψ(s). This is certainly an irrational strategy but more economical in terms of cognitive efforts. Feedback magnitudes are explicitly shown with the stimuli in each trial and readily available for any related cognitive computation. But feedback probability, as a latent variable, requires trial-by-trial learning and inference, which is more cognitively demanding. The MO strategy is defined as,

Like the EU strategy, the MO strategy is converted to a Bernoulli distribution after passing through a softmax function. This processing is necessary because a hybrid model is uninterpretable when its components follow heterogeneous distributions. The softmax function enhances the model’s interpretability.

Unlike EU and MO, the HA strategy depends on neither feedback magnitude m(s) nor feedback probability ψ(s). The HA strategy reflects the tendency to repeat previous frequent choices. This reinforcement reflects the habit of choosing a stimulus, a phenomenon sometimes called hot-hand bias (Gilovich et al., 1985) or perseveration (Gershman, 2020; Wood & Runger, 2016) in literature. For example, if an agent chooses the left stimulus more often in past trials, she forms a preference for the left stimulus in future trials. We constructed it as a Bernoulli distribution (henceforth called habitual distribution) over the two stimuli πHA (s). The trial-by-trial update rule of πHA (s) will be detailed in Eqs. 56 below.

We implemented the hybrid policy of a linear mixture of the three strategies following the methods used in Daw et al. (2011),

where wEU, wMO, and wHA are the weighting parameters of each strategy. The three weighting parameters should be summed to 1, i.e., wEU + wMO + wHA = 1. We can thus describe the policy an observer adopted just by examining the weighting parameters.

Next, we modeled the second challenge — the probabilistic learning process. Two distributions — the feedback probability and the habitual probability — are learned and updated in a trial-by-trial fashion. We updated the feedback probability according to the outcome of the left stimulus s1:

where αψ is the learning rate. O(⋅) is an indicator function that returns 1 at the true feedback stimulus or 0 otherwise. Intuitively, the stimulus that induced a reward would be reinforced and its feedback probability was enhanced. This update equation is the standard format of the well-known Rescorla-Wagner model (Rescorla, 1972). To keep consistent with Gagne et al., (2020), we also explored the valence-specific learning rate in some models,

The habitual distribution is updated in a similar manner.

where απ is the learning rate. A(⋅) is also an indicator function that returns 1 for the stimulus chosen at the current trial. Intuitively, the stimulus chosen in each trial regardless of its feedback will be reinforced via Eq. 6.

We developed two variants for each model, a context-free and a context-dependent variant. The context-free MOS6 has six parameters ξ = {β, αHA, αψ, wEU, wMO, wHA}. This variant does not include the value-specific learning rate design. The context dependent variant MOS22 has 22 parameters. Among them β and αHA are context-free parameters that holds the same for all contexts. Parameters {αψ+, αψ−, wEU, wMO, wHA} are context-dependent parameters that are fitted specifically to each context. We will further discuss the model fitting details in the later section.

The flexible learning rate (FLR) model

The FLR model refers to Model 11 (i.e., the best-fitting model) in Gagne et al. (2020). Here, we describe the FLR model using the same notation system with the published paper, slightly different from the notations in the MOS model. The FLR model models the probability of selecting the left stimulus s1 as,

where β and βHA are the inverse temperature parameters of the value of the left stimulus and the HA strategy, respectively. The value of the left stimulus v represents the advantage of s1 over s2,

where λ is the weighting parameter balancing the two terms. The first term ψ(s1) − ψ(s2) indicates the feedback probability difference between the two options. The second term, sign(m (s1) − m(s2))|m(s1) − m(s2)|r, indicates the feedback magnitude differences scaled by a non-linear factor r. Intuitively, the value v of s1 can be understood as the weighted sum of the two terms. We write this nonlinear scaling in a slightly different form with Eq. 1b in Gagne et al. (2020) to better replicate their coding implementation.

During the learning stage, the FLR model learns the feedback probability using the same equations in the MOS model (Eqs. 56). The context-free variant FLR6 has 6 parameters ξ = {αHA, βHA, r, αψ, β, λ}. The context-dependent variant FLR19 considers {αHA, βHA, r} as context-free parameters; {αψ+, αψ, β, λ} as context-dependent parameters.

The risk-sensitive (RS) model

We adopted the RS model from Behrens, et al. (2007). The RS model assumes that participants apply the EU policy but with a subjectively distorted feedback probability ,

where β is the inverse temperature. The distorted probability is calculated by,

where the γ indicates participants’ risk sensitivity. When γ = 1, a participant has an optimal risk balance. γ < 1 and γ > 1 indicate risk-seeking and risk-aversive tendencies, respectively.

The RS model learns the feedback probability the same as the MOS and FLR models (i.e., Eq. 5). The model did not include the HA strategy. The context-free variant RS3 has 3 parameters ξ = {β, αψ, γ}. The context-dependent variant RS12 considers {β, αψ+, αψ, γ} as context-dependent parameters.

The Pearce-Hall (PH) model

That people use different learning rates in different contexts implies that people adaptively adjust the learning rate during the learning process. To constitute this hypothesis, we adopt the PH model, an adaptive learning rate model, from Pearce and Hall (1980). The PH model posits that an adaptive learning rate in Eq. 5,

where k is a scale factor of the learning rate. Each trial the learning rate is updated in accordance with the absolute prediction error,

where η is the step size for the learning rate. We have no knowledge of participants’ learning rate values before the experiment, so we need to also fit the initial learning rate value, . The PH model generate a choice through a sigmoid function,

The context-free variant PH4 has 4 parameters . The context-dependent variant PH17 considers {} as context-free parameters; {k+, k, η, γ} as context-dependent parameters.

Model fitting

To characterize participants’ behavioral patterns in different experimental context c, we fit the context-dependent parameters to each context following a 2-by-2 factorial structure (Table 1). For example, in the MOS model, we were only interested in the learning rate parameters αψ and three strategies weights wEU, wMO, wHA and fit them separately to each context. The remaining two parameters {β, αHA} were held constant across all four experimental contexts for each participant. Thus, there were 22 free parameters (2 context-free parameters + 5 context-dependent parameters × 4 conditions) of the MOS model in each participant. In contrast, the context-free variant (MOS6), we fit the same set of parameters to all contexts.

Four experimental contexts

Parameters were estimated for each participant via the maximum a posteriori (MAP) method. The objective function to maximize is:

where ξ(c) means the model parameters under condition c. M is the model and N refers to the number of trials of the participant’s behavioral data in condition c. mi, Oi, and si are the presented magnitude, true feedback probability, and participants’ responses recorded in each trial.

Parameter estimation was performed using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm in the scipy.optimize module in Python. This algorithm provides an approximation of the inverse Hessian matrix for the parameter, a critical component that can be employed in Bayesian model selection (Rigoux et al., 2014). For each participant, we ran the optimization with 40 randomly chosen initial parameters to avoid local minima.

In order to use the BFGS algorithm, we reparametrized the model, thereby transforming the original fitting problem into an unconstrained optimization problem (Supplemental material Note 1). Importantly, to fit the weighting parameters (wEU, wMO, wHA) and ensure them summed to 1, we parameterized the weighting parameters as outputs of a softmax function,

and fit the logits λi of the weights. All logits were assumed to be normally distributed with a prior N (0, 10). Due to its normality, we also used the logits as participants’ strategy preferences in statistical analyses in the result section. To provide better intuition, we will use the terms “weighting parameters” and “logits” interchangeably in the following section. Specifically, we may refer to the logits λEU, λMO, and λHA as the weighting parameters.

Simulate to explain the previous learning rate effects

In the result section, Explain the previous learning rate effects using the strategy preferences, we illustrate how we explicate some classical learning rate phenomena on the weighting parameters of the MOS model, referred to as the strategy preferences. Here are some technical details.

We simulate to show that the strategy preferences alone can explain the slower learning curve in the patient group. Each simulation task included 90 stable trials, followed by 90 volatile trials. The parameters used for simulations are β = 8.536, αHA = 0.403, αψ = 0.460, λEU = 0.712, λMO = −0.988, λHA = 0.276. We outputted the predicted probability of the left action a1 for each strategy along with the learning trials to generate Fig. 4B. We simulated the policy for the healthy control group using the averaged parameters except for replacing the three weighting parameters with (Fig. 4A, HC curve). The same method was applied HC HC HC to generate the PAT curve using parameters .

Demonstrating the remaining two effects is equivalent to establishing the following assertion: when compared to the MO strategy, a preference for the EU strategy results in a qualitatively greater extent of increase from stable to volatile conditions, whereas a preference for the MO strategy corresponds to a lesser increase extent. We generated 10 blocks of synthesized data on the MOS model with reward feedback using β = 8.536, αHA = 0.403, αψ = 0.460, 5 blocks each for the EU and MO strategies. For the EU strategy, we set weighting parameters λEU = 10, λMO = 0, λHA = 0, which yields wEU ≈ 1; Similarly, we set the weighting parameters to λEU = 0, λMO = 10, λHA = 0 to synthesize the learning curve for the MO strategy, resulting in wMO ≈ 1. We then fit the FLR and the RS models to these synthesized data, controlling all parameters except for the learning rate.

All parameter values introduced here are reparametrized rather than the raw values.

Parameter recovery and model recovery analyses

We conducted a parameter recovery analysis to validate the fitting of the MOS models. We generated 80 synthetic datasets varying the four parameters of interest {αψ, λEU, λMO, λHA}. The remaining parameters were fixed to the averaged fitted weighting parameters β = 8.804, αHA = 0.366. Each dataset in both cases contained ten blocks. For each dataset, we fit our MOS6 model to the data and compared the fitted parameters and the ground-truth parameters. Parameter recovery analysis aims to exclude the exchangeability between the learning rate and weighting parameters.

To further differentiate models, we also performed a model recovery analysis. We generated 40 synthetic ten-block datasets from the MOS6 model, using the parameters fitted to each participant. We fit all six models to each dataset and examined whether the MOS6 model, as the generative model, was still the best-fitting model on the synthetic datasets.

Results

To illustrate that the mixture of strategies provides parsimonious alternative explanations, we first demonstrate that the context-free MOS6 model can quantitatively capture human learning behaviors and predict individual psychiatric syndromes. Furthermore, we simulate to show that the MOS6 model, without its parameters held constant, can explain certain human behavioral phenomena, as previously evidenced by context-dependent learning rates.

The mixture-of-policy model quantitatively captures learning behaviors

We fit a total of eight models to the behavioral data reported in Gagne et al. (2020). Model fitting and comparison results are summarized in Table 2. To quantify goodness-of-fit, we calculated negative total likelihood (NLL), Akaike Information Criterion (AIC; Akaike, 1974), and Bayesian Information Criterion (BIC; Schwarz, 1978) for each individual participant and performed Bayesian model selection at the group level (Rigoux et al., 2014).

Modeling fitting reveals that the MOS framework accurately accounts for human behaviors. MOS6 and MOS22 were the best-fitting models in terms of BIC and AIC, respectively. The different results based on AIC and BIC may be due to the different degrees of penalty on model complexity (i.e., number of free parameters). The group-level Bayesian model comparisons suggested MOS6 as the best-fitting model. These model comparisons underscore that the MOS framework outperforms existing models, FLR and RS, in the previous literature (Behrens et al., 2007; Gagne et al., 2020). Importantly, an analysis of the parameters in the MOS22 model revealed no significant differences across different experimental contexts (discussed later). This suggests that MOS22 and MOS6 are not qualitatively distinct. Therefore, we conclude that the MOS6 model can effectively account for human behaviors in a relatively context-free manner.

The difference in learning rates between stable and volatile conditions highlights the human capacity to flexibly adapt their learning rate in response to environmental volatility. To explore this adaptability, we applied a model with a built-in adaptive learning rate known as the Pearce-Hall model (PH4, PH17). However, our findings indicate that this model does not provide a better explanation than the MOS6 model. This suggests that there may be behavioral variations that cannot be fully accounted for by the parameters of learning rate.

Model comparison results. The reds indicate the target models.

A-C. Averaged relative increase in negative log-likelihood (NLL), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) subtracted by the lowest values. Lower scores indicate better models. Error bars indicate the standard deviation for the estimated mean across 86 participants.

D. Protected exceedance probability (PXP) for the group-level Bayesian model selection.

MDD and GAD patients favor simpler and more irrational strategies

The MOS model assumes that each participant’s response is a result of a weighted combination of three strategies, EU, MO, and HA. We therefore can summarize participants’ decision preferences using the weighting parameters w. For example, a larger weight wEU for the EU strategy indicates the participant’s tendency of using the rational strategy in value computation and action selection. For the significant testing, we used the logit (λ) of the weight parameters (w) as indicators of decision preference for significant testing. This is because the weighting parameters are not normally distributed, but their logits are approximately subject to the normal assumption. We performed a Welch’s t-test (Delacre et al., 2017) on the MOS6 model to ensure a reliable analysis of the unequal population for the health control and patient groups. We found that the patient group exhibited a weaker tendency for the rational EU strategy (t(57.980) = 2.195, p = 0.032, Cohen’s d = 0.508) and the HA strategy (t(59.032) = 2.389, p = 0.020, Cohen’s d = 0.550), but stronger tendency for the MO strategy(t(63.746) = −3.479, p = 0.001, Cohen’s d = 0.783) (Fig. 3A). However, there was no group difference in (log) learning rate (t(72.041) = 0.678, p = 0.500, Cohen’s d = 0.147).

Parameter analyses of the MOS6 model.

A. The weighting parameters for the healthy control participants (HC, purple) and the patients (PAT, pink) diagnosed with MDD and GAD. The y-axis means averaged preference over different volatility levels (volatile and stable) and feedback types (reward and aversive). Error bars reflect the standard deviation across 86 participants. Significance symbol conventions are: *: p < 0.05; p < 0.01; p < 0.001; n.s.: non-significant.

B. Decision preferences predict participants’ general factor score (g score) in the bifactor analysis reported in Gagne, et al. (2020). The y-axis indicates the averaged preference over different volatility levels (volatile and stable) and feedback types (reward and aversive). This average operation is permitted here because the logit of the weight is normally distributed. The shaded areas reflect 95% confidence intervals of the regression prediction.

For completeness, we also examined whether the decision preferences and the learning rates varied as a function of volatility levels (stable/volatile) and feedback types (reward/aversive) using the MOS22. We conducted three 2 × 2 × 2 ANOVAs on the logit of the three weighting parameters of MOS22 and found no significant relationship between different volatility levels or feedback types (all ps > 0.149) except that participants are more likely to use EU strategy under the reward condition (F(1, 300) = 13.426, p = 0.021, η2 = 0.016; see more details in Supplemental Note 2). In addition, the between-group (HC/PAT) analyses of the decision preferences in MOS22 are mostly consistent with the MOS6 results shown above (Supplemental Fig. S2). For the (log) learning rate parameters, we also examined the outcome valence (higher/lower reward than expectation) apart from the volatility levels or feedback types, finding no general significance (all ps > 0.046; Supplemental material Note 2). All these results suggest that one set of parameters is sufficient for the MOS model to describe the human behavioral dataset.

Decision preferences predict the general severity of anxiety and depression

We investigated the relationship between decision preferences and psychiatric symptom severity (Fig. 3B). To measure symptom severity, we used the bifactor analysis approach described by Gagne et al., (2020) which decomposed measurements of symptom severity into the factors specific to anxiety and depression, and the general factor (g score) indicating the common symptoms shared by them. Our findings indicate that patients with severe symptoms exhibit a weaker tendency of using the optimal EU strategy (Pearson’s r = −0.221, p = 0.040) but a stronger tendency of using the MO strategy (Pearson’s r = 0.360, p = 0.001). Additionally, there was a significant correlation between symptom severity and the preference for the HA strategy (Pearson’s r = −0.285, p = 0.007). In other words, the participants with severer anxiety tend to use a less accurate but simpler strategy for probabilistic learning. Again, we suspect that this is because anxiety and depression reduce cognitive resources in patients, and they have to choose less resource-consuming strategies. We will return to this point in the discussion.

Explain learning rate effects using the strategy preferences

Three ubiquitous observations have been documented in probabilistic learning tasks. First, individuals with anxiety and depression often exhibit a slower learning curve over the course of learning, as evidenced by a smaller fitted learning rate (Chen et al., 2015; Pike & Robinson, 2022) (Chen et al., 2015). Second, to adapt to the high environmental volatility, subjects increase their learning rate to generate a faster learning curve (Behrens et al., 2007). Third, the extent of the learning rate increment from the stable to the volatile condition is smaller in the patient group, a hallmark of their learning deficits (Browning et al., 2015; Gagne et al., 2020). Here, we demonstrate that the MOS model can qualitatively reproduce all three effects by only attributing them to strategy preferences without resorting to the learning rate parameter.

We first averaged over data from 43 human participants (26 health control and 17 patients) in an experimental context and showed that the patients exhibited slower convergence to the true feedback probability than the healthy control group (Fig. 4A). Next, we used the MOS model to simulate the learning behaviors of the two groups using the averaged weighting parameters {wEU, wMO, wHA} of each group. Meanwhile, we controlled the learning rate effect by fixing the parameters {β, αHA, αψ} to their averaged values across all participants for both HA and PAT groups, volatility levels (stable/volatile), and feedback types (reward/punishment).

C. The simulated learning curve for each strategy.

Simulated learning behavior of the two groups and the three strategies. The black dashed lines indicate the ground truth feedback probability. The simulated curves in B-C were generated by averaging 500 simulations.

A. The human learning curves of the two groups were produced using the data in the aversive context (see the reward context in Fig. S3) and smoothed by a Gaussian kernel with a window size of 5 trials and a s.t.d. of 2 trials.

B. Simulated the learning curves of the two groups using their fitted parameters in the MOS model.

Without introducing the learning rate difference between healthy controls and patients, we observed the same slower learning curve in the patient group in the simulations (Fig. 4B). To gain insights, we visualized their respective learning curves throughout the learning task (Fig. 4C). The EU strategy, which is theoretically optimal, quickly approximates the true feedback probability and exhibits a faster learning curve. The HA strategy can also adapt to the volatile environment at a slower adaptation speed and with longer delays, resulting in a slower learning curve. This is intuitively reasonable because shaping a habit usually takes a longer time. The MO strategy is not adaptive to environmental volatility at all and exhibits a flat learning curve throughout the entire course of learning. As previously mentioned, the patients tend to be more magnitude-oriented (MO) possibly because of their inability to afford the effort-consuming EU strategy. Their preference for the slowest decision strategy induces a flattened learning curve in the probabilistic learning task. Therefore, we conclude that strategy preference can explain the slower learning curves in the patient group.

Next, we investigate whether the strategy preferences can account for the remaining two effects. Based on our earlier conclusion that health participants prefer the EU strategy while patients favor the MO strategy, we equate this problem as showing that a preference for the EU strategy results in a greater extent of increase in the learning rate from stable to volatile condition, whereas a preference for the MO strategy corresponds to a lesser extent of increase. To this end, we fit the FLR and RS models to the simulated data generated by each strategy in the MOS model. We controlled all parameters except for the learning rate parameters across the two strategies (see Methods Simulate to explain the previous learning rate effects for details). We observed that, for the EU strategy, there was an elevation in the fitted learning rate from the stable to the volatile condition (Fig. 5 Learning rate), which mirrors the findings of faster learning curves in the volatile condition. The MO strategy displayed almost no increase. Furthermore, we also found that the increase in the learning rate was smaller for the MO strategy (Fig. 5 Learning rate: volatile - stable), indicating that patients would display a smaller increase in the learning rate. These results suggest that strategy preferences alone can provide a natural explanation for patients’ maladaptive learning behaviors in response to environmental volatility.

Fitted learning rate and the learning rate differences between the stable and volatile conditions. The simulated data are generated by the MOS model and fitted by the FLR (A) and the RS (B) model.

In summary, the MOS model can effectively explain the three well-established learning curve effects in previous literature. It is important to note that, in contrast to the FLR or RS models, the apparent differences in learning curves in the MOS model originate from the weighting differences in strategy rather than learning rate per se. This means that the MOS model provides a key theoretical interpretation that differs from that in the majority of literature.

Model and parameter recovery analyses support model and parameter identifiability in MOS

It is intriguing that the MOS model can reproduce the classic learning curve effects only by adjusting strategy preferences without altering the learning rates. However, there are two potential confounding factors to consider. First, it is possible that adjusting the learning rate, rather than strategy preferences, could produce the same behavioral outcomes that are indistinguishable by the model fitting. If this holds, the MOS model might be problematic, as all learning rate differences may be automatically attributed to strategy preferences because of some unknown idiosyncratic model fitting mechanisms. Second, the fact that the MOS framework outperforms the other two frameworks may be partly due to an unknown bias in the model design. It is possible that the MOS model always wins, irrespective of how the data is generated.

To circumvent these issues, we performed parameter and model recovery analyses to investigate the identifiability of true parameters and models. The parameter recovery results demonstrate that the true parameters that generate synthetic datasets can be correctly estimated and identified (all Pearson’s rs > 0.720), demonstrating that the effects of learning rate and weighting parameters are not interchangeable in the MOS6 model.

Parameter and model recovery analyses

A. Parameter recovery for the MOS6 model. Each recovered parameter is averaged over ten samples.

B-C. Model recovery analyses. The MOS6 model is still the best-fitting model on synthetic data generated by MOS6 per se, according to the AIC (B) and BIC (C) comparisons. Error bars indicate the standard deviation of the mean value across 79 synthetic data points.

For model recovery, we fit all six models to the synthetic data generated by MOS6 and found MOS6, as the generative model, was still the best-fitting model based on the lowest averaged AIC and BIC (Fig. 7). Both parameter and model recovery analyses suggest that our modeling approach is reliable and the MOS6 model is distinguishable. We ensured that differences in decision preferences between patients and HC were not the result of idiosyncratic model design or fitting procedures. Remark that we excluded the NLL and PXP from the evaluation. The NLL always favors models with more parameters, while the use of PXP, which is designed for group-level comparisons, is not an appropriate metric here since we knew in advance that all data were generated from one identical model in this model recovery.

Discussion

In this article, we propose a mixture-of-strategy model assuming that human agents’ decision policy consists of three distinct components: the EU, the MO, and the HA strategies. The EU strategy is optimal in terms of maximizing reward. The MO and HA are simpler heuristic strategies that are cognitively demanding. We applied the MOS model to a public dataset and found that it outperformed existing models in capturing human behaviors. We summarized human behaviors using the estimated parameters of the target model and reported three primary conclusions. First, individuals with MDD and GAD tended to favor more irrational policies (i.e., a stronger preference for the MO strategy). Second, individual decision preferences predict the general severity of anxiety and depression. Third, decision preferences explain several learning rate phenomena that have been studied before. All conclusions suggest that a mixture of strategies provides an effective and parsimonious explanation for human learning behaviors in volatile reversal tasks.

The attempt of decision analysis in the previous studies

We are not the first to examine the human decision process. Numerous previous studies have also explored this cognitive process, although they did produce particular findings.

The well-established finding that humans apply flexible learning rates in different experimental blocks is a successful case study of the ideal observer analysis. Behrens et al. (2007) constructed a hierarchical ideal Bayesian observer that dynamically models how higher-order environmental volatility influences the speed of updating the lower-order feedback probability. Because of the hierarchical interaction, the model predicts a faster updating speed for the feedback probability in a volatile environment. The ideal Bayesian model proposes an optimal manner of processing new information, like how an agent should behave. Human behavioral data was better accounted for by the RS model, which updates feedback probability in the classical Rescorla-Wagner format. Interestingly, the key prediction of the ideal Bayesian model has been preserved in the RS implementation: human subjects had a significantly higher learning rate in the volatile than the stable environment. The success of the RS model seems to suggest that humans can flexibly adjust learning rates according to environmental volatility. The view is better established as more studies replicate this learning rate effect (e.g., Browning et al., 2015; Gagne et al., 2020).

Despite this, some attention has nevertheless been paid to understanding the decision process. Browning et al. (2015) studied the decision process of the RS9 model. They examined and compared the risk-sensitive parameter γ and inverse temperature β but found no significant difference between different degrees of trait anxiety and volatility. Gagne et al. (2020) constructed 13 models in a stepwise manner to find the best-fitting description of human decision-making in the volatile reversal learning task. However, the study did not attempt to connect the decision process to anxiety and depression traits, possibly due to the best-fit model, the FLR18 model implemented here, being too complex to analyze.

The problems in both attempts are straightforward. The RS9 model might have an inaccurate description of the human decision process, and the FLR18 model is not understandable. The MOS model developed here relieves both issues, providing a competitive fit and being constructed in an easy-to-understand form. Additionally, the MOS model also provides a parsimonious description of the behavioral data, using a set of parameters to capture the data in four experimental contexts. However, the model yields a contradictory explanation. It shows no significant difference in terms of learning rate. In other words, the apparent differences in learning curves may arise from decision processes (i.e., decision preferences) rather than learning processes. Note that we reproduced this finding using the same model in the Gagne et al. (2020) dataset, so this difference is not introduced by replacing the Bayesian estimation with the MAP parameter estimation. The good quantitative performance of the MOS model and the qualitative explanation of the adaptation effect without introducing a flexible learning rate seem to challenge a range of previous results.

We emphasize that previous results and ours may not be mutually exclusive and may coexist. We argue that the current experimental paradigm is insufficient to disassociate the two possible accounts. Although the MOS framework quantitatively wins in model comparisons, it requires examining their qualitatively distinct predictions to further differentiate the two accounts. We will discuss this issue in future directions below.

The normative interpretation of the mixed strategies

The normative interpretation of learning rate can be elusive. On one hand, the quality of a learning rate does not monotonically increase with its value. Consequently, one’s cognitive ability cannot be directly assessed based on their fitted learning rate, unless compared to the theoretically optimal learning rate. On the other hand, the optimal learning rate is highly context-dependent and can even vary from trial to trial (Behrens et al., 2007). This can pose challenges when assessing participants’ performance across different cognitive tasks.

Based on the principle of resource rationality, the MOS model demonstrates stronger normative characteristics. The model suggests that preference towards the three strategies can be used to qualitatively approximate this reward-effort tradeoff. Especially, the EU strategy is (defined to be) the most rewarding strategy (Von Neumann & Morgenstern, 1947), but also cognitively demanding (Gershman et al., 2015). Hence, a higher preference for the EU strategy typically signifies better cognitive ability and capacity. Individuals with psychiatric diseases exhibit a significantly lower preference for this EU strategy compared with healthy individuals, which implies that their cognitive resources might be disrupted. The MO and HA strategies are more computationally economical, though they yield fewer rewards. It is worth noting that the patient group exhibits a greater preference for the MO strategy, which may imply mental impairments beyond limited cognitive resources. According to the resource-rationality principle (Gershman, 2020) and Fig. 4C: The HA strategy is a cost-efficient strategy that brings more rewards than the MO strategy. This may prompt further investigation into the underlying reasons behind these preferences.

This framework can be extended to understand human behaviors in paradigms beyond the volatile reversal task. The key lies in identifying heuristics that contrast with the EU strategy. For instance, when employing the MOS model in a volatile reversal task with fixed reward magnitude signals set to 1, we can exclude the MO strategy from the pool and preserve EU and HA. A higher preference for the EU strategy still implies better cognitive ability.

Atypical learning speed in psychiatric diseases

In the present work, we found that patients with depression and anxiety display slower learning speed in the probabilistic learning tasks (shown in Fig. 4A). We attributed the very observation to participants’ decision preferences. However, in conventional Rescorla-Wagner modeling, the learning speed is primarily indicated by the parameter of learning rate. For example, Chen (2015) conducted a systematic review of reinforcement learning in patients with depression and identified 10 out of 11 behavioral datasets showing either comparable or slower learning rates in depressive patients. Nonetheless, depressive patients may not always have a slower learning rate. In a recent meta-analysis that summarized 27 articles with 3085 participants, including 1242 with depression and/or anxiety, Pike and Robinson (2022) found a reduced reward but enhanced punishment learning rate. This finding yields two practical implications. First, the heterogeneous findings in the literature may arise from heterogeneous pathologies in depression and anxiety. Second, the learning rate as an indicator of human learning and decision-making is not yet perfect and needs to be revised. The mixture decision strategy model may provide useful complementary explanations of the consequences of a spectrum of symptoms.

Limitations and future directions

The MOS model provides relative context-free interpretations for some learning rate phenomena, but not all of them. One among them is the value-specific learning rate differences, where the learning rates for positive outcomes are higher than the negative ones (Chen et al., 2015; Gagne et al., 2020; Pike & Robinson, 2022). It’s worth noting that there is no difference between value-specific learning rates, even in MOS 22, where value-specific learning rates are incorporated (Supplementary Note 2). This suggests that at least the effect of the value-specific learning rate is modest in this dataset. Future studies may consider exploring explicit behavioral markers for value-specific learning that do not rely on specific computational models, rather than merely estimating the learning rate value from noisy behavioral data.

We propose an experimental paradigm that can potentially disassociate the learning rate from the mixture of strategy at behavioral level. The idea is to verify the cognitive constraints hypothesis by manipulating participants’ cognitive loads. The volatile reversal learning task can be implemented with a secondary task (e.g., asking participants to remember words through headphones). We expect to identify a preference shift from the EU strategy to the MO strategy from participants (a decreasing wEU and an increasing wMO) because human agents should compromise to a simpler and irrational strategy due to resource constraints induced by the secondary task. In general, we expect this line of research to include more experimental paradigms such that we can gain a complete picture of human learning behavior.

We will also explore why individuals with mental disorders prefer simpler strategies when making decisions. One possible explanation is that individuals with depression exhibit a maladaptive emotion regulation behavior called rumination, suffering from irresistible and persistent negative thoughts (Song et al., 2022; Yan et al., 2022). It is likely that the presence of negative thoughts consumes some cognitive resources, such that the participants fail to utilize the complicated but rewarding EU strategy.

Acknowledgements

We thank the authors of Gagne et al. (2020) for sharing their data. This work was supported by the National Natural Science Foundation of China (32100901), Shanghai Pujiang Program (21PJ1407800), Natural Science Foundation of Shanghai (21ZR1434700), the Research Project of Shanghai Science and Technology Commission (20dz2260300) and the Fundamental Research Funds for the Central Universities (to R.-Y.Z.)

Conflict of Interests

The authors declare no competing financial interests.

Author Contributions

Z. F. and R-Y.Z. conceived and designed the study. Z. F., M. Z., T.X., Y.L., H.X., and P.Q. processed the data. Z. F. implemented the computational models. Z. F., and R-Y.Z. made the first draft of the manuscript. All authors provide valuable feedback on the final manuscript.

Supplemental Information

Supplemental Note 1: the priors for reparametrized parameters

We fit the models using the BFGS method, which requires us to first turn the constrained optimization problem (in terms of the parameter range) into an unconstrained one. To do so, we applied the reparameterization tricks. For example, we passed the raw parameter values through the sigmoid function to create parameters with range (0, 1). For parameters with range (0, ∞), we used the exponential function ξ = exp(ξraw). The raw parameter values are all sampled from a Gaussian space.

We carefully tuned the raw parameter priors to ensure the parameters have a reasonable prior or a prior that is consistent with other published research in the reparametrized space (not the raw value space) (Fig. S1). For parameters with range (0, 1), we approximate the uniform distribution Uniform(0, 1); for parameters with range (0, ∞), we approximate Gamma(3, 3).

The reparametrized priors for parameters.

A. For parameters with a range of (0, 1), the raw values were sampled from N(0, 1.55) and passed through the sigmoid function.

B. For parameters with a range of (0, ∞), the raw values were sampled from N(2, 1) and passed through the exponential function.

C. For parameters with a range (−∞, ∞), the raw values were sampled from N(0, 10).

Supplemental Note 2: the complete statistical results of MOS18

We performed multiple 2 × 2 × 2 ANOVAs with the logit of the three weight parameters, dubbed decision preferences, as the dependent variable, group (health control/patients) as a between-subject factor, volatility level (stable/volatile) and feedback types (reward/aversive) as within-subject factors.

For the weighting parameters for the EU strategy wEU, the patient group exhibited a weaker tendency for the rational EU strategy (F(1, 300) = 27.195, p < 0.001, η2 = 0.076). People also showed a higher tendency for the EU strategy for the reward feedback than the aversive one (F(1, 300) = 5.368, p = 0.021, η2 = 0.016). No significant main effects of volatility level (F(1, 300) = 0.022, p = 0.926, η2 = 0.006) and significant interaction effects were found (all ps > 0.149).

For the weighting parameters for the MO strategy wMO, the patient group exhibited a stronger tendency for the MO strategy (F(1, 300) = 10.652, p < 0.001, η2 = 0.031). No significant main effects of volatility level (F(1, 300) = 0.537, p = 0.464, η2 < 0.001) and feedback types (F(1, 300) = 0.431, p = 0.512, η2 = 0.002) as well as significant interaction effects were found (all ps > 0.420).

For the weighting parameters for the HA strategy wHA, the two groups exhibited no significant differences in the preference for the HA strategy (F(1, 300) = 0.434, p = 0.511, η2 = 0.001). No significant main effects of volatility level (F(1, 300) = 0.872, p = 0.351, η2 = 0.003) and feedback types (F(1, 300) = 1.484, p = 0.224, η2 = 0.004) as well as significant interaction effects were found (all ps > 0.357).

For the log learning rates log αψ, there were no significant main effects of participant groups (F(1, 300) = 1.489, p = 0.223, η2 = 0.002), feedback types (F(1, 300) = 0.002, p = 0.961, η2 = 0.000), and volatility levels (F(1, 300) = 1.280, p = 0.258, η2 = 0.002). We also examined the value-specific learning rate effect and not significant difference (F(1, 300) = 0.006, p = 0.937, η2 = 0.000). There is a weak significance in the interaction patients group × volatility levels × feedback types (F(1, 300) = 3.998, p = 0.046, η2 = 0.006). No other significant interaction effects were found (all ps > 0.258).

Parameter analyses of MOS22.

A. The weighting parameters for the healthy control participants (HC) and the patients (PAT) diagnosed with MDD and GAD. The y-axis means averaged preference over different volatility levels (volatile and stable) and feedback types (reward and aversive). Error bars reflect the standard error of the mean value across participants ×experimental conditions× feedback types.

B. Decision preferences predict participants’ general factor score (g score) in the bifactor analysis reported in Gagne, et al. (2020). The y-axis indicates the averaged preference over different volatility levels (volatile and stable) and feedback types (reward and aversive). This average operation is permitted here because the logit of the weight is normally distributed. The shaded areas reflect 95% confidence intervals of the regression prediction.

The human learning behaviors of the two groups in the reward condition.