1 Introduction

Active inference from the free energy principle provides a powerful explanatory tool for understanding the dynamic relationship between an agent and its environment [1]. Free energy is a measure of uncertainty used to describe a system, which can be understood as the difference between the real system state and the estimated system state [2]. In addition, expected free energy can also be used to guide the optimization process of decision-making. Under the active inference framework, perception, action, and learning are all driven by the minimization of variational free energy (Flgure 1). By minimizing free energy, people can optimize decisions which encompasses both the reduction of uncertainty about the environment (through exploration) and the maximization of rewards (through exploitation). Active inference [3] is a pragmatic implementation of the free energy principle to action, proposing that agents not only minimize free energy through perception, but also through actions that enable them to reach preferable states. Briefly, in active inference, the agent has an internal belief model to approximate the hidden states of the environment (perception), and actively acts to enable oneself to reach preferable states (action)(see Section 2.1).

Active inference. (a) Qualitatively, agents receive observations from the environment and use these observations to optimize the internal cognitive model of the environment. Then agents actively sample the environment states by action (choose actions that would make them in more favorable states). The environment changes its state according to agents’ actions and again agents receive new observations from the environment. (b) Quantiflcationally, agents optimize the internal cognitive model by minimizing the variational free energy. Then agents select policies (actions) by minimizing the expected free energy to minimize the surprise in the future.

In recent years, the active inference framework has been applied to understanding cognitive processes and behavioral policies in human decisions. Many works provide support for the potential of active inference framework to describe complex cognitive processes, which provide theoretical insights into behavioral dynamics [47]. For instance, it is theoretically deduced in the active inference framework on exploration and exploitation trade-off [3, 8] which trade-off is essential to the functioning of cognitive agents in many decision contexts [9, 10]. Speciflcally, exploration is to take the action that offers extra information about the current environment, while exploitation is to take action to maximize the potential reward given the current belief. The exploration-exploitation trade-off captures an inherent tension and uncertainty, particularly when the agent is confronted with incomplete information about the environment [11]. However, these theoretical studies have rarely been conflrmed experimentally with lab empirical evidence from both behavioral and neural levels.

The decision-making process frequently involves grappling with varying forms of uncertainty, such as ambiguity -the element of uncertainty that can be mitigated through sampling, risk -the inherent uncertainty presented by a stable environment, and unexpected uncertainty -the uncertainty pertaining to environmental changes. Studies have investigated these different forms of uncertainty in decision-making, focusing especially on their neural correlates [1215]. However, it remains an open question whether the brain represents these different types of uncertainty distinctly [16](Aim 1). In addition to the representation of uncertainties, the brain may also encode the value of resolving these uncertainties [16](Aim 2). The active inference framework presents a theoretical approach to resolving these research gaps. Within this framework, ambiguity is represented by the information gain about model parameters associated with choosing a particular action, while risk is signifled by the variance of hidden environmental states. In active inference, the representations of uncertainty naturally translate into representations of the value of reducing ambiguity or avoiding risk(see Section 2.1), which means that those representations may show common associated neural signatures [1].

Our study, therefore, aimed to determine how the human brain represents different uncertainties (Q1), and how humans encode the policies or value of resolving these uncertainties (Q2). To achieve these aims, we utilized the active inference framework to examine the exploration-exploitation trade-off, with behavioral data and electroencephalogram (EEG) neural recordings (see Methods). We designed a contextual two-armed bandit task (see Section 2.2 and Flgure 4 (a)), wherein participants were instructed to maximize cumulative rewards. They were offered various policies to either avoid risk, reduce ambiguity, or maximize immediate rewards (see Methods). With aims to address the two questions, our study provides results of 1) how participants trade off the exploration and exploitation in the contextual two-armed bandit task (behavioral evidence)(see Section 3.1); 2) how brain signals differ under different levels of ambiguities and risks (sensor-level EEG evidence)(see Section 3.2) ; 3) how our brain encodes the trade-off of exploration and exploitation, evaluates the values of reducing ambiguity and avoiding risk during action selection, and 4) updates the information about the environment during belief update (source-level EEG evidence)(see Section 3.3).

2. Methods

2.1 The free energy principle and active inference

The free energy principle can sample different states of the environment by choosing different actions to obtain the sensory input we like that this process is termed as active inference. Under the active inference framework, the free energy can be viewed as the objective function of the system, i.e. the free energy to minimize. By minimizing free energy, we can optimize decisions and reduce uncertainty. In active inference, variational inference is used to estimate model parameters (minimize variational free energy), guide the agent’s actions (minimize expected free energy), and compute an objective function by maximizing the expected log-likelihood (see Flgure 1 (b)). This process can be viewed as an optimization problem that seeks to flnd the best model parameters and action strategy to maximize the log-likelihood. By minimizing the objective function, optimal model parameters can be estimated and better decisions can be made [17]. This principle bridges the sensory input, cognitive processes, and action output, enabling us to quantitatively describe the neural processes of learning about the environment. For example, the brain receives sensory input s from the environment, and the cognitive model encoded by the human brain q(s) makes an inference on the cause of sensory input p(s |o). In the free energy principle, minimizing free energy refers to minimizing the difference (e.g., KL divergence) between the cognitive model (p(s |o)) encoded by our brain and the causes of the sensory input (q(s)). Thus, free energy is an information-theoretic quantity that constrains the evidence for the data model. Free energy can be minimized by the following two means [18]:

  • Minimize free energy through perception. Based on existing observations, by maximizing model evidence, the brain improves its internal cognitive model, reducing the gap between the true cause of the sensory input and the internal cognitive model.

  • Minimize free energy through action. The agent actively samples the environment, making the sensory input more in line with the cognitive model by sampling the states which are preferred. Minimizing free energy through action is one of the advantages of the free energy principle over Bayesian inference, which can only passively optimize cognition.

2.1.1 The generative model

Active inference builds on partially observable Markov decision processes: (O, S, U, T, R, P, Q)(see Table 1).

Ingredients for computational modeling of active inference

In this model, the generative model P is parameterized as follows and the model parameters are η = a, c, d, β [3]:

where o is the sensory inputs, s is the hidden states of the environment, π is the policy of the agent, A is the likelihood matrix mapping from hidden states to outcomes, B is the transition probability for hidden states under the action of a policy in time t, d is the prior expectation of each state at the beginning of each trial, γ is the inverse temperature pf beliefs about policies, β is the prior expectation of temperature of beliefs about policies, a is the concentration parameters of likelihood, Cat() is the categorical distribution, Dir() is the Dirichlet distribution, and Γ() is the Gamma distribution.

The posterior probability of the corresponding hidden state and parameters (, π, A, B, β) is as Eq. (2)

The generative model is a conceptual representation of how agents construe their environmental circumstances. This model fundamentally posits that agents’ observations are contingent upon states, and the transitions of these states inherently depend on both the state itself and the chosen policy or action sequence. It is crucial to note that within this model, the policy is considered a stochastic variable requiring inference, thus considering planning as a form of inference. This inference process involves deducing the optimal policy from the agents’ observations. All the conditional probabilities incorporated within this model are parameterized using a Dirichlet distribution [19]. The Dirichlet distribution’s sufflcient statistic is its concentration parameter, which is equivalently interpreted as the cumulative frequency of previous occurrences. In essence, this means that the agents incorporate the frequency of past combinations of states and outcomes into the generative model. Therefore, the generative model plays a pivotal role in stipulating the probabilities and uncertainties related to the potential states and outcomes.

2.1.2 Variational free energy and expected free energy

Perception, decision-making, and learning in active inference are all achieved by minimizing the variational and expected free energy with respect to the model parameters and hidden states. The variational free energy can be expressed in various forms with respect to the reduced posterior as Eq. (3):

These forms of free energy are consistent with the variational inference in statistics. Minimizing free energy is equal to maximizing model evidence, that is, minimizing surprise. In addition, free energy can also be written in other forms as Eq. (4):

The initial term, denoted as DKL(q(s) || p(s)), is conventionally referred to as “complexity”. This term, reflecting the divergence between q(s) and p(s), quantifles the volume of information intended to be encoded within q(s) that is not inherent in p(s). The subsequent term, Eq[logp(o| s)], designated as “accuracy”, represents the conditional probability of receiving an observation given each state.

The minimization of variational free energy facilitates a progressive alignment between the approximate posterior distribution of hidden states, as encoded within the brain’s representation of the environment, and the actual posterior distribution of the environment, conditioned on observed data. However, it is noteworthy that our policy beliefs are predominantly future-oriented. Our aspiration entails designing policies that possess the potential to effectively guide us toward achieving the future state that we desire. Thus, it follows that these policies should aim to minimize future free energy, or in other words, the expected free energy. The relationship between policy selection and expected free energy is inversely proportional: a lower expected free energy under a given policy heightens the probability of that policy’s selection. Hence, expected free energy emerges as a crucial factor influencing policy choice.

Next, we can derive the expected free energy in the same way as the variational free energy:

In Eq. (7), it is important to note that we anticipate observations that have not yet transpired. Consequently, we designate . Within the context of expected free energy, if we establish a relationship between lnP (oτ) and preference, it enables us to express expected free energy in terms of cognitive value and extrinsic value. The implications of such a relationship offer a new lens to understand the interplay between cognitive processes and their environmental consequences, thereby enriching our understanding of decision-making under the active inference framework.

In this context, extrinsic value aligns with the concept of expected utility. On the other hand, epistemic value corresponds to the anticipated information gain, encapsulating the exploration of both model parameters (active learning) and the hidden states (active inference), which are to be illuminated by future observations.

Belief updates play a dual role by facilitating both inference and learning processes. The inference is here understood as the optimization of expectations concerning hidden states. Learning, on the other hand, involves the optimization of model parameters. This optimization necessitates the flnding of sufflcient statistics of the approximate posterior that minimize the variational free energy. Active inference employs the technique of gradient descent to identify the optimal update method [3]. In the present work, our focus is primarily centered on the updated methodology related to the mapping function A and the concentration parameter a:

where α is the learning rate.

2.2 Contextual two-armed bandit task

In this study, we developed a “contextual two-armed bandit task”, which is based on the conventional multi-armed bandit task [20]. In this task, participants are instructed to explore two paths that offer rewards with the aim of maximizing them. One path provides constant rewards in each trial, labeled the “safe path,” while the other, referred to as the “risky path,” probabilistically offers varying amounts of rewards. The risky path has two distinct contexts, “Context 1” (high-reward context) and “Context 2” (low-reward context), each corresponding to different reward distributions.

The context associated with the risky path undergoes random alternations in each trial, and participants can determine the speciflc context of the current trial’s risky path by accessing a cue, although this comes with a cost. Additionally, participants must discern and comprehend the reward distributions of both contexts. For a comprehensive overview of the speciflc parameter settings, please refer to Flgure 2.

Generative model of the contextual two-armed bandit task. (a) There are 2 stages in this task. The flrst choice is: “Stay” and “Cue”. The “Stay” option gives you nothing while the “Cue” option gives you a −1 reward and the context information about the “Risky” option in the current trial. The second choice is: “Safe” and “Risky”. The “Safe” option gives you a +6 reward and the “Risky” option gives you a reward probabilistically ranging from 0 to +12 depending on the current context (context 1 or context 2); (b) The four policies in this task are: “Cue” and “Safe”, “Stay” and “Safe”, “Cue” and “Risky”, and “Stay” and “Risky”; (c) The A-matrix maps from 8 hidden states (columns) to 7 observable outcomes (rows).

In the task, active inference agents with different parameter conflgurations can exhibit different decision-making policies, as demonstrated in a simulation experiment (see Flgure 3). By adjusting parameters such as prior, learning rate, and precision, agents can operate under different policies. Agents with a low learning rate (and a relatively high proportion of epistemic value) will initially incur a cost to access the cue, enabling them to thoroughly explore and understand the reward distributions of different contexts. Once sufflcient environmental information is obtained, the agent will evaluate the actual values of various policies and select the optimal policy for exploitation. In the experimental setup, the optimal policy requires selecting the risky path in a high-reward context and the safe path in a low-reward context after accessing the cue. However, in particularly difflcult circumstances, an agent with a high learning rate (and a smaller proportion of epistemic value) may become trapped in a local optimum and consistently opt for the safe path, especially if the initial high-reward scenarios encountered yield minimal rewards.

The simulation experiment results. This Flgure demonstrates how an agent selects actions and updates beliefs over 60 trials in the active inference framework. The flrst two panels (a-b) display the agent’s policy and depict how the policy probabilities are updated (choosing between the stay or cue option in the flrst choice, and selecting between the safe or risky option in the second choice). The scatter plot indicates the agent’s actions, with green representing the cue option when the context of the risky path is “Context 1” (high-reward context), orange representing the cue option when the context of the risky path is “Context 2” (low-reward context), purple representing the stay option when the agent is uncertain about the context of the risky path, and blue indicating the safe-risky choice. The shadow represents the agent’s confldence, with darker shadows indicating greater confldence. The third panel(c) displays the rewards obtained by the agent in each trial. The fourth panel(d) shows the prediction error of the agent in each trial, which decreases over time. Flnally, the flfth panel(e) illustrates the expected rewards of the “Risky Path” in the two contexts of the agent.

2.3 EEG collection and analysis

2.3.1 Participants

Participants were recruited via an online recruitment advertisement. We recruited 25 participants (male: 14, female: 11, mean age: 20.82 ± 2.12 years old) concurrently collecting electroencephalogram (EEG) and behavioral data. All participants signed an informed consent form before the experiments. This study was approved by the local ethics committee of the University of Macau (BSERE22-APP006-ICI).

2.3.2 Data collection

In our experiment, to diversify the data, we incorporated an additional “you can ask” stage prior to the stay-cue option. This measure was implemented to ensure that each participant encountered trials under conditions of heightened uncertainty. Participants were presented with the following experimental scenario and instructions: “You are on a quest for apples in a forest, beginning with 5 apples. You encounter two paths: 1) The left path offers a flxed yield of 6 apples per excursion. 2) The right path offers a probabilistic reward of 0/3/6/9/12 apples per exploration, but it includes two distinct contexts, labeled “Context 1” and “Context 2,” each with a different reward distribution. Note that the context associated with the right path will randomly change in each trial. Before selecting a path, a ranger will provide information about the context of the right path (“Context 1” or “Context 2”) in exchange for an apple. The more apples you collect, the greater your monetary reward will be.

The participants were informed of the introduction which included basic information about the experiment and press the spacebar to proceed(e.g., the total number of apples collected being linked to the monetary reward they would receive). For each trial, the experimental procedure is illustrated in Flgure 4 (a), and comprises flve stages:

  1. “You can ask” stage: Participants are given the option to ask the ranger for information, which lasts for 2 seconds.

  2. “Flrst choice” stage: Participants must decide whether to press the right or left button to ask the ranger for information, at the cost of an apple. This stage also lasts 2 seconds and corresponds to the action selection in active inference.

  3. “Flrst result” stage: Participants either receive information about the context of the right path for the current trial or gain no additional information. This stage lasts for 2 seconds and corresponds to the belief update in active inference.

  4. “Second choice” stage: Participants decide whether to select the RIGHT or LEFT key to choose the respective path. This stage again lasts for 2 seconds and corresponds to the action selection in active inference.

  5. “Second result” stage: Participants are informed about the number of apples rewarded in the current trial and their total apple count, which lasts for 2 seconds. And this stage corresponds to the belief update in active inference.

The experiment task and behavioral result. Panel (a) outlines the flve stages of the experiment, which include the “You can ask” stage to determine if participants can request information from the ranger, the “Flrst choice” stage to decide whether to ask the ranger for information, the “Flrst result” stage to display the result of the “Flrst choice” stage, the “Second choice” stage to choose between left and right paths under different uncertainties, and the “Second result” stage to show the result of the “Second choice” stage. Panel (b) displays the number of times each option was selected. Flnally, panel (c) compares model-free RL and active inference models.

Each stage is separated by a jitter ranging from 0.6 to 1.0 seconds. The entire experiment consists of a single block with a total of 120 trials.

2.3. EEG processing

The processing of EEG signals was conducted using the EEGLAB toolbox [21] in the Matlab and the MNE package [22]. The preprocessing of EEG data involved multiple steps, including data selection, downsampling, high- and low-pass flltering, and independent component analysis (ICA) decomposition. Data segments encompassing a 2-second interval before and after the experiment were chosen. Subsequently, the data was downsampled to a frequency of 250Hz and subjected to high- and low flltering within the 1-30 Hz frequency range. In instances where channels exhibited abnormal data, these were resolved using interpolation and average values. Following this, ICA was applied to identify and discard components flagged as noise.

After obtaining the preprocessed data, our objective was to gain a more comprehensive understanding of the speciflc functions associated with each brain region. To accomplish this, we employed the head model and source space available in the “fsaverage” of the MNE package. To localize the sources, we utilized eLORETA [23] and mapped the EEG data to the source space.

We segmented the data into flve time intervals that corresponded to the flve stages of the experiment. The flrst stage, known as the “You can ask” stage, involved identifying participants’ willingness to access cues. In the second stage, referred to as the “Flrst choice” stage, participants decided whether to seek cues. The third stage, called the “Flrst result” stage, focused on obtaining the results of cue access. The fourth stage, known as the “Second choice” stage, involved choosing between visiting the safe or risky path. Flnally, the flfth stage, named the “Second result” stage, encompassed receiving rewards. Each interval lasted two, and this categorization allowed us to investigate brain activity responses to two distinct choices at different stages of the task. Speciflcally, we examined the processes of prediction (action selection) and outcome (belief update) within the framework of active inference.

3. Results

3.1. Behavioral results

The active inference framework was employed to flt the participants’ behavioral policies. To account for differing preferences regarding resolving uncertainty and rewards, we incorporated fltting coefflcients into the three terms constituting the expected free energy (active learning, active inference, and extrinsic value). Subsequently, these fltting parameters were integrated into the active inference model, enabling the extraction of expected free energy, prediction error, and other variables for each trial. We then could perform linear regression to analyze the associated EEG signals for each brain region.

The model comparison results demonstrated that active inference provided a performance to flt participants’ behavioral data compared to the basic model-free reinforcement learning (Flgure 4 (c)). Notably, the active inference captured the participants’ exploratory inclinations better compared to model-free RL [24, 25]. This was evident in our experimental observations (Flgure 4 (b)) where participants signiflcantly favored consulting the ranger over opting to stay. Consulting the ranger, which provided environmental information, emerged as a more beneflcial policy within the context of this task.

Moreover, participants’ preferences for uncertainty were found to vary depending on the context. When participants lacked information about the context and the risky path had the same average rewards as the safe path but with greater variability, they showed an equal preference for both options(Flgure 4 (b), “Not ask”). However, in context 1 (Flgure 4 (b), “Context 1”, high-reward context), where the risky path offered greater rewards than the safe path, participants strongly favored the riskier option, which not only provided higher rewards but also had added epistemic value. In contrast, in context 2 (Flgure 4 (b), “Context 2”, low-reward context), where the risky path had lower rewards than the safe path, participants mostly chose the safe path but occasionally opted for the risky path, recognizing that despite its lower rewards, it offers epistemic value.

3.2 EEG results at sensor level

As depicted in Flgure 5 (a), we divided electrodes into flve clusters: left frontal, right frontal, central, left parietal, and right parietal. Within the “Second choice” stage, participants were required to make decisions amidst varying degrees of uncertainty (the uncertainty about the hidden states and the uncertainty about the model parameters). Thus, we investigated whether distinct brain regions exhibited differential responses under such uncertainty.

EEG results in at the sensor level. (a) The electrode distribution. (b) The signal amplitude of different brain regions in the flrst and second half of the experiment in the “Second choice” stage. The right panel shows the visualization of the evoked data and spectrum data. (c) The signal amplitude of different brain areas in the “Second choice” stage where participants know the context or do not know the context of the right path. The right panel shows the visualization of the evoked data and spectrum data

Usually, in the flrst half of the experimental trials, participants would display greater uncertainty about model parameters compared to the latter half of the trials [8]. We thus analyzed data from the flrst half and latter half trials, and identifled statistically signiflcant differences in the signal amplitude of the left frontal region (p < 0.01), the right frontal region (p < 0.05), the central region (p < 0.01), and the left parietal region (p < 0.05) suggesting a role for these areas in encoding the statistical structure of the environment(Flgure 5 (b)). We postulate that when participants have constructed the statistical model of the environment during the second half of the trials, brains could effectively utilize the statistical model to make superior decisions and exhibit more positive activities.

To investigate whether distinct brain regions exhibited differential responses under the uncertainty about the hidden states, we divided all trials into two groups: the asked trials and the not-asked trials based on whether participants chose to ask in the “Flrst choice” stage. In the not-asked (Flgure 5 (c)), participants displayed greater uncertainty about the hidden states of the environment compared to the asked trials. We identifled statistically signiflcant differences in the signal amplitude of the left frontal region (p < 0.01), the right frontal region (p < 0.05), and the central region (p < 0.001), suggesting a role for these areas in encoding the hidden states of the environment. It may suggest that when participants knew the hidden states, they could effectively integrate the information with the environmental statistical structure to make superior decisions and exhibit more positive brain activities. The right panel of Flgure 5 (c) reveals a higher signal in the delta band during not-asked trials, suggesting a correlation between theta band signal and uncertainty about the hidden states [26]. To investigate whether distinct brain regions exhibited differential responses under the uncertainty about the hidden states, we divided all trials into two groups: the asked trials and the not-asked trials based on whether participants chose to ask the range in the “Flrst choice” stage. In our settings, participants would display greater uncertainty about the hidden states of the environment for the not-asked trials, compared to the asked trials. We identifled statistically signiflcant differences between the two conditions in the signal amplitude over the left frontal region (p < 0.01), the right frontal region (p < 0.05), and the central region (p < 0.001)(Flgure 5 (c)), suggesting a role for these areas in encoding the hidden states of the environment. The right panel of Flgure 5 (c) also reveals a higher signal in the delta band during not-asked trials, suggesting a correlation between theta band signal and uncertainty about the hidden states [26].

3.3. EEG results at source level

In order to uncover the functional roles of various brain regions in the decision, we employed a generalized linear model (GLM) to flt the EEG source signal. The GLM included several regressors to capture different aspects of the decision-making process, namely expected free energy, active learning, active inference, extrinsic value (reward), and reward prediction error. Incorporating these regressors, enables us to know how each of these factors influenced EEG activity and contributed to the decision-making process.

3.3.1 “Flrst choice” stage – action selection

During the “Flrst choice” stage, participants were presented with the choice of either choosing to stay or approaching the ranger to gather information regarding the present situation of the risky path, the latter choice coming at a cost.

We found a robust correlation (p < 0.05) between the “expected free energy” regressor and activities of the lateral occipital cortex (Flgure 6 (a)). In addition, the rostral middle frontal cortex and the middle temporal gyrus, also displayed correlations with expected free energy. With respect to the “extrinsic value” regressor, we identifled a strong correlation (p < 0.05) with the activities over the inferior temporal gyrus. Furthermore, the superior temporal gyrus, insula, and precentral area. For the “active inference” regressor (Flgure 6 (b)), we observed a strong negative correlation (p < 0.05) with activities of the middle temporal gyrus, the inferior temporal gyrus, superior temporal gyrus, and precentral area. Interestingly, we observed that during the “Flrst choice” stage, expected free energy and extrinsic value regressors were strongly correlated both at the beginning and the end. However, expected free energy correlations appeared earlier than those of extrinsic value at the beginning, suggesting that the brain initially encodes reward values before integrating these values with information values (active inference and active learning) for decision-making. In the case of the “active learning” regressor, we found strong correlations in the middle temporal gyrus, lateral occipital cortex, inferior temporal gyrus, and inferior parietal lobule.

The source estimation results of the “Flrst choice” stage of expected free energy and active inference. (A) The regression coefflcients (β) of the expected free energy regressor. The blue point indicates the most correlated brain region (the lateral occipital cortex, left hemisphere, MNI: [−9.9, −96.8, 9.8]). The right panel shows the neural activity of the blue point and the shadow indicates that the neural activity of the blue point in these time intervals (0.380s to 0.581s and 1.172s to 1.724s) is signiflcantly correlated with expected free energy (p < 0.05 and the time intervals are longer than 0.2s). (B) The regression coefflcients (β) of the active inference regressor. The blue point indicates the most correlated brain region (the middle temporal gyrus, left hemisphere, MNI: [-63.5, −23.5, −13.8]). The right panel shows the neural activity of the blue point and the green shadow indicates that the neural activities of the blue point in these time intervals (0.08s to 0.636s, 0.657s to 0.906s, and 1.32s to 2.00s) are signiflcantly correlated with active inference (p < 0.05 and the time intervals are longer than 0.2s).

3.3.2 “Flrst result” stage – belief update

During the “Flrst result” stage, participants were presented with the outcome of their initial choice, which informed them of the current context: either “Context 1” or “Context 2” for the risky path, or no additional information if they opted not to ask. This process correlates with the “active inference” regressor (Flgure 7 (a) (b)), as it corresponds to resolving uncertainties about hidden states. We observed a robust correlation (p < 0.05) within certain regions of the middle temporal gyrus and superior frontal cortex. Additionally, other brain regions such as the temporal pole, lateral orbitofrontal cortex, and rostral middle frontal cortex also displayed time-dependent correlations. Towards the conclusion of the “Flrst result” stage, we noted subtle correlations with active inference in parietal and occipital regions. This observation suggests that following the assimilation of environmental state information, the brain starts utilizing a reward model to assist in decision-making.

The source estimation results of the two result stages of active inference and active learning. (A) The regression coefflcients (β) of the active inference regressor in the “Flrst result” stage. The blue point indicates the most correlated brain region (the middle temporal gyrus, right hemisphere, MNI: [52.6, −32.3, −19.7]). The right panel shows the neural activity of the blue point and the shadow indicates that the neural activity of the blue point in these time intervals (0.076s to 0.436s, 0.47s to 0.67s, and 0.80s to 1.24s) is signiflcantly correlated with expected free energy (p < 0.05 and the time intervals are longer than 0.2s). (B) The regression coefflcients (β) of the active learning regressor in the “Second result” stage. The blue point indicates the most correlated brain region (the intersection of the middle temporal gyrus and the bankssts, left hemisphere, MNI: [-52.5, −56.6, 5.5]). The right panel shows the neural activity of the blue point and the shadow indicates that the neural activity of the blue point in these time intervals (0.355s to 0.745 and 1.441 to 1.651) is signiflcantly correlated with active inference (p < 0.05 and the time intervals are longer than 0.2s).

3.3.3 “Second choice” stage – action selection

During the “Second choice” stage, participants choose between the risky and safe paths based on the available information, with the aim of maximizing rewards. This requires a balance between exploration and exploitation, similar to the “Flrst choice” stage. Flrst, for the “expected free energy” regressor (Flgure 8(a) (b)), we identifled strong correlations (p < 0.01) in particular regions of the lateral occipital cortex. Additionally, correlations were observed in various periods within other brain regions such as the superior parietal gyrus, inferior parietal gyrus, and rostral middle frontal cortex. Generally, the correlations between regressors and brain signals were more pronounced in the “Second choice” stage compared to the “Flrst choice” stage. Regarding the “extrinsic value” regressor, we found that certain regions of the middle temporal gyrus showed strong correlations during the time interval from 0.104 to 0.168. Furthermore, other brain regions such as the inferior temporal gyrus, and insula, exhibited some degree of correlation at different time periods. For the “active learning” regressor (Flgure 8(c) (d)), strong correlations (p < 0.05) were evident in the lateral occipital cortex, the parietal lobule, and temporal pole.

The source estimation results of the “Second choice” stage of expected free energy and active learning. (a) The regression coefflcients (β) of the expected free energy regressor. The blue point indicates the most correlated brain region (the lateral occipital cortex, left hemisphere, MNI: [−5.5, −92.8, 15.4]). The right panel shows the neural activity of the blue point and the shadow indicates that the neural activity of the blue point in these time intervals (0.188s to 0.516s, 0.532s to 1.312s, and 1.336s to 2.00s) is signiflcantly correlated with expected free energy (p < 0.05 and the time intervals are longer than 0.2s). (b) The regression coefflcients (β) of the active learning regressor. The blue point indicates the most correlated brain region (the lateral occipital cortex, left hemisphere, MNI: [−9.9, −96.8, 9.8]). The right panel shows the neural activity of the blue point and the shadow indicates that the neural activity of the blue point in the time intervals (0.108s to 2.00s) is signiflcantly correlated with expected free energy (p < 0.05).

3.3.4 “Second result” stage – belief update

During the “Second result” stage, participants obtain specific rewards based on their second choice: selecting the safe path yields a flxed reward, whereas choosing the risky path results in variable rewards, contingent upon the context. For the “extrinsic value” regressor, we observed strong correlations (p < 0.05) in speciflc regions of the rostral middle frontal gyrus and the lateral orbitofrontal cortex. Additionally, other brain regions such as the pars orbitalis, inferior temporal gyrus, lateral occipital sulcus, caudal middle frontal gyrus, and frontal pole demonstrated varying degrees of correlation with “extrinsic value” across different time periods. With regards to the “active learning” regressor (Flgure 7 (c) (d))), strong correlations (p < 0.05) were identifled at the middle temporal gyrus, and the superior parietal lobule.

4 Discussion

In this study, we utilized active inference to explore the different cognitive types and associated neural components involved in human exploration strategies during the decision-making process. By employing a contextual two-bandit task, we demonstrated that the active inference model framework effectively describes real-world decision-making. Our flndings indicate that active inference not only provides explanations and distinctions for uncertainties during decision-making, but also reveals the common and unique neural correlates associated with different types of uncertainties and decision-making policies. This was supported by evidence from both sensor-level and source-level EEG.

4.1 The varieties of human exploration strategies in active inference

In the diverse realm of human behavior, it has been observed that exploration strategies vary signiflcantly depending on the situation at hand. Such strategies can be viewed as a blend of directed exploration, where options with higher levels of uncertainty or ambiguity are favored, and random exploration, where actions are chosen with absolute randomness [27]. In the framework of active inference, the randomness in exploration is derived from the precision parameter employed during policy selection. As the precision parameter increases, the randomness in agents’ actions also escalates. On the other hand, the directed exploration aspect stems from the computation of expected free energy. Policies that lead to the exploration of more ambiguous options, hence yielding higher information gain, are assigned increased expected free energy by the model [3, 4, 11].

Our model-fltting results of decision behavior indicate that people show high variance in the exploration strategies4 (b). Exploration strategies, from a model-based perspective, incorporate a fusion of model-free learning and model-based learning. Intriguingly, these two learning ways exhibit both competition and cooperation within the human brain [28, 29]. The simplicity and effectiveness of model-free learning contrast with its inflexibility and data inefflciency. Conversely, model-based learning, although flexible and capable of forward planning, demands substantial cognitive resources. The active inference model tends to lean more toward model-based learning, as this model incorporates a cognitive representation of the environment to guide the agent’s actions. Our simulation results showed these model-based behaviors that the agent constructs an environment model and uses the model to maximize rewards3. To integrate model-free learning, a habitual term was added in [3]. This allows the active inference agent to exploit the cognitive model (model-based) for planning in the initial task stages and utilize habits for increased accuracy and efflciency in later stages.

4.2 The strength of the active inference framework in decision

Active inference is a comprehensive framework elucidating neurocognitive processes (Flgure 1). It unifles perception, decision-making, and learning within a single framework centered around the minimization of free energy. One of the primary strengths of the active inference model lies in its robust statistical [30] and neuroscientiflc underpinnings [31], allowing for a lucid understanding of an agent’s interaction within its environment. This model effectively harmonizes cognitive and practical values by emphasizing both within the context of free energy, thereby facilitating the modeling of each participant’s cognitive and practical value.

Active inference offers a superior exploration mechanism compared with basic model-free reinforcement learning (Flgure 4 (c)). Since traditional reinforcement learning models predicate their policies solely on the state, this setting leads to difflculty in extracting temporal information [32] and increases the likelihood of entrapment within local minima. In contrast, the policies in active inference are determined by both time and state. This dependence on time [33] enables policies to adapt efflciently, such as emphasizing exploration in the initial stages and exploitation later on. Moreover, this mechanism prompts more exploratory behavior in instances of state ambiguity. A further advantage of active inference lies in its adaptability of different task environments [4]. It can conFlgure different generative models to address distinct tasks, and compute varied forms of free energy and expected free energy.

Despite these strengths, the active inference framework also has its limitations [34]. One notable limitation pertains to its computational complexity (Flgure 2 (c)), resulting from its model-based architecture, restricting the traditional active inference model’s application within continuous state-action spaces. Additionally, the model heavily relies on the selection of priors, meaning that poorly chosen priors could adversely affect decision-making, learning, and other processes [8].

4.3 Representing uncertainties at the sensor level

In the previous work, the employment of EEG signals in decision-making processes under uncertainty has largely concentrated on event-related potential (ERP) and spectral features at the sensor level [3538]. In our study, the sensor level results reveal more positive activities in multiple brain regions during the second half trials compared to the flrst half, and similarly, during not-asked trials as opposed to asked trials5.

In our setting, after the flrst half of the trials, participants had learned some information about the environmental statistical structure, thus experiencing less ambiguity in the latter half of the trials. This increased understanding enabled them to better utilize the statistical structure for decision-making as compared to the flrst half of the trials. In contrast, during the not-asked trials, the lack of knowledge of the environment’s hidden states led to higher-risk actions. This elevated risk was reflected in increased positive brain activities.

The aspects of ambiguity and risk, two pivotal factors in decision-making, are often misinterpreted and can vary in meaning depending on the context. Regarding the sensor level results, we flnd an overall more positive amplitude for the second half of the trials than the flrst half of the trials5 (b). It may indicate a generally more positive amplitude of EEG for the lower ambiguity trials, which may contract with previous studies showing more positive amplitude for higher ambiguity trials in previous studies [38, 39]. For example, a late positive potential (LPP) was identifled in their work, which differentiated levels of ambiguity, with the amplitude of the LPP serving as an index for perceptual ambiguity levels. However, the ambiguity in their task was deflned as the perceptual difflculty of distinguishing, while our deflnition of ambiguity corresponds to the information gained from certain policies. Furthermore, Zheng et al. [40] used a wheel-of-fortune task to examine the ERP and oscillatory correlations of neural feedback processing under conditions of risk and ambiguity. Their flndings suggest that risky gambling enhanced cognitive control signals, as evidenced by theta oscillation. In contrast, ambiguous gambling heightened affective and motivational salience during feedback processing, as indicated by positive activity and delta oscillation. Future work may focus on this oscillation level analysis and reveal more evidence on it.

4.4 Representation of decision-making process in human brain

In our experiment, each stage corresponded to distinct phases of the decision-making process. Participants made decisions to optimize cumulative rewards based on current information about the environment during the two choice stages while acquiring information about the environment during the two result stages.

During the “Flrst choice” stage, participants had to decide whether to bear an additional cost in exchange for information regarding the environment’s state (epistemic value). Here, the primary source of epistemic value stemmed from resolving the uncertainty of the hidden staterisk. The occipital cortex appears to play a critical role in this process by combining extrinsic value with epistemic value (expected free energy) to guide decision-making (Flgure 6). Previous study [41] has demonstrated signiflcant activations in the lateral occipital complex during perceptual decision-making, indicating that the human brain may use perceptual persistence to facilitate reward-related decisions.

As for the “Flrst result” stage, participants learned about the environment’s hidden states. Our results indicated that the regions within the temporal lobe played a crucial role in both valuing the uncertainty of hidden states and learning information about these hidden states (Flgure 7 (a)). Other studies have similarly demonstrated the importance of the temporal pole and the inferior temporal areas in processing the ambiguity regarding lexical semantics [42, 43]. Studies on macaques also identifled the role of the inferior temporal lobe in representing blurred visual objects [44]. Throughout the “Flrst result” stage, participants are processing the state information relevant to the current trial. The middle temporal gyrus is postulated to play a key role in processing this state information and employing it to construct an environmental model. This aligns with previous flndings [45], which suggest that the middle temporal gyrus collaborates with other brain regions to facilitate conscious learning. Moreover, studies have also identifled deflcits in episodic future thinking in patients with damage to the middle temporal gyrus (MTG) [46], thereby indicating the critical role of MTG in future-oriented decision-making tasks, particularly those involving future thinking [4749].

In the “Second choice stage”, participants chose between a safe path and a risky path, contingent on perceived value. When knowing the environment’s hidden states, participants tended to resolve the uncertainty of model parameters by opting for the risky path. Conversely, without knowledge of the hidden states, participants gravitated towards risk avoidance by selecting the safe path. Our results highlighted the signiflcance of the occipital lobe regions, in conjunction with the temporal and parietal lobes, in both valuing the uncertainty of model parameters and learning about these parameters (Flgure 8). These results are consistent with another study that demonstrates activation in the superior parietal, right precentral gyrus, postcentral gyrus, and superior frontal regions during decision-making involving ambiguity and risk [50]. Since the superior parietal region was involved in the integration of visual motion information and extraction of episodic memory [5153], it may play a crucial role in our decision task as well, where participants need to extract statistical relationship from different visual input at different times.

For the “Second result stage”, participants got rewards according to their actions, constructing the model-free action value function and the model-based state transition function. Our results highlighted the role of the frontal cortex in learning the action value function and the role of the middle temporal gyrus in learning the state transition function (Flgure 7 (b)). Notably, the correlations’ signiflcance between “active inference” and the middle temporal gyrus reached its peak later than the correlations’ signiflcance between “extrinsic value” and the orbitofrontal cortex. This temporal disparity may suggest that the brain processes reward information earlier than environmental information. However, this is contrary to previous flndings that the temporospatial factors derived from PCA on the effect model-based prediction error peaked earlier than those on the effect model-free prediction error [54]. Future work should look deeper into where and when the human brain processes different information in decision tasks.

In the two choice stages, we observed stronger correlations for the expected free energy compared to the extrinsic value, suggesting that the expected free energy could serve as a better representation of the brain’s actual value employed to guide actions [55]. Our results pointed to a strong correlation between the expected free energy and activations in the occipital lobe. Such a result may be explained by that the lateral occipital sulcus plays a key role in the persistence of perceptual information and represents delayed rewards [41, 56].

5 Conclusion

In the current study, we introduced the active inference framework as a means to investigate the neural mechanisms underlying an exploration and exploitation decision-making task. Compared to model-free reinforcement learning, active inference provides a superior exploration bonus during the initial trials and offers a better flt to the participants’ behavioral data. Given that the behavioral task in our study only involved variables from a limited number of states and rewards, future research should strive to apply the active inference framework to more complex tasks. Speciflc brain regions may play key roles in balancing exploration and exploitation. The lateral occipital gyrus was primarily involved in action selection (expected free energy), while the temporal lobe regions were mainly engaged in valuing the information related to the hidden states of the environment. Furthermore, the middle temporal gyrus and lateral occipital gyrus were prominently involved in valuing the information related to the environmental model parameters. The temporal pole regions primarily participated in learning the hidden states of the environment (active inference), while the middle temporal gyrus was more engaged in learning the model parameters of the environment (active learning). In essence, our flndings suggest that active inference is capable of investigating human behaviors in decision-making under uncertainty, reducing ambiguity, avoiding risk, and maximizing rewards. Overall, this research presents evidence from both behavioral and neural perspectives that support the concept of active inference in decision-making processes. We also offer insights into the neural mechanisms involved in human decision-making under various forms of uncertainty.

Data and Code availability

All experiment codes and analysis codes are available at GitHub: https://github.com/andlab-um/FreeEnergyEEG.

Acknowledgements

This work was mainly supported by the Science and Technology Development Fund (FDCT) of Macau[0127/2020/A3, 0041/2022/A], the Natural Science Foundation of Guangdong Province (2021A1515012509), Shenzhen-Hong Kong-Macao Science and Technology Innovation Project (Category C) (SGDX2020110309280100), MYRG of University of Macau (MYRG2022-00188-ICI), NSFC-FDCT Joint Program 0095/2022/AFJ, the SRG of University of Macau (SRG202000027-ICI), the National Key R&D Program of China (2021YFF1200804), National Natural Science Foundation of China (62001205), Shenzhen Science and Technology Innovation Committee (2022410129, KCXFZ2020122117340001), Shenzhen-Hong Kong-Macao Science and Technology Innovation Project (SGDX2020110309280100), Guangdong Provincial Key Laboratory of Advanced Biomaterials (2022B1212010003).

Author contributions

S.Z, Q.L., and H.W. developed the study concept and designed the study; S.Z. and H.W. prepared experimental materials; Q.L. and H.W. supervised the experiments and analyses; S.Z. and Y. T. performed the data collection; S.Z. performed the data analyses; all authors drafted, revised and reviewed the manuscript and approved the flnal manuscript for submission.

Competing interests

The authors declare no competing interests.